Statistics in geophysics linear regression

24 249 0
Statistics in geophysics linear regression

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Statistics in Geophysics: Linear Regression Steffen Unkel Department of Statistics Ludwig-Maximilians-University Munich, Germany Winter Term 2013/14 1/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Historical remarks Sir Francis Galton (1822-1911) was responsible for the introduction of the word “regression” Galton, F (1886): Regression towards mediocrity in hereditary stature, The Journal of the Anthropological Institute of Great Britain and Ireland, Vol 15, pp 246-263 Regression equation: yˆ = y¯ + (x − x¯) , where y denotes the height of the child and x is a weighted average of the mother’s and father’s heights Winter Term 2013/14 2/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Regression to the mean Figure: Scatterplot of mid-parental height against child’s height, and regression line (dark red line) Winter Term 2013/14 3/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Relationship between two variables We can distinguish predictor variables and response variables Other names frequently seen are: Predictor variable: input variable, X -variable, regressor, covariate, independent variable Response variable: output variable, predictand, Y -variable, dependent variable We shall be interested in finding out how changes in the predictor variables affect the values of a response variable Winter Term 2013/14 4/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Relationship between two variables: Example 35 30 ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● 15 ● ● ● ● 10 ● ● ● Canandaigua minimum temperature in degrees Fahrenheit ● ● ● ● −10 10 20 30 Ithaca minimum temperature in degrees Fahrenheit Figure: Plot of the minimum temperature (◦ F) observations at Ithaca and Canandaigua, New York, for January 1987 Winter Term 2013/14 5/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Model In simple (multiple) linear regression one (two or more) predictor variable(s) is (are) assumed to affect the values of a response variable in a linear fashion For the model of simple linear regression, we assume y = f (x) + = β0 + β1 x + , where E(y |x) = f (x) is known as the systematic component and is the random error term Inserting the data yields the n equations yi = β0 + β1 xi + i , i = 1, , n with unknown regression coefficients β0 and β1 Winter Term 2013/14 6/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Assumptions The systematic component f is a linear combination of covariates, that is, f is linear in the parameters Additivity of errors The error terms i (i = , n) are random variables with E( i ) = and constant variance σ (unknown), that is, homoscedastic errors with Var( i ) = σ We assume that errors are uncorrelated, that is, Cov( i , j ) = for i = j We often assume a normal distribution for the errors: i ∼ N (0, σ ) Winter Term 2013/14 7/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Least squares (LS) fitting The estimated values βˆ0 and βˆ1 are determined as minimizers of the sum of squares deviations n (yi − (β0 + β1 xi ))2 i=1 for given data (yi , xi ), i = 1, , n This yields βˆ1 = n ¯)(yi − i=1 (xi − x n (x ¯)2 i=1 i − x βˆ0 = y¯ − βˆ1 x¯ Winter Term 2013/14 8/24 y¯ ) , Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Least squares (LS) fitting II An estimate for the error variance σ , called the residual variance, is σ ˆ = = n−2 n−2 n ˆ2i i=1 n (yi − yˆi )2 , i=1 where ˆi and yˆi (i = , n) are the residuals and fitted values, respectively The sum of squared residuals is divided by n − because two parameters have been estimated Winter Term 2013/14 9/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals LS fitting: Example 35 30 ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● 15 ● ● ● ● 10 ● ● ● Canandaigua minimum temperature in degrees Fahrenheit ● ● ● ● −10 10 20 30 Ithaca minimum temperature in degrees Fahrenheit Figure: Minimum temperature (◦ F) observations at Ithaca and Canandaigua, New York, for January 1987, with fitted least squares line (ˆ yi = 12.459 + 0.598xi ) Winter Term 2013/14 10/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Goodness-of-fit How much of the variation in the data has been explained by the regression line? Consider the identity yi − yˆi = yi − y¯ − (ˆ yi − y¯ ) ⇔ (yi − y¯ ) = (ˆ yi − y¯ ) + (yi − yˆi ) Decomposition of the total sum of squares: n n (yi − y¯ ) = i=1 n (yi − yˆi )2 (ˆ yi − y¯ ) + i=1 SST Winter Term 2013/14 i=1 SSR 11/24 SSE Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Coefficient of determination Some of the variation in the data (SST) can be ascribed to the regression line (SSR) and some to the fact that the actual observations not all lie on the regression line (SSE) A useful statistic to check is the R value (coefficient of determination): R = n yi i=1 (ˆ n (y i=1 i − y¯ )2 SSR = = 1− SST − y¯ )2 n i=1 ˆi n i=1 (yi − y¯ )2 = 1− SSE , SST for which it holds that ≤ R ≤ and which is often expressed as a percentage by multiplying by 100 The square root of R is (the absolute value) of the Pearson correlation between x and y Winter Term 2013/14 12/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals ANOVA table for simple linear regression Source of variation Degrees of freedom (df) Sum of squares (SS) Mean square (MS) F -value Regression SSR MSR Residual n−2 SSE MSR = SSR σ ˆ = SSE Total n−1 SST Winter Term 2013/14 13/24 n−2 σ ˆ2 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals F-test for significance of regression Suppose that the errors i are independent N (0, σ ) variables Then it can be shown that if β1 = 0, the ratio F = MSR σ ˆ2 follows an F -distribution with and (n − 2) degrees of freedom Statistical test: H0 : β1 = versus H1 : β1 = We compare the F -value with the 100(1 − α)% point of the tabulated F (1, n − 2)-distribution in order to determine whether β1 can be considered nonzero on the basis of the data we have seen Winter Term 2013/14 14/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Confidence intervals (1 − α) × 100% confidence intervals for β0 and β1 : [βˆj ± σ ˆβˆj × t1−α/2 (n − 2)] , where j = 0, , σ ˆ σ ˆβˆ1 = n i=1 (xi − x¯)2 and σ ˆβˆ0 = σ ˆ n i=1 xi n n i=1 (xi − x¯)2 For sufficiently large n: Replace quantiles of the t(n − 2)-distribution by quantiles of the N (0, 1)-distribution Winter Term 2013/14 15/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Hypothesis tests Example: Two-sided test for β1 : H0 : β1 = H1 : β1 = Observed test statistic: t= βˆ1 βˆ1 − = , σ ˆβˆ1 σ ˆβˆ1 Rejection region: |t| > t1−α/2 (n − 2) Note that the variable F (1, n − 2) is the square of the t(n − 2) variable Winter Term 2013/14 16/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Prediction intervals A prediction interval for a future observation y0 at a location x0 with level (1 − α) is given by βˆ0 + βˆ1 x0 ± t1−α/2 (n − 2)ˆ σ 1+ + n (x0 − n i=1 (xi x¯)2 − x¯)2 A confidence interval for the regression function β0 + β1 x with level (1 − α) is given by βˆ0 + βˆ1 x ± t1−α/2 (n − 2)ˆ σ Winter Term 2013/14 17/24 + n (x − x¯)2 − x¯)2 n i=1 (xi Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Prediction intervals: Example ● 35 30 ++ ++ + ++ 25 + ● ● + ● ● + ● 20 15 + ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● + ++ ++ ++ ● ● ● ● ● ● ● ● ● ● ++ ++ ● ● ● ● + + ● ● ++ + + ● ● + ● + ++ ● ● + ● + ● ● ● ● ● ● ● ● ● ● ● ● + ● + ● ● + + + ● + Canandaigua minimum temperature in degrees Fahrenheit + + ++ +−10 + 10 20 30 Ithaca minimum temperature in degrees Fahrenheit Figure: 95% prediction intervals (red crosses) and 95% confidence intervals (green dots) around the regression (thick black line) for the January 1987 temperature data Data to which the regression was fit (black dots) are also shown Winter Term 2013/14 18/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Residuals versus fitted values ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residuals ● ● ● Residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 −5 ● ● ● ● 10 15 20 25 30 Fitted values 10 15 20 25 30 Date, January 1987 Figure: Scatterplot of the residuals as a function of the predicted value yˆi (i = , n) (left) and as a function of date (right), for the January 1987 temperature data Winter Term 2013/14 19/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Durbin-Watson test A test for serial correlation of regression residuals is the Durbin-Watson test Observed test statistic: d= n i=2 (ˆi − ˆi−1 ) n i=1 ˆi , 0≤d ≤4 If successive residuals are positively (negatively) serially correlated, d will be near (near 4) The distribution of d is symmetric around The critical values for Durbin-Watson tests vary depending on the sample size and the number of predictor variables Winter Term 2013/14 20/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Durbin-Watson test II Compare d (or − d, whichever is closer to zero) with the tabulated critical values dL and dU If d < dL , conclude that positive serial correlation is a possibility; if d > dU , conclude that no serial correlation is indicated If − d < dL , conclude that negative serial correlation is a possibility; if − d > dU , conclude that no serial correlation is indicated If the d (or − d) value lies between dL and dU , the test is inconclusive Winter Term 2013/14 21/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Durbin-Watson test: Example Durbin-Watson test data: linmodel1 DW = 1.5554, p-value = 0.08104 alternative hypothesis: true autocorrelation is greater than Winter Term 2013/14 22/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Quantile-quantile plot A graphical impression of whether the residuals follow a normal distribution can be obtained through a quantile-quantile (Q-Q) plot The residuals are plotted on the vertical, and the standard normal variables corresponding to the empirical cumulative probability of each residual are plotted on the horizontal Draw a straight line through the main middle bulk of the plot If all the points lie on such a line, more or less, one would conclude that the residuals not deny the assumption of normality of errors Winter Term 2013/14 23/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Quantile-quantile plot: Example ● ● ● ● ● ● ●● ● ● Sample Quantiles ● ●●● ●●●● ●● ● ● ● ● ● ● −5 ● ● ● ● ● −2 −1 Theoretical Quantiles Figure: Gaussian Q-Q plot of the residuals obtained from the regression of the January 1987 temperature data Winter Term 2013/14 24/24 [...]... i=1 SST Winter Term 2013/14 i=1 SSR 11/24 SSE Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Coefficient of determination Some of the variation in the data (SST) can be ascribed to the regression line (SSR) and some to the fact that the actual observations do not all lie on the regression line (SSE)... variance Interval estimation and tests for the parameters Examining residuals ANOVA table for simple linear regression Source of variation Degrees of freedom (df) Sum of squares (SS) Mean square (MS) F -value Regression 1 SSR MSR Residual n−2 SSE MSR = SSR σ ˆ 2 = SSE Total n−1 SST Winter Term 2013/14 13/24 n−2 σ ˆ2 Setting the scene Fitting a straight line by least squares The analysis of variance Interval... Canandaigua minimum temperature in degrees Fahrenheit + + ++ +−10 + 0 10 20 30 Ithaca minimum temperature in degrees Fahrenheit Figure: 95% prediction intervals (red crosses) and 95% confidence intervals (green dots) around the regression (thick black line) for the January 1987 temperature data Data to which the regression was fit (black dots) are also shown Winter Term 2013/14 18/24 Setting the scene Fitting... Winter Term 2013/14 16/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Prediction intervals A prediction interval for a future observation y0 at a location x0 with level (1 − α) is given by βˆ0 + βˆ1 x0 ± t1−α/2 (n − 2)ˆ σ 1+ 1 + n (x0 − n i=1 (xi x¯)2 − x¯)2 A confidence interval for the regression. ..Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Goodness-of-fit How much of the variation in the data has been explained by the regression line? Consider the identity yi − yˆi = yi − y¯ − (ˆ yi − y¯ ) ⇔ (yi − y¯ ) = (ˆ yi... tabulated F (1, n − 2)-distribution in order to determine whether β1 can be considered nonzero on the basis of the data we have seen Winter Term 2013/14 14/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Confidence intervals (1 − α) × 100% confidence intervals for β0 and β1 : [βˆj ± σ ˆβˆj × t1−α/2... as a function of date (right), for the January 1987 temperature data Winter Term 2013/14 19/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Durbin-Watson test A test for serial correlation of regression residuals is the Durbin-Watson test Observed test statistic: d= n 2 i=2 (ˆi − ˆi−1 ) n 2 i=1... distribution of d is symmetric around 2 The critical values for Durbin-Watson tests vary depending on the sample size and the number of predictor variables Winter Term 2013/14 20/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Durbin-Watson test II Compare d (or 4 − d, whichever is closer to zero)... correlation is indicated If 4 − d < dL , conclude that negative serial correlation is a possibility; if 4 − d > dU , conclude that no serial correlation is indicated If the d (or 4 − d) value lies between dL and dU , the test is inconclusive Winter Term 2013/14 21/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals... for the parameters Examining residuals Durbin-Watson test: Example Durbin-Watson test data: linmodel1 DW = 1.5554, p-value = 0.08104 alternative hypothesis: true autocorrelation is greater than 0 Winter Term 2013/14 22/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation and tests for the parameters Examining residuals Quantile-quantile plot A graphical ... mid-parental height against child’s height, and regression line (dark red line) Winter Term 2013/14 3/24 Setting the scene Fitting a straight line by least squares The analysis of variance Interval estimation... (multiple) linear regression one (two or more) predictor variable(s) is (are) assumed to affect the values of a response variable in a linear fashion For the model of simple linear regression, ... of variance Interval estimation and tests for the parameters Examining residuals Assumptions The systematic component f is a linear combination of covariates, that is, f is linear in the parameters

Ngày đăng: 04/12/2015, 17:09

Mục lục

  • Setting the scene

  • Fitting a straight line by least squares

  • The analysis of variance

  • Interval estimation and tests for the parameters

  • Examining residuals

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan