Multiple Regression STAT E-150 Statistical Methods
Transcription
Multiple Regression STAT E-150 Statistical Methods
STAT E-150 Statistical Methods Multiple Regression Three percent of a man's body is essential fat, which is necessary for a healthy body. However, too much body fat can be dangerous. For men between the ages of 18 and 39, a healthy body fat percent is 8% to 19%. (For women it is 21% to 32%.) It is not easy to measure body fat percent, but we can find a model for the relationship between body fat percent and waist size and use it to find the body weight percent associated with a given waist size. 2 The scatterplot indicates a positive linear relationship between waist size and body fat percent: 3 The SPSS output shows a significant linear relationship between the two variables. Coefficientsa Unstandardized Coefficients B Model 1 (Constant) Waist Std. Error -42.734 2.717 1.700 .074 Standardized Coefficients Beta t .824 Sig. -15.731 .000 22.875 .000 a. Dependent Variable: Pct BF Model Summary Model 1 R .824a R Square .678 Adjusted R Std. Error of Square the Estimate .677 4.7126 R2 = .678, so we know that almost 68% of the variability in the body fat percentage is accounted for by the waist size. What other variables might be used to predict body fat percentage? Can we improve the prediction by including additional variables? 4 The Multiple Linear Regression Model We have n observations on k explanatory variables X1, X2, X3, …, Xk and a response variable, Y. The multiple regression model is: Y = β0 + β1x1 + β2x2 + + βkxk+ ε where ε ~ N(0, σε) and the errors are independent from one another. The predictor variables may be higher powers or other functions of quantitative variables, coded categorical variables, or interaction terms. The main restriction is that the model is linear; that is, each term is a constant multiple of a predictor. 5 Fitting a Multiple Linear Regression Model As we did in Simple Linear Regression, we will choose a possible set of predictors, estimate the coefficients based on sample data, and assess the fit. We will again use the sum of squared residuals, where the residuals are the differences between the actual Y values and the Y values predicted by the prediction equation ˆ = βˆ + βˆ X + βˆ X + + βˆ X Y 0 1 1 2 2 k k and use SPSS to determine the estimates of the coefficients βi that minimize the sum of the squared residuals. 6 We will test the hypotheses H0: β1 = β2 = β3 = = βk = 0 Ha: The slopes are not all zero. Our assumptions are: - The y-values are independent of each other - Y has a constant variance for any combination of predictors - The values of y are normally distributed for any fixed set of values for the explanatory variables That is, the errors are independent values from a N(0, σε) distribution. 7 If the null hypothesis is rejected, then test a null hypothesis for each of the coefficients: H0: βj = 0 Ha: βj ≠ 0 Note: If the null hypothesis is not rejected, it does not mean that the corresponding predictor variable has no relationship to y; it means that the predictor variable contributes nothing to modeling y after allowing for all the other predictors. 8 The hypotheses for fitting a multiple linear regression model to predict body fat percentage based on waist size and height are H0: βheight = βweight = 0 Ha: The slopes are not both zero. 9 Here are the scatterplots using the individual predictors: Although this suggests a linear relationship between waist size and body fat percentage, there doesn't appear to be a linear relationship between height and body fat percentage. 10 Here are some of the results for a multiple regression analysis with both height and waist as predictors: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Std. Error Coefficients Beta -3.110 7.687 Waist 1.773 .072 Height -.601 .110 t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF a ANOVA Model Regression Sum of Squares 12216.077 df 2 1 Residual 4912.743 247 Total 17128.820 249 Mean Square F Sig. 6108.038 307.096 .000 b 19.890 a. Dependent Variable: Pct BF b. Predictors: (Constant), Height, Waist The p-value for height is close to 0, so we know that height does contribute to the multiple regression model. 11 The graph shown below is called a scatterplot matrix. It shows the scatterplots for all pairs of the variables we are using Which pair of variables shows a strong linear relationship? Which pair of variables shows a weak linear relationship? Which pair of variables shows no linear relationship? 12 The graph shown below is called a scatterplot matrix. It shows the scatterplots for all pairs of the variables we are using Which pair of variables shows a strong linear relationship? Pct BF and Waist Which pair of variables shows a weak linear relationship? Height and Waist Which pair of variables shows no linear relationship? Pct BF and Height 13 Residual Analysis These plots tell us that there is no particular scatter to the residuals, and that the distribution of the residuals is close to normal. 14 Use the SPSS output provided to answer the questions below: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF Model Summaryb Model 1 R .845a Adjusted R Std. Error of the Square Estimate R Square .713 .711 4.4598 a. Predictors: (Constant), Height, Waist b. Dependent Variable: Pct BF What is the fitted regression equation? 15 Use the SPSS output provided to answer the questions below: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF Model Summaryb Model 1 R .845a R Square .713 Adjusted R Std. Error of the Square Estimate .711 4.4598 a. Predictors: (Constant), Height, Waist b. Dependent Variable: Pct BF What is the fitted regression equation? %BodyFat = 1.773 waist - .601 height - 3.110 16 Use the SPSS output provided to answer the questions below: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF Model Summaryb Model 1 R .845a R Square .713 Adjusted R Std. Error of the Square Estimate .711 4.4598 a. Predictors: (Constant), Height, Waist b. Dependent Variable: Pct BF %BodyFat = 1.773 waist - .601 height - 3.110 What does the value 1.773 tell you? An increase of one inch in the waist measurement is associated with an increase of 1.773 in body fat percentage. 17 Use the SPSS output provided to answer the questions below: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF Model Summaryb Model 1 R .845a R Square .713 Adjusted R Std. Error of the Square Estimate .711 4.4598 a. Predictors: (Constant), Height, Waist b. Dependent Variable: Pct BF %BodyFat = 1.773 waist - .601 height - 3.110 What does the value 1.773 tell you? An increase of one inch in the waist measurement is associated with an increase of 1.773 in body fat percentage for men of a particular height. 18 Use the SPSS output provided to answer the questions below: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF Model Summaryb Model 1 R .845a R Square .713 Adjusted R Std. Error of the Square Estimate .711 4.4598 a. Predictors: (Constant), Height, Waist b. Dependent Variable: Pct BF %BodyFat = 1.773 waist - .601 height - 3.110 What change in Body Fat Percentage is associated with each additional inch of height? An increase of one inch of height is associated with an decrease of .601 in body fat percentage for men of a particular weight. 19 Use the SPSS output provided to answer the questions below: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF Model Summaryb Model 1 R .845a R Square .713 Adjusted R Std. Error of the Square Estimate .711 4.4598 a. Predictors: (Constant), Height, Waist b. Dependent Variable: Pct BF %BodyFat = 1.773 waist - .601 height - 3.110 What change in Body Fat Percentage is associated with each additional inch of height? An increase of one inch of height is associated with an decrease of .601 in body fat percentage for men of a particular weight. 20 Use the SPSS output provided to answer the questions below: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF Model Summaryb Model 1 R .845a R Square .713 Adjusted R Std. Error of the Square Estimate .711 4.4598 a. Predictors: (Constant), Height, Waist b. Dependent Variable: Pct BF What is the value of R2 ? What does it tell you? 21 Use the SPSS output provided to answer the questions below: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF Model Summaryb Model 1 R .845a R Square .713 Adjusted R Std. Error of the Square Estimate .711 4.4598 a. Predictors: (Constant), Height, Waist b. Dependent Variable: Pct BF What is the value of R2 ? What does it tell you? R2 = .713 which tells us that height and waist size together account for about 71.3% of the variation in the body fat percentage for men. 22 Use the SPSS results to complete the hypothesis test: a ANOVA Model Regression Sum of Squares 12216.077 df Mean Square 2 1 Residual 4912.743 247 Total 17128.820 249 F Sig. 6108.038 307.096 .000 b 19.890 a. Dependent Variable: Pct BF b. Predictors: (Constant), Height, Waist The value the test statistic is: 307.096 p = 0+ What can you conclude? Since p is close to zero, the null hypothesis is rejected. This data indicates that there is a linear relationship between body fat percentage and the predictor variables Waist and Height. 23 Use the SPSS results to complete the hypothesis test: a ANOVA Model Regression Sum of Squares 12216.077 df Mean Square 2 1 Residual 4912.743 247 Total 17128.820 249 F Sig. 6108.038 307.096 .000 b 19.890 a. Dependent Variable: Pct BF b. Predictors: (Constant), Height, Waist The value the test statistic is: 307.096 p = 0+ What can you conclude? is close to zero, the null hypothesis is rejected. This data indicates that there is a linear relationship between body fat percentage and the predictor variables Waist and Height. 24 Use the SPSS results to complete the hypothesis test: a ANOVA Model Regression Sum of Squares 12216.077 df Mean Square 2 1 Residual 4912.743 247 Total 17128.820 249 F Sig. 6108.038 307.096 .000 b 19.890 a. Dependent Variable: Pct BF b. Predictors: (Constant), Height, Waist The value the test statistic is: 307.096 p = 0+ What can you conclude? Since p is close to zero, the null hypothesis is rejected. This data indicates that there is a linear relationship between body fat percentage and the predictor variables waist and height.is close to zero, the null hypothesis is rejected. This data indicates that there is a linear relationship between body fat percentage and the predictor variables Waist and Height. 25 We also want to estimate the standard deviation of the error term, σε As we add a new predictor to the model, we have a new coefficient to estimate, and so we lose one more degree of freedom. The estimate for the standard error of the multiple regression model with k predictors is σ̂ε SSE nk 1 26 Use the SPSS output to find the standard error of this regression model: a ANOVA Model Regression Sum of Squares 12216.077 df 2 1 Residual 4912.743 247 Total 17128.820 249 Mean Square F Sig. 6108.038 307.096 .000 b 19.890 a. Dependent Variable: Pct BF b. Predictors: (Constant), Height, Waist SSE σ̂ε nk 1 27 Use the SPSS output to find the standard error of this regression model: a ANOVA Model Regression Sum of Squares 12216.077 df 2 1 Residual 4912.743 247 Total 17128.820 249 Mean Square F Sig. 6108.038 307.096 .000 b 19.890 a. Dependent Variable: Pct BF b. Predictors: (Constant), Height, Waist SSE σ̂ε nk 1 4912.743 19.8896 4.4598 247 28 Assessing a Multiple Regression Model Individual t-Tests for Coefficients in Multiple Regression In order to determine whether any one of the predictor variables is helpful to include in the model, we test the coefficient for that predictor: H0: βi = 0 Ha: βi ≠ 0 ˆ i 0 The test statistic is t with n - k - 1 degrees of freedom. ˆ SE() 29 It is important to remember that the meaning of each coefficient depends on all of the predictors in the regression model. If we fail to reject the null hypothesis, it means that the corresponding predictor variable contributes nothing to the multiple regression model after allowing for all other predictors. 30 Use the SPSS output to test the coefficients in our model: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Coefficients Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βheight = 0 Ha: βheight ≠ 0 t= p= What is your conclusion? 31 Use the SPSS output to test the coefficients in our model: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Coefficients Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βheight = 0 Ha: βheight ≠ 0 t = -5.47 p = 0+ What is your conclusion? 32 Use the SPSS output to test the coefficients in our model: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Coefficients Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βheight = 0 Ha: βheight ≠ 0 t = -5.47 p = 0+ What is your conclusion? Since p is close to 0, we will reject the null hypothesis. There is evidence that the percent of body fat is related to the height. We can conclude that the body fat percentage changes as the height changes, for men with the same waist size. 33 Use the SPSS output to test the coefficients in our model: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Coefficients Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βwaist = 0 Ha: βwaist ≠ 0 t= p= What is your conclusion? 34 Use the SPSS output to test the coefficients in our model: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βwaist = 0 Ha: βwaist ≠ 0 t = 24.768 p = 0+ What is your conclusion? 35 Use the SPSS output to test the coefficients in our model: Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βwaist = 0 Ha: βwaist ≠ 0 t = 24.768 p = 0+ What is your conclusion? Since p is close to 0, we will reject the null hypothesis. There is evidence that the percent of body fat is related to the waist size. We can conclude that the body fat percentage changes as the waist size changes, for men of the same height. 36 Can we do a one-tailed test? Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βwaist = 0 Ha: βwaist > 0 t = 24.768 p= What is your conclusion? Since p is close to 0, we will reject the null hypothesis. There is evidence that the percent of body fat is related to the waist size. We can conclude that the body fat percentage changes as the waist size changes, for men of the same height. 37 Can we do a one-tailed test? Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βwaist = 0 Ha: βwaist > 0 t = 24.768 p = .000/2 = 0+ What is your conclusion? Since p is close to 0, we will reject the null hypothesis. There is evidence that the percent of body fat is related to the waist size. We can conclude that the body fat percentage changes as the waist size changes, for men of the same height. 38 Can we do a one-tailed test? Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Coefficients Std. Error -3.110 7.687 Waist 1.773 .072 Height -.601 .110 Beta t Sig. -.405 .686 .859 24.768 .000 -.190 -5.470 .000 a. Dependent Variable: Pct BF H0: βwaist = 0 Ha: βwaist > 0 t = 24.768 p = .000/2 = 0+ What is your conclusion? Since p is close to 0, we will reject the null hypothesis. There is evidence that the percent of body fat is related to the waist size. We can conclude that the body fat percentage increases as the waist size changes, for men of the same height.to 0, we will 39 Adjusted R2 The adjusted R2 is an adjustment to R2 that takes the sample size and the number of parameters (βj) into consideration. The adjusted R2 increases as more predictors are added to the model, and so it can be useful in comparing regression models with different numbers of predictor variables. 40 Creating a Scatterplot Matrix Click on Graphs > Chart Builder. Select Scatter/Dot from the list of charts. Drag the Scatterplot Matrix to the window. 41 Drag the matrix variables to the horizontal axis. Click on OK. The scatterplot matrix will appear in the Output Viewer. 42 43 Estimating the Model Click on Analyze > Regression > Linear Drag the dependent variable and all independent variables to the appropriate locations. Click on OK. 44 This will produce several tables: Model Summary Model R R Square .845a 1 Adjusted R Square .713 Std. Error of the Estimate .711 4.4598 a. Predictors: (Constant), Waist, Height Coefficientsa Standardized Coefficients Unstandardized Coefficients Model 1 B (Constant) Std. Error Beta -3.110 7.687 Height -.601 .110 Waist 1.773 .072 t Sig. -.405 .686 -.190 -5.470 .000 .859 24.768 .000 a. Dependent Variable: Pct BF ANOVAb Sum of Squares Model 1 Regression Mean Square 12216.077 2 6108.038 4912.743 247 19.890 17128.820 249 Residual Total df F 307.096 Sig. .000a a. Predictors: (Constant), Waist, Height b. Dependent Variable: Pct BF 45 If you click on Plots in the Linear Regression dialog box, you will get this dialog box: Plot the *ZRESIDS on the Y axis against the *ZPRED values on the X axis. You may also choose to create a Normal Probability Plot and/or histogram of the residuals. 46 Click on Continue and then OK. Here are the results: 47