MiniTab
Transcription
MiniTab
STAT 350 (Fall 2014) Autor: Will A. Eagon Lab 9: SAS Solution 1 Lab 9: Inference for Regression Objectives: Perform Inference and check for assumptions in Linear Regression A (30 points) Beer and Blood Alcohol (Data Set: ex10-23bac.txt) How well does the number of beers a student drinks predict his or her blood alcohol content? Sixteen student volunteers at Ohio State University drank a randomly assigned number of 12-ounce cans of beer. Thirty minutes later, a police officer measured their blood alcohol content (BAC). Here are the data: The students were equally divided between men and women and differed in weight and usual drinking habits. Because of this variation, many students don’t believe that number of drinks predicts blood alcohol well. Solution File → Open Worksheet → Files of type: Text (*.txt) → ex10-23bac.txt → Open The following code is used for all answers in this problem except where explicitly mentioned otherwise. Stat → Regression → Regression → Fit Regression Model → Response: BAC, Continuous predictors: Beer → Options(Confidence Level for all intervals: 90, type of confidence interval: Two-sided) → OK → Results: Display of results: Expanded tables → OK → OK 1. (5 pts) Make a scatterplot of the data (including the least squares regression line). Briefly describe the relationship between blood alcohol content and the number of beers. Solution: Graph → Scatterplot → With Regression → OK → (Y-variables: BAC, X variables: Beers) → OK STAT 350 (Fall 2014) Autor: Will A. Eagon Lab 9: SAS Solution 2 It appears that there is moderately strong, positive, linear relationship between BAC and number of beers. 2. (5 pts) Obtain the equation of the least-squares regression line for predicting blood alcohol from number of beers. What is r2 for these data? Solution: Regression Equation BAC = -0.0127 + 0.01796 Beers Model Summary S 0.0204410 R-sq 79.98% R-sq(adj) 78.55% PRESS 0.0086171 R-sq(pred) 70.51% R2 = 78.44% 3. (5 pts) Using the results from parts (1) and (2), briefly summarize what your data analysis shows. You should emphasize in this summary whether this data can be used for prediction purposes. Please comment on whether this is an SRS situation based on the information provided. Solution: Analysis of Variance Source Regression Beers Error Lack-of-Fit Pure Error Total DF 1 1 14 7 7 15 Seq SS 0.023375 0.023375 0.005850 0.002802 0.003048 0.029225 Contribution 79.98% 79.98% 20.02% 9.59% 10.43% 100.00% Adj SS 0.023375 0.023375 0.005850 0.002802 0.003048 Adj MS 0.023375 0.023375 0.000418 0.000400 0.000435 F-Value 55.94 55.94 P-Value 0.000 0.000 0.92 0.543 STAT 350 (Fall 2014) Autor: Will A. Eagon Lab 9: SAS Solution 3 First the sample can be treated as SRS because it is possible that from an SRS of everyone that you could obtain equal numbers of women and women. We can see from the scatterplot that we do have a linear relationship and the standard deviation is close to being approximately constant. The R 2 value is close to being one and the MSE is very small, therefore the points in the scatterplot are closed to the regression line in an absolute sense. Therefore, this data can be used for prediction 4. (10 pts) Is there significant evidence that drinking more beers increases blood alcohol on the average in the population of all students? Please perform the 4*-step process (state hypotheses, give a test statistic and P-value, and state your conclusion). Solution: Coefficients Term Constant Beers Coef -0.0127 0.01796 SE Coef 0.0126 0.00240 90% CI (-0.0350, 0.0096) (0.01373, 0.02219) T-Value -1.00 7.48 P-Value 0.332 0.000 VIF 1.00 Step 0: Definition of the terms 1 is the population slope Step 1: State the hypotheses H0: 1 = 0 Ha: 1 > 0 Step 2: Find the Test Statistic, report DF. tt = 7.48 DF = 14 Step 3: Find the p-value: P-value < 0.000/2 The value provided is the two-sided so to convert it to a one-sided P-value we need to divide the value by 2. Step 4: Conclusion: = 0.05 Since 0.000 ≤ 0.05, we should reject H0 The data provides sufficiently strong evidence (P-value = 0.000) that there is an positive linear association between BAC and number of beers. 5. (5 pts) Steve thinks he can drive legally 30 minutes after he drinks 5 beers. The legal limit is BAC = 0.08. Give and interpret a 90% prediction interval for Steve’s BAC. Can he be confident he won’t be arrested if he drives and is stopped? Note: It is still bad to drink when buzzed, that is your BAC is below 0.08. Solution: Stat → Regression → Regression → Predict → Response: BAC, enter individual values, Beers:5 → Options (Confidence level: 90, Type of Interval: Two-sided) → OK → Results → Be sure that both Regression equation and Prediction tables are checked → OK → OK Variable Beers Fit 0.0771182 Setting 5 SE Fit 0.0051300 90% CI (0.0680826, 0.0861538) 90% PI (0.0399988, 0.114238) STAT 350 (Fall 2014) Autor: Will A. Eagon Lab 9: SAS Solution 4 The 90% prediction interval for Steve’s BAC is (0.039988, 0.114238). We are 90% confident that the next value of BAC after drinking 5 beers is between 0.039988 and 0.114238. Since values greater than 0.08 are in the interval, he should not be confident that he can drive 30 minutes after drinking 5 beers. B (50 points) House Prices (Data Set: sales.txt - webpage) Real estate is typically reassessed annually for property tax purposes. This assessed value, however, is not necessarily the same as the fair market value of the property. The data file summarizes an SRS of 30 properties recently sold in a Midwestern city. Both variables, Sales Price and Assessed value are measured in thousands of dollars. Solution: File → Open Worksheet → Files of type: Text (*.txt) → sales.txt → Open The following code is used for all answers in this problem except where explicitly mentioned otherwise. Stat → Regression → Regression → Fit Regression Model → Response: SalesPrice, Continuous predictors: AssessedValue → Options (Confidence Level for all intervals: 95, type of confidence interval: Two-sided) → OK → Graphs (Residual Plots: Check Individual plots with histogram and Normal Plot, Residuals versus the variables: AssessedValue) → OK → Results: Display of results: Expanded tables → OK → OK 1. (4 pts) Inspect the data. How many have a selling price greater than the assessed value? Do you think this trend would be true for the larger population of all homes recently sold? Explain your answer. Solution: There are 17 houses that have a selling price greater than the assessed value. This is nearly half of the total number of house. Perhaps for large sample, there will still be approximately half of the houses that have a selling price greater than the assessed value. This trend may not generalize if we were to examine cities outside of this Midwestern city because there is a dependency among the real estate values. 2 (5 pts) Make a scatterplot with assessed value on the horizontal axis. Please include the regression line in your plot. Briefly describe the relationship between assessed value and selling price. STAT 350 (Fall 2014) Autor: Will A. Eagon Lab 9: SAS Solution 5 Solution: Graph → Scatterplot → With Regression → OK → (Y-variables: SalesPrice, X variables: AssessedValue) → OK The two variables have strong positive linear relationship. However, it does look like there might be an x-outlier with an AssessedValue of more than 300. 3. (5 pts) Obtain the residuals and plot them versus assessed value. Is there anything unusual to report? If so, explain. Solution: I see no pattern here so the association seems to be linear. Also from the plot I would say that constant standard deviation is valid. Again, there looks like there is an outlier with AssessedValue greater than 300. STAT 350 (Fall 2014) Autor: Will A. Eagon 4. Lab 9: SAS Solution 6 (5 pts) Do the residuals appear to be approximately Normal? Explain your answer. Be sure to include the appropriate graph in your answer. Solution: It looks like the residuals are normal because on the QQ plot the points are close to the line and the line on the histogram seems to match without important deviation. Therefore the x-outlier does not affect the normality of the residuals. 5. (5 pts) Based on your answers to parts, (2), (3), and (4), do the assumptions for the linear regression analysis appear reasonable? Explain your answer. Solution: First, it is appropriate to treat our sample as SRS. Also, the three other assumptions are met: linear, constant standard deviation of the residuals and normality of the residuals. The only trouble spot is the x – outlier to determine if it is influential or not. 6. (5 pts) Obtain the least-squares regression line for predicting selling price from assessed value. Solution: Regression Equation SalesPrice = 37.4 + 0.849 AssessedValue 7. (3 pts) Calculate the predicted selling prices for homes currently assessed at $155,000, $220,000, and $285,000. (This part may be done by hand.) Stat → Regression → Regression → Predict → Response: SalesPrice, enter individual values, AssessedPrice:155,220,285 → OK STAT 350 (Fall 2014) Autor: Will A. Eagon Variable AssessedValue Fit 168.984 Fit 224.160 Fit 279.336 90% CI (157.347, 180.621) 90% PI (121.872, 216.096) Setting 220 SE Fit 5.90144 Variable AssessedValue 7 Setting 155 SE Fit 6.83222 Variable AssessedValue Lab 9: SAS Solution 90% CI (214.108, 234.212) 90% PI (177.414, 270.906) Setting 285 SE Fit 12.0943 90% CI (258.736, 299.936) 90% PI (229.251, 329.421) OR Sales Price 1 = 37.41025 + 0.84886 * 155 = 168.9836 Sales Price 2 = 37.41025 + 0.84886 * 220 = 224.1595 Sales Price 3 = 37.41025 + 0.84886 * 285 = 279.3354 8. (3 pts) Suppose these houses sold for $142,900, $224,000, and $286,000 respectively. Calculate the residual for each of these sales. (This part may be done by hand.) Solution: Residual 1 = Observed Value – Predicted value = 142.9 – 168.9836 = 26.08 ($26,080) Residual 2 = Observed Value – Predicted value = 224 – 224.1595 = 0.159 ($159) Residual 3 = Observed Value – Predicted value = 286 – 279.3354 = -6.66 (-$6,660) 9. (10 pts) Construct and interpret a 95% confidence interval for the slope and the intercept. Explain why inference on the intercept is not of interest in this problem. Solution: Coefficients Term Constant AssessedValue Coef 37.4 0.849 SE Coef 23.9 0.121 95% CI (-11.7, 86.5) (0.601, 1.097) T-Value 1.56 7.03 P-Value 0.130 0.000 VIF 1.00 Slope: 95% CI (0.601, 1.097) We are 95% confident that the population slope is between 0.601 and 1.097. Intercept: 95% CI (-11.7, 86.5) We are 95% confident that the population y-intercept is between -11.67 and 86.5. Since there cannot be an Assessed Value of 0 for a house, the y-intercept is not relevant in this situation. OR Since the data points do not include an Assessed Value of 0, the y-intercept would be an extrapolated point so should not be considered in the study. STAT 350 (Fall 2014) Autor: Will A. Eagon Lab 9: SAS Solution 8 10. (5 pts) Using the result from part (9), compare the estimated regression line with y = x, which says, on average, the selling price is equal to the assessed value. Is there evidence that this model is not reasonable? In other words, is the selling price typically larger or smaller than the assessed value? Explain your answer. How does your answer compare to your response in part (1). Solution: To answer this question, you need to look at the confidence intervals of both the slope and the yintercept. If y = x is the regression line for the population then 0 would be in the confidence interval for the y-intercept and 1 would be in the confidence interval for the slope. Since this is what occurs in this situation, then there is no data to suggest that this model is not reasonable. Note: It is usually not appropriate to remove the y-intercept from the model because then then the methodology is not appropriate. This is consistent to what was stated in part (1), that is, nearly half of the selling prices were below the assessed values.