Sample Midterm (Solutions)
Transcription
Sample Midterm (Solutions)
Applied Regression Analysis 41100 Instructor: Federico M. Bandi Sample Midterm (Solutions) The allotted time is 1 hour and 30 minutes. The exam is divided into three parts. The first and second part are true-false and multiple choice, respectively. Please answer the true-false and multiple choice questions on the exam by circling the best answer. There will be no partial credit for these questions. The third part of the exam consists of several problems. Please answer these problems in the space provided on the exam (you may use the back of the sheets if necessary). You will get partial credit for these problems provided that your answers are organized and legible so that your train of thought can be easily followed. Note: You should answer all questions on the exam. The blue books will not be looked at. Please print your name in the space provided below and sign. Panicking is not allowed and will be penalized! Name: Please, sign the following pledge: “I pledge my honor that I have not violated the Honor Code during this examination.” Signature: True/False Multiple Choice Question 1 Question 2 8 Points 18 Points 13 Points 27 Points Total 66 Points 1 True or False (1 point each) (1) The estimated least-squares residuals are not correlated with the X values but could be correlated with the fitted values T F b b0 +b1 X. ThereFalse. The fitted values are perfectly correlated with the X values since Y= fore, the correlation between the fitted values and the residuals is also zero. See Chapter 1 in the notes. (2) If the correlation between the estimated least-squares residuals and the X values were positive, then the estimated slope in linear regression analysis would be too flat. True. If the slope is too flat the residuals associated with values around the lower bound of the range of the X variable are negative on average while those associated with values about the upper bound of the range of the X variable are positive on average. Hence, the shape of the scatter plot of the residuals against the X variable is upward sloping. We have a similar discussion in Chapter 1 in the notes. (3) In simple regression analysis, E(β 0 ) = b0 TF False. The opposite is true. The estimator is unbiased (not the actual parameter), therefore E(b0 ) = β 0 . See Chapter 3. (4) The true variance of the residuals (σ 2 ) will eventually decrease if enough observations are added to the sample TF False. The true variance is a population parameter (see Chapter 2). Only its estimate (s2 ) is affected by the sample size. (5) If we fail to reject a certain null hypothesis about a certain parameter of interest (the slope in the SLR model, say) at the 5% level, then we might still reject at the 1% level TF False. The cut-off values for a 1% test are larger than the cut-off values for a 5% test. Therefore, if you fail to reject at the 5% level you also fail to reject at the 1% level. See Chapter 4. (6) It is possible to reject a certain null hypothesis about a certain parameter of interest (the slope in the SLR model, say) even when the conjectured null hypothesis is true TF True. The nature of the tests that we have been discussing is such that the probability of rejecting when the null hypothesis is true coincides with the level of the test (it is 5% for a 5% test). See Chapter 4. (7) If the p-value is 3%, then we would always reject the null hypothesis TF False. For example, you would not reject the null when the level of the test is 1%. Of course, you would reject the null if the level of the test is 5%. See Chapter 4. (8) When the sample size is very large, the interval (b0 +b1 Xf −2s, b0 +b1 Xf +2s) is a valid 95% predictive interval for Yf given a new Xf . 2 True. Just take a look at the formula for a predictive interval at the end of Chapter 4. Notice that the t cut-off values can be replaced by 2 and −2 (the approximate cut-off values based on the normal distribution) because the t distribution tends to the normal as the sample size (the number of degrees of freedom) gets large. Notice also that the statement implies that a simple “plug-in” interval (one that you obtain by replacing the true quantities β 0 , β 1 and σ with the estimated quantities b0 , b1 and s) is a good way to predict Yf given a new Xf when the estimation uncertainty is not substantial (when n is large). In this case the sample quantities (b0 , b1 and s, that is) are similar to the population quantities (β 0 , β 1 and σ, that is) which are the quantities that you would use if you knew what the population looks like (i.e., if you knew the true data-generating process). (Compare the predictive interval based on the true population parameters in Chapter 2 to the predictive interval based on estimates of the population parameters in Chapter 4.) 3 Multiple Choice (3 points each) [1] You run a linear least-squares regression on a data set of 20 observations. The sample average of the X values is 10 and the sample average of the Y values is 5. Suppose you add an observation that has X = 10 and Y = 5. Now you run a new least-squares regression on the sample of 21 observations. How would the slope estimate change (compared to the previous regression)? (a) It would increase (b) It would decrease (c) It would not change (d) Cannot tell based on the information given It would not change. The terms that get added in the expressions for the numerator and the denominator in the formula for the slope estimate (see Chapter 1 in the notes) for the 21-st observation are both equal to zero, since the new (X, Y) pair is at the point of means (X,Y). [2] Assume the same set-up as in the previous question. How would the R2 change? (a) It would increase (b) It would decrease (c) It would not change (d) Cannot tell based on the information given It would not change. Both the SSE and the SST in the definition of the coefficient of determination (see Chapter 1) remain unchanged. In fact, the term for the 21-st observation in both summations is equal to zero. (To see this, recall that the point of means (X,Y) is on the regression line.) 4 [3] Assume the same set-up as in the previous questions. How would the estimated standard deviation of the residuals (s) change? (a) It would increase (b) It would decrease (c) It would not change (d) Cannot tell based on the information given It would decrease. SSE is unchanged, but n is larger. [4] Consider the following simple linear regression (SLR) model: Yi = 1 + 2Xi + εi , where εi ; N (0, 1) i.i.d.. The error term ε is independent of X for every i. Which of the following statements is WRONG? (a) E(Y |X = 2) = 5 (b) The 95% predictive interval for Y given X = 2 is (3, 7) (c) The 68% predictive interval for Y given X = 2 is (4, 6) (d) If Xi ; N (0, 1), then the variance of Y is equal to 6 (i.e., V ar(Y ) = 6) (e) If Xi ; N (0, 1), then the expected value of Y is equal to 1 (i.e., E(Y ) = 1) The answer is (d). (a) E(Y |X (b) 95% CI = 2) = 1 + 2 ∗ 2 = 5, = (5 − 2 ∗ 1, 5 + 2 ∗ 1) = (3, 7), (c) 68% CI = (5 − 1 ∗ 1, 5 + 1 ∗ 1) = (4, 6), (d) V ar(Y ) = 4V ar(X) + V ar(ε) = 5, (e) E(Y ) = 1 + 2E(X) + E(ε) = 1. 5 [5] Which of the following results in a LARGER confidence interval width for the intercept estimate in linear regression analysis (everything else being kept constant)? (a) smaller estimated intercept (b) smaller degree of confidence (c) smaller sample size ¡ ¢ (d) smaller estimated residual variance s2 ¡ ¢ (e) larger regressors variance s2X (f) None of the above The answer is (c). Immediate, by looking at the formula for the confidence interval of the intercept from Chapter 4. [6] One concern about the depletion of the ozone layer is that the increase in UV light will decrease crop yields. An experiment was conducted in a green house where soybean plants were exposed to varying UV levels measured in Dobson units. At the end of the experiment the yield (kg) was measured. Using 100 observations, a linear regression analysis was performed with the following results: Intercept UV Estimate 3.9800118 −0.046285 Std error 0.053774 0.010741 t-ratio 74.01 ?? P-value < .0001 0.0008 Which of the following statements is WRONG? (a) An increase in UV light decreases crop yields (b) The missing t-ratio is −4.309 (c) An approximate 95% confidence interval for the true slope is −0.046285 ± 2 ∗ 0.010741 (d) At the 5% level, we fail to reject the null hypothesis that the true slope is equal to −0.05 (e) None of the above 6 The answer is (e). (a)yes, of course, −0.046285 (b) 0.010741 = −4.309, (c)yes, of course, (d) −0.046285 − (−0.05) 0.010741 = 0.34(fail to reject). 7 Long Problems [1] (13 points) Suppose that the weekly sales (SALES) of a company i depend on advertising (AD) levels according to the following simple linear regression (SLR) model: SALESi = 10 + 5ADi + εi , where εi is N (0, 4). The variables and their units are: SALES = amount of weekly sales (in thousand of dollars) AD = weekly advertising expenditure (in thousand of dollars) (a) (3 points) If no money is spent on advertising this week, what is the probability that SALES will be greater than 10 thousand dollars? If ADi = 0, then SALESi = 10 + εi ; N (10, 4). So, P (SALES > 10) = .5. (b) (2 points) If the company spends one thousand dollars on advertising this week, what is the expected value of SALES? If ADi = 1, then SALESi = 10 + 5 ∗ 1 + εi ; N (15, 4). So, E(SALES) = 15. (c) (2 points) If the company spends one thousand dollars on advertising this week, what is the standard deviation of SALES? 8 If ADi = 1, then SALESi = 10 + 5 ∗ 1 + εi ; N (15, 4). So, V (SALES) = 4 and Std(SALES) = 2. (d) (3 points) If the company spends one thousand dollars on advertising this week, what is the approximate probability that SALES will be between 13 thousand dollars and 17 thousand dollars? If ADi = 1, then SALESi = 10 + 5 ∗ 1 + εi ; N (15, 4). ¢ ¡ < SALES−15 < 17−15 = P (−1 < Z < 1) ≈ 68%. So, P (13 < SALES < 17) = P 13−15 2 2 2 (e) (3 points) If the company spends one thousand dollars on advertising this week, what is the approximate probability that SALES will be greater than 19 thousand dollars? If ADi = 1, then SALESi = 10 + 5 ∗ 1 + εi ; N (15, 4). ¡ SALES−15 ¢ So, P (SALES > 19) = P > 19−15 = P (Z > 2) ≈ 2.5%. 2 2 9 [2] (27 points) For 114 restaurants in NYC we have Zagat ratings on food as well as information on the price of a meal. We want to understand if there is a relationship between quality (as summarized by the Zagat ratings) and average price per meal. We run a simple linear regression of price on quality and obtain the following output Intercept slope Estimate −18.154 2.6253 Std error 6.553 0.3315 t-ratio −2.77 7.92 P-value 0.007 0.000 In addition, we know that s = 8.93 and R − sq = 35.9% (a) (2 points) Give an economic interpretation for the sign of the slope. Quality has a positive impact on price. As the rating increases by one, the average price increases by $2.62. (b) (2 points) Find an approximate 95% confidence interval for the true slope. 2.6253 ± 2 ∗ 0.3315 (c) (2 points) Test the hypothesis that the slope is equal to zero at the 5% level. (You should be very precise here.) P − value = 0.000 < 0.05 10 Reject (d) (3 points) Test the hypothesis that the slope is equal to 2 at the 5% level. (You should use our usual rule-of-thumb here.) 2.62 − 2 = 1.87 < 2 0.3315 fail to reject (e) (3 points) You are planning to have dinner in Soho at the restaurant YumYum. You know that the Zagat rating is 24. How much do you expect to pay? Pb = −18.154 + 2.62 ∗ 24 = $44.88 (f) (3 points) Assume the estimated values are the true model parameters and find a 95% predictive interval for the total price given a rating equal to 24. $44.88 ± 2 ∗ s = $44.88 ± 2 ∗ (8.93) = ($27.02, $62.74) (g) (3 points) Compare your result from (f) to the true predictive interval from Minitab which is ($26.831, $62.876) . What do you notice? Why? 11 They are very similar. A simple “plug-in” interval (which implies acting as if the estimates were the true model parameters) is a good approximation to the correct predictive interval from Chapter 4 in the notes since we have a sufficiently large number of observations (notice what happens to the predictive interval from Chapter 4 in the notes when n gets very very large...) (h) (2 points) Use the information in point (g) and the fact that the t cut-off value t112,0.025 = −1.98, to find the standard error of the predicted values (spred in the notes). $62.876 = $44.88 + 1.98 ∗ s ⇒ s = $62.876 − $44.88 = 9.09. 1.98 (i) (3 points) Use your result from part (h) and the value of s to find the standard error of the fitted values (sf it in the notes). s2pred = s2f it + s2 ⇒ s2f it = s2pred − s2 = 82.62 − 79.74 = 2.88. Hence, sf it = √ 2.88 = 1.69 (j) (2 points) Use your result from part (i) to find a confidence interval for the expected price given a rating of 24. $44.88 ± 1.98 ∗ 1.69 = (41.53, 48.22) (k) (2 points) Do you think quality of food is sufficient to explain the price of a meal? Briefly explain your answer. No, there is something else going on. Maybe decor, style and so on. The R-squared is quite low. 12