Introductory Econometrics Problem set 2 â solutions
Transcription
Introductory Econometrics Problem set 2 â solutions
Introductory Econometrics Problem set 2 – solutions Jan Zouhar Department of Econometrics, University of Economics, Prague, [email protected] Problem 2.1. Use the data in wage2.gdt for both this problem and the remaining problems below. a ) Estimate the equation log.wage/ D ˇ0 C ˇ1 exper C ˇ2 exper2 C ˇ3 educ C ˇ4 female C ˇ5 nonwhite C u: (1) and report the results using the usual format. After loading the dataset and creating the variables l_wage log.wage/ and sq_exper exper2 , we run the OLS regression to obtain Model 1: OLS, using observations 1-526 Dependent variable: l_wage coefficient std. error t-ratio p-value ----------------------------------------------------------const 0.395402 0.103219 3.831 0.0001 *** exper 0.0389529 0.00482908 8.066 5.03e-015 *** sq_exper -0.000687172 0.000107516 -6.391 3.66e-010 *** educ 0.0839129 0.00699065 12.00 1.80e-029 *** female -0.337424 0.0363579 -9.281 4.53e-019 *** nonwhite -0.0213127 0.0596998 -0.3570 0.7212 Mean dependent var Sum squared resid R-squared F(5, 520) Log-likelihood Schwarz criterion 1.623268 89.03680 0.399737 69.25751 -279.2075 596.0069 S.D. dependent var S.E. of regression Adjusted R-squared P-value(F) Akaike criterion Hannan-Quinn 0.531538 0.413793 0.393966 1.89e-55 570.4150 580.4354 The usual format of reporting the results is the equation form: 3 log.wage/ D 0:395 C 0:0390 exper 0:000687 exper2 C 0:0839 educ 0:337 female 0:0213 nonwhite; .0:103/ .0:00483/ .0:000108/ .0:00699/ .0:0364/ .0:0597/ n D 526; R2 D 0:400: b ) Based on your results from part a, find the 99% confidence interval for ˇ5 . Is the (partial) effect of race statistically significant at the 1% level in your equation? The easiest way to obtain the 99% CI is to use Gretl’s built-in routines: Analysis ! Confidence intervals for coefficients ! ˛ ! 0:99, the result is 99% CI for ˇ5 D Œ 0:176; 0:133: Alternatively, we can calculate the endpoints of the interval manually using the formula coefficient ˙ c.standard error/ D 0:02131 ˙ 2:585.0:05970/; where c D 2:585 is the 99:5th percentile of t526 5 1 ; the value can be found e.g. in Gretl’s Tools ! Statistical tables. The 99% CI contains zero, meaning that we cannot reject the null that ˇ5 D 0 against the two-sided alternative at the 1% level. (We could also compare the p-value for nonwhite from the regression output, 0.721, to our significance level.) We conclude that the effect of race is not significant in our equation. 1 Introductory econometrics: Problem Set 2 Jan Zouhar c ) Use White’s test and the Breusch-Pagan test (Tests ! Heteroskedasticity ! White’s test / BreuschPagan) to show whether Assumption MLR.5 holds. What do you conclude? (Report the value of the test statistics and the resulting p-value along with your conclusions.) What does the test tell you about the results you obtained from the regression? After running the tests, Gretl appends the following text to the original regression output: White’s test for heteroskedasticity Null hypothesis: heteroskedasticity not present Test statistic: LM = 24.7849 with p-value = P(Chi-square(17) > 24.7849) = 0.0996277 Breusch-Pagan test for heteroskedasticity Null hypothesis: heteroskedasticity not present Test statistic: LM = 13.663 with p-value = P(Chi-square(5) > 13.663) = 0.0178977 From the p-values we can immediately see that White’s test fails to reject the null of homoskedasticity, while the Breusch-Pagan test does reject at the conventional 5% level. As I mentioned in the lectures, power of these tests is rather limited, and if any of them rejects, the conservative approach is to proceed as if heteroskedasticity was present in the model. Recall that under heteroskedasticity, the MLR.5 assumption is violated, and our results related to standard errors, confidence intervals and hypothesis tests do not hold (i.e., last three columns in Gretl’s regression output are unusable). On the other hand, OLS estimates of ˇ0 ; : : : ; ˇ5 are still consistent and typically fairly efficient, so the coefficient column can be retained. d ) Using the approximation %wage 100.ˇ1 C 2ˇ2 exper/exper; find the approximate return to the fifth year of experience. What is the approximate return to the twentieth year of experience? For the fifth year of experience we have exper D 5; exper D 1; plugging these values and our estimates into the above formula gives %wage 100 0:0390 C 2. 0:000687/5 1 D 3:213; i.e. the wage is expected to change by approximately 3.213 per cent as a result of increasing working experience from 4 to 5 years. Analogously, for the twentieth year of experience we have %wage 100 0:0390 C 2. 0:000687/20 1 D 1:152; showing that the increase in wage diminishes with additional experience. e ) At what value of exper does additional experience actually begin to lower predicted log.wage/? (Or, what is the turning point in the effect of experience?) How many people have more experience in this sample? (Hint: Sorting the data using Data ! Sort data ! exper might help you out with the last question.) From the lectures, we know that the turning point can easily be obtained from first-order conditions as turning point D coefficient on the linear term D 2 coefficient on the quadratic term 0:0390 D 28:4 years of experience: 2. 0:000687/ It turns out that 121 people in the sample have at least 29 years of experience (exper is recorded as an integer in our data). Problem 2.2. Based on (1), you want predict the salary of a white male person with 5 years of work experience and 18 years of education. This prediction is made difficult by the presence of logarithms; read Wooldridge’s section ‘Predicting y when log.y/ is the dependent variable’. a ) Find the prediction, assuming that u is normally distributed (conditional on all independent variables), i.e. that assumptions MLR.1 through MLR.6 hold. 2 Introductory econometrics: Problem Set 2 Jan Zouhar First of all, it is convenient to re-estimate the equation with slightly modified variables: we replace exper, sq_exper and educ with exper_5 exper 5; sq_exper_5 .exper educ_18 exper 5/2 ; 18: (Use Add ! Define new variable. . . to create these variables in Gretl.) In the new equation, log.wage/ D ı0 C ˇ1 exper_5 C ˇ2 sq_exper_5 C ˇ3 educ_18 C ˇ4 female C ˇ5 nonwhite C u; (2) coefficients ˇ1 ; : : : ; ˇ5 are the same as in (1), the only thing that has changed is the intercept (hence the different notation, ı0 ). We can easily verify this by running the OLS for (2) in Gretl – compare the results for (2) below with those for (1) above: Model 2: OLS, using observations 1-526 Dependent variable: l_wage coefficient std. error t-ratio p-value ------------------------------------------------------------const 2.08342 0.0451186 46.18 4.08e-186 exper_5 0.0320812 0.00381271 8.414 3.84e-016 sq_exper_5 -0.000687172 0.000107516 -6.391 3.66e-010 educ_18 0.0839129 0.00699065 12.00 1.80e-029 female -0.337424 0.0363579 -9.281 4.53e-019 nonwhite -0.0213127 0.0596998 -0.3570 0.7212 Mean dependent var Sum squared resid R-squared F(5, 520) Log-likelihood Schwarz criterion 1.623268 89.03680 0.399737 69.25751 -279.2075 596.0069 S.D. dependent var S.E. of regression Adjusted R-squared P-value(F) Akaike criterion Hannan-Quinn *** *** *** *** *** 0.531538 0.413793 0.393966 1.89e-55 570.4150 580.4354 The reason why we transformed the variables is that in the new equation, the intercept tells us something about the wage of a white male person with 5 years of work experience and 18 years of education. For this person, (2) reduces to log.wage/ D ı0 C u; or wage D e ı0 Cu : For our prediction, we will use the expected wage of the person in question, which is E Œwage D E Œe ı0 Cu D „ƒ‚… e ı0 E Œe u : „ƒ‚… A (3) B The first term, A, can be consistently estimated by exponentiating the OLS intercept, in our case AO D e 2:0834 D 2 8:032: If u Normal.0; 2 /, it can be shown that B E Œe u D e =2 ; see the lectures. An estimate of is provided in the regression output under the name S.E. of regression. Therefore, we can estimate B as BO D 2 e 0:4138 =2 D 1:0894: Altogether, our estimate of the person’s wage is b wage D AO BO D 8:032.1:0894/ D $8:75 per hour: b ) Save the residuals from (1) to a new variable uhat, and test for normality (Variable ! Normality test), the null is that uhat is normally distributed. What do you conclude? Gretl’s output after running the tests is: Test for normality of uhat: Doornik-Hansen test = 10.6516, with p-value 0.00486434 Shapiro-Wilk W = 0.991748, with p-value 0.00508462 Lilliefors test = 0.0367591, with p-value ~= 0.08 3 Introductory econometrics: Problem Set 2 Jan Zouhar Jarque-Bera test = 10.5413, with p-value 0.00514034 The null hypothesis is that u is normally distributed. This null is rejected by the Doornik-Hansen, Shapiro-Wilk and Jarque-Bera tests at the 5% (or 1%) level, so we have to acknowledge that the assumption of normality of u was not justified in the previous task. c ) Find the prediction once again, this time using Duan’s (1983) smearing estimate, described in the same section of Wooldridge’s book. (Hint: you will need to create a new variable, calculated as exp.uhat/, and find its mean, e.g. by displaying summary statistics.) Once again,P we will base our prediction on (3), but this time, we will estimate B as the average e uO in the sample, n uO i i.e. BO D i D1 e : In Gretl, this can be done as follows. We have already saved the residuals in the uhat variable. Next, we create exponentiated residuals and store them in a new variable, say expuhat: Add ! Define new variable. . . ! expuhat = exp(uhat). Now we can obtain BO as the mean of expuhat, e.g. via View ! Summary statistics. This gives us BO D 1:0891, which is nearly the same as before, and does not change the predicted wage from 2.2a within first 3 significant figures: b wage D AO BO D 8:032.1:0891/ D $8:75 per hour: Problem 2.3. a ) Estimate a modified version of (1) with the level, rather than log, of wage as the dependent variable: wage D ˇ0 C ˇ1 exper C ˇ2 exper2 C ˇ3 educ C ˇ4 female C ˇ5 nonwhite C u: (4) Gretl’s output from the OLS regression is given below; note that the equation form of reporting your estimates is preferred. Model 3: OLS, using observations 1-526 Dependent variable: wage coefficient std. error t-ratio p-value ---------------------------------------------------------const -2.28278 0.746117 -3.060 0.0023 exper 0.255446 0.0349070 7.318 9.61e-013 sq_exper -0.00444815 0.000777181 -5.723 1.77e-08 educ 0.554632 0.0505318 10.98 2.37e-025 female -2.11579 0.262813 -8.051 5.64e-015 nonwhite -0.157833 0.431539 -0.3657 0.7147 Mean dependent var Sum squared resid R-squared F(5, 520) Log-likelihood Schwarz criterion 5.896103 4652.262 0.350280 56.06904 -1319.651 2676.894 S.D. dependent var S.E. of regression Adjusted R-squared P-value(F) Akaike criterion Hannan-Quinn *** *** *** *** *** 3.693086 2.991096 0.344033 1.36e-46 2651.302 2661.322 b ) Save the residuals (u) O from (4) and find the sample correlation coefficients between uO and all the explanatory variables (i.e., 5 correlation coefficients). Explain the results. After saving your residuals, the correlations can be obtained through View ! Correlation matrix. All correlations are nearly zero. Actually, the fact they are not exactly zero is only attributable to rounding errors, inherent in all computer calculations. We know that OLS residuals are non-correlated with all explanatory variables, see the slide entitled ‘Three facts about the fitted values and residuals’ in my ‘Multiple regression’ presentation. In particular, the fact that residuals (u) O are not correlated with explanatory variables tells us nothing about our assumption MLR.4 which rules out correlation between explanatory variables and the random error (u)! 1 1 c ) Save the fitted values wage from (4) and find the sample correlation coefficient between wage and wage. Is there any relationship between this correlation coefficient and the R2 from the regression model? (Hint: See Wooldridge, look for the origin of the term ‘R-squared’.) 4 Introductory econometrics: Problem Set 2 Jan Zouhar With all that we have learnt to do in Gretl so far, this one should be a breeze. The relationship with R-squared is as follows: R2 D 0:35 D 0:5922 D Œcorr .wage; wage/2 : b 1 1 d ) Based on (1), calculate the predicted wage for all people in the sample (wage2), using Duan’s estimate as in Problem 2.2. Find the squared correlation between wage and wage2, and use the result to compare the goodness of fit of (1) and (4). (See Wooldridge, same section as in Problem 2.2, for a comparison of goodness of fit for models that combine dependent variables in the level and the log form.) 1 (i) Open the window with OLS regression output for either (1) or (2), save the fitted values in a new variable 1 using Save ! Fitted values. l_wage_hat l_wage In order to obtain wage2 as requested, the following steps need to be carried out: (ii) Create exponentiated fitted values Add ! Define new variable. . . ! A = exp(l_wage_hat) – this gives us an estimate of A from (3) for each individual in the sample. 1 (iii) Create a new variable wage2hat wage2 based on (3) using Add ! Define new variable. . . ! wage2hat = A * 1.0891, where 1.0891 was our estimate of B obtained using Duan’s method in 2.2c. 1 1 Finally, we can find the correlation between wage and wage2; the result is corr .wage; wage2/ D 0:625. This is higher than the correlation of 0.592 between fitted and actual values in 2.3b where wage, rather than log.wage/, was the dependent variable, implying that the model with logarithmic wage has a better fit. As an aside, note that if our only aim is to obtain the correlation between fitted and actual values, we needn’t do step (iii) above, as multiplication by a constant does not affect correlations. In other words, in terms of variable names created in Gretl in the above procedure, corr .wage; A/ D corr .wage; wage2hat/. 5