Economics 375: Introduction to Econometrics Homework #4 This
Transcription
Economics 375: Introduction to Econometrics Homework #4 This
Economics 375: Introduction to Econometrics Homework #4 This homework is due on February 28th. 1. One tool to aid in understanding econometrics is the Monte Carlo experiment. A Monte Carlo experiment allows a researcher to set up a known population regression function (something we’ve assumed we can never observe) and then act like a normal econometrician, forgetting for the moment the population regression function, and seeing how closely an OLS estimate of the regression comes to the true and known population regression function. Our experiment will demonstrate that OLS is unbiased (something that the Gauss Markov Theorem should convince you of). The idea behind Monte Carlo experiments is to use the computer to create a population regression function (which we usually think of as being unobserved), then acting like we “forgot” the PRF, and using OLS to estimate the PRF. Thus, a Monte Carlo experiment allows a researcher to understand if OLS actually comes “close” to the PRF or not. In Stata, this is easy. Start by creating opening Stata and creating a new variable titled x1. The easiest way to do this is to enter two commands in the command box without my brackets {set obs 20} {gen x1 = _n in 1/20}. The first command tells stata that you are going to use a data set with 20 observations. The second sets the value of x1 equal to the index number (always starting at one and increasing by one for each observation). To create a random normal variable type the following in the command line gen epsilon = rnormal(). This generates a new series of 20 observations titled "epsilon" where each observation is a random draw from a normal distribution with mean zero and variance one. In this case, gen represents the generate command, epsilon is the name of a new variable you are creating that is a random draw from a normal distribution. The generate command is a commonly used command in Stata. It might be worth reading the help menu on this command (type: help generate in the command line). After creating epsilon, we are ready to create our dependent variable y. To do this, let’s create a population regression where we know the true slope and intercept of the regression. Since my favorite football player was Dave Krieg of the Seattle Seahawks (#17) and my favorite baseball player was Ryne Sandberg (#23), we will use these numbers to generate our dependent variable. In Stata use the gen command to create y where: yi = 17 + 23x1i + epsiloni Your command will look something like: gen y = 17 + 23*x1 + epsilon At this point, if you’ve done everything correctly, you should have data that looks something like: Using your created data, use Stata’s reg command to estimate the regression: yi = B0 + B1x1i a. Why didn’t you include epsilon in this regression? Generally, econometricians do not observe the error term of any regression (if they did, they would not need to estimate the regression since knowing the value of Y, X and the value of the error term would allow the econometrician to perfectly observe the PRF). b. What are your estimates of the true slope coefficients and intercept? Perform a hypothesis test that B1 = 23. What do you find? When I estimate my regression, I get: . reg y x Source SS df MS Model Residual 352132.22 16.159413 1 18 352132.22 .897745166 Total 352148.379 19 18534.1252 y Coef. x1 _cons 23.01135 16.50865 Std. Err. .0367422 .4401408 t 626.29 37.51 Number of obs F( 1, 18) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 . 0.0000 1.0000 1.0000 .94749 P>|t| [95% Conf. Interval] 0.000 0.000 22.93416 15.58395 23.08854 17.43335 Note: this will differ from what you obtained because my epsilon will differ from yours. Ho: B1 = 23 Ha: B1 ≠ 23 t = (23.01135 – 23)/.0367422 = .308 tc,18,95% = 2.101 I fail to reject the null and conclude that I do not have enough evidence to state that the slope differs from 23. c. When you turn this homework into me, I will ask the entire class to tell me their estimates of the true, B0, and B1. I will then enter these estimates in a computer, order each from smallest to largest, and then make a histogram of each estimate. What will this histogram look like? Why? I performed 10,000 different experiments exactly as described above. I found: It is pretty apparent that the estimates of B0 and B1 are normally distributed around the true population means of 17 and 23. The Gauss Markov theorem indicates that these distributions have the smallest variance—as long as our classical assumptions are correct. Are they in this case? Interestingly, when I performed this monte carlo experiment, I found the standard deviation of my 10,000 estimates of B0 to equal .4612 and for B1 to equal .0386. For a moment, consider the variances of the slope and intercept we discovered in class: 2 2 ^ σ ∑ Xi Var B 0 = n ∑ x i2 σ2 ^ Var B1 = ∑ x i2 In our data, the sum of the xi2 = 665. Since the variance of the regression is equal to 1 (by virtue of setting up epsilon), the 1 ^ Var B1 = =.0015 and taking the square root gives the standard error of B1-hat of .0387—very 665 close to the monte carlo estimate of .0386. ∑ d. Use Stata to compute . The square root of this is termed the “standard error of the regression.” Does it equal what you would expect? Why or why not? . In my case = .897. In this case, my best guess at the variance of the (usually unknown) error term is .897. Since we know that the variance is actually equal to one and that there may be sampling error when stata draws from a distribution with variance equal to one, my estimates turn out well. 2. On the class webpage, I have posted a Stata file entitled “2002 Freshmen Data” This data is comprised of all complete observations of the 2002 entering class of WWU freshmen (graduating class of around 2006). The data definitions are: aa: a variable equal to one if the incoming student previously earned an AA actcomp: the student’s comprehensive ACT score acteng: the student’s English ACT score actmath: the student’s mathematics ACT score ai: the admissions index assigned by WWU office of admissions asian, black, white, Hispanic, other, native: a variable equal to one if the student is that ethnicity f03 and f04: a variable equal to one if the student was enrolled in the fall of 2003 or the fall of 2004 gpa: the student’s GPA earned at WWU in fall 2002 summerstart: a variable equal to one if the student attended summerstart prior to enrolling in WWU fig: a variable equal to one if the student enrolled in a FIG course firstgen: a variable equal to one if the student is a first generation college student housing: a variable equal to one if the student lived on campus their first year at WWU hrstrans: the number of credits transferred to WWU at time of admission hsgpa: the student’s high school GPA male: a variable equal to one if the student is male resident: a variable equal to one if the student is a Washington resident runstart: a variable equal to one if the student is a running start student satmath: the student’s mathematics SAT score satverb: the student’s verbal SAT score Some of these variables (the 0/1 or “dummy” variables) will be discussed in the future. Admissions officers are usually interested in the relation between high school performance and college performance. Consider the population regression function: gpai = β0 + β1hsgpai + εi a. Use the “2002 Freshmen Data” to estimate this regression. How do you interpret your estimate of β1? Why does this differ from what you found in homework #3? I find: . reg gpa hsgpa Source SS df MS Model Residual 195.606722 860.667118 1 2079 195.606722 .413981298 Total 1056.27384 2080 .507823962 gpa Coef. hsgpa _cons 1.001795 -.7431143 Std. Err. .0460869 .1628574 t 21.74 -4.56 Number of obs F( 1, 2079) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 = = = = = = 2081 472.50 0.0000 0.1852 0.1848 .64341 [95% Conf. Interval] .9114138 -1.062495 1.092176 -.4237337 A unit increase in high school GPA increases college GPA by 1.001 units. H0: β1 = 0 HA: β1 ≠ 0 t = (1.001 – 0)/.046 = 21.74 tc,95%,2161= 1.96 Reject H0 and conclude that high school GPA does impact college GPA. b. When I was in high school, my teachers told me to expect, on average, to earn one grade lower in college than what I averaged in high school. Based on the results of your regression, would you agree with my teachers? If my teachers were correct, then the population regression function would be Fall02GPAi = -1 + 1×HSGPAi + εi. Note, that only under this population regression function would students earning any hsgpa would end up having exactly a one unit lower college gpa. At first glance, one might look at our regression estimates and quickly conclude that the intercept is not equal to -1 so my teachers were incorrect. However, our estimated intercept of -.79 is an estimate; how likely does -.79 result when the true intercept is -1 is a question that can only be answered using a hypothesis test: H0 : β0 = -1 HA : β0 ≠ -1 t = (-.74 - -1)/.162 = 1.59 tc,95%,2077 = 1.96 I would fail to reject this hypothesis and conclude that my intercept is statistically no different than -1 which is what I would need for my college GPA to be one unit less than my high school GPA. However, this too would be an incorrect approach as it only tests one of the two needed requirements (note, the slope must equal one AND the intercept must equal -1). What I really need to test is: H0 : β0 = -1 & β1 = 1 HA : (β0 ≠ -1 & β1 ≠ 1) or (β0 = -1 & β1 ≠ 1) or (β0 ≠ -1 & β1 = 1) In this case, the alternative hypothesis simply states all of the options not included in the null hypothesis. To test this, let us impose the null hypothesis: Fall02GPAi = -1 + 1×HSGPAi + εi.. This statement is something that I cannot estimate, after all there are no coefficients in it. However, if the null hypothesis is true, then it must be that Fall02GPAi +1 - HSGPAi = εi. If we square both sides and add them up, then I will have a restricted residual sum of squares. I obtain this in stata by: . gen restresid = gpa +1 - hsgpa (33 missing values generated) . gen restresid2 = restresid^2 (33 missing values generated) . total restresid2 Total estimation Number of obs Total restresid2 1004.833 = 2081 Std. Err. [95% Conf. Interval] 30.02165 945.9573 I can produce an f-test using this information: F = 1063.708 (. .)/ ./( ) =174.12 Fc,2,2079 ≈3.00 In this case, I reject the null hypothesis and conclude my high school teachers were wrong. c. Now, consider the multivariate regression: GPA = β0 + β1hsgpa + β2SatVerb + β3SatMath+ β4Runningstart+ β5Fig+ β6FirstGen d. What is your estimate of β1? How do you interpret your estimate of β1? Why does this differ from what you found in homework #3? I find: 174.12142 . reg gpa hsgpa satverb satmath runningstart fig firstgen Source SS df MS Model Residual 288.775092 767.498748 6 2074 48.129182 .370057256 Total 1056.27384 2080 .507823962 gpa Coef. hsgpa satverb satmath runningstart fig firstgen _cons .8877086 .0021184 .0006516 -.0800822 .2122322 -.0498571 -1.867454 Std. Err. .044665 .0001952 .000203 .036424 .0386718 .0286369 .1747998 t 19.87 10.86 3.21 -2.20 5.49 -1.74 -10.68 Number of obs F( 6, 2074) Prob > F R-squared Adj R-squared Root MSE P>|t| = = = = = = 2081 130.06 0.0000 0.2734 0.2713 .60832 [95% Conf. Interval] 0.000 0.000 0.001 0.028 0.000 0.082 0.000 .8001157 .0017357 .0002536 -.1515137 .1363926 -.1060172 -2.210255 .9753015 .0025012 .0010496 -.0086507 .2880718 .006303 -1.524652 . In this case, the coefficient on hsgpa tells me that an increase in high school gpa of one unit increases college gpa by .887 units, holding SAT scores, running start, fig participation and first generation status constant. This last phrase (holding…) is important because it recognizes that the impact of high school gpa on college gpa has been purged of the impact of these other variables. It is for this reason that it differs from the estimates you found in part a. e. Test to see if the variables hsgpa, SatVerb, SatMath, Runningstart, Fig and FirstGen predict a significant amount of the variation in WWU first quarter GPA. Ho : β1 = β2 = β3 = β4 = β5 = β6 = 0 ↔ R2 = 0 Ho : R2 ≠ 0 This requires an F-test where the restricted model forces the null hypothesis to be true—that is it forces all slope coefficients to be equal to zero. I can do that in stata by: . reg gpa Source SS df MS Model Residual 0 1056.27384 0 2080 . .507823962 Total 1056.27384 2080 .507823962 gpa Coef. _cons 2.783631 My F-statistic is Std. Err. .0156214 (. .)/ ./( ) t 178.19 Number of obs F( 0, 2080) Prob > F R-squared Adj R-squared Root MSE = = = = = = 2081 0.00 . 0.0000 0.0000 .71262 P>|t| [95% Conf. Interval] 0.000 2.752996 2.814266 =130.06 My Fc,95%,6,2074 ≈ 2.10 I reject the null and conclude that hsgpa, satverb, satmath, runningstart, fig, and firstgen statistically explain some of the variation in college gpa. Said another way, I conclude R2 is not zero. Note the F-statistic I find is exactly the same one Stata reports in the second line of the right column above the regression results. f. Does SatVerb and SatMath predict WWU first quarter GPA? Test this! This is equivalent to testing Ho : β2 = β3 = 0 Ho : (β2 = 0 & β3 ≠ 0) or (β2 ≠ 0 & β3 = 0) or (β2 ≠ 0 & β3 ≠ 0) My restricted regression is: 130.06244 . reg gpa hsgpa runningstart fig firstgen Source SS df MS Model Residual 211.95788 844.31596 4 2076 52.98947 .406703256 Total 1056.27384 2080 .507823962 gpa Coef. hsgpa runningstart fig firstgen _cons 1.014023 -.0180178 .1766598 -.1303211 -.7631823 My f-statistic is Std. Err. .0457579 .0379149 .0403859 .0294277 .1623353 (. .)/ ./( ) t 22.16 -0.48 4.37 -4.43 -4.70 Number of obs F( 4, 2076) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.635 0.000 0.000 0.000 = = = = = = 2081 130.29 0.0000 0.2007 0.1991 .63773 [95% Conf. Interval] .9242872 -.092373 .0974587 -.188032 -1.081539 1.103759 .0563375 .2558609 -.0726103 -.4448252 = 103.79 Fc,95%,2,2074 ≈ 3.00 I reject the null hypothesis and conclude that satverb and satmath do predict fall quarter college GPA. g. Offer two reasons why the coefficient on runningstart is negative. Is this coefficient statistically different than zero? Is it “economically” important? From the full regression, I would reject the null hypothesis that the runninstart coefficient is zero at the 10%, but not the 5% level. For the moment, let us say that the 10% level is an acceptable rejection level for our purposes. If this is the case, then I conclude that students in the running start program do worse as freshmen than those entering as traditional students. Why might this be? There are a large number of reasons. Here I offer a few: 1. It might be that running start students enroll in more difficult courses their fall quarter (perhaps thinking that they are prepared for them given their prior history); 2. Running start students may be worse students than traditional students in some ways unmeasured by the included independent variables.