Estimation EC 252 Introduction to Econometric Methods

Transcription

Estimation EC 252 Introduction to Econometric Methods
Multiple Regression Analysis: Estimation
EC 252 Introduction to Econometric Methods
Abhimanyu Gupta
February 9, 12, 2015
1 / 46
Outline
More on interpretation
The Expected Value of the OLS Estimators
Assumptions
Result: OLS is unbiased
Potential bias from misspecification
The Variance of the OLS Estimators
Homoscedasticity assumption
Result
Interpretation
Variance in misspecified models
Estimating σ 2
Efficiency of OLS: The Gauss-Markov Theorem
Reading: Wooldridge (2009), Introductory Econometrics, Chapter 3.
2 / 46
Outline
More on interpretation
The Expected Value of the OLS Estimators
Assumptions
Result: OLS is unbiased
Potential bias from misspecification
The Variance of the OLS Estimators
Homoscedasticity assumption
Result
Interpretation
Variance in misspecified models
Estimating σ 2
Efficiency of OLS: The Gauss-Markov Theorem
3 / 46
Units of Measurement
What is the effect of changing units of x or y on our results?
I
If the dependent variable y is multiplied by some constant c, then the OLS
intercept and slope estimates are also multiplied by c.
I
If the independent variable x is multiplied by some nonzero constant c,
then the OLS slope coefficient is divided by c (intercept is not affected).
variable does not affect the intercept.
I
What happens to R 2 when the unit of measurement of either the
independent or the dependent variable changes?
4 / 46
Functional form: the log-transformation
I
Often we consider a non-linear transformation of the dep. variable.
I
Leading case: the log-transformation. Why is this useful?
Example (Wage effects of education)
log(wage) = β0 + β1 schooling + u
effect on the dependent variable?
∂ log(wage)
= β1
∂ schooling
1
∂wage
= β1
wage ∂ schooling
schooling induces an approximate percentage change (%∆) in wages of 100β1 .
We write this as follows:
%∆wage ≈ (100β1 )∆schooling
5 / 46
Example (Wage effects of education)
Re-estimating the wage equation with the dependent variables in log, we obtain:
\ = 0.58 + 0.08 schooling
log(wage)
with R 2 = 0.19.
I
An additional year of schooling increases wages by about 8 percent.
I
This model is called a log-level model (dep. variable in logs, independent
variable in levels).
I
Before, we assumed that each unit change of x has the same constant
effect on y .
I
Now we assume that each each unit change of x has the same percentage
effect on y .
I
β1 is known in this case as semi-elasticitiy.
I
Note that this interpretation is approximate – it only works well for small
changes in x.
6 / 46
The constant-elasticity model
Now consider the case where both x and y are transformed into logs:
log(y ) = β0 + β1 log(x) + u
Here we have
β1 =
∂ log(y )
x ∂y
=
∂ log(x)
y ∂x
I
This is known as the constant-elasticity model
I
β1 measures the percentage change of y to a one percent change in x.
7 / 46
Summary: different functional forms
Log-transformations of our variables affect the interpretation of the coefficient.
The following table summarizes the different cases:
Model
level-level
level-log
log-level
log-log
Dependent
Variable
y
y
log (y )
Independent
Variable
x
log (x)
x
log (y )
log (x)
Interpretation
of β1
∆y = β1 ∆x
∆y = (β1 /100)%∆x
%∆y = (100β1 )∆x
(semi-elasticity)
%∆y = β1 %∆x
(elasticity)
8 / 46
Outline
More on interpretation
The Expected Value of the OLS Estimators
Assumptions
Result: OLS is unbiased
Potential bias from misspecification
The Variance of the OLS Estimators
Homoscedasticity assumption
Result
Interpretation
Variance in misspecified models
Estimating σ 2
Efficiency of OLS: The Gauss-Markov Theorem
9 / 46
The Expected Value of the OLS Estimators
Assumption MLR.1 (Linear in Parameters)
The model in the population (the population model or true model) can be
written as:
y = β0 + β1 x1 + β2 x2 + ... + βk xk + u
where β0 , β1 , ..., βk are the unknown parameters (constants) of interest and u
is an unobservable random error or disturbance term.
10 / 46
Is the linearity assumption of MLR.1 restrictive?
I
I
Often the relationship between economic variables is not perfectly linear.
Example: effect of effort on output might be characterized by
decreasing marginal productivity.
How can we account for non-linearity in a regression model?
I the multiple regression model allows to include polynomials:
y = β0 + β1 x1 + β2 x12 + β3 x13 + u
I
I
There are other ways of incorporating non-linearities: e.g. interactions (set
x3 = x1 × x2 )
Thus, in principle the formulation is very general.
Nontheless:
I We require the functional relation we end up choosing to be appropriate...
I ...which is difficult to judge: economic theory rarely provides insights on
what the exact functional form should be.
11 / 46
The Expected Value of the OLS Estimators
Assumption MLR.2 (Random Sampling)
We have a random sample of n observations {(xi1 , xi2 , ..., xik , yi ) : i = 1, 2, ..., n},
following the population model in Assumption MLR.1.
This assumption is often fine in cross sectional data.
12 / 46
The Expected Value of the OLS Estimators
Assumption MLR.3 (No Perfect Collinearity)
In the sample (and therefore in the population), none of the independent
variables is constant, and there are no exact linear relationships among the
independent variables.
I
If an independent variable is an exact linear combination of the other
independent variables, then the model suffers from perfect collinearity,
and it cannot be estimated by OLS.
I
Note: Assumption MLR.3 does allow the independent variables to be
correlated; they just cannot be perfectly correlated.
13 / 46
When do collinearity problems crop up?
MLR.3 rules out the following cases:
I
one variable is a constant multiple of another: e.g. x1 = γx2
I
one variable can be expressed as an exact linear function of two or more of
the other variables, e.g. x1 = γ0 + γ2 x2 + γ3 x3 + ... + γk xk .
Intuition:
I
If there is an exact linear relationship, it is impossible to tell apart the
effect of one variable from the other: we have no variation to separate out
the effects.
I
There are many combinations of (β0 , β1 , ..., βk ) which all deliver the same
value from the loss function (sum of squared residuals).
I
Given the exact restriction on the relationship between the different
covariates, the ceteris paribus notion is meaningless.
14 / 46
Example (Age versus cohort effects)
Assume you are interested in estimating a wage regression for a sample of
workers, using data collected during one year (2009). You believe that workers’
productivity changes with age. You also believe that workers productivity
depends on their “vintage” (that is their cohort as measured by year-of-birth).
This motivates the following specification:
log w = β0 + β1 age + β2 cohort + u
Is assumption MLR.3 satisfied in this case?
15 / 46
Example (continued)
I
Note that the following relation is true for all individuals in the sample:
cohort + age = 2009
Conclusion:
I
We have to drop either the age effect or the cohort effect from the
specification.
I
That is, we effectively have to assume that either β1 = 0 or β2 = 0.
In this example, both age and vintage are potentially important.
I
both factors may be genuinely relevant.
I
but the data does not allow us to separate out the two effects.
16 / 46
The Expected Value of the OLS Estimators
Assumption MLR.4 (Zero Conditional Mean)
The error u has an expected value of 0 given any values of the independent
variables. In other words:
E (u|x1 , x2 , ..., xk ) = 0
I
When Assumption MLR.4 holds, people often say the xj are exogenous
explanatory variables.
I
If xj is correlated with u for any reason, people often say xj is an
endogenous explanatory variable.
17 / 46
Assumption MLR.4 is critical
A list of cases when assumption MLR.4 can fail:
1. misspecified functional form (misspecification)
2. in case of omission of a variable that is correlated with any of the
independent variables (omitted variable problem)
3. in specific forms of measurement error in an explanatory variable
(measurement error problems)
4. in case one or more of the explanatory variables is determined jointly
with y (simultaneity problems).
18 / 46
The Expected Value of the OLS Estimators
Theorem 3.1 (Unbiasedness of OLS)
Under Assumptions MLR.1 through MLR.4:
E βˆj = βj , j = 0, 1, ..., k,
for any values of the population parameter βj . In other words, the OLS
estimators are unbiased estimators of the population parameters.
19 / 46
Two remarks
1. Unbiasedness is a statement about the sampling distribution of an estimator
I
I
I
If we kept drawing fresh samples from the population, what would the
distribution of the estimator look like?
Unbiasedness says nothing about how a particular realization is related to
the true parameter value.
In a particular sample, the estimated coefficient may be far away from the
true value even though the estimator is unbiased.
2. Unbiasedness is a statement about the expected value, and not about
dispersion
I
I
An estimator can be unbiased but still have a large dispersion around the
true value.
Unbiasedness says nothing about the probability of being close to the true
parameter value.
20 / 46
When does misspecification bias OLS estimates?
We now ask if we can describe what happens in specific cases of
misspecification.
What is the effect of including irrelevant variables?
What is the effect of excluding a relevant variable?
21 / 46
What is the effect of including irrelevant variables?
I
Suppose we specify the model as:
y = β 0 + β1 x 1 + β2 x 2 + β3 x 3 + u
and this model satisfies the Assumptions MLR.1 through MLR.4.
I
Assume that x3 is irrelevant: β3 = 0.
I
Note that then: E (y |x1 , x2 , x3 ) = E (y |x1 , x2 ) = β0 + β1 x1 + β2 x2 .
I
Since we do not know that β3 = 0, we estimate the following equation:
yˆ = βˆ0 + βˆ1 x1 + βˆ2 x2 + βˆ3 x3
I
Key result: Including an irrelevant variable does not bias the OLS
estimators. However, it can affect their variance.
I
loss of efficiency: we estimate an additional parameter (β3 ).
22 / 46
What is the effect of excluding a relevant variable?
I
Suppose the true population model is:
y = β0 + β1 x1 + β2 x2 + u
and assume that this model satisfies Assumptions MLR.1 through MLR.4.
I
However, due to data availability, we estimate instead the model by
excluding x2 :
y˜ = β˜0 + β˜1 x1
23 / 46
Example (Unobserved ability in a wage regression)
Suppose that wages are determined as a function of schooling and of ability.
Thus, the appropriate model would be
[
log
w = βˆ0 + βˆ1 schooling + βˆ2 ability
If we cannot observe individuals ability, we may have to restrict ourselves to
]
log
w = β˜0 + β˜1 schooling
24 / 46
I
We use the following fact (Lecture 4, Slide 22):
β˜1 = βˆ1 + βˆ2 δ˜1
where βˆ1 and βˆ2 are the slope estimators from the multiple regression of yi
on xi1 and xi2 , and δ˜1 is the slope from the simple regression of xi2 on xi1 .
I
Now compute the expected value (noting that δ˜1 is fixed, conditional on
the x’s):
E β˜1 = E βˆ1 + βˆ2 δ˜1 = E βˆ1 + E βˆ2 δ˜1 = β1 + β2 δ˜1
I
which implies the bias in β˜1 (the omitted variable bias) is:
Bias β˜1 = E β˜1 − β1 = β2 δ˜1
Conclusion: There are two cases where β˜1 is unbiased:
1. if the unobserved covariate is irrelevant for y : β2 = 0;
2. if x1 and x2 are uncorrelated: δ˜1 = 0.
25 / 46
The sign of the bias in β˜1 depends on the signs of both β2 and δ˜1 :
β2 > 0
β2 < 0
Corr(x1 , x2 ) > 0
positive bias
negative bias
Corr(x1 , x2 ) < 0
negative bias
positive bias
Terminology:
˜1 > β1 , then β˜1 has an upward bias.
I If E β
˜1 < β1 , then β˜1 has a downward bias.
I If E β
I
The phrase biased towards zero refers to cases where E β˜1 is closer to
zero than β1 :
I
I
if β1 > 0, then β˜1 is biased towards zero if it has a downward bias;
if β1 < 0, then β˜1 is biased towards zero if it has an upward bias.
26 / 46
Omitted variable bias - more general cases
I
Suppose the true population model is:
y = β 0 + β1 x 1 + β2 x 2 + β3 x 3 + u
and this model satisfies the Assumptions MLR.1 through MLR.4.
I
However, we omit x3 and estimate:
y˜ = β˜0 + β˜1 x1 + β˜2 x2
I
It is difficult to obtain the direction of the bias in general, because x1 , x2
and x3 can all be pairwise correlated.
I
Even if x2 and x3 are uncorrelated, if x1 is correlated with x3 , coefficients
on all included regressors β˜1 and β˜2 will normally be biased.
I
It is better to start by estimating a general model – and risk including
irrelevant variables – than to omit relevant variables and risk bias.
27 / 46
Outline
More on interpretation
The Expected Value of the OLS Estimators
Assumptions
Result: OLS is unbiased
Potential bias from misspecification
The Variance of the OLS Estimators
Homoscedasticity assumption
Result
Interpretation
Variance in misspecified models
Estimating σ 2
Efficiency of OLS: The Gauss-Markov Theorem
28 / 46
The Variance of the OLS Estimators
Assumption MLR.5 (Homoscedasticity)
The error u has the same variance given any values of the explanatory variables:
Var (u|x1 , ..., xk ) = σ 2
If Assumption MLR.5 fails, then the model exhibits heteroscedasticity:
I
The variance of the error term, given the explanatory variables, is not
constant, but varies with the regressors.
I
Assumption MLR.5 is about the properties of the second moment of u.
The assumption is distinct from MLR.4 (which deals with first moments).
29 / 46
Gauss-Markov assumptions
I
Assumptions MLR.1 through MLR.5 are collectively known as the
Gauss-Markov assumptions.
I
Assumptions MLR.1 and MLR.4 can be written as:
E (y |x) = β0 + β1 x1 + β2 x2 + ... + βk xk
I
Assumption MLR.5 can be written as:
Var (y |x) = σ 2
where x is the set of all independent variables, (x1 , ..., xk ).
30 / 46
The Variance of the OLS Estimators
Theorem 3.2 (Sampling Variance of the OLS Slope Estimators)
Under Assumptions MLR.1 through MLR.5, conditional on the sample values of
the independent variables:
Var βˆj =
σ2
SSTj 1 − Rj2
for j = 1, 2, ..., k.
Pn
Notation: SSTj = i=1 (xij − x¯j )2 is the total sample variation in xj , and Rj2
is the R-squared from regressing xj on all other independent variables (including
an intercept).
The size of Var (βˆj ) is important: a larger variance means a less precise
estimator → larger confidence intervals and less powerful hypothesis tests, as we
will see in Lecture 6 (Inference in the MLR model).
31 / 46
Interpretation
Theorem 3.2 tells us that the variance of the estimator βˆj depends on three
factors:
1. The error variance σ 2 (+)
2. The extent of the linear relationship among the independent variables, Rj2
(+)
3. The total sample variation in xj , SSTj (–)
1. The error variance σ 2 : a larger σ 2 means larger variances for the OLS
estimators.
I
I
This reflects more noise in the data.
To reduce the error variance for a given y , add more explanatory variables to
the equation.
32 / 46
Interpretation
2. The total sample variation in xj , SSTj : the larger the total variation in
xj , the smaller Var (βˆj ).
I
To increase the sample variation in each of the independent variables,
increase the sample size.
3. The linear relationship among the independent variables, Rj2 :
I
Consider a limiting case:
Var βˆj → ∞ as Rj2 → 1
I
A close to perfect linear association between regressors means that
separating out the independent effect of any one of them becomes
increasingly hard.
High correlation between two or more independent variables is called
multicollinearity (perfect correlation is ruled out by MLR.3).
33 / 46
Interpretation
Remarks:
I
I
I
What ultimately matters for statistical inference is how big βˆj is in relation
to its standard deviation.
Also note that a high degree of correlation between certain independent
variables can be irrelevant as to how well we can estimate other parameters
in the model
In practice, this means that inference about a particular parameter of
interest does not rely on all the regressors in a model displaying lots
independent variation: e.g. one can include multiple correlated proxies to
control for variation in variables we are not primarily interested in.
34 / 46
Variance in misspecified models
In applications, we sometimes have little guidance on whether to include a
particular covariate or not.
I
I
I
There’s typically a tradeoff between bias and variance.
Suppose the true population model, which satisfies the Gauss-Markov
assumptions, is:
y = β0 + β1 x 1 + β 2 x 2 + u
(1)
Consider two estimators of β1 :
1.
yˆ = βˆ0 + βˆ1 x1 + βˆ2 x2
where Var βˆ1 =
σ2
.
SST1 (1 − R12 )
2.
y˜ = β˜0 + β˜1 x1
σ2
where Var β˜1 =
.
SST1
35 / 46
Variance in misspecified models
Assuming that x1 and x2 are correlated, we can draw the following conclusions:
1. When β2 6= 0, β˜1 is biased, βˆ1 is unbiased, and Var β˜1 < Var βˆ1 .
=⇒ When β2 6= 0, there are two reasons for including x2 in the model:
I
I
Any bias in β˜1 does not shrink as the sample size grows, but the variance
does;
The variance of β˜1 shown in the previous slide effectively ignores the
inflation in the error variance that occurs when x2 is omitted. This omission
a consequence of treating regressors of non-random.
2. When β2 = 0, β˜1 and βˆ1 are both unbiased, and Var β˜1 < Var βˆ1 .
=⇒ β˜1 is preferred if β2 = 0.
As we will see in Lecture 6, we can perform a statistical test of the hypothesis
β2 = 0 after running the regression (1), which may lead us to re-estimate the
model omitting x2 .
36 / 46
Estimating the error variance σ 2
To operationalize the expression for the variance of the OLS estimators given in
Theorem 3.2, we need to form an estimate of σ 2 .
I
The unbiased estimator of σ 2 in the multiple regression case is:
2
σ
ˆ =
I
Pn
uˆi2
SSR
=
(n − k − 1)
(n − k − 1)
i=1
where the degrees of freedom is:
df = n − (k + 1) = (no. observations) – (no. estimated parameters)
I
Recall: the division by n − k − 1 comes from the fact that, in obtaining the
OLS estimates, k + 1 restrictions are imposed on the OLS residuals, so that
there are only n − k − 1 df in the residuals.
I
For the SLR model (Lecture 3, Slide 42), we were dividing by n − 2, since
in that case k = 1.
37 / 46
Theorem 3.3 (Unbiased Estimation of σ 2 )
Under the Gauss-Markov Assumptions MLR.1 through MLR.5,
E σ
ˆ 2 = σ2
I
σ
ˆ is called the standard error of the regression (SER), the standard
error of the estimate or the root mean squared error.
I
It is an estimator of the standard deviation of the error term.
Note that σ
ˆ can either decrease or increase when another independent
variable is added to a regression. Why?
I
I
I
I
I
numerator: SSR goes down...
denominator: k increases, n − k − 1 goes down...
...so the overall effect is ambiguous.
But when n >> k, more likely than not σ
ˆ will fall.
38 / 46
I
For constructing confidence intervals and conducting tests, we need to
estimate the standard deviation of βˆj :
σ
sd βˆj =
[SSTj (1 − Rj2 )]1/2
I
Since σ is unknown, we replace it with its estimator σ
ˆ.
I
This gives us a feasible estimator (i.e. one that depends on no unknown or
unobserved quantities) of the standard error of βˆj :
se βˆj =
σ
ˆ
[SSTj (1 − Rj2 )]1/2
39 / 46
Outline
More on interpretation
The Expected Value of the OLS Estimators
Assumptions
Result: OLS is unbiased
Potential bias from misspecification
The Variance of the OLS Estimators
Homoscedasticity assumption
Result
Interpretation
Variance in misspecified models
Estimating σ 2
Efficiency of OLS: The Gauss-Markov Theorem
40 / 46
The Gauss-Markov Theorem: Introduction
Why should we choose OLS as our estimator?
We have already seen that it is unbiased; but:
I
What about its variance?
I
How dispersed is it around the true parameter values?
I
Are there other estimators which are less dispersed?
The Gauss-Markov Theorem tells us that under a set of specific assumptions
(the Gauss-Markov assumptions)
I
OLS has smaller variance than any other estimator within the class of linear
and unbiased estimators.
The Gauss-Markov Theorem justifies the use of OLS rather than
competing linear estimators.
41 / 46
Efficiency of OLS
Gauss-Markov Theorem
Under the following set of assumptions:
I
MLR.1 (Linear in parameters)
I
MLR.2 (Random sampling)
I
MLR.3 (No perfect collinearity)
I
MLR.4 (Zero conditional mean)
I
MLR.5 (Homoscedasticity)
The OLS estimators (βˆ0 , βˆ1 , ..., βˆk ) are best linear unbiased estimators
(BLUE) of the population parameters (β0 , β1 , ..., βk ).
Knowing this makes life easy: if any other linear estimator is proposed, and the
assumptions above hold, that estimator will be less efficient than OLS.
42 / 46
Interpretation: A comparison within a specific class
Allll estimators i
of β
I
Estimator: a rule that can be applied
to any sample of data to produce an
estimate.
I
Linear: an estimator is linear if, and
only if, it can be expressed as a linear
function of the data on the dependent
variable:
Subset of linear
estimators
β˜j =
n
X
wij yi
i=1
Subset of lilinear unbiased
estimators
OLS
I
where each wij can be a function of the
sample values of all the independent
variables.
Unbiased: β˜j is an unbiased estimator
of βj if E β˜j = βj .
43 / 46
The criterion:
I
Best: smallest variance.
=⇒ Under the assumptions MLR.1-MLR.5, we have that for any estimator β˜j
that is linear and unbiased,
Var β˜j ≥ Var βˆj
where βˆj is the OLS estimator.
Keep in mind:
I
If any of the Gauss-Markov assumptions fail, then this theorem no longer
holds.
44 / 46
Effects of heteroscedasticity
Under heteroscedasticity, the Gauss-Markov theorem no longer applies:
I
Assumption MLR.5 does not hold any longer.
I
The OLS estimator is still unbiased (MLR.5 is not required for Theorem
3.1).
I
But the Gauss-Markov theorem does not apply, so we know a more efficient
linear unbiased estimator may exist (it does!).
Intuition why OLS may not be efficient:
I
Heteroscedasticity means that some observations are more informative
(contain less noise) than others...
I
...but the OLS objective function puts equal weight on all squared residuals
uˆi2 .
I
OLS does not exploit the fact that we can extract more information from
some observations than others...
I
...it is not surprising that there may be a more efficient estimator!
We will return to this important topic in Lecture 10.
45 / 46
Outline
More on interpretation
The Expected Value of the OLS Estimators
Assumptions
Result: OLS is unbiased
Potential bias from misspecification
The Variance of the OLS Estimators
Homoscedasticity assumption
Result
Interpretation
Variance in misspecified models
Estimating σ 2
Efficiency of OLS: The Gauss-Markov Theorem
Next lecture: Multiple Regression Analysis: Inference.
46 / 46
Outline
Appendix: Numerical example to fix ideas (continued)
47 / 46
This continues the numerical exercise for the simple linear regression model
(from week 3).
Example (Wage effects of education (continued))
Table : Data
I
We estimated the coefficients as
follows (n = 10):
\
wage = −3.569 + 0.8597 schooling,
I
We can predict residuals uˆi in the
sample, and estimate the variance of
the error term:
n
X
1
95.54
= 11.94
uˆi2 =
σ
ˆ =
(n − 2)
10 − 2
2
i=1
id
1
2
3
4
5
6
7
8
9
10
wage
6
5.3
8.75
11.25
5
3.6
18.18
6.25
8.13
8.77
schooling
8
12
16
18
12
12
17
16
13
12
48 / 46
Example (Wage effects of education (continued))
I
We can also estimate R 2 from the residuals and the total sum of squares
SST:
95.54
SSR
=1−
= 0.395
R2 = 1 −
SST
157.93
Thus, our empirical model explains 39.5% of the observed variation in
wages.
49 / 46
Example (Wage effects of education (continued))
I
Plug in σ
ˆ to obtain estimates of the standard error of βˆ1 and βˆ0 :
√
σ
ˆ
11.94
ˆ
se β1 = qP
=√
= 0.376
n
2
84.40
(x
−
x
¯
)
i
i=1
q P
√
√
n
σ
ˆ n1 i=1 xi2
11.94 193.4
√
=
se βˆ0 = qP
= 5.23
2
n
84.40
(x
−
x
¯
)
i
i=1
I
We can summarize what we have learned so far about this regression by
writing
\
wage = -3.569 + 0.8597 schooling, n = 10, R 2 = 0.395
(5.23)
(0.376)
50 / 46