topic 2 - Victor Aguirregabiria

Transcription

topic 2 - Victor Aguirregabiria
ECONOMETRICS II (ECO 2401)
Victor Aguirregabiria
Winter 2015
TOPIC 2: BINARY CHOICE MODELS
1. Introduction
2. BCM with cross sectional data
2.1. Threshold model
2.2. Interpretation in terms of utility maximization
2.3. Probit and Logit Models
2.4. Testing hypotheses on parameters
2.5. Measures of goodness of …t
2.6. Partial E¤ects and Average Partial E¤ects in BCM
2.7. BCM as a Regression Model
2.8. Misspeci…cation of Binary Choice Models
2.9. Speci…cation tests based on Generalized Residuals
2.10. Semiparametric methods
2.11. BCM with endogenous regressors
3. BCM with panel data
3.1. Static models
a) Fixed e¤ects estimators
b) Random e¤ects estimators
3.2. Dynamic models
a) Fixed e¤ects estimators
b) Random e¤ects estimators
1.
INTRODUCTION
Econometric discrete choice models, or qualitative response models, are models where the dependent variable takes a discrete and …nite set of values.
Many economic decisions involve choices among discrete alternatives.
(1) Labor economics: labor force participation; unionization; occupational choice; migration; retirement; job matching (hiring/…ring
workers); strikes.
(2) Population and family economics: number of children; contraceptive choice; marriage; divorce.
(3) Industrial organization: entry and exit in a market; product
choice in a di¤erentiated product market; purchase of durable goods;
…rms’choice of location.
(4) Education economics: going to college decision.
(5) Political Economy: voting
Some classi…cations of Discrete Choice Models (DCM) that have relevant
implications for the econometric analysis are:
a) Type of data: Cross-section / Panel
b) Number of choice alternatives: Binomial or Binary / Multinomial
c) Speci…cation assumptions: Parametric / Semiparametric
d) Dynamic / Static;
e) Single-agent / games
2.
BCM WITH CROSS SECTIONAL DATA
We are interested in the occurrence or non-occurrence of a certain event
(e.g.,“an individual is unemployed”, ”a worker is unionized”, ”a …rm invests in
R&D”) and how this event depends on some explanatory variables X .
De…ne the binary variable Y such that:
8
>
< Y =1
>
: Y =0
if event occurs
if it does not
De…ne the Conditional Choice Probability (CCP).
P (x)
Note that:
Pr(Y = 1jX = x)
E (Y j X = x) = P (x)
A BCM is a paramtric model for the conditional expectation E (Y j X = x),
that is also the CCP, P (x).
Reduced form Model for CCP
In some empirical applications, the researcher may be interested in the CCP
P (x) just as a predictor of Y given X = x, not in a causal e¤ect interpretation
of the model.
In that case, the researcher can just choose a ‡exible speci…cation of P (x).
For instance:
P (x) = F (x0 )
where F (:) is a known function that maps the index x0 into the the probability
space [0; 1], e.g., F (:) is a CDF.
Model with explicit speci…cation of unobservables
Many times we are interested in the causal e¤ect of X on Y . Then, it is
useful to consider a model that relates Y with X and with the unobservable
variables for the researcher, ", and makes assumptions about the relationship
between X and ".
Y = g (X;
, ")
Since Y is a discrete variable, it should respond in a discrete way (i.e., not
continuously) to changes in (X; ,").
That is, g (:) should be a function that maps continuous variables in , ", or
X into the binary set f0; 1g.
In principle, this condition rules out the Linear Regression Model (i.e., Y =
X 0 + ") as a valid model for a binary dependent variable. But we will discuss
in detail this point in Sections 2.5 and 2.6 below.
2.1. THRESHOLD MODELS
A popular speci…cation of g (X; ,") that appears naturally in many economic
applications is the threshold function:
Y = g (X;
, ") =
8
>
< 1
>
: 0
if Y (X; ,")
0
if Y (X; ,") < 0
Y (X; ,") is a real-valued function that is denoted latent variable. Note
that setting the threshold at 0 is an innocuous normalization because Y (X; ,")
always includes a constant term.
A common speci…cation of the latent threshold function is:
Y (X; ,") = X 0
where
is a K
1 vector of parameters.
"
Therefore, the model is:
Y =
8
>
< 1
>
: 0
if "
X0
if " > X 0
We can also represent the model using the Indicator Function 1fAg where
1fAg = 1 if A is true and 1fAg = 0 if A is false.
Y = 1f"
X0 g
When " is independent of X and it has a CDF F ( ), we have that:
P (x) = Pr (Y
0 j X = x) = Pr("
x0 ) = F (x0 )
The relationship between the conditional probability P (x) and the index x0
depends on the distribution of ".
If " is N (0; 1):
If " is Logistic:
F (x0 ) =
F (x0
)=
x0
( x0
exp x0
)=
1 + exp (x0 )
Interpretation of the parameters
We know that in a linear regression model, Y = X 0 + ", when " is (mean)
independent of X we have that E (Y jX = x) = x0 and:
with
8
@E (Y jX = x)
>
>
>
<
@xk
k =>
>
>
:
E (Y jX = x +
k
if Xk continuous
k)
E (Y jX = x) if Xk discrete
a vector of 0s except at position k where we have 1.
In a BCM, we have that:
@E (Y jX = x)
=
@xk
E (Y jX = x +
k)
0
k f (x
E (Y jX = x) = F (x0 + k )
)
if Xk continuou
F (x0 ) if Xk discrete
DCM and Models with Non-additive unobservables
Discrete choice models belong to a class of nonlinear econometric models
where unobservables (error term) enter into the model in a non-additive form:
Y = g (X; ")
where g (:; :) is a function that is not additive in ", e.g., g (X; " + c) 6=
g (X; ") + c. In DCMs this non-additivity is a natural implication of the
discrete nature of the dependent variable.
In this class of models, the "Average Partial E¤ect" is di¤erent to the
"Partial E¤ect at the Average". A linear-regression approach typically provides estimates of "Partial E¤ect at the Average". We will discuss why for
some empirical questions we are interested in estimating the "Average Partial
E¤ect" and not the "Partial E¤ect at the Average".
Interpretation of the parameters
(2)
The main di¤erence between the LRM and the BCM in the interpretation of
@E (Y jX = x)
is that in the LRM the Partial E¤ects
are constant across
@xk
x while in the BCM they depend on the individual characteristics, and more
speci…cally on its propensity or probability of Y = 1 given X .
Taking into account that P (x) =
we have that:
8
>
>
>
As
>
>
>
<
>
>
>
>
>
>
: As
x0
x0
F (x0
@E (Y jX = x)
) and
= k f (x0 ),
@xk
1:
@E (Y jX = x)
P (x) ! 0 and
!0
@xk
! +1:
@E (Y jX = x)
P (x) ! 1 and
!0
@xk
!
2.2. INTERPRETATION IN TERMS OF UTILITY MAXIMIZATION
Example 1: Consider an individual who has to decide whether to purchase
a certain durable good or not (e.g., an iPhone). Suppose that the purchased
quantity is either one or zero: Y 2 f0; 1g.
Y 2 f0; 1g is the indicator of purchasing the durable good.
The utility function is U (C; Y ), where C represents consumption of the
composite good. More speci…cally:
U (C; Y ) = u(C ) + Y
n
Z0
1
"
o
is a vector of parameters, u(:) is an increasing function, Z is a vector of
characteristics observable to the econometrician, such as age and education,
and " is a zero mean random variable that is individual-speci…c.
1
The individual’s decision problem is to maximize U (C; Y ) subject to the
budget constraint C + P Y
M , where P is the price of the good, and M
is the individual’s disposable income.
We can represent this decision problem as:
maximization of
fU (M; 0) ; U (M
Therefore, the optimal choice is Y = 1 i¤ U (M
Y =1 ,
u(M
P ) + Z0 1
P; 1)g
P; 1) > U (M; 0), or:
" > u(M )
For instance, suppose that u(C ) = 1C
Then,
Y =1 ,
,
where X 0 = ( P; P [M
1P
X0
+ 2 P [M
2C
2,
with 1
P ] + Z0 1
0 and 2
0.
" >0
" >0
P ]; Z ), and 0 = ( 1; 2; 01).
Conditional on fZ; P; M g, the probability that an individual purchases the
product is:
P (Y = 1jZ; P; M ) = F ( [ P; P (M
P)
M; Z ] )
Example 2:
Y = Indicator of the event "individual goes to college".
X = { HS grades ; Family income ; Parents’Education ; Scholarships }
Let U0 and U1 be the utilities associated with choosing Y = 0 (no college)
and Y = 1 (college), respectively.
Consider the following speci…cation of these utility functions:
U0 = X 0 0 + " 0
U1 = X 0 1 + " 1
If the individual maximizes her utility, then:
fY = 1g () f U1
where
= 1
0,
and " = "0
"1 .
U0 g ()
"
X0
2.3. PROBIT AND LOGIT MODELS
To complete the parametric speci…cation of the model we should make an
assumption about the distribution of the disturbance "i. The most common
assumptions in the literature are:
Probit model: "
Logit model: "
N (0; 1)
then, F (x0 ) =
Logistic( ) then,
F (x0
x0
exp x0
)=
1 + exp (x0 )
Maximum likelihood estimation. Let fyi; xi : i = 1; 2; :::ng be a random
sample of (Y; X ).
The likelihood function is:
L( ) = Pr(y1; y2; :::; yn)jx1; x2; :::; xn)
=
n
Y
i=1
Pr(yijxi) =
Y
F (x0i
)
yi =1
Y
[1
F (x0i )]
yi =0
The log-likelihood function:
l( ) =
n
X
i=1
li ( ) =
n
X
i=1
yi ln F (x0i ) + (1
yi) ln 1
F (x0i )
This likelihood is continuous and twice di¤erentiable if F (:) is.
For the Probit and Logit models, the likelihood is also globally concave.
The MLE is the value of b that solves the likelihood equations:
0
n
n
X
X
xi f (xi b )
@li( b )
@l( b )
=
=
yi
0
0
b
b
@
@
F (xi )]
i=1
i=1 F (xi ) [1
@li( b )
is called the score of observation i.
@
0
F (xi b ) = 0
For the Probit model the likelihood equations are:
n
X
@li( b )
i=1
@
=
n
X
i=1
0
xi (xi b )
yi
0
0
b
b
(xi ) [1
(xi )]
0
(x i b ) = 0
And for the Logit model the likelihood equations are:
n
X
@li( b )
i=1
@
=
n
X
i=1
0
xi @yi
0
exp xi b
0
1 + exp xi b
because for the logistic distribution f (") = F (")[1
1
A=0
F (")].
Computation of the MLE
There is not a closed-form expression for the MLE. We should calculate b
numerically using an iterative algorithm.
The most common iterative algorithms to obtain MLE are Newton-Raphson
and BHHH. Given that the likelihood is globally concave, both algorithms converge to the unique maximum regardless of the initial value we use to initialize
the algorithm.
Newton-Raphson iterations:
b K+1
=
bK
2
bK
2l
@
n
i
X
6
6
4
0
i=1 @ @
3 12
7
7
5
n @li
6X
6
4
@
i=1
bK
3
7
7
5
BHHH iterations:
b K+1
=
bK
2
n @li
6X
+6
4
@
i=1
bK
Note that, at the true value of
2
1
p lim 4
n!1 n
n
X
@ 2 li (
3
)5
0
i=1 @ @
i.e., Fisher’s information matrix.
@li
bK
@ 0
3 12
7
7
5
n @li
6X
6
4
@
i=1
bK
in the population:
=
2
1
p lim 4
n!1 n
n
X
@li(
i=1
@
3
) @li( ) 5
@ 0
3
7
7
5
Asymptotic properties of the MLE
If the model is correctly speci…ed,
p b
n
!d N (0; V )
where
@li ( ) @li ( )
V =E
@
@
! 1
=E
f (X 0
F (X 0 )[1
)2
F (X 0 )]
A consistent estimate of V is obtained by substituting
by the sample mean such that:
Vd
ar b =
Vb
n
=
0
XX 0
by ^ and . EX (:)
1 1
0
2
b
f (xi )
1 @1
0A
x
x
i
i
n n i=1 F (x0i b )[1 F (x0i b )]
n
X
! 1
2.4. TESTING HYPOTHESES ON PARAMETERS AND REPORTING
ESTIMATION RESULTS
Wald, LM and LR tests as usual for MLE
Reporting estimation results: For some applications the estimated partial
e¤ects can be more informative than the estimates of the parameters. The
partial e¤ect can be evaluated at the mean value of the regressors x.
The estimated partial e¤ect for explanatory variable k (evaluated at the
sample mean x) is:
Pd
Ek =
8
>
<
>
:
b
0b
k f (x )
F (x0 b + b k )
if Xk continuous
F (x0 b ) if Xk discrete
However, in some applications we may be more interested in Average Partial
E¤ects than in Partial E¤ects evaluated at the mean. We come back to this
point in Sections 2.5 and 2.6 below.
Example: Default in the payment of college student loans. Knapp and Seaks
(REStat, 1992).
- Sample: 1834 college students in Pennsylvania who got a student
loan and left college in the academic year 1984-1985.
Variable
b (s:e:)
Graduation dummy
-1.090 (0.121)
P artial Ef f ect
in % points
-9.9
Parent’s income (in thousand $)
-0.018 (0.004)
-0.2
Loan amount (in thousand $)
0.026 (0.020)
+0.3
College cost (in thousand $)
0.085 (0.061)
+0.9
2.5. MEASURES OF GOODNESS OF FIT
Standard Residuals. In a BCM, after the estimation of , we cannot obtain
residuals for the unobservable ". Note that the "standard" residual "bi is such
that:
yi = 1fx0i b
"bi
0g
If yi = 1, we know that "bi x0i b . If yi = 0, we know that "bi > x0i b . But we
do not know the exact value of "bi.
This is a relevant issue for the identi…cation of the distribution of ", and to the
distributional assumptions. However, it is not an issue to obtain Goodness-of-…t
measures from the estimated model.instead of MLE.
2.5. MEASURES OF GOODNESS OF FIT
De…ne the following …tted values:
Pbi = F (x0i b )
and
Log likelihood function: l( b )
Number of wrong predictions:
ybi =
Pn
i=1 (yi
Pseudo-R-squares:
- Square of the correlation between yi and Pbi.
- Square of the correlation between yi and ybi.
(
0 if Pbi < 0:5
1 if Pbi 0:5
ybi)2
Pbi)2
Pn (yi
Weighted RSS: i=1 b
Pi(1 Pbi)
Likelihood Ratio Index (or McFadden’s R-square):
l( b )
LRI = 1
l0
where l0 is the log-likelihood when all parameters except the constant term are
zero. It is simple to prove that,
l0 = n0 ln(n0) + n1 ln(n1)
n ln(n)
where n0 = #obs with yi = 0, and n1 = #obs with yi = 1:
BCM as a Nonlinear regression model: Generalized Residuals
E (Y jX = x) = F (x0 )
var(Y jX = x) = F (x0 ) [1
F (x0 )]
Therefore, we can write:
Y = F (X 0 ) + u
where E (ujX = x) = 0, and var(ujX = x) = F (x0 )[1
F (x0 )].
Given an estimate b , we can obtain the (generalized) residuals: ui =
yi Pbi = yi F (x0i b ).
2.6. PARTIAL EFFECTS AND AVERAGE PARTIAL EFFECTS IN BCM
Is it reasonable (good econometric practice) to use a linear regression model when the dependent variable is binary?
Under what conditions or for which types of empirical questions?
To answer these questions, …rst we have to de…ne the concepts of "Partial
E¤ect", "Average Partial E¤ect", and "Partial E¤ect at the Average".
In econometrics, typically, we are interested in ceteris paribus e¤ects: how Y
changes when a variable Xk changes keeping constant the rest of the variables.
This type of ceteris paribus e¤ect is called the Partial E¤ect of Xk on Y .
De…ne P Ek (X0; "0) as the Partial E¤ect given that the initial value of (X; ")
is (X0; "0) and we change X to X0 + k where k is a vector of zeroes at
every position except at position k where we have a 1. In the general model,
we have that:
P Ek (X0; "0) = g (X0 +
k;
, "0 )
g (X0;
, "0 )
The conditional Average Partial E¤ect AP Ek (X0) is de…ned as P Ek (X0; "0)
averaged over the distribution of the unobservables " but conditional on X0:
AP Ek (X0) =
Z
P Ek (X0; ") dF (")
The unconditional Average Partial E¤ect AP Ek is is de…ned as P Ek (X0; "0)
averaged both over the distribution of the unobservables " and over the distribution of observables X .
AP Ek =
Z
P Ek (X; ") dF (") dFX (X )
It is important to distinguish between the average partial e¤ect AP Ek and
the Partial E¤ect at the average individual. The later is:
P Ek (X0 = E (X ); "0 = E ("))
= g (E (X ) +
k;
, E ("))
g (E (X );
, E ("))
In a linear regression model (LRM), individuals are assumed to be homogeneous in term partial e¤ects: i.e., g (X; ,") = X 0 + " and therefore:
P Ek (X0; "0) = AP Ek (X0) = AP Ek = k
More precisely, in a LRM (without random coe¢ cients) we can allow for
interactions between observable variables such that Partial E¤ects may vary
across individuals according to observable characteristics. However, Partial
E¤ects do not depend on unobservables.
LRM with random coe¢ cients allow for unobserved heterogeneity in Partial
E¤ects and therefore in those models the Average Partial E¤ect is not equal
the Partial E¤ect at the average.
BCM is a class of models where the di¤erence between the Average Partial
E¤ect and the Partial E¤ect at the average appear naturally as the result of
the binary nature of the dependent variable.
In a BCM, we have that:
n
P Ek (X0; "0) = 1 "0
[X0 +
where 1 f:g is the indicator function.
k
]0
o
n
1 "0
X0 0
o
Partial e¤ects at the individual level depend on the individual’s X and ".
This is an important property of BCM. This property derives naturally from
the discrete aspect of the dependent variable.
In a BCM, we have that the APE are:
AP Ek (X0) = F ([X0 +
k]
0
)
F (X00 )
The marginal partial e¤ect is similar to the partial e¤ect but when
resents a marginal change in a continuous variable Xk . In that case:
AM P Ek (X0) = k f (X00 )
where f (:) is the PDF of ".
The AMPE at the average individual is:
AM P Ek (X0 = E (X )) = k f (E (X )0 )
k
rep-
In BCM, Partial E¤ects vary over individuals and, in general, the AP E can
be very di¤erent to the Partial E¤ect at the average.
This is a property that distinguishes BCM from Linear Regression Model.
When our main interest is to estimate the PE for the average individual, then
we can use a LRM for the binary variable Y . For large samples, the estimates
will not be very di¤erent to the estimate of the same e¤ect from a BCM.
However, most of the time in economics we are interested in APE and not in
the PE for the average individual. If that is the case, using a LRM for a binary
Y is a very bad choice because that model imposes the very implausible (even
impossible!) restriction that PEs do not depend on the unobservables.
Example: School Attendance of Children from Poor Families.
Suppose that we are interested in the determinants of elementary school
attendance of kids (Y ) from poor families.
Y = Kids in the family attend (regularly) school
We are interested in evaluating the e¤ects of a public program that tries to
encourage school attendance by providing a subsidy that is linked to school
attendance, e.g., PROGRESA program in Mexico since 1997.
We have data on fY; S , Xg where: S is the amount of subsidy and it
is S = 0 for families in the control group, and S = $M for families in the
experimental group. X contains family socioeconomic characteristics.
Example: School Attendance of Children from Poor Families (2)
We estimate the BCM:
Y = 1f"
S + X0 g
Let ^ and ^ be the estimated parameters, and P^i = F (^ si + x0i ^ ) the
estimated probability of school attendance for family i.
The Partial E¤ect of the receiving the subsidy for individual i is:
P E (xi; "i) = 1f"i
If
M + x0i g
0 this e¤ect can be only zero or one.
1f"i
x0i g
Example: School Attendance of Children from Poor Families (3)
Even if we knew the true values of
we cannot estimate "i.
and , we cannot obtain P Ei because
However, we can estimate the average partial e¤ect of the subsidy for a family
with characteristics xi, AP E (xi):
\
A
P E (xi) = F (^ M + x0i ^ )
F (x0i ^ ) ' f (x0i ^ ) ^ M
And the estimated increase in the number of kids attending school because
the program:
Kids Attending School =
n
X
i=1
\
1fsi > 0g A
P E (xi)
We could also estimate the "counterfactual" e¤ect of the hypothetical application of the policy to a population of H families:
2
3
n
X
1
\
\
Kids Attending School = H 4
A
P E (xi)5 = H A
PE
n i=1
Example: School Attendance of Children from Poor Families (4)
The e¤ect of the policy for the average family is:
AP E (x) = F (^ M + x0 ^ )
F (x0 ^ ) ' f (x0 ^ )^ M
If we use this partial e¤ect for the average household to extrapolate the e¤ect
of the policy in the actual experiment, we get:
2
\
A
P E (x) 4
n
X
i=1
3
1fsi > 0g5
And if we make this same extrapolation for the hypothetical application of the
policy to a population of H families, the predicted e¤ect is:
\
H A
P E (x)
Example: School Attendance of Children from Poor Families (5)
\
\
In general, A
P E and A
P E (x) can be quite di¤erent. The magnitude of
this di¤erence depends on the variance or dispersion of x0i ^ (i.e., on the level
of cross-families heterogeneity in the propensity to send kids to school), and on
the magnitude of ^ M .
\
\
But even in the hypothetical case that A
P E and A
P E (x) were similar, we
can interested in estimation AP E (x) for di¤erent groups of families according
\
to x. For instance, suppose that A
P E (xi) is very close to zero for almost every
family i, but it is very large for families with very low income, that represent
only 1% of the population. This information is very useful to target the policy.
2.7. BCM AS A REGRESSION MODEL
A regression model is a statistical model that speci…es how the conditional
mean E (Y jX ) depends on X , i.e., it speci…es a function m(X; ) for E (Y jX ).
E (Y jX ) = m(X; )
This implies that:
Y = m(X; ) + u
where u is a disturbance or unobservable variable that, by construction, is mean
independent of X , i.e., E (ujX ) = 0.
When m(X; ) = X 0 , we have a linear regression model. When m(X; )
is nonlinear in the parameters, we have a nonlinear regression model, e.g.,
m(X; ) = expfX 0 g, or m(X; ) = 1[X1 2 + X1 3 ] 4 .
When Y is binary, we have that:
E (Y jX ) = 1
Pr(Y = 1jX ) + 0
Pr(Y = 0jX ) = Pr(Y = 1jX )
Therefore, a BCM for Pr(Y = 1jX ) is also a Regression Model for E (Y jX ).
According to the threshold BCM:
E (Y jX ) = F (X 0 )
An therefore,
Y = F (X 0 ) + u
where, by construction, u is mean independent of X .
Therefore, in this context, we can justify using a Linear Regression Model
(LRM) for the binary dependent variable Y as a …rst-order (linear) approximation to the function F (X 0 ) for X around its mean E (X ).
F (X 0 ) ' F (E (X )0 ) + (X
E (X ))0
f (E (X )0 )
Let X = (1; X1) where 1 represents the constant term and X1 the rest of
the regressors. Then, X 0 = 0 + X10 1 and E (X )0 = 0 + E (X1)0 1,
and (X E (X ))0 = (X1 E (X1))0 1. Solving these expressions in the
equation above, we have:
F (X 0 ) '
0
+ X10 1
0 )
=
F
(
E
(
X
)
0
and 1 = f (E (X )0 ) 1
where
:
f (E (X )0 ) E (X1)0 1
Note that 1 = 1f ( 0X ) is the AMPE for the average individual.
Therefore, we can use a Linear Regression Model for the binary variable Y .
This type of model is called the Linear Probability Model:
Y = X0
+u
and the slopes
have a clear interpretation as Average Partial E¤ects for the
Average individual. OLS estimation of this LRM provides consistent estimates
of .
The main limitation of the Linear Probability Model is that it does not
provide any information about the APE for individuals other than the average
individual, or for the unconditional APE (that depends on the conditional APE
for all the individuals).
This limitation is particularly serious in BCMs where the APEs F ([X +
]0 ) F (X 0 ) vary signi…cantly over X . In that case, the APEs of a significant group of individuals, and the unconditional APE, can be very di¤erent to
the APE for the average individual.
2.8. MISSPECIFICATION OF BCM
Remember that in the linear regression model a necessary and su¢ cient
condition for consistency the OLS estimator is that E ("jx) = 0. That is,
heteroscedasticity, autocorrelation and non-normality of the error term does
not a¤ect the consistency of the OLS estimator as long as E ("jx) = 0.
However, in the context of discrete choice models, the consistency of the
MLE depends crucially on our assumptions about "i.
If "i is heteroscedastic, or if it has a cdf that is not the one that we have
assumed, then the MLE is no longer consistent.
The reason is that our assumption about "i a¤ects not only second and
further moments of yi, but also its conditional mean:
Suppose that the true model is such that "i
True model: yi = F (x0i ) + ui;
Instead, we assume that "i
Estimated Model: yi =
It is clear that, if F 6=
iid with cdf F . Then,
where E (uijxi) = 0
iid N(0,1) [Probit]. Then,
(x0i ) + ui ; where ui = ui + F (x0i )
then E (ui jx) 6= 0; and MLE using
(x0i )
is inconsistent.
Suppose that the researcher is not particularly interested in the estimates of
but only in the estimated probabilities P (xi):
For instance, a car insurance company that is only interested in the probability
of accident of an individual with characteristics xi:
In this case, the main issue is the consistency of P (xi) not the consistency
of : One might think that misspeci…cation of F is not a big issue in this case.
However, that is not true. Misspeci…cation of F (:) can generates important
biases both on the estimator of and on the estimator of P (:).
Horowitz’s Monte Carlo experiment:
Suppose that the true probabilities are P (xi) and the researcher estimates a
logit model. How close are fP^Logit(x)g to the true fP (x)g?
Horowitz (Hbook of Stat, 1993) performed a Monte Carlo study to answer
this question. He considered di¤erent cases for P (x):
(1) Homoscedastic probit;
(2) Student-t;
(3) Uniform;
(4) Heteroscedastic logit;
(5) Heteroscedastic probit;
(6) Bimodal distribution of ".
The main results are:
(a) The errors are small when the true distribution of " is unimodal
and homoscedastic.
(b) The errors can be very large when the true distribution of " is
bimodal or it is heteroscedastic.
Summary of Horowitz’s Monte Carlo study:
True Model
Homoced. and unimod
Bimodal
Heteroscedastic
^
^
Ex(P (x) P
P
Logit (x)) maxx jP (x)
Logit (x)j
0.01
0.02
0.05
0.20
0.10
0.30
2.9. SPECIFICATION TESTS BASED ON GENERALIZED RESIDUALS
In the LRM, we typically test assumption on the error term " by using the
residuals "bi = yi x0i b .
In BCM we cannot obtain residuals for "i but we can get residuals for the
error term ui in the regression-like representation of the BCM:
yi = F (x0i ) + ui
We can get the resduals:
and standardized residuals:
bi
u
b i = yi
u
q
yi
F (x0i b )
F (x0i b )
F (x0i b )(1
F (x0i b ))
Under the null hypothesis that the model is correctly speci…ed, we have that:
ui
q
yi
F (x0i )
F (x0i )(1
F (x0i ))
ui should be independent of xi with zero mean.
b i and xi, we test the correct
By testing the independence of the residuals u
speci…cation of the model.
GENERAL PURPOSE SPECIFICATION TEST
b i and the estimated CCPs Pbi
Given the standardized residuals u
we run the OLS regression:
F (x0i b ),
b i = 0 + 1 Pbi + 2 (Pbi)2 + ::: + q (Pbi)q + ei
u
De…ne the statistic LM = n R2, where R2 is the R-square coe¢ cient from
the previous regression.
Under the null hypothesis (the model is correctly speci…ed), LM is asymptotically distributed as 2q 1.
TEST OF HETEROSCEDASTICITY IN BCM
Consider the BCM Y = 1fX 0
"jX
"
0g where:
~0
N 0 ; exp X
~ is the vector X without the constant term.
where X
We are interested in testing the null hypothesis of homoscedasticity, that is
equivalent to test: H0 : = 0.
A possible approach is to estimate and by MLE. That approach is computationally demanding because the log-likelihood of this model is no longer
globally concave in ( ; ).
Instead, we can estimate the standard probit model, under the null hypothesis
of = 0, and use a LM test for the null. The LM statistic is:
"
#0
! 1"
#
^
^
^
@ log L( ; = 0)
@ log L( ; = 0)
@ log L( ; = 0)
LM =
V ar
@( ; )
@( ; )
@( ; )
Under H0, LM is asymptotically distributed as 2dim( ).
Davidson and McKinnon (JE, 1984) show that this LM statistic can be obtained as the output of a simply auxiliary regression.
LM = n R2, where R2 is the R-square coe¢ cient from the following
regression:
u
^ i = xi
1
+ zi
2
+ ei
where:
u
^i
yi
(x0i ^ )
q
(x0 ^ )(1
(x0 ^ ))
i
xi
xi (x0i ^ )
q
(x0 ^ ))
(x0 ^ )(1
i
zi
i
i
x
~i (x0i ^ ) (x0i ^ )
q
(x0i ^ )(1
(x0i ^ ))
2.10. ADAPTIVE (SEMIPARAMETRIC) ESTIMATION OF BCM
The consistency of the ML estimator of Probit or Logit models relies on the
correct speci…cation of the probability distribution of the unobservable ".
That is, consistency of the MLE in BC models is not robust to misspeci…cation
of the CDF of ".
This property contrasts with the consistency of OLS in the linear regression
model: i.e., the OLS is the MLE when " is normally distributed, but it is
also consistent when " is not normal, and even asymptotically e¢ cient (if " is
homoscedastic and not serially correlated). In econometrics, this type of robust
estimators are called ADAPTIVE ESTIMATORS.
Are there adaptive estimators of the BCM which are robust to di¤erent properties of the unobserved error term such as heteroscedasticity, serial correlation,
or the particular functional form for the distribution of the error?
We consider four adaptive estimators of the BCM:
(1) Least Absolute Deviations (LAD) estimator;
(2) Manski’s Maximum Score Estimator;
(3) Horowitz’s Smooth Maximum Score Estimator;
(4) Klein and Spady estimator.
1.
Least Absolute Deviations (LAD) estimation
LAD is an estimation method that is adaptive for a very general class
of econometric models.
Remember that Least Squares (LS) estimation (linear or nonlinear) is based
on the following property of the mean. Let
E (Y ). Then,
= arg min E [Y
c
c] 2
The LS estimator is based on the sample counterpart of this property of the
mean:
n
1X
b = arg min
(yi c)2
c
n i=1
We have that, b !p .
Least Absolute Deviations (LAD)
(3)
Similarly, LAD estimation is based on the following property of the median.
If m median(Y ) then,
m = arg min E (jY
c
cj)
The LAD estimator is based on the sample counterpart of this property of
the median:
n
1X
c = arg min
jyi cj
m
c
n i=1
c !p m.
We have that, m
Least Absolute Deviations (LAD)
(4)
Consider the general econometric model:
Y = f (X; ; ")
where f is a known function; X is a vector of observable explanatory variables;
" is an unobservable variable; and is a vector of parameters.
The assumptions that de…ne this class of models are:
(A1) the function f is known and monotonic in ";
(A2) median("jX ) = 0.
Least Absolute Deviations (LAD)
(5)
Under assumptions (A1) and (A2), we have that
median(Y jX ) = f (X; ; 0)
Based on this condition, the true value of
= arg min E (jY
c
satis…es the following condition:
f (X; c; 0)j)
LAD estimator is based on the sample counterpart of this property:
^ LAD = arg min
Xn
jy
i=1 i
f (xi; ; 0)j
The LAD estimator minimizes the sum of absolute deviations of yi with respect
to its median f (xi; ; 0).
Least Absolute Deviations (LAD)
(6)
Under assumptions (A1) and (A2), the LAD estimator is consistent. Therefore, LAD is a general type of semiparametric estimator for nonlinear econometric models.
If function f is continuously di¤erentiable in , then the LAD estimator is:
(a) root-n consistent; (b) asymptotically normal; (c) it has a simple expression
for its asymptotic variance that it is simple to estimate; (d) we can use standard
gradient optimization methods to compute ^ LAD .
If function f is NOT continuous in , LAD is still consistent but, in general,
properties (a) to (d) do not hold.
2.
Manski’s Maximum Score Estimator
Consider the BCM Y = 1fX 0
"
0g where we assume that:
median("jX ) = 0
That is, " is median independent of X , and the median is zero.
Other than median("jX ) = 0, no other assumption is made on dist. of ".
If we knew , a "natural" predictor of Y is 1fX 0
(a) the support of 1fX 0
(b) median(Y jX ) = 1fX 0
0g because:
0g is the same as the support of Y : f0; 1g
0g.
Maximum Score Estimator (MSE)
(2)
We have a correct prediction when:
either
Y =1
and
1fX 0
or
Y =0
and
1fX 0 < 0g
0g
Given a sample fyi; xi : i = 1; 2; :::; ng, consider the following sample
criterion function:
Xn
0
S( ) =
y
1
fx
i
i
i=1
0g + (1
yi) 1fx0i < 0g
This criterion function provides the number of correct predictions for a given
value of . We call it the Score function.
Maximum Score Estimator (MSE)
(3)
The Maximum Score Estimator (MSE) is the value of
the score function:
that maximizes
^ M SE = arg max S ( )
Under median("jX ) = 0, the MSE is a consistent estimator of .
Therefore, the MSE is an estimator that is robust to heteroscedasticity, serial
correlation, and to any form of the distribution of ".
In that sense, the MSE has similar properties as OLS in a linear regression
mode under the mean independence assumption E ("jX ) = 0.
Equivalence of LAD and MSE
Before we discuss other properties of the MSE, it is interesting to show that
for the BCM, the MSE and the LAD are identical estimators.
Let LAD( ) be the LAD criterion function, and let S ( ) be the score
function.
We now show that LAD( ) = n S ( ) and therefore minimizing LAD( )
is equivalent to maximizing S ( ) such that the MSE is the LAD estimator.
Equivalence of LAD and MSE
LAD( ) =
n
P
i=1
=
o
1 yi = 1 and
n
P
yi 1fx0i < 0g + (1
n
P
yi 1
i=1
=
n
0g
n
P
i=1
=
1fx0i
yi
(2)
i=1
= n
n
P
i=1
= n
1fx0i
yi 1fx0i
S( )
x0i
n
< 0 + 1 yi = 0 and
yi) 1fx0i
0g + (1
0g + (1
yi) 1
x0i
o
0
0g
1fx0i < 0g
yi) 1fx0i < 0g
Properties of the MSE
Note that the score function S ( ) is discontinuous and not di¤erentiable in
. 1fx0i
0g is a step function, and this implies that S ( ) is also step
function.
Example. Consider the model Y = 1f + X
0g. We have a sample
of n = 4 observations: (xi; yi) = f(x1; 0), (x2; 0), (x3; 1), (x4; 1)g, and
0 < x1 < x2 < x3 < x4 .
The score function is:
S ( ) = 1f + x1 < 0g + 1f + x2 < 0g
+ 1f + x3
0g + 1f + x4
0g
or
S( ) =
8
>
2 if
>
>
>
>
>
>
>
>
>
3 if
>
>
>
>
>
<
4 if
>
>
>
>
>
>
>
>
3 if
>
>
>
>
>
>
>
:
2 if
There is not a single value
[ x3; x2).
<
x4
2 [ x4 ;
x3 )
2 [ x3 ;
x2 )
2 [ x2 ;
x1 )
x1
that maximizes S ( ) but a whole interval
As the sample size increases, the amplitude of thi intervale gets smaller.
Properties of the MSE
In this case, discontinuity of S ( ) does not a¤ect the consistency of the
MSE, but it has several important implications.
(a) We cannot use the standard gradient based methods to search for
the MSE.
(b) If the sample size is not large enough, there may not be a unique
value of that maximizes S ( ). The maximizer of S ( ) can be a
whole (compact) set in the space of .
(c) The MSE is not asymptotically normal. It has a not standard
distribution.
(d) The rate of convergence of the MSE to the true
root-n . It is n1=3.
is lower than
3.
Horowitz’s Smooth Maximum Score Estimator
Limitations (a) to (d) of the MSE motivate the use of the smooth-MSE
proposed by Horowitz.
First, note that score function S ( ) can be written as follows:
S( ) =
n
X
yi 1fx0i
0g + (1
yi) 1fx0i < 0g
n
X
yi 1fx0i
0g + (1
yi) 1
i=1
=
i=1
=
n
X
(1
i=1
yi) +
n
X
i=1
(2yi
1) 1fx0i
1fx0i
0g
0g
Smooth Maximum Score Estimator
(2)
Xn
Therefore, maximizing S ( ) is equivalent to maximizing
(2yi
i=1
1fx0i
0g, and:
Xn
0
^ M SE = arg max
1)
1
fx
0g
(2
y
i
i
i=1
Limitations (a)-(d) of the MSE are due to the fact that 1fx0i
discontinuous in .
Horowitz proposes to replace 1fx0i
0g by a function
x0i
bn
1)
0g is
!
, where
(:) is the CDF of the standard normal, and bn is a bandwidth parameter such
that: (1) bn ! 0 as n ! 1; and (2) nbn ! 1 as n ! 1. That is, bn goes
to zero but more slowly than 1=n.
Smooth Maximum Score Estimator
(3)
The Smooth-MSE is de…ned as:
^ SM SE = arg max
n
X
(2yi
1)
i=1
As n ! 1, and bn ! 0, the function
x0i
bn
!
x0i
bn
!
converges 1fx0i
0g,
and the criterion function converges to the Score function. This implies the
consistency of ^ SM SE .
Under the additional condition that nbn ! 1 as n ! 1, this estimator is
asymptotically normal, nr consistent with r 2 [2=5; 1=2], and it can be computed using standard gradient search methods because the criterion function is
continuously di¤erentiable.
Smooth MSE in STATA
See Blevins, J. R. and S. Khan (2013): "Distribution-Free Estimation of
Heteroskedastic Binary Response Models in Stata," Stata Journal 13, 588–602.
Blevins and Khan have created a command in Stata, dfbr (for distribution
free binary response), that implements the Smooth MSE and other methods
for the estiation of BCM with a nonparametric speci…cation of the distribution
of ".
NONPARAMETRIC IDENTIFICATION OF F ("jX )
Once we have estimated the vector of parameters using an adaptive method
such as the smooth-MSE, we want to estimate Average partial e¤ects (APE) for
di¤erent individuals in the sample or out of the sample (for di¤erent values of
xi). As shown above, to estimate APEs for individuals who are not the average
individual in the sample (or for some other average or marginal individual) we
need to estimate the distribution of ".
Given and our assumption that median("jX ) = 0, is the CDF F ("jX )
nonparametrically identi…ed? No, without further assumptions. More speci…cally, no if only assumption median independence between " and X .
NONPARAMETRIC IDENTIFICATION OF F ("jX )
Matzkin (ECTCA, 1992). A su¢ cient condition for the identi…cation of F
is:
(a)
(b)
f0 , where " and Z are independent;
X0 = Z + X
f, Z has variation over the whole real line;
Conditional on X
f (but we don’t need full inde(c)
" is median independent of X
pendence);
f=x
e ) is nonparaProof: The CCP function P (z; x) = Pr(Y = 1jZ = z , X
e ). Suppose that
metrically identi…ed from the data at every (z; x
has been
identi…ed/estimated (e.g., MSE estimator).
e 0 and any "0 2 R, we can de…ne the value z0 = "0
Given any x
e0)
F ("0jx
Pr("
= Pr("
f=x
e0)
"0 j X
e 00
z0 + x
e0)
= P (z0; x
e 00 . Then:
x
f=x
e0)
jX
e 0,"0) we can always de…ne a value z0 such that the
That is, for any (x
e 0) give us the CDF of ", F ("0jx
e 0).
empirical CCP P (z0; x
EFFICIENT SEMIPARAMETRIC ESTIMATION
Consider the BCM Y = 1f"
X 0 g where:
(a)
" is not completely independent of X . Instead, V ar("jX ) =
2 (X 0 ), i.e., there may be heteroscedasticity;
"
is independent of X with CDF F (:) continuous and
(b)
0
(X )
strictly increasing.
According to this model:
P (x) = Pr(Y = 1jX = x) = F
We de…ne G(x0 )
F
x0
(x0 )
!
x0
(x0 )
!
EFFICIENT SEMIPARAMETRIC ESTIMATION
Klein-Spady estimator propose a semiparametric maximum likelihood estimator of and the function G(:).
The log-likelihood function is:
Xn
0
l ( ; G) =
y
ln
G
x
i
i
i=1
+ (1
h
yi) ln 1
G x0i
i
And KS estimator is de…ned as:
b
b
KS ; GKS
= arg max
f ;Gg
l ( ; G)
The di¢ cult issue here is that G is not a …nite-dimension vector of parameters,
but a real-valued function or in…nite-dimension vector of parameters.
This is not a standard MLE, and both its computation and the derivation of
its asymptotic properties are non-standard problems.
Under mild regularity conditions, Klein and Spady show that the estimator is
consistent and asymptotically normal. The estimator of is root-n consistent.
Also, it is asymptotically e¢ cient within the class of semiparametric estimators.
b be this
The procedure starts with an initial guess of the function G. Let G
0
b can the , i.e., we postulate a Probit model
initial guess. For instance, G
0
with homocedasticity.
Then, at every iteration K
Step 1: Estimate
1 we perform two steps.
b
given G
K 1.
b
K
= arg max l
This is a standard MLE (or quasi MLE).
b
;G
K 1
b
Step 2: Given b K , we obtain a new G
K using a kernel method (NadarayaWatson) estimator:
b (z ) =
G
K
Xn
y K
i=1 i
Xn
K
i=1
x0i b K
bn
x0i b K
bn
z
z
!
!
where bn is a bandwidth parameter. This is a nonparametric estimator of
E (Y jX 0 b K = z ), and we know that E (Y jX 0 = z ) = Pr(Y = 1jX 0 =
z ) = G(z ).
The algorithm iterates until convergence, e.g., until b K
b
K 1 <
10 6.
2.11. BCM WITH CONTINUOUS ENDOGENOUS REGRESSORS
Rivers and Vuong (JE, 1988). Consider the model:
= 1 X0 +
(1)
Y
(2)
W = Z0 + u
W +">0
where " and u are independent of X and Z , but cov ("; u) 6= 0, and therefore
" and W are not independent.
Suppose that ("; u) are jointly normal. Then, we have that:
" =
u+
2 )
where (a) = "u= 2u; (b) is normally distribution as N (0, 2" 1
where is the correlation between " and u; (c) is independent of u; (d) since
" is independent of X and Z , we have that is independent of X , Z , and u,
and therefore it is independent of W .
Then, we can write the probit model:
Y
And given that
have that:
= 1 X0 +
W+
u+
>0
is normally distributed and independent of X , W , and u, we
Pr(Y = 1jX; W; u) =
X0
+
W+
u
!
We do not know u, but we can obtain a consistent estimate of u as the
residual u
^=Y
Z 0^.
Rivers and Vuong (1988) propose the following procedure:
Step 1. Estimate the regression of W on Z and obtain the residual u
^;
Step 2. Run a probit for Y on X , W and u
^.
This is in fact the method in the STATA command "ivprobit".
Using this procedure we obtain consistent estimates of
Note that
,
, and
6= 0 if and only if cov ("; u) 6= 0.
Therefore, a t-test of H0 :
= 0 is a test of the endogeneity of W .
.
Blundell and Powell (Review of Economic Studies, 2004) "Endogeneity
in Semiparametric Binary Response Models"
Blundell and Powell (2004) extend Rivers-Vuong method to models where
the distribution of the unobservables, " and u, is nonparametrically speci…ed.
TBW
3. BCM WITH PANEL DATA
As in linear PD models, we distinguish static and dynamic PD BCMs.
(1) Static models: explanatory variables are strictly exogenous;
(a) Exogenous individual e¤ects: Avery-Hansen-Hotz Quasi-MLE;
(b) Endogenous individual e¤ects: FE methods: (b1) Manski’s
MSE; (b2) Chamberlain’s Conditional Logit.
(c) Endogenous individual e¤ects: RE methods: (c1) Chamberlain
Correlated RE; and (c2) Heckman-Singer …nite mixture model.
(2) Dynamic models:
(a) FE methods. (a1) Chamberlain’s Conditional logit; (a2) HonoreKyriatzidou conditional logit.
(b) RE methods. (b1) Heckman-Singer …nite mixture model; (b2)
Arellano-Carrasco.
3.1. Static Binary Choice Models
Consider the Panel Data BCM:
0 +
Yit = 1fXit
i
where uit is independent of
exogenous regressors.
i
uit
0g
and and of fXi1; Xi2; :::; XiT g, i.e., strictly
We have panel data with N individuals and T periods where N is large and
T is small. We want to estimate .
We are concerned about the correlation of Xit with the individual e¤ect
i.
Avery-Hansen-Hotz Pseudo MLE
Suppose that the individual e¤ect i and Xit are independently distributed.
Then, we can use a MLE to estimate . The conditional log-likelihood function
P
is l( ) = N
i=1 li ( ) where:
li( ) = ln Pr(yi1; yi2; :::; yiT j xi1; xi2; :::; xiT ; )
= ln Pr( 1f2(yit
1)"it
2(yit
1)xit g for t = 1; 2; :::; T )
Since "0s are serially correlated, these probabilities involve T dimensional integrals. This is computationally costly.
Also, we have to specify the stochastic process of uit and estimating the
parameters of this process together with .
Is there a way to avoid this multiple integration problem? Is there an "adaptive
estimator" that is consistent regardless the form of the serial correlation in "it?
Avery-Hansen-Hotz (IER, 1983) provide a simple estimator that is robust
to serial correlation. They show that a method that estimates using a standard Probit (or logit) model that ignores the serial correlation in "it is root-N
consistent and asymptotically normal.
Consider the Pseudo- log likelihood function:
l( ) =
PN PT
i=1 t=1 yit ln
And let ^ AHH be the value of
Hansen-Hotz estimator.
x0it
+ (1
yit) ln
x0it
that maximizes this function. This is Avery-
If the distribution of "it is normal with zero mean and constant variance (and
the stochastic process of "it satis…es some standard stationarity conditions),
then this estimator is consistent and asymptotically normal, regardless the serial
correlation in "it over time (or across individuals).
Why is Avery-Hansen-Hotz estimator consistent despite it is an MLE based
on a misspeci…ed likelihood function? Because the likelihood equations that
de…ne the estimator are valid moment conditions regardless the form of serial
correlation in "it.
The likelihood equations are:
0
N
T X
xit f (xit )
1 X
yit
0
0
N t=1 i=1 F (xit ) [1 F (xit )]
0
F (xit ) = 0
where F and f are the CDF and the PDF of "it. As N goes to in…nity, these
equations converge to:
T
X
E z (xit)
t=1
where z (xit) is the vector
h
yit
0
0
F (xit )
xit f (xit )
.
F (x0it ) [1 F (x0it )]
i
=0
If "it and Xit are independently
h
0
distributed, we can show easily that E z (xit) yit F (xit )
that these moment conditions / likelihood equations hold.
i
= 0 such
Therefore, AHH estimator can be seen as a GMM estimator based on valid
moment conditions.
Note that we can use the same approach to estimate BCM using Time Series
data, Yt = 1fXt0
"t 0g, provided the variables satisfy some stationarity
conditions.
The likelihood equations are:
0
T
1 X
xt f (xt )
yt
T t=1 F (x0t ) [1 F (x0t )]
0
F (xt ) = 0
And under mild stationarity conditions, as T ! 1,
T
h
1 X
z (xt) yt
T t=1
0
i
h
F (xt ) !p E z (xt) yt
0
that is equal to 0 because E (YtjXt) = F (Xt ).
0
i
F (xt )
Bias of the MLE based of FD and WG transformations
Now, consider the more interesting case where i and Xit can be correlated.
In a BCM, the transformations of the model in First-Di¤erences or WithinGroups does not eliminate the individual e¤ect i.
0 +
Yit = 1fXit
i
0
6= 1f Xit
uit
uit
0g
0
1fXit
1 + i
uit 1
0g
0g
Therefore, a MLE based on the equation
provides an inconsistent estimator of .
0
Yit = 1f Xit
uit
0g
We will show later that Manski’s Maximum Score Estimator can be used to
obtain a consistent estimator of that is somehow based on a …rst di¤erence
transformation of the model, but not exactly on the transformation above.
Bias of ML-Dummy Variables Estimator
In the Static Linear PD model, we show that the LSDV estimator was consistent (for …xed T ) and equivalent to the WG estimator.
Unfortunately, that is not the case in the Static (or Dynamic) BCM.
The estimator is de…ned as:
( b ; b ) = arg max
N
X
f ; g i=1
where
li ( ; i ) =
T
X
t=1
yit ln F (x0it + i) + (1
li ( ; i )
yit) ln 1
F (x0it + i)
Bias of ML-Dummy Variables Estimator
(2)
The likelihood equations are:
N
X
@li( b ; b i)
With respect to :
@
i=1
With respect to
i:
where
N
X
@li( b ; b i)
i=1
@
= 0
@li( b ; b i)
= 0
@ i
0
N X
T
X
xit f (xit b + b i)
=
yit
0
0
b
b
F (xit + b i)]
i=1 t=1 F (xit + b i ) [1
0
T
X
f (xit b + b i)
@li( b ; b i)
=
yit
0
0
b
b
@ i
b
b
F (xit + i)]
t=1 F (xit + i ) [1
0
F (xit b + b i)
0
F (xit b + b i)
Bias of ML-Dummy Variables Estimator
(3)
0
f (xit b + b i)
For instance, for the Logit model,
= 1
0
0
b
b
F (xit + b i) [1 F (xit + b i)]
such that the likelihood equations become:
N X
T
X
i=1 t=1
2
xit 4yit
T
X
t=1
2
4yit
3
b
exp xit + b i
5= 0
0 b
1 + exp xit + b i
0
3
b
exp xit + b i
5= 0
0 b
1 + exp xit + b i
0
We can use a BHHH method to compute ( b ; b ). Greene (Econometrics
Journal, 2004) has developed a computationally e¢ cient method to calculate
this estimator [in the spirit of Within Groups transformation, but in a sequential
method]. In, particular we do not need to invert any matrix with dimension
N + K to compute this estimator.
Bias of ML-Dummy Variables Estimator
(4)
Though the conputation of a Dummy Variables (Fixed E¤ects) estimator
for PD-BCM is computationally simple, the estimator of is inconsistent as
N ! 1 and T is …xed. It is only consistent when T also goes to in…nity.
This estimator of
in linear models.
does not share the nice properties of the LSDV estimator
The reason is that, in this model, as as N ! 1 and T is …xed
cov ( b ; b ) 6= 0
The estimator b is asymptotically correlated with b such that the asymptotic
estimation error in b contaminates the estimator b .
Bias of ML-Dummy Variables Estimator:
Example
Consider an example with T = 2, only one explanatory variable xit that is
the dummy variable for t = 2, and distribution F (:) that is symmetric around
the median = 0:
Yi1 = 1f i ui1 0g
Yi2 = 1f + i
ui2
0g
The likelihood equations are:
N
X
i=1
(yi1 + yi2)
2
4yi2
exp ( b i)
1 + exp ( b i)
exp b + b i
1 + exp b + b
exp b + b i
i
1 + exp b + b i
3
5= 0
= 0
Bias of ML-Dummy Variables Estimator:
Example
For observations with (yi1; yi2) = (0; 0), we have that 0
F ( b i)
F b + b i = 0, and this implies that: (a) b i ! 1; and (b) these ob-
servations do not contribute to the estimator b because li( b ; b i) ! 0 for any
b.
For observations with (yi1; yi2) = (1; 1), we have that 1 F ( b i) + 1
F b + b i , and this implies that: (a) b i ! +1; and (b) these observations
do not contribute to the estimator b because li( b ; b i) ! 0 for any b .
Bias of ML-Dummy Variables Estimator:
Example
For observations with (yi1; yi2) = (0; 1) or with (yi1; yi2) = (1; 0), we have
that
1
that implies b i =
b
2
F ( b i)
, such that F
F b + bi = 0
b =2 + F
b =2 = 1.
Therefore, the concentrated log-likelihood function is:
l( ) =
N
X
i=1
1fyi1 = 0; yi2 = 1g ln F
2
+1fyi1 = 1; yi2 = 0g ln F
2
Bias of ML-Dummy Variables Estimator:
De…ne p
F
2
Example
. The concentrated log-likelihood is maximized at:
PN
i=1 1fyi1 = 0; yi2 = 1g
pb = P
N 1fy + y = 1g
i1
i2
i=1
And the MLE of
is:
b = 2 F 1 (pb) = 2 F 1
!
PN
i=1 1fyi1 = 0; yi2 = 1g
PN
i=1 1fyi1 + yi2 = 1g
Bias of ML-Dummy Variables Estimator:
Example
Is this a consistent estimator of ?
It is clear that:
p lim b = 2 F 1 p lim pb
N !1
N !1
=2F
Pr (Yi1 = 0; Yi2 = 1)
In general, 2 F 1
Pr (Yi1 + Yi2 = 1)
b is inconsistent.
!
1
Pr (Yi1 = 0; Yi2 = 1)
Pr (Yi1 + Yi2 = 1)
!
6= , and this ML-DV estimator
Bias of ML-Dummy Variables Estimator:
Example
For instance, for the logit model, we can show that p =
does not depend of
p=
i
and:
Pr (Yi1 = 0; Yi2 = 1)
Pr (Yi1 + Yi2 = 1)
exp ( )
Pr (Yi1 = 0; Yi2 = 1)
=
=F( )
Pr (Yi1 + Yi2 = 1)
1 + exp ( )
Therefore, for the logit model:
p lim b = 2 F 1 (F ( )) = 2
N !1
Fixed E¤ects Estimators for Static Panel Data BCM
As in the case of linear panel data models, we distinguish two approaches:
(a) Fixed E¤ects approach: no assumption on the joint distribution of Xi
and i.
(b) Random E¤ects approach: there is a parametric assumption on the
joint distribution of Xi and i.
We consider two …xed e¤ects estimators:
1. Chamberlain’s Conditional Logit model.
2. Manski’s MSE applied to Panel Data BCM
Chamberlain Conditional Logit
0 +
Consider the BCM Yit = 1fXit
i
distribution. Therefore,
Pr(Yit = 1 j Xit;
0g where uit has a logistic
uit
o
n
0 +
exp Xit
i
n
o
i) =
0 +
1 + exp Xit
i
And if uit is independent over time:
Pr(Yi1,Yi2,:::; YiT j Xi;
i)
=
P
n
T exp Y (X 0
Y
it
it
n
0
t=1 1 + exp Xit
o
+ i)
+ i
o
De…ne the random variable Si = T
t=1 Yit that represents the number of
times that the binary event has occurred during the T sample periods.
Chamberlain Conditional Logit
(2)
Let Yi = fyi1; yi2; :::; yiT g, and Xi = fxi1; xi2; :::; xiT g. The key result
behind Chamberlain conditional logit estimator is that:
Pr (Yi j Xi; Si; i; ) = Pr (Yi j Xi; Si; )
i.e., it does not depend on
i.
First, by the chain rule, it is clear that: Pr (Yi; Si j Xi; i) = Pr (Yi j Xi; Si; i)
Pr (Si j Xi; i), and therefore:
Pr (Yi; Si j Xi; i)
Pr (Yi j Xi; i)
Pr (Yi j Xi; Si; i) =
=
Pr (Si j Xi; i)
Pr (Si j Xi; i)
Given our logit model and that uit is iid over time, we have that the probability Pr (Yi j Xi; i) is:
Pr (Yi j Xi; i) =
=
YT
t=1
Pr (yit j xit; i)
n
h
i o
yit x0it + i
n
o
t=1 1 + exp x0
it + i
YT
exp
nP
o
T
0
exp
t=1 yit xit + Si i
= YT h
n
oi
0
1
+
exp
x
it + i
t=1
To derive the expression for Pr (Si j Xi; i) it is useful to de…ne the sets:
H (Si) =
D = (d1; d2; :::; dT )
2 f0; 1gT
XT
:
d = Si
t=1 t
Using this de…nition, we can write:
Pr (Si j Xi; i) =
=
=
X
Pr (D j Xi; i)
D2H(Si )
X
YT
D2H(Si )
X
t=1
Pr (dt j xit; i)
nP
T d x0
exp
t=1 t it
D2H(Si )
n
YT h
0
1
+
exp
x
it
t=1
+ Si i
+ i
oi
o
Combining the previous expressions for Pr (Yi j Xi; i) and Pr (Si j Xi; i)
we have that:
Pr (Yi j Xi; Si; i) =
Pr (Yi
Pr (Si
n P
o
T
0
exp
j Xi; i)
t=1 yit xit
n P
= X
T d x0
j Xi; i)
exp
t=1 t it
d2H(Si )
o
which does not depend on i. Therefore, Pr (Yi j Xi; Si; i) = Pr (Yi j Xi; Si).
The conditional log-likelihood function
Xn
l( ) = i=1 log Pr (Yi j Xi; Si)
Using the expression for Pr (Yi j Xi; Si) obtained before, we have that
2
6
Xn
6
6
l( ) =
log
6
i=1
4
o
n P
T
0
exp
t=1 yit xit
n P
X
T d x0
exp
t=1 t it
D2H(Si )
Xn XT
0
=
y
x
it
it
i=1
t=1
2
3
7
7
o7
7
5
nP
Xn
6 X
T d x0
log
exp
4
t=1 t it
i=1
D2H(Si )
This function is globally concave in .
3
o
7
5
MSE for Panel Data BCM. Manski (Ectca, 1987)
0 +
Consider the BCM Yit = 1fXit
i
0 +
Yit = 1fXit
i
Therefore, conditional on
Yit =
If
0g
uit
0g. The model implies that:
0
1fXit
1 + i
uit 1
0g
Yit 6= 0, we have that:
8
>
< 1
>
:
uit
if
0
Xit
uit > 0
1 if
0
Xit
uit < 0
Yit 6= 0 then: either (a) Yit 1 = 0 and Yit 1 = 1, and this implies that
0
Yit = 1 and that Xit
uit > 0; or (b) Yit 1 = 1 and Yit 1 = 0, and
0
this implies that Yit = 1 and that Xit
uit < 0.
Assumption: Conditional on Xit, Xit 1, and i, the variables uit and uit 1
have the same probability distribution with support ( 1; 1).
It is possible to show that, this assumption implies that:
median(
uit j
Xit,
Yit 6= 0) = 0
Therefore, we can apply the MSE to the model:
Yit =
with
Yit 6= 0.
8
>
< 1
>
:
if
0
Xit
uit > 0
1 if
0
Xit
uit < 0
Given a sample fyit; xitg, the score function is:
S( ) =
N X
T
X
i=1 t=1
0
1f yit = 1g 1f Xit
> 0g+1f yit =
0
1g 1f Xit
< 0g
That is just equal to the number of observations for which we score a correct
0
prediction for the sign of yit if we use the sign of Xit
as a predictor.
The MSE is the value of
that maximizes the score function:
^ M SE = arg max S ( )
This estimator has the same properties as in the cross-section case: N 1=3 consistent,
asymptotically non-normal, and possibly not uniquely de…ned in …nite samples.
Following Horowitz, we can de…ne a smooth-MSE for this estimator by re0
placing the discontinuous function 1f Xit
> 0g with a continuously di¤er-
0
0 ) converges uniformly
entiable function KN
Xit
such that KN ( Xit
0
> 0g as N goes to in…nity.
to 1f Xit
Correlated Random E¤ects Static Probit model
Suppose that
i
i
and Xi have a joint normal distribution. Then:
0
0
0
= Xi1
1 + Xi2 2 + ::: + XiT T + ei
= Xi0
+ ei
where ei is normally distributed and independent of
i.
Solving this expression in the equation of the BCM, we have:
0 + X0
Yit = 1f Xit
i
= 1f Xi0 t
where t = ( 1; :::; t 1;
uit
+ (ei
uit)
0g
0g
+ t; t+1; :::; T ) and uit = uit
ei .
Random E¤ects Static Probit model
(2)
If uit is normally distributed (the original model is a Probit model) and
independent of Xi, then uit is also normally distributed and independent of
Xit. Then, Yit = 1f Xi0 t uit 0g is a standard Probit model and we can
estimate the parameters t using MLE or the Pseudo-MLE of Avery-HansenHotz.
Given these estimate of t and of its variance matrix, we can estimate and
using a simple MD estimator. Given that the system of equations that relates
and and is linear, the MD estimator has a simple closed form expression
for the estimator of ^ in terms of ^ and Vd
ar(^ ).
3.2. Dynamic Binary Choice Models
Chamberlain (1985) Conditional Logit model for autorregresive PD BCM.
Honore and Kyriadzidou (ECTCA, 2000) extension to include also strictly
exogenpus regressors.
Conditional MLE for Dynamic PD Logit
Consider the dynamic panel data logit model
Yit = 1
n
Yi;t 1 + i
where uit has a logistic distribution.
In this model Si =
not true that:
uit > 0
o
PT
t=1 yit is not a su¢ cient statistic for
i.
That is, it is
Pr (Yit j Yit 1; Si; i) = Pr (Yit j Yit 1; Si)
However, fortunately, there is an alternative way to construct a su¢ cient
statistic for i controlling for (Yi1; YiT ; Si).
Conditional MLE for Dynamic PD Logit
(2)
Suppose that T = 4 and let Yi = fyi1; yi2; yi3; yi4g be the choice history
for individual i. We distinguish four sets of choice histories:
A
B
C
D
=
=
=
=
fy1; 1; 0; y4g
fy1; 0; 1; y4g
fy1; 1; 1; y4g
fy1; 0; 0; y4g
De…ne Si = 1(Yi 2 A [ B ). We will show that:
Pr (Yi j 1(Yi 2 A [ B ),
i,
) = Pr (Yi j 1(Yi 2 A [ B ), )
We can construct a (Conditional) likelihood function based on the probabilities Pr (Yi j 1(Yi 2 A [ B ), ), and the corresponding MLE is a consistent
estimator of .
Conditional MLE for Dynamic PD Logit
(3)
First, we obtain Pr (Yi j i,A [ B ). By Bayes’rule we have that:
Pr (Yi j i, A [ B ) =
Pr (Yi j i)
Pr (Yi j i)
=
Pr (A [ B j i)
Pr (A j i) + Pr (B j i)
Note that:
Pr (A j i) = Pr (y1j i) Pr (1j y1; i) Pr (0j 1; i) Pr (y4j 0; i)
1
exp (y4 i)
exp ( y1 + i)
= Pr (y1j i)
1 + exp ( y1 + i) 1 + exp ( + i) 1 + exp ( i)
And
Pr (B j i) = Pr (y1j i) Pr (0j y1; i) Pr (1j 0; i) Pr (y4j 1; i)
1
exp ( i) exp (y4 [ + i])
= Pr (y1j i)
1 + exp ( y1 + i) 1 + exp ( i) 1 + exp ( + i)
Conditional MLE for Dynamic PD Logit
(4)
Therefore,
Pr (A j i, A [ B ) =
Pr (A j i)
Pr (A j i) + Pr (B j i)
exp ( [y1 y4])
=
1 + exp ( [y1 y4])
The CMLE is the value of
function:
lC ( ) =
that maximizes the Conditional log-likelihoof
P
i 1fyi2 = 1; yi3 = 0g ln (
+ 1fyi2 = 0; yi3 = 1g ln (
where (:) is the logistic function.
[y1i
[y1i
y4i])
y4i])
Conditional MLE for Dynamic PD Logit
(5)
In this simple model, with T = 4 and without exogenous covariates X , it is
simple to show that CMLE of is:
^ = log #f1; 1; 0; 0g + #f0; 0; 1; 1g
#f0; 1; 0; 1g + #f1; 0; 1; 0g
!
where #fy1; y2; y3; y4g means the number of individuals in the sample with a
choice history fy1; y2; y3; y4g.
Interpretation (intuition).
- If time persistence in yit is generated by individual heterogeneity, then for an
individual, we should have persistence in only one of the two states, either at
0 (if i is small) or at 1 (if i is large).
- If time persistence in yit is generated by true state dependence ( > 0), then
we should have persistence in both states, 0 and 1.
Choice histories f1; 1; 0; 0g and f0; 0; 1; 1g are the only histories that provide
evidence of persistence in both states. The larger the sample frequency of
these histories the stronger is the evidence of structural state dependence and
the larger the estimator of . The choice histories f0; 1; 0; 1g and f1; 0; 1; 0g
are the only histories that provide evidence of no persistence in any of the
two states. The larger the sample frequency of these histories the smaller the
estimator of .
Conditional MLE for Dynamic PD Logit
(6)
It is possible to extend the previous result to Panel Data with any value of
T
4, to obtain the following expression.
PT
Let Yi = fYi1; Yi2; :::; YiT g and si = t=1 yit. Then,
PT 1
exp
t=2 yit yit 1
Pr (Yi j i, si; yi1 yiT ) = P
PT 1
d2Ci exp
t=2 dt dt 1
where:
Ci =
8
<
:
(d1; d2; :::; dT )
2 f0; 1gT
:
T
X
t=1
9
=
dt = si; d1 = yi1; dT = yiT
;
Honore and Kyriadzidou (ECTCA, 2000)
Consider the dynamic panel data logit model
Yit = 1
n
0
Yi;t 1 + Xit
+ i
uit > 0
o
where uit has a logistic distribution, and Xit are strictly exogenous regressors
with respect to uit.
For T = 4, they show that (si; yi1 yi4) are su¢ cient statistics for
if we condition on xi3 = xi4.
i
only
They propose a version of the CMLE that incorporates kernel weights that
depend on the distance kxi3 xi4k.