NEW SEMIPARAMETRIC PAIRWISE DIFFERENCE ESTIMATORS FOR PANEL DATA SAMPLE SELECTION MODELS Abstract

Transcription

NEW SEMIPARAMETRIC PAIRWISE DIFFERENCE ESTIMATORS FOR PANEL DATA SAMPLE SELECTION MODELS Abstract
NEW SEMIPARAMETRIC PAIRWISE DIFFERENCE ESTIMATORS FOR
PANEL DATA SAMPLE SELECTION MODELS
María Engracia Rochina-Barrachina1
Abstract
In this paper, estimation of the coefficients in a “double-index” panel data sample
selection model is considered under the assumption that the selection function depends
on the conditional means of some observable variables. We present two methods. The
first is a “weighted double pairwise difference estimator” because it is based in the
comparison of individuals in time differences.
The second is a “single pairwise
difference estimator” because only differences over time for each individual are required.
The finite sample properties of the estimators are investigated by Monte Carlo
experiments. Their advantages are: no distributional assumptions or a parametric
selection mechanism are needed, and heteroskedasticity over time is allowed for.
1. INTRODUCTION
•
I am grateful to Bo Honoré, Myoung-jae Lee and Frank Windmeijer for useful comments and
suggestions. Thanks are also owed to participants at the Econometric Society European Meeting
(ESEM), August/September 1999, Santiago de Compostela, Spain. Financial support from the Spanish
foundation “Fundación Ramón Areces” is gratefully acknowledged. The usual disclaimer applies.
1
Department of Economics, Universidad de Valencia and University College London.
1
In a panel data sample selection model, where both the selection and the regression
equation of interest may contain individual effects allowed to be correlated with the
observable variables, Wooldridge (1995) proposed a method for correcting for selection
bias. For a panel for two time periods Kyriazidou (1997) proposes an estimator
imposing weaker distributional assumptions.
Also for two time periods a more
parametric approach getting ride of given assumptions in the methods above has been
developed by Rochina-Barrachina (1999).
The last two estimators can be easily
generalised to the case of more than two time periods.
The method by Wooldridge (1995), given that it is based on estimation of a model in
levels, needs the assumption of a linear projection form of the individual effects in the
equation of interest on the leads and lags of the explanatory variables. The other two
methods overcome this problem by estimation of a model in differences over time for a
given individual. Time differencing for the same individual will eliminate the individual
effects from the regression equation. The work of Kyriazidou (1997) is the less
parametric of the three methods in the sense that the distribution of all unobservables is
left unspecified and it is allowed an arbitrary correlation between individual effects and
regressors. The price we pay is in terms of another assumption, that is, the called
conditional exchangeability assumption for the errors in the model. This assumption
allows for individual heteroskedasticity
of
unknown
form but
it
imposes
homoskedasticity over time. The advantage of the estimator proposed by RochinaBarrachina (1999) is that it allows the variance of the errors to vary over time. It is then
relaxed the assumption that the errors for a given individual are homoskedastic. The
2
price we pay for this is that we need to assume a trivariate normal distribution of the
errors in differences in the main equation jointly with the composed errors in the
selection rules for the two time periods we are pair differencing.
According to the results of a Monte Carlo investigation of the finite-sample properties
of Wooldridge (1995) and Kyriazidou’s (1997) estimators (Rochina-Barrachina, (1996))
we can conclude that important factors of bias or lack in precision in the estimates
comes from misspecification problems related to the individual effects in the main
equation and violations of the conditional exchangeability assumption.
Rochina-
Barrachina’s (1999) estimator gets ride of both factors as can be seen in the Monte Carlo
experiments presented in that work. However, the need to assume a trivariate normal
distribution for the errors may question the robustness of the estimator against
misspecification of the error distribution. The work in this paper has been developed
with the aim of keeping the properties of Rochina-Barrachina’s (1999) estimator but
allowing for a free joint trivariate distribution.
In this paper, estimation of the coefficients in a “double-index” selectivity bias model is
considered under the assumption that the selection correction function depends only on
the conditional means of some observable selection variables. We will present two
alternative methods. The first one follows the familiar two-step approach proposed by
Heckman (1976,1979) for selection models.
The procedure will first estimate
consistently and nonparametrically the conditional means of the selection variables. In
the second step we will not only take pair differences for the same individual over time
3
(to eliminate the individual effects as in Kyriazidou (1997) and Rochina-Barrachina
(1999)) but also after this we will take pairwise differences across individuals to
eliminate the sample selection correction term (the idea of pairwise differencing across
individuals in a cross section setting appears in Powell (1987) and Ahn and Powell
(1993)). On the resulting model after this double differencing we will apply a weighted
least squares regression with decreasing weights to pairs of individuals with larger
differences in their “double index” variables, and then larger differences in the selection
correction terms. The alternative method will need just pairwise differences over time
for the same individual but will include three steps. The first one will be identical to the
corresponding one in the other method, that is, nonparametrically we will estimate the
conditional means of the selection variables. In the second step we will estimate by
nonparametric regression the conditional means of pairwise differences in explanatory
variables and pairwise differences in dependent variables on the selection variables (the
“double index”) estimated in the first step. The third step will use these nonparametric
regression estimators to write a model in the spirit of the semiparametric regression
model of Robinson (1988) that will be estimated by OLS.
The paper is organised as follows. Section 2 describes the model, discusses some related
identification issues, and revises assumptions on the sample selection correction terms
in the available difference estimators for panel data sample selection models. Section 3
presents the new estimators.
Section 4 reports results of a small Monte Carlo
simulation study of its finite sample performance. In Section 5 we show the link
between both estimators.
4
Section 6 gives concluding remarks, and the Appendices
provide formulae for the asymptotic variance-covariance matrices.
2. THE MODEL AND THE AVAILABLE ESTIMATORS
Our case of study is a panel data sample selection model. In this model we are
interested in the estimation of the regression coefficients
yit = xit +
i
+
it
;
d it* = f t (zi ) − ci − uit ;
i = 1,..., N ;
[
in the equation
t = 1,..., T ,
(2.1)
]
(2.2)
d it = 1 d it* ? 0 ,
where z i = (z i1 ,..., z iT ). xit and zi are vectors of explanatory variables (which may have
components in common),
it
and uit are unobserved disturbances,
i
are individual-
specific effects allowed to be correlated to the explanatory variables xi , and ci are
individual-specific effects uncorrelated to zi . Whether or not observations for yit are
available is denoted by the dummy variable d it .
In (2.2) there is no need to assume any parametric assumption about the form of the
selection indicator index f t (z i ). In fact, by assuming that depends on all the leads and
lags of an F-dimensional vector of conditioning variables z we allow for an individual
effects structure with correlation with the explanatory variables and/or for sample
selection indices with a lagged endogenous variable as explanatory variable.
5
This
flexibility is convenient because although the form of this function may not be derived
from some underlying behavioural model, the set of conditioning variables which govern
the selection probability may be known in advance.
Like misspecification of the
parametric form of the selection function, misspecification of the parametric form of the
index function results in general in inconsistent estimators of the coefficients in the
equation of interest, as pointed out by Ahn and Powell (1993).
Time differencing on the observational equation (2.1) for those observations which have
d it = d is = 1 (s ? t ) we get
y it − y is = (xit − xis ) + (
it
−
is
)
(2.3)
It might be the case that we do not want to specify any selection indicator function but
we just want to assume that selection depends on a T ↔F - vector zi . In this case, by
assuming that (
it
−
the expectation of
(
is
it
) is mean independent of
−
is
xit , xis , z i conditional on d it = d is = 1 ,
) conditional on selection (i.e.
d it = d is = 1 ) is a function of
only zi , so that the expectation of (yit − yis ) conditional on selection takes the form
[
]
[
E yit − yis xit , xis , zi , d it = d is = 1 =( xit − xis) + E
= (xit − xis ) +
ts
it
−
(zi )
is
]
xit , xis , zi , d it = d is = 1
(2.4)
and consequently, a selection corrected regression equation for (yit − yis ) is given by
6
yit − yis = (xit − xis ) +
ts
(zi ) + (eit − eis )
(2.5)
where we have taken out from the error term
E
[(
it
−
is
) xit , xis , zi , d it
]
= d is = 1 =
[
ts
(zi )
(
it
−
is
) in (2.3) its conditional mean
driven by sample selection.
]
E (eit − eis ) xit , xis , zi , d it = d is = 1 = 0 by construction and
ts
()?
Thus,
is an unknown
function of the T ↔F − vector zi .
Equation (2.5) provides insight concerning identification. Notice that if some linear
combination (xit − xis )
of (xit − xis ) where equal to any function of zi , then there
would be asymptotically perfect multicollinearity among the variables on the right-hand
side of equation (2.5), and
(yit − yis )
could not be estimated from a regression of observed
on (xit − xis ) and
unknown function of zi ,
ts
ts
()? .
The reason is that any approximation to the
()? , will also be able to approximate the linear combination
of (xit − xis ) , resulting in asymptotic perfect multicollinearity. To guaranty that
taking any nontrivial
(xit − xis )
there is no measurable function
Γ(zi ) such that
= Γ(zi ) we need to impose the following identification assumption:
{
[
][
]}is non-
Assumption 1: E d it d is ( xit − xis) − E(xit − xis zi ) ? ( xit − xis) − E(xit − xis zi )
singular, i.e. for any
(xit − xis )
7
= Γ(zi ) .
'
? 0 there is no measurable function Γ(zi ) such that
Accordingly, identification of
requires the strong exclusion restriction that none of the
components of (xit , xis ) can be an exact linear combination of components of zi . This
implies that (xit , xis ) and zi cannot have any components in common.
As in sample selection models typically individual components of the vector zi appear
in the vector of regressors xit , xis in the main equation we are interested in structures for
the selection correction component that permit identification under this situation. If we
do not want identification to relay on strong exclusion restrictions we should impose
more
structure
[
on
ts
(zi )
for
]
the
E (eit − eis ) xit , xis , zi , d it = d is = 1 = 0 to identify
stochastic
restriction
. In the literature there are different
ways to impose this structure for models with sample selection. The restricted form of
the selection correction in (2.5) is typically derived through imposition of restrictions on
the behaviour of the indicator variables d i
(
= t , s) given zi ; that is, the indicator
variables d i are assumed to depend upon f (zi ) through the binary response model in
(2.2). In what remains of this section we make a revision of this literature to understand
the contribution of the methods proposed in section 3. The following classification
obeys to different degrees of distributional assumptions for the unobservables in the
model and to whether or not it is imposed a parametric form for the index function in the
selection equation.
Different structures on the form of the selection correction
8
ts
(zi ) .
Case A.
One way of imposing more structure on the form of the selection correction
ts
(zi ) is as
follows
ts
(zi ) = E[(
it
[(
= E [(
=E
−
is
it
−
it
−
{
= Λ {f (z ,
) xit , xis , zi , d it = d is = 1]
is ) x it , x is , zi , ci + uit ≤ f t (zi ), ci + uis ≤ f s (zi )]
is )x it , x is , zi , ci + uit ≤ f (zi , t ), ci + uis ≤ f (zi , s )]
= Λ f (zi ,
i
= Λ {f (zi ,
t
t
), f (zi , s ); F3 [( it −
is
), f (zi , s ); F3 [( it −
t ), f (zi , s )}
is
), (ci + uit ), (ci + uis ) xit , xis , zi ]}
), (ci + uit ), (c i + uis) f( zi , t) , f( zi , s)]}
(2.6)
, } is unknown and f (.,. ) are scalar single index functions of
where the function Λ{??
known parametric form (which can be linear but not necessarily). The joint conditional
distribution function F3 of the error terms
depends only upon the double index
(
it
−
is
), (ci + uit ), (ci + uis ) xit , xis , zi
{f (z , ), f (z , )}.
i
t
i
s
A consequence of ignorance
, } is unknown.
concerning the form of this distribution is that the functional form of Λ{??
The selection correction term
ts
(zi ) can be written as in (2.6) when (
(ci + uit ), (ci + uis ) are independent of
mean
independent
of
xit , xis , z i
(ci + uit ), (ci + uis ) are independent of
9
xit , xis , z i , or alternatively, when
conditional
on
it
−
(
it
is
−
(ci + uit ), (ci + uis ) ,
)
is
and
)
is
and
xit , xis , z i . The conditional mean independence
assumption always holds if
but we do not require
(
it
[(
−
is
it
−
is
), (ci + uit ), (ci + uis )] is independent of
) to be independent of
alternative sets of assumptions the expectation of
(i.e. d it = d is = 1 ) is a function of only
xit , xis , z i ,
xit , xis , z i . Under any of the two
(
−
it
is
) conditional on selection
{f (z , ), f (z , )}, so that the expectation of
i
t
i
s
(yit − yis ) conditional on selection takes the form
[
]
E yit − yis xit , xis , zi , d it = d is = 1 =( xit − xis) + Λ{ f( zi ,
) , f( zi , s)}
t
(2.7)
The selection corrected regression equation for (yit − yis ) is given by
yit − yis = (xit − xis ) + Λ{f (zi ,
[
]
t
), f (zi , s )}+ eits ,
(2.8)
E eits xit , xis , zi ,d it = d is = 1 = 0.
We need the following identification assumption for
to be identified in (2.8):
Assumption 2:
[
(
E d it d is ( xit − xis) − E xit − xis f ( zi ,
is
non-singular,
Γ(f (zi ,
t
i.e.
for
any
) , f ( zi , s))]?[( xit − xis) − E (xit − xis f ( zi , t ), f ( zi , s))]?
'
t
?0
), f (zi , s )) such that (xit − xis )
Now we incorporate more structure on
?
there
= Γ(f (zi ,
ts
(zi )
information, that the distribution of the indicators d i
10
is
t
no
measurable
function
), f (zi , s )) .
by adding, as extra identifying
(
= t , s) depends on the double
index
{f (z , ), f (z , )}.
i
t
i
The double index structure of the selection correction
s
permits identification even when individual components of the conditioning vector zi
appear in the regressors xit , xis .
Case B.
A fully standard parametric approach applied to (2.6) leads to
ts
(zi ) = E [(
−
it
[(
= E [(
= E [(
=E
{
= Λ{z
is
it
−
it
−
it
−
) xit , xis , zi , d it = d is = 1]
is ) x it , x is , zi , ci + uit ≤ f t (zi ), ci + uis ≤ f s (zi )]
is )x it , x is , zi , ci + uit ≤ f (zi , t ), ci + uis ≤ f (zi , s )]
is )x it , x is , zi , ci + uit ≤ zi t , ci + uis ≤ zi s ]
= Λ zi
t
, zi
s
i
t
, zi
s
[(
; Φ [(
; F3
3
−
it
it
−
is
(2.9)
), (ci + uit ), (ci + uis ) xit , xis , zi ]}
is
), (ci + uit ), (ci + uis ) xit , xis , zi ]}
? are scalar aggregators in the selection equation of a linear parametric form
where f ()
and we have imposed strong stochastic restrictions by specifying the joint conditional
distribution function F3 of the error terms
(
it
−
is
), (ci + uit ), (ci + uis ) xit , xis , zi
as a
trivariate normal distribution function Φ 3 . Under these parametric assumptions, the
form of the selection term, to be added as an additional regressor to the differenced
equation in (2.3), can be worked out (see, Rochina-Barrachina (1999)). Under this fully
parametric approach the estimation method developed in Rochina-Barrachina (1999)
consists on a two steps estimator. The method eliminates the individual effects from
11
the equation of interest by taking time differences conditioning to observability of the
individual in two time periods. Two correction terms, which form depends upon the
linear scalar aggregator function and the trivariate normal joint distribution function
assumed for the unobservables in the model, are worked out. Given consistent first step
estimates of these terms, simple least squares in the equation of interest can be used to
obtain consistent estimates of
in the second step.
Because of the linearity
? , the estimator under Case B corresponds to the called “More
assumption for f ()
parametric new estimator” in Rochina-Barrachina (1999).
Case C.
Relaxing in Case B the parametric form for the index functions f (.,. ) we get
ts
(zi ) = E [(
=E
=E
it
[(
[(
−
is
it
−
it
−
) xit , xis , zi , d it = d is = 1]
is ) x it , x is , zi , ci + uit ≤ f t (zi ), ci + uis ≤ f s (zi )]
−1
[ht (zi )], ci + uis ≤ F −1 [hs (zi )]]
is )x it , x is , zi , ci + uit ≤ F
{ [h (z )], F [h (z )]; F [(
= Λ{
Φ [h (z )], Φ [h (z )]; Φ [(
=Λ F
−1
−1
t
i
t
i
−1
s
i
s
i
3
it
−1
3
it
−
−
?,
where the selection indicator indices f ()
is
(2.10)
), (ci + uit ), (ci + uis ) xit , xis , zi ]}
is
), (c i + uit) ,( ci + uis)
xit , xis , zi
]}
= t , s are unknown and of unrestricted
form. We have still imposed as in Case B strong stochastic restrictions by specifying
the joint conditional distribution of the errors
(
it
−
is
), (ci + uit ), (ci + uis ) xit , xis , zi
as
trivariate normal. The values of these semiparametric indices in the selection equation
12
are
recovered
by
[
applying
the
inversion
rule
[
]
f t (zi ) = Φ −1 ht (zi )
]
and
f s (zi ) = Φ −1 hs (zi ) , where the conditional expectations h (zi ) = E (d i zi ) for
are replaced with nonparametric estimators
= t, s
h$ (zi ) = E$ (d i zi ) , such as kernel
? in (2.2) the
Given the unrestricted treatment of the functions f ()
estimators.
estimator under Case C corresponds to the three steps estimator called “Less parametric
new estimator” in Rochina-Barrachina (1999).
Both for Case B and Case C, although Rochina-Barrachina’s (1999) estimators are based
upon an independence assumption where
[(
it
−
), (ci + uit ), (ci + uis )]
'
is
is independent
of xit , xis , zi with a joint normality of the error terms, for Rochina-Barrachina’s (1999)
methods to work, it is sufficient to have a) marginal normality for (ci + uit ), (ci + uis )
and consequently joint normality of
(ci + uit ) and (ci + uis ) ;
b) independence of
xit , xis , zi for (ci + uit ) and (ci + uis ) ; c) a conditional mean independence assumption of
(
it
−
is
)
from xit , xis , zi once conditioning to
projection of
(
it
−
is
)
on
[(c
i
(ci + uit ) and (ci + uis ) ;
]
+ uit ), (ci + uis ) .
d) a linear
Furthermore, the normality of
(ci + uit ) and (ci + uis ) could be relaxed under other distributional assumption, but it can
be difficult to give a closed form for the sample selection correction term as in the
normal case. Under a, b, c and d
E
(
it
−
)
is
= {( ci + uit ), ( ci + uis)} =
'
its
'
its
E−1
(
'
its
its
)E ((
it
−
is
) its )=
'
its
,
(2.11)
13
=(
where
ts
,
st
)' = E −1 ( its
'
its
)E ((
−
it
is
) its ) .
Then, the selection bias is
[
E(
) ci + uit ≤ f t( zi), ci + uis ≤ f s( zi)]=
−
it
'
is
?E
(
ci + uit ≤ f t( zi), ci + uis ≤ f s( zi)
its
)
(2.12)
Expression which can be worked out with the results for a truncated normal distribution
in Tallis (1961) and it leads to the same sample selection correction terms than in
Rochina-Barrachina (1999) under full joint normality.
Rochina-Barrachina’s (1999) estimators (Case B and Case C) do not require technically
exclusion restrictions.
Case D.
ts
(zit , zis ) = E [(
[
= E(
− E(
=E
]
−
is
) xit , xis , zit , zis ,
it
xit , xis , zit , zis ,
i
,
i
, d it = d is = 1 − E
it
xit , xis , zit , zis ,
i
,
i
, uit ≤ zit −
i
,uis ≤ zis −
i
, uis ≤ zis −
i
,uit ≤ zit −
i
is
{
− Λ{z
xit , xis , zit , zis ,
= Λ zit −
is
it
−
i
i
i
, zis −
, zit −
i
[
;F[
i
i
,
; F3
3
it
is
i
,
i
, d it = d is = 1
] [
is
xit , xis , zit , zis ,
, uit , uis xit , xis , zit , zis ,
, uis , uit xit , xis , zit , zis ,
i
i
,
,
)
)
]}
]}= 0
i
,
i
]
, d it = d is = 1 =
i
i
(2.13)
14
where
[
F3
the
,
it
equality
to
, uit, uis xit, xis, zit, zis,
is
,
i
zero
]= F [
i
3
holds
,
is
if
zit = zis
and
].
There are
, uis, uit xit, xis, zit, zis,
it
,
i
i
no prior distributional assumptions on the unobserved error components but they are
subject to the joint conditional exchangeability assumption above. The idea of imposing
these conditions, under which first differencing for a given individual not only eliminates
the individual effects in the main equation but also the sample selection effects, is
exploited by the estimator developed by Kyriazidou (1997). Conditioning to a given
individual the estimation method is developed independently of the individual effects in
the selection equation.
For this reason we do not need to explicitly consider,
parametrically or non-parametrically, the correlation between the individual effects in
that equation and the explanatory variables.
In
Kyriazidou’s
(1997)
[
model
identification
of
requires
]
E ( x t − x s )(' x t − x s )d t d s ( z t − z s ) = 0 to be finite and non-singular. Given that we
require support of (zt − z s ) at zero, nonsingularity requires an exclusion restriction on
the set of regressors, namely that at least one of the variables zit is not contained in xit .
Summarising, Rochina-Barrachina’s (1999) approach imposes strong stochastic
restrictions, by specifying the joint conditional distribution of the error terms
( it − is )(, ci + uit )(, ci + uis )
as trivariate normal.
Under this assumption “sample
selectivity regressors” that asymptotically purge the equation of interest of its
15
selectivity bias can be computed and the corrected model can be estimated by OLS on
the selected subsample of individuals observed the two time periods. However, if the
joint distribution of the error terms is misspecified, then the estimator of
will be
inconsistent in general. The semiparametric method developed by Kyriazidou (1997)
relaxes the assumption of a known parametric form of the joint distribution but imposes
a parametric form for the index function ft(.) and the named “joint conditional
exchangeability” assumption for the time varying errors in the model.
The two
semiparametric methods for panel data sample selection models proposed in this paper
will avoid the mentioned limitations in the available methods.
3. THE PROPOSED ESTIMATORS
3.1. Weighted Double Pairwise Difference Estimator (WDPDE)
We assume here that the conditional mean of the differenced error term
(
it
−
is
) = ( yit − yis ) − (xit − xis )
in (2.3) depends on
xit , xis , z i
only through
ht (z i ), hs (z i ) , where the two indices ht (z i ), hs (z i ) are probabilities defined as
ht (z i ) = Pr(d it = 1 z i ) = E(d it z i )= E (d it xit , xis , z i )
hs (z i ) = Pr(d is = 1 z i ) = E(d is z i )= E (d is xit , xis , z i ).
(3.1)
In this case the differenced error term of the main equation satisfies the mean double
index restriction
16
E[(
it
−
is
) xit , xis , z i , d it
= d is = 1] = E [(
=
) ht (z i ), hs (z i ), d it
[h t (z i ), hs (z i )] a.s.,
it
−
is
= d is = 1]
(3.2)
Accordingly to this assumption, we need the sample selection correction term, to be
included in (2.3), to be a continuous function of the probabilities in (3.1).
The
regressors xit , xis and z i should not enter separately into the correction term.
Although it is not explicit in (3.2), other than the case of independence between the error
terms
( it − is )(, ci + u it )(, ci + u is ) and the regressors, it is unlikely that the assumption
holds. Precisely, the assumption of a known parametric form z i
, = t , s for the
indices in the selection equation can be relaxed in this approach because we assume this
independence. Under the indices restriction in (3.2) we consider estimation of the
parameter vector
of a “double-index, partially linear” model of the form
y it − y is = (xit − xis ) +
where
[ht (z i ), hs (z i )]+ (eit − eis )
(?,?) is an unknown, smooth function of two
(3.3)
scalars, unobservable “indices”
ht (z i ), hs (z i ) . It is derived from (3.2) that the error term (eit − eis ) has by construction
conditional mean zero,
E[(eit − eis ) xit , xis , z i , d it = d is = 1] = E [(eit − eis ) ht ( z i ), hs ( z i ), d it = d is = 1]= 0 a.s.,
(3.4)
17
We need some method for estimation of the unobservable conditional expectation terms
in (3.1). A natural way is the use of the nonparametric kernel method
N
h$t (zi ) =
N
Kil d lt
l =1
N
h$s (zi ) =
,
Kil d ls
l =1
N
Kil
,
Kil
l =1
Kil …K
zi − z l
√.
g1 N ↵
(3.5)
l =1
It is interesting to see what happens if we base estimation just in (3.2), (3.3) and (3.4),
that is if we develop an estimator that relies just on differences over time for a given
individual. The result is that even under the new set-up, that is conditioning in
probabilities, we cannot avoid the “exchangeability” assumption in Kyriazidou’s (1997)
method. To see this decompose the conditional mean of the differenced error in (3.2) in
two terms
[( it− is) xit, xis, z,i d it = d is= 1]= E[ it xit, xis, z,i d it = d is= 1 ]− E[ is xit, xis, z,i d it = d is= 1 ]=
{ht (z i ,) hs (z i ;) F[ it, (ci+ u it ,)(ci+ u is ) xit, xis, z i]}− {hs( z i), ht( z i); F [ is, ( ci+ u is)(, ci+ u it) xit, xis, z i ]}
E
=
−
its
ist
(3.6)
where for
assumption
18
its
=
ist
we need ht (z i ) = hs (z i ) and the “conditional exchangeability”
F[ it ,
is
,(ci + u it )(
, ci + u is ) xit , xis , z i ] …F [ is ,
it
,(ci + u is )(
, ci + u it ) xit , xis , z i ]
(3.7)
It is important to notice that this “conditional exchangeability” assumption implies for
the first step estimator the conditional stationarity assumption
F( c +u
i
it
) xit , xis , zi
…F(c +u
i
is
(3.8)
) xit , xis , zi
Estimation methods compatible with this condition are the conditional maximum score
estimator (Manski, (1987)), the conditional smoothed maximum score estimator
(Kyriazidou, (1994); Charlier, Melenberg, and van Soest, (1995)), and the conditional
maximum likelihood estimator (Chamberlain, (1980)).
All these methods are
independent of the individual fixed effects in a structural sample selection equation,
reason for which (3.8) can be rewritten as
Fuit
xit , xis , zi , ci
…Fu
is
xit , xis , zi , ci
∩ Fuit
xit , xis , zit , zis ,
i
…Fu
is
xit , xis , zit , zis ,
(3.9)
i
The use of these methods implies a linearity assumption for the index in the selection
rule according to what f t (z i ) in (2.1) is assumed to be equal to z it −
i
+ ci . According
to Ahn and Powell (1993) if the latent regression function is linear, to condition in
probabilities is equivalent to conditioning on z i , = t , s . Given the known parametric
form of the selection indices we do not need now to assume independence between the
error terms and the regressors. Anticipating this result we kept the regressors in the
conditioning set of (3.6). Under the linearity assumption identification will require some
19
component of z to be excluded from x . By the contrary, if the true latent regression
function is non-linear in z we have identification even without exclusion restrictions
because these non-linear terms are implicitly excluded from the regression function of
interest.
We ended up then in the method developed by Kyriazidou (1997) where (3.6) is now
rewritten using as indices no the probabilities but the linear indices z it , z is , and both
in (3.6) and (3.7) appear in the conditioning set xit , xis , zit , zis in place of xit , xis , zi . In
this setting it is necessary to assume that a root-n-consistent estimator ˆ of the true
was available, what is not going to be needed in our approach.
In sample selection models with cross section data pairs of observations are constructed
across individuals.
Up to date, in panel data sample selection models they are
constructed, not across individuals, but over time for the same individual (Kyriazidou
(1997)).
In our approach the pairs of observations will be constructed across
individuals in differences over time. The motivation of the method is both to eliminate
the individual effects and to get ride of sample selection problems. The drawback of
Kyriazidou’s (1997) estimator was given by the fact that elimination of the sample
selection effects needed the named “joint conditional exchangeability assumption”. In
our
{[(y
method
it
given
a
pair
[
observations
]
− yis )(
, xit − xis )], (y jt − y js )(
, x jt − x js )}
hits …(hit , his ) = (h jt , h js )…h jts ,
20
of
with
characterised
by
the
d it = d is = 1, d jt = d js = 1
vector
and
[( y
it
] [
]
− y is ) − (y jt − y js ) = ( xit − xis ) − (x jt − x js ) +
{ [E(d z ), E(d z )]− [E(d z ), E (d z )]}+ [(e
[(x − x )− (x − x )] + [(e − e )− (e
it
i
is
it
i
jt
is
jt
i
js
js
it
i
it
is
]
− eis )− (e jt − e js ) =
]
− e js )
jt
(3.10)
where we have assumed
)(h (z ,)h (z ))= (h (z ,)h (z ),)d = d = 1, d = d = 1 ]=
E [[(y − y )− (y − y )]− [(x − x )− (x − x )] (h (z ,)h (z ))= (
h (z ,)h (z ),)d
= { [E (d z ,)E (d z )]− [E (d z ), E (d z )]},
E
[(
it
−
is
it
)− ( jt
is
it
−
js
jt
i
t
js
is
i
i
s
it
jt
i
t
is
jt
i
js
j
s
js
j
it
t
i
is
s
jt
i
t
js
j
s
j
it
= d is = 1, d jt = d js = 1
i
(3.11)
and by construction
[
]
E (eit − eis )− (e jt − e js )(ht (z i ), hs (z i )) = (ht (z j ), hs (z j )), d it = d is = 1, d jt = d js = 1 = 0.
(3.12)
How close are the vectors of conditional means will be weighted by the bivariate kernel
weighs
$ ijts … 1 k
g 22 N
h$its − h$ jts
g2 N
√d it d is d jt d js .
√
↵
The estimator will be of the form
21
(3.13)
]
[ ] S$
$ = S$
xx
−1
xy
−1
N
S$xx … √
2↵
,
N −1
N
i =1 j = i +1
[
(
[
(
$ijts (xit − xis ) − x jt − x js
)] ['(x
it
(
− xis ) − x jt − x js
)]
(3.14)
and
−1 N −1
N
S$xy … √
2↵
N
i =1 j = i +1
$ijts (xit − xis )− x jt − x js
)] '[(y
it
(
− yis )− y jt − y js
)]
Then our WDPDE will be defined with a closed form solution that comes from a
weighted least squares regression of the distinct differences
(yit − yis )− (y jt − y js )
in
dependent variables on the distinct differences (xit − xis )− (x jt − x js ) in regressors, using
$ ijts … 1 k
g 22 N
h$its − h$ jts
g2 N
√d it d is d jt d js as bivariate kernel weighs. We only have to include
√
↵
pairs of observations for individuals observed two time periods and we have to exclude
pairs of individuals for which hits ? h jts .
The advantages of this estimator are the following. No distributional assumptions for
the error terms are needed compared with the estimators in Rochina-Barrachina (1999)
or Wooldridge (1995), and no “time-reversibility” or “conditional exchangeability”
assumption is needed compared with Kyriazidou (1997). We do not need conditions for
a given individual over time to eliminate the selection terms but conditions among
individuals in time differences. For comparability with Kyriazidou’s (1997) notation
we can write
22
[(
F [(
F
it
jt
−
−
is
) ,(
jt
),(
it
js
−
−
js
is
),(c
) ,(c
i
(
),(c
)(
( ) ( )] …
)h (z ) ,h (z ) ,h (z ) ,h (z )]
)
+ uit ) ,(ci + uis ) , c j + u jt , c j + u js ht (zi ) ,hs (zi ) ,ht z j ,hs z j
)(
+u jt , c j +u js
j
i
+uit ) ,(ci +uis
t
i
s
i
t
j
s
j
(3.15)
We require
( it − is )(, ci + u it )(, ci + u is ) to be i.i.d. across individuals and independent of
the individual-specific vector xit , xis , z i .
functional form of F or
eliminating
the
In other words, we cannot allow for the
to vary across individuals. This is crucial to our method for
sample
selection
effect.
It
is
not
( it − is )(, ci + u it )(, ci + u is ) be i.i.d. across time for the same individual.
form of F or
required
that
The functional
can vary across time.
3.2. Single Pairwise Difference Estimator (SPDE)
We generalise Robinson (1988) to the case of panel data sample selection models. In a
model like the one in (3.3)
y it − y is = (xit − xis ) +
[ht (z i ), hs (z i )]+ (eit − eis ) ,
(3.16)
we have already eliminated the individual effects in the main regression by taking time
differences for a given individual. First, we can estimate the two indices ht (z i ), hs (z i )
that correspond to the probabilities defined in (3.1) with the same nonparametric kernel
estimator of (3.5). Second, we take expectations conditional on the probability indices
23
and observability in the two time periods to get
E(y it − y is ht (z i ), hs (z i ), d it = d is = 1) =
E (xit − xis ht (z i ), hs (z i )) +
(3.17)
[ht (z i ), hs (z i )]
To get ride of the selection bias in (3.16) we take out from (3.16) its conditional
expectation in (3.17), and then we get the “centred” equation
( yit
(
)
− y is ) − E ( y it − y is ) ht (zi ), hs (zi ), d it = d is = 1 =
{(x
it
(
)}
− x is ) − E (x it − x is ) ht (zi ), hs (zi ), d it = d is = 1
+ (eit − eis )
(3.18)
In the second step we insert in (3.18) the nonparametric regression kernel estimators of
(
)
E ( yit − yis ) h$t ( zi ), h$s( zi ), d it = d is = 1
and
(
)
E ( xit − xis ) h$t ( zi ), h$s( zi ), d it = d is = 1 .
Specifically, an estimated value of those conditional means can be constructed by fitting
a kernel regression. Using the same kernel as in (3.5) above, the estimated conditional
means are of the form
24
N
EN
(
)
(yit − yis )h$t (zi ), h$s (zi ), d it = d is = 1 =
j? i
(
,
N
$ ijts
j? i
N
(
)
E N (xit − xis )h$t (zi ), h$s (zi ), d it = d is = 1 =
j? i
)
$ ijts y jt − y js
(
)
$ ijts x jt − x js
N
(3.19)
,
$ ijts
j? i
1 h$ − h$ jts √
$ ijts … 2 k its
d it d is d jt d js
g2 N
g2 N √
↵
Finally, in the third step, we apply least squares regression of the differences
( yit − yis) − E N (( yit − yis) h$t( zi), h$s( zi), d it
= d is = 1
( xit − xis) − E N (( xit − xis) h$t( zi), h$s( zi), d it
= d is = 1 to get
[ ] S$
$ = S$xx
S$xx …
N
i =1
−1
xy
) on
the differences in regressors
)
,
{
(
d it d is (xit − xis ) − E N (xit − xis )h$t (zi ),h$s (zi ),d it = d is = 1
{
(
)}
)}'
? (xit − xis )− E N (xit − xis )h$t (zi ),h$s (zi ),d it = d is = 1
(3.20)
and
S$xy …
N
i =1
{
(
d it d is (xit − xis )− E N (xit − xis )h$t (zi ),h$s (zi ),d it = d is = 1
{
(
)}
? (yit − yis )− E N (yit − yis )h$t (zi ),h$s (zi ),d it = d is = 1
4. MONTE CARLO RESULTS
25
)}'
In this subsection we report the results of a small simulation study to illustrate the
finite-sample performance of the proposed estimators. Each Monte Carlo experiment is
concerned with estimating the scalar parameter
yit = d it [xit +
d it = z 1it
*
1
i
+ z 2it
+
−
2
];
it
i
i = 1,..., N ;
[
− uit ;
in the model
, ,
t = 12
]
(4.1)
d it = 1 d it ? 0 ,
*
where yit is observed if d it = 1 . The true value of
,
1
, and
2
is 1; z1it and z2it
follow a N(0,1); x it is equal to the variable z 2it . The individual effects are generated as
[
i
= − (z11i + z1i 2 ) / 2 + (z 2i1 + z 2i 2 ) / 2 + (z11i * z 2i1 + z1i 2 * z 2i 2 ) / 2 +
i
= (xi1 + xi 2 ) / 2 + 2 ?
2
2
(01, )+ 1 .
The particular design of
2
2
i
.
(01, ) + 007
]
and
is driven by the fact
that we do not need to restrict its correlation with the explanatory variables to be linear.
f t (zi )
z1it
1
+ z 2it
in
2
(2.2)
+
[( z
11
i
coincides
in
our
experiment
design
]
+ z1i 2) / 2+ ( z 2i1 + z 2i 2) / 2+ ( z11i * z 2i1 + z1i 2 * z 2i 2) / 2 .
varying errors are uit =
2
2
(01, ) and
it
= 08
. * uit + 06
. *
2
2
(01, ) .
with
The time
The errors in the main
equation are generated as a linear function of the errors in the selection equation, which
guarantees the existence of non-random selection into the sample. We report results
when normalised and central
2
distributions with 2 degrees of freedom are considered.
Our estimators are distributionally free methods and therefore they are robust to any
distributional assumption.
26
The results with 100 replications and different sample sizes are presented in Table 1
(WDPDE) and Table 2 (SPDE). It is fair to say that we will probably need bigger
sample sizes that the ones included in the experiments to exploit the properties of these
estimators. The tables report the estimated mean bias for the estimators, the small
sample standard errors (SE), and as not all the moments of the estimators may exist in
finite samples some measures based on quantiles, as the median bias, and the median
absolute deviation (MAD) are also reported. In Panel A we report the finite sample
properties of the estimator that ignores sample selection. The purpose in presenting
these results is to make explicit the importance of the sample selection problem in our
experiment design. In Table 1, this estimator is obtained by applying least squares to
the model in double differences where correction for sample selection has been ignored
and for the sample of individuals who are observed in both time periods, i.e. those that
have d i1 = d i 2 = 1 .
In Table 2, by applying least squares to the model in single
differences over time for a given individual observed two time periods.
In Panels B and C we implement second (R=1), fourth (R=3), and sixth (R=5) higher
order bias reducing kernels of Bierens (1987).
They correspond to a normal, to a
mixture of two normals and to a mixture of three normals, respectively. The bandwidth
−1 2 R +1 + 2 T ?f ]
sequence for the first step is2 g N = g ? N [ ( )
, where T=2 is the number of time
periods and f=2 is the dimension of zi . The first step probabilities h1 (zi ) and h2 (zi )
are estimated by leave-one-out kernel estimators (this is theoretically convenient)
2
By following the best uniform consistency rate in Bierens (1987) for multivariate kernels. If we were
focused on convergence in distribution the optimal rate would have been obtained by setting
g N = g ?N
27
−1 [2 (R +1 )+ T ?f
]
.
constructed as in (3.5) but without zi being used in estimating h$ (zi ) . The summations
in (3.5) should read l ? i . The bandwidth sequence for the weights in the second step
−1 2 R +1 + 2 q
of the WDPDE and the SPDE is g N = g ? N [ ( ) ] , where q=2 is the dimension of
the vectors hts . The constant part of the bandwidth was chosen equal to 1, 0.5 or 3 in
both steps. There was no serious attempt at optimal choice.
From both tables we see that in Panels B and C the estimators are less biased than the
estimator ignoring correction for sample selection. The bias are all positive, they
increase as the kernel order increases and they diminish with sample size. The best
behaviour is found with the combination of R=1 and constant part of the bandwithn
g=1. Some anomalous results for sample size 1000 may be claiming the use of some
trimming to ensure that all the kernel estimators are well behaved. The SPDE performs
slightly better than the WDPDE, which can have its origin on the extra differencing
present in the latter method.
28
TABLE 1: Weighted Double Pairwise Difference Estimator (WDPDE)
uit =
it
i
[
250
500
750
1000
Mean
Bias
0.0650
0.0426
0.0258
0.0282
(01, )
= 08
. * uit + 06
. *
= (xi1 + xi 2 ) / 2 + 2 ?
2
2
(01, )
2
, )+ 1
2 (01
= − (z11i + z1i 2 )/ 2 + (z 2i1 + z 2i 2 )/ 2 + (z11i * z 2i1 + z1i 2 * z 2i 2 )/ 2 +
PANEL A
Ignoring Correction For Sample Selection
Mean Bias
Median Bias
SE
0.1099
0.1186
0.1416
0.0937
0.1005
0.1239
0.0933
0.0911
0.1075
0.0912
0.0887
0.1015
N
250
500
750
1000
N
i
2
2
R=1 & g=1
Media
SE
n Bias
0.0735
0.1444
0.0496
0.1139
0.0248
0.0827
0.0244
0.0782
MAD
0.1143
0.0747
0.0580
0.0580
Mean
Bias
0.0969
0.0679
0.0499
0.0570
PANEL B
R=1 & g=0.5
Media
SE
n Bias
0.0842
0.1811
0.0806
0.1419
0.0398
0.1025
0.0629
0.0991
2
2
.
(01, )+ 007
MAD
0.1194
0.1024
0.0911
0.0887
MAD
0.1194
0.1101
0.0655
0.0744
Mean
Bias
0.0917
0.0762
0.0707
0.0700
R=1 & g=3
Median
SE
Bias
0.0913
0.1365
0.0816
0.1137
0.0777
0.0935
0.0654
0.0882
PANEL C
N
250
500
750
1000
Mean Bias
0.0713
0.0576
0.0464
0.0531
29
R=3 & g=1
Median Bias
0.0754
0.0605
0.0550
0.0589
SE
0.1465
0.1112
0.0864
0.0844
MAD
0.1055
0.0888
0.0709
0.0662
Mean Bias
0.0852
0.0669
0.0700
0.0735
]
R=5 & g=1
Median Bias
0.0879
0.0722
0.0612
0.0749
SE
0.1439
0.1099
0.0949
0.0907
MAD
0.1051
0.0808
0.0625
0.0751
MAD
0.0966
0.0861
0.0777
0.0654
TABLE 2: Single Pairwise Difference Estimator (SPDE)
uit =
it
i
250
500
750
1000
[
(01, )
= 08
. * uit + 06
. *
= (xi1 + xi 2 ) / 2 + 2 ?
2
2
(01, )
2
, )+ 1
2 (01
= − (z11i + z1i 2 )/ 2 + (z 2i1 + z 2i 2 )/ 2 + (z11i * z 2i1 + z1i 2 * z 2i 2 )/ 2 +
PANEL A
Ignoring Correction For Sample Selection
Mean Bias
Median Bias
SE
0.1090
0.1156
0.1412
0.0940
0.1001
0.1242
0.0930
0.0906
0.1074
0.0911
0.0889
0.1014
N
250
500
750
1000
N
i
2
2
Mean
Bias
0.0448
0.0165
0.0074
0.0063
R=1 & g=1
Media
n Bias
0.0431
0.0134
0.0167
0.0053
SE
MAD
0.1373
0.0942
0.0704
0.0641
0.1029
0.0583
0.0510
0.0431
Mean
Bias
0.0910
0.0494
0.0441
0.0432
2
2
.
(01, )+ 007
MAD
0.1182
0.1011
0.0906
0.0889
PANEL B
R=1 & g=0.5
Median
SE
Bias
0.1039
0.1497
0.0554
0.1028
0.0431
0.0792
0.0470
0.0718
MAD
0.1114
0.0773
0.0547
0.0597
Mean
Bias
0.0705
0.0616
0.0443
0.0472
R=1 & g=3
Median
SE
Bias
0.0670
0.1255
0.0684
0.1037
0.0505
0.0753
0.0409
0.0708
PANEL C
N
250
500
750
1000
Mean Bias
0.0749
0.0459
0.0471
0.0370
30
R=3 & g=1
Median Bias
0.0680
0.0526
0.0356
0.0379
SE
0.1550
0.1277
0.1277
0.0837
MAD
0.1054
0.0781
0.0562
0.0578
Mean Bias
0.0764
0.0552
0.0704
0.0729
]
R=5 & g=1
Median Bias
SE
0.0672
0.1354
0.0703
0.2379
0.0525
0.1934
0.0690
0.1600
MAD
0.0989
0.0844
0.0706
0.0746
MAD
0.0861
0.0818
0.0550
0.0495
5. RELATIONSHIP BETWEEN THE WDPDE AND THE SPDE
We have presented, for both methods, least squares estimation of
as a final step in the
estimation procedures but we can derive also instrumental variables estimation of
to
make explicit the fact that no strict exogeneity is needed for the variables in the main
equation. The exogenous variables zi can be used to construct a k-dimensional vector
(dimension of xit ) of “instrumental variables” for (xit − xis ) . In particular, if we let the
instruments be suitable functions of the conditioning variables zi and z j , algebraically
these instruments are defined as Z its …Z ts (zi ) for some function Z ts : ∑ F *T ♦ ∑ K . The
estimator in (3.14) rewritten as a weighted instrumental variables estimator is given by
the following expression:
[ ] S$
$ = S$
Zx
−1
Zy
−1
N
S$Zx … √
2↵
,
N −1
N
i =1 j = i +1
[
][(x
[
][(y
$ ijts Z its − Z jts
it
(
− xis ) − x jt − x js
)]
(5.1)
and
−1 N −1
N
S$Zy … √
2↵
N
i =1 j = i +1
$ ijts Z its − Z jts
it
(
− yis )− y jt − y js
)]
For the estimator in (3.20) we can also present an alternative version to the least squares
approach given by a weighted instrumental variables version.
(
As in some other
)
applications of kernel regression estimators, E N ( yit − yis ) h$t ( zi ), h$s( zi ), d it = d is = 1
(
)
and E N ( xit − xis ) h$t ( zi ), h$s( zi ), d it = d is = 1 cause technical difficulties associated with
31
its random denominators, which can be small (which need not be bounded away from
zero).
To avoid this problem a convenient choice of instrumental variables is the
product of the original instruments Z its with the sum in the denominators of
(
)
E N ( yit − yis ) h$t ( zi ), h$s( zi ), d it = d is = 1
and
(
)
E N ( xit − xis ) h$t ( zi ), h$s( zi ), d it = d is = 1 ;
that is, the instruments Z$its are defined as
Z$its …Z its ?
N
j? i
$ ijts .
(5.2)
With this definition, the coefficients of an instrumental variables regression of
( yit − yis) − E N (( yit − yis) h$t( zi), h$s( zi), d it
= d is = 1
)
( xit − xis) − E N (( xit − xis) h$t( zi), h$s( zi), d it
= d is = 1 using the instrumental variables Z$its
)
can be shown to be algebraically equivalent to $ , defined in (5.1) above3.
3
To show that they are equivalent we have to take into account the property of the kernel Kij = K ji .
32
on
[ ] S$
$ = S$Zx
−1
Zy
,
{
)}
(
N

d it d is Z its ? $ ijts √ (xit − xis ) − E N (xit − xis )h$t (zi ),h$s (zi ),d it = d is = 1 =
↵
i =1
j? i
S$Zx …
N
N
N

d it d is Z its ? $ ijts √ (xit − xis )−
↵
i =1
j? i
N
j? i
(
$ ijts x jt − x js
?
N
$ ijts
j? i
{
)
?
)}
N

d it d is Z its ? $ ijts √ (yit − yis ) − E N ( yit − yis) h$t ( zi ), h$s( zi ), d it = d is = 1 =
↵
i =1
j? i
S$Zy …
N
(
N
N

d it d is Z its ? $ ijts √ (yit − yis ) −
↵
i =1
j? i
N
j? i
(
$ ijts y jt − y js
N
j? i
$ ijts
)
?
?
(5.3)
The estimators in (5.1) and (5.3) are equivalent.
6. CONCLUDING REMARKS
In this paper, estimation of the coefficients in a “double-index” selectivity bias model is
considered under the assumption that the selection correction function depends only on
the conditional means of some observable selection variables.
We present two
alternative methods. The first is a “weighted double pairwise difference estimator”
because it is based in the comparison of individuals in time differences. The second is a
“single pairwise difference estimator” because only differences over time for each
individual are required. Their advantages with respect to already available methods are
that they are distributionally free methods, there is no need to assume a parametric
33
selection mechanism and heteroskedasticity over time is allowed for. The methods do
not require strict exogeneity for the variables in the main equation and they are
equivalent under a special type of instrumental variables.
The finite sample properties of the estimators are investigated by Monte Carlo
experiments.
The results of our small Monte Carlo simulation study show the
following. Both estimators are less biased than the estimator ignoring correction for
sample selection. The bias are all positive, they increase as the kernel order increases
and they diminish with sample size. The best behaviour is found with the combination
of R=1 and constant part of the bandwithn g=1. The SPDE performs slightly better
than the WDPDE, which can have its origin on the extra differencing present in the latter
method.
REFERENCES
- AHN, H. AND J. K. POWELL (1993), “Semiparametric estimation of censored
selection models with a nonparametric selection mechanism”, Journal of Econometrics,
58, 3-29.
- BIERENS, H. J. (1987), " Kernel estimators of regression functions ", in Advances in
Econometrics, Fifth World Congress, Volume I, Econometric Society Monographs, No.
13, ED. T. F. BEWLEY, Cambridge University Press.
- CHAMBERLAIN, G. (1980), " Analysis of covariance with qualitative data ", Review
of Economic Studies, XLVII, 225-238.
- CHARLIER, E., B. MELENBERG, AND A. H. O. VAN SOEST (1995), “ A
smoothed maximum score estimator for the binary choice panel data model with an
application to labour force participation “, Statistica Nederlandica, 49, 324-342.
34
- HECKMAN, J. (1976), "The common structure of statistical models of truncation, and
a simple estimates for such models ", Annals of Economics and Social Measurement, 15,
475-492.
- HECKMAN, J. (1979), " Sample selection bias as a specification error ",
Econometrica, 47, 153-161.
- HOROWITZ, J. L. (1988), “Semiparametric M-estimation of censored linear
regression models”, Advances in Econometrics, 7, 45-83.
- KYRIAZIDOU, E. (1994), " Estimation of a panel data sample selection model ",
unpublished manuscript, Northwestern University.
- KYRIAZIDOU, E. (1997), “ Estimation of a panel data sample selection model “,
Econometrica, Vol. 65, No. 6, 1335-1364.
- MANSKI, C. (1987), " Semiparametric analysis of random effects linear models from
binary panel data ", Econometrica, 55, 357-362.
- POWELL, J. L. (1987), “Semiparametric estimation of bivariate latent variable
models”, Working paper number 8704, Revised April 1989 (Social Systems Research
Institute, University of Wisconsin, Madison, WI).
- ROBINSON, P. M. (1988), “Root-N-consistent semiparametric regression”,
Econometrica, Vol. 56, No. 4, 931-954.
- ROCHINA-BARRACHINA, M.E. (1996), " Small sample properties of two different
estimators of panel data sample selection models with non-parametric components ",
unpublished paper.
- ROCHINA-BARRACHINA, M.E. (1999), " A new estimator for panel data sample
selection models ", Annales d’Économie et de Statistique, 55/56, 153-181.
- WOOLDRIDGE, J. M. (1995), " Selection corrections for panel data models under
conditional mean independence assumptions ", Journal of Econometrics, 68, 115-132.
Appendix I
The variance-covariance matrix for the WDPDE
35
One variation inside the termed “semiparametric M-estimators” by Horowitz (1988)
defines the WDPDE of
−1
$ = argmin N √
 Β
2↵
N −1
as a minimazer of a second-order (bivariate) U-statistic,
N
i =1 j = i +1
{[(∆y
its
) (
− ∆y jts − ∆xits − ∆x jts
) ]$
U
} …argmin
 Β
2
ijts
0N
( ),
(I.1)
that will solve an approximate first order condition
−1
N
√
2↵
N −1
N
i =1 j = i +1
(∆x
its
) [(
) (
)]
− ∆x jts ' ∆yits − ∆y jts − ∆xits − ∆x jts $ $ ijts = 0,
(I.2)
where ∆yits = yit − yis , ∆xits = xit − xis , and $ ijts is defined by expression (3.13) in the
main text above. The empirical loss-function in (I.1) and the estimating equations in
(I.2) also depend upon an estimator of the nonparametric components hits and hist
defined in (3.1) and (3.5). To derive the influence function for an estimator satisfying
(I.2), we first do an expansion around
(
value expansion around hits − h jts
(h$
its
determine the effect on $ of estimation of
)
− h$ jts .
Expanding (I.2) around
−1
N
0=
√
2↵
N −1
N
i =1 j = i +1
−1 N −1
N
− √
2↵
from where
36
) to
of (I.2) and subsequently a functional mean-
(∆x
N
i =1 j = i +1
its
(∆x
its
we get
) [(
) (
− ∆x jts ' ∆yits − ∆y jts − ∆xits − ∆x jts
)(
) (
− ∆x jts ' ∆xits − ∆x jts $ ijts $ −
),
) ]$
ijts
(I.3)
−1 N −1 N
( )
N
=
√
2↵
N $−
−1
N
N
√
2↵
{S$ }
−1
xx
being
(
its
−
jts
(∆x
its
i =1 j =i +1
)…[(∆y
N −1 N
)(
−1
)
− ∆x jts ' ∆xits − ∆x jts $ ijts ? ?
?
(∆x
its
)(
− ∆x jts '
i =1 j =i +1
its
−
jts
)$
ijts
…
(I.4)
NS$ x
its
) (
− ∆y jts − ∆xits − ∆x jts
) ].
If we analyse the components of (I.4):
1
1) S$ xx …2 ?
N
N −1
i =1
(
)(
)
N
1
∆xits − ∆x jts ' ∆xits − ∆x jts $ ijts = p S xx = p 2 ?
N − 1 j =i +1
(I.5)
xx
As S xx = U 1 N , that is a bivariate U-statistic, by using U-statistics asymptotic theory we
NU 1 N = p 2 ?
know
1
N
N
i =1
[(
)(
)(
)
)
E ∆xits − ∆x jts ' ∆xits − ∆x jts
ijts
∆xits , hits , d it , d is
]
and
then
U 1N
1
= 2?
N
N
p
i =1
[(
E ∆x its − ∆x jts ' ∆x its − ∆x jts
{[(
)(
2 ? E E ∆x its − ∆x jts ' ∆x its − ∆x jts
The matrix
2)
37
xx
is easily handled, since
(
)
NS$ x expanded around hits − h jts ,
)
ijts
ijts
]
,h ,d ,d ]
}= 2 ?
∆x its , hits , d it , d is =
∆x its
its
it
is
1 $
S xx consistently estimates it.
2
(I.6)
xx
−1 N −1 N
N
… N
√
2↵
NS$ x
1
N
1
N
N −1
i =1
N −1
i =1
(
)(
∆xits − ∆x jts '
i =1 j =i +1
(
)(
2 N
∆xits − ∆x jts '
N − 1 j =i +1
2
N −1
N
N
(∆x
jts
−
its
)(
− ∆x lts '
j = i +1 l =1
jts
)g1
lts
jts
k
2
2N
−
jts
its
−
)g
)g
1
2
2N
k
hits − h jts
g2N
h$its − h$ jts
g2N
√d its d jts =
√
↵
√d its d jts +
↵
h * its − h * jts
k'
√ d jts d lts K jl
g2N
↵
1
3
2N
−1
N
K jl
l =1

d it
√ − h$its √
d is↵
↵
(I.7)
. is the derivative of the second-stage kernel k ' ()
. and K jl is
where d its …d it d is , k ' ()
defined as in (3.5). The expression (I.7) includes derivatives of the weights with respect
(
)
to hits − h jts (kernel derivatives).
For the first term on the right hand side of (I.7),
1
N
2
N
N −1
i =1
N
(
)(
N
2
∆x its − ∆x jts '
N − 1 j = i +1
(
)(
E ∆x its − ∆x jts '
i =1
its
−
−
its
jts
)g
jts
1
2
2N
)g
k
1
2
2N
k
hits − h jts
hits − h jts
g2 N
g2 N
√d its d jts =
↵
√d its d jts ∆x its ,
↵
its
p
(I.8)
, hits , d its ,
and for the second,
1
N
2
N
N −1
i =1
N
i =1
(
)(
N N
2
∆x jts − ∆x lts '
N − 1 j = i +1 l =1
(
)(
E ∆x jts − ∆x lts '
jts
−
lts
jts
) g1
−
3
2N
lts
k'
)g
1
3
2N
k'
h * its − h * jts
√d jts d lts K jl
g2 N
↵
hits − h jts
g2 N
√d jts d lts K jl
↵
−1
N
K jl
l =1
−1
N
K jl
l =1
d it


√ − h$its √=
d is ↵
↵
d it


d it


√ − h$its √ hits ,
√ − h$its √?
d is ↵
d is ↵
↵
↵?
(I.9)
Substituting (I.5), (I.8) and (I.9) in (I.4) we get
38
p
( ) { }
N $−
−1
=p
xx
(
)(
N
E
l =1
∆x jts − ∆x lts '
jts
?
1
N
−
lts
(
N
i =1
)g
)(
E ∆xits − ∆x jts '
1
k'
3
2N
its
−
hits − h jts
√d jts d lts K jl
g2N ↵
jts
)g1
2
2N
−1
N
l =1
k
K jl
hits − h jts
g2N
√d its d jts ∆xits ,
↵
its
, hits , d its +


d it
d it


$
$
√ − hits √ hits ,
√ − hits √ ?
d is ↵
d is ↵
↵
↵
?
(I.10)
that is asymptotically normal,
( )?? ♦ N 0,
N $−
where
d
xx
$ …1
Ω
xx
N
−1
xx
is estimated by
Ω xx
[ ] '√↵,
−1
(I.11)
xx
1 $
S xx by (I.5), and Ω xx can be estimated by
2

d it
d it


$ 
$ its + $ its
√ − h$its √ '
√ − h$its √ $ its + its
d is ↵
d is ↵
↵
↵
N
i =1
(I.12)
where
$ its … 1
N −1
$
its
1
…
N −1
N
j =1
N
(
)(
∆xits − ∆x jts ' $ its − $ jts
N
j =1 l =1
(∆x
jts
)(
$
$
1 hits − h jts
√d its d jts ,
k
g 22 N
g2 N √
↵
)
− ∆x lts ' $ jts − $ lts
h$its − h$ jts
1
√d jts d lts K jl
'
k
g 23N
g2 N √
↵
)
−1
N
l =1
K jl
(I.13)
The general theory derived for minimizers of mth-order U-statistics can be applied to
show
N − consistency and to obtain the large sample distribution of the WDPDE for
panel data sample selection models. The variance-covariance matrix for this estimator
depends upon the conditional variability of the errors in the regression equation and the
deviations of the selection indicators from their conditional means,
39
d it

√ − h$its .
d is ↵
Appendix II
The variance-covariance matrix for the SPDE
We can define the SPDE of
$ = argmin 1
 Β N
{[
as a minimazer of
N
i=1
)] [
(
)] }
(
∆ yits − E N ∆ y ts h$t (zi ), h$ s (zi ), d it = d is = 1 − ∆ xits − E N ∆ x ts h$t (zi ), h$ s (zi ), d it = d is = 1 ?
2
d it d is
(II.1)
that will solve an approximate first order condition
N
1
−
N
{[∆y
i =1
its
[∆x
its
− EN
(
)]
'
∆x ts h$t (zi ), h$s(zi ), d it = d is = 1 ?
)] [
(
)] }
(
− E N ∆y ts h$t ( zi ), h$s( zi ), d it = d is = 1 − ∆xits − E N ∆x ts h$t( zi ), h$s( zi ), d it = d is = 1 ? $ d it d is =
(II.2)
Expanding (II.2) around
1
N
0=−
{[∆y
1
N
its
N
i =1
N
i =1
[
we get
(
)]
'
∆xits − E N ∆x ts h$t (zi ), h$s(zi ), d it = d is = 1 ?
(
)] [
(
= d = 1)][∆x − E (∆x
)] }d d +
= 1)]d d ?
− E N ∆y ts h$t ( zi ), h$s( zi ), d it = d is = 1 − ∆xits − E N ∆x ts h$t( zi), h$s( zi), d it = d is = 1 ?
[∆x
its
− EN
($ − )
(
∆x ts h$t ( zi ), h$s( zi ), d it
'
is
its
N
ts
h$t ( zi ), h$s( zi ), d it = d is
(II.3)
from where
40
it
it
is
is
( )=
N $−
N
1
N
its
N
N
1
?
[∆x − E (
[∆x − E (∆x
i= 1
N
its
i =1
)][∆x
'
∆x ts h$t ( zi ), h$s( zi ), d it = d is = 1
N
)][(
$
$
= d is = 1
ts ht ( zi ), hs( zi ), d it
− EN
its
'
−
it
)]
(
)−
EN
is
(−
t
s
h$t( zi), h$s( zi), d it = d is = 1
? d it d is
(II.4)
where (
−E N
(
t
−
it
−
is
s
) = ∆yits − ∆xits
, and
)
(
(∆ x
)
= 1) .
h$ t (zi ), h$ s (zi ), d it = d is = 1 = − E N ∆ y its h$ t ( zi ), h$ s( zi ), d it = d is = 1 +
EN
its
h$ t (zi ), h$ s (zi ), d it = d is
(II.5)
It can be shown that the inverted matrix in (II.4) is consistent for
A=
[∆x
ts
)] [∆x
(
− E ∆x ts ht (z ), hs (z ), d t = d s = 1
We shall analyse now the term
estimating
four
(h ( z) , h ( z) , E(∆x
t
s
ts
1
N
'
ts
)]
(
− E ∆x ts ht (z ), hs (z ), d t = d s = 1 d it d is ? .
?
in (II.4). We have to work out the effect of
i =1
dimensional
) (
ht( z) , hs( z) , d t = d s = 1 , E
t
−
the asymptotic variance of our parameter of interest
summand in
{[
1
N
(II.6)
N
infinite
conditional
s
means
))
ht( z) , hs( z) , d t = d s = 1
on
. The moment condition for the
N
can be written as
i =1
(
) (
E m ht( z), hs( z), E ∆x ts ht( z), hs( z), d t = d s = 1 , E
t
−
s
)]}= 0
ht ( z), hs( z), d t = d s = 1
(II.7)
41
−
∆x ts h$t( zi), h$s( zi), d it = d is = 1 d it d is?
?
where
[
(
) (
m ht( z) , hs( z) , E ∆x ts ht( z) , hs( z) , d t = d s = 1 , E
[
)] [(
(
∆x ts − E ∆x ts ht (z), hs (z), d t = d s = 1
'
−
) − E( t
−
t
t
s
s
−
)]
= 1)]d d .
ht ( z), hs( z), d t = d s = 1 =
s
ht( z) , hs( z) , d t = d s
t
s
(II.8)
The following four derivatives are of interest:
m
(
=−(
) [
E ∆x ts ht (z), hs (z), d t = d s = 1
−
t
)− E ( t −
s
s
)]
ht (z ), hs (z ), d t = d s = 1 d t d s ,
(II.9)
E
(
m
t
−
s
= − ∆x
) [
ht (z), hs (z), d t = d s = 1
ts
)]
(
− E ∆x ts ht (z ), hs (z ), d t = d s = 1 d t d s ,
(II.10)
m
=−
ht (z )
[
) [(
(
E ∆x ts ht (z ), hs (z ), d t = d s = 1
t
)]
(
− ∆x ts − E ∆x ts ht (z ), hs (z ), d t = d s = 1
'
'
E
t
(
t
−
s
t
−
s
)− E ( t −
s
)]
ht (z ), hs (z ), d t = d s = 1 d t d s
)
ht (z ), hs (z ), d t = d s = 1 d t d s ,
(II.11)
m
=−
hs (z )
[
s
) [(
(
E ∆x ts ht (z ), hs (z ), d t = d s = 1
)]
(
− ∆x ts − E ∆x ts ht (z ), hs (z ), d t = d s = 1
'
'
s
E
(
t
−
s
t
−
s
)− E ( t −
s
)]
ht (z ), hs (z ), d t = d s = 1 d t d s
)
ht (z ), hs (z ), d t = d s = 1 d t d s ,
(II.12)
where
t
(
)
E ?ht (z), hs (z), d t = d s = 1
(
and
s
(
)
E ?ht (z), hs (z), d t = d s = 1
are
the
)
derivatives of E ?ht (z), hs (z), d t = d s = 1 with respect to ht (z ) and hs (z ) , respectively.
For the moment condition in (II.7) a functional expansion around ht (z ) and hs (z ) gives
42
1
N
[
N
N
1
N
=p
+ E
+E
+E
(
)
m h$t (zi ), h$s (zi ), E N ∆x ts ht (zi ), hs (zi ), d it = d is = 1 , E N
i =1
i =1
t
−
m ht (zi ), hs (zi ), E N ∆x ts ht (zi ), hs (zi ), d it = d is = 1 , E N
(
[
(
m
(
)
)
E ∆x ts ht (z ), hs (z ), d t = d s = 1
(
t
−
[
s
−
t
)]
ht (zi ), hs (zi ), d it = d is = 1
s
)]
ht (zi ), hs (zi ), d it = d is = 1
)]
(
ht (z ), hs (z ), d t = d s = 1 ∆x ts − E ∆x ts ht (z ), hs (z ), d t = d s = 1
m
E
(
)
h z ,h z ,d = ds = 1
s t ( ) s( ) t
ht (z), hs (z), d t = d s = 1
m
ht ( z), hs( z), d t = d s = 1 [d t − ht ( z)] + E
ht (z )
[(
t
−
s
)− E( t −
s
)]
ht (z), hs (z), d t = d s = 1
m
ht (z ), hs (z ), d t = d s = 1 [d s − hs (z )]?
hs( z)
?
(II.13)
For our estimator, the two means of
(
)
m E ?ht (z ), hs (z ), d t = d s = 1
conditional on
ht (z ), hs (z ) , and d t = d s = 1 are zero (see (II.9) and (II.10), above). Furthermore, the
corresponding
two
terms
[
[
]
E m ht (z) ht (z), hs (z), d t = d s = 1
for
and
]
E m hs (z) ht (z), hs (z), d t = d s = 1 , according to (II.11) and (II.12), are also zero
[
E(
because of
[
t
) − E(
−
s
(
t
−
s
]
)
ht( z), hs( z), d t = d s = 1 ht (z ), hs (z ), d t = d s = 1 = 0
]
)
and E ∆x ts − E ∆x ts ht (z), hs (z), d t = d s = 1 ht (z ), hs (z ), d t = d s = 1 = 0 . Hence, there is
no effect of estimating the four infinite dimensional nuisance parameters on the
asymptotic variance of
? in (II.13) is equal to zero.
given that the correction term in {}
Therefore, we get
( )= A ?
1
[∆ x − E(∆ x
N
N $−
−1
p
N
its
=i 1
)][(
ts ht( zi) , hs( zi) , d it = d is = 1
'
it
−
)−
is
(
E
t
−
s
ht( zi) , hs( zi) , d it = d is = 1
? d it d is
= A−1 ?
N
i
=i 1
(II.14)
that is asymptotically normal,
43
)]
(
)?? ♦ N (0,
N $−
d
)
A −1 E ( ')A −1 ,
(II.15)
where
i
[
)][(
(
… ∆xits − E ∆x ts ht (zi ), hs (zi ), d it = d is = 1
'
it
−
is
)− E ( t −
s
)]
ht (zi ), hs (zi ), d it = d is = 1 d it d i
(II.16)
A can be estimated as in (II.4), while E ( ') is estimated by replacing all the
conditional
(h ( z) , h ( z) , E(∆x
t
s
means
ts
) (
ht( z) , hs( z) , d t = d s = 1 , E
nonparametric estimates.
44
involved,
t
−
that
s
is
))
ht( z) , hs( z) , d t = d s = 1 , with