Treatment evaluation in the presence of sample selection Martin Huber Abstract:

Transcription

Treatment evaluation in the presence of sample selection Martin Huber Abstract:
Treatment evaluation in the presence of sample selection
Martin Huber
University of St. Gallen, Dept. of Economics
First draft: April 2008
This version: April 2009
Abstract: Sample selection is inherent to a range of treatment evaluation problems as the estimation of
the returns to schooling or of the effect of school vouchers on test scores of college admissions tests, when
some students abstain from the test in a non-random manner. Parametric and semiparametric estimators
tackling selectivity typically rely on restrictive functional form assumptions that are unlikely to hold in
reality. This paper proposes nonparametric weighting and matching estimators of average and quantile
treatment effects that are consistent under more general forms of sample selection and incorporate effect
heterogeneity with respect to observed characteristics. These estimators control for the double selection
problem (i) into the observed population (e.g., working or taking the test) and (ii) into treatment by
conditioning on nested propensity scores characterizing either selection probability. Weighting estimators
√
based on parametric propensity score models are shown to be n-consistent and asymptotically normal.
Simulations suggest that the proposed methods yield decent results in scenarios when parametric estimators
are inconsistent.
Keywords: treatment effects, sample selection, inverse probability weighting, propensity score matching.
JEL classification: C13, C14, C21
I have benefited from comments by Joshua D. Angrist, Eva Deuchert, Markus Fr¨
olich, Michael Lechner, Blaise
Melly, Rudi Stracke, and by seminar/conference participants in St. Gallen, Engelberg, and Bern. Address for
correspondence: Martin Huber, SEW, University of St. Gallen, Varnb¨
uelstrasse 14, 9000 St. Gallen, Switzerland,
[email protected].
1
Introduction
The sample selection problem, which was discussed by Gronau (1974), Heckman (1974), and Vella (1998),
among many others, arises whenever the outcome of interest is only observable for some subpopulation
conditional on selection that is non-ignorable conditional on observed characteristics. Potential bias due to
selection is an issue for a range of evaluation problems, e.g., when estimating the returns to schooling1 based
on a selective subpopulation of working or the effect of school vouchers on college admissions tests2 , given
that some students abstain from the test in a non-random manner.
This paper discusses identification and estimation of treatment effects in the presence of sample selection, attrition, and survey non-response related to unobserved characteristics. It considers a sample
selection model of rather general form in which two forms of selection appear, firstly, sample selection as
discussed above and secondly, non-random treatment assignment which is selective with respect to observed characteristics. The literature that is based on the conditional independence assumption (see for
instance Lechner, 1999, and Imbens, 2004) assumes that treatment effects are identified conditional on
observed characteristics jointly related to the treatment (e.g., education) and the outcome (e.g., employment). However, in the framework considered here, we also need to tackle the sample selection problem.
Under certain conditions the latter can be controlled for by conditioning on the sample selection propensity score, i.e., the conditional probability to be selected into the observed population. This intuitive result
was acknowledged by Angrist (1997), among others, and underlies the estimator of Ahn & Powell (1993)
based on matching individuals with similar selection propensity scores. It is also used by Newey (2007) for
the identification of nonparametric choice models. It follows that treatment effects in our framework are
identified when conditioning both on the sample selection propensity score and on observed confounders.
This paper establishes assumptions sufficient to point identify unconditional average treatment effects
(ATEs) and quantile treatment effects (QTEs) for the observed population. Our nonparametric selection
model invokes considerably weaker restrictions than parametric and semiparametric specifications
encountered in the sample selection literature. In particular, the model allows for effect heterogeneity
with respect to the observed confounders and the sample selection propensity score.
This allows
identifying heterogenous QTEs at different ranks of the outcome distribution. Furthermore, additivity
between observed and unobserved terms is not imposed as in virtually all models used in empirical
applications. The main contribution of the paper is the proposition of nonparametric estimators which
‘kill two birds with one stone’ by controlling for selectivity bias (i) in the observed population (e.g.,
working or taking the test) and (ii) with respect to the treatment assignment, using a nested propensity
score characterizing either selection probability. The estimators rely on inverse probability weighting
(IPW) and propensity score matching, where the (first stage) sample selection propensity score is
1
2
See for instance Mulligan & Rubinstein (2008).
See for instance Angrist, Bettinger & Kremer (2004).
1
included as additional covariate among other observed factors to compute the (second stage) propensity
to receive the treatment.
The estimators invoke a minimum of assumptions required for point identification in the presence of
the double selection problem into the observed population and into treatment. Monte Carlo simulations
suggest that even for moderate sample sizes, IPW and matching are considerably more accurate than
parametric estimators with respect to bias and mean squared error when the data generating process is
nonlinear. This paper provides two empirical applications. Firstly, we estimate the wage differentials
between individuals with high school graduation and lower education. Selection bias stems from the fact
that wages are only observed for the subpopulation of working. Secondly, we check the robustness of the
effects of school vouchers on test scores in a school voucher lottery. Selection bias is due to non-random test
√
attendance related to voucher possession. n-consistency and asymptotic normality of the IPW estimator
is established in the appendix.
The remainder of the paper is organized as follows. Section 2 reviews the literature on sample selection
models and highlights its relations and distinctions to the framework discussed in this paper. Section 3
introduces a general sample selection model and discusses the identifying assumptions. Section 4 presents
IPW estimators for average and quantile treatment effects. Section 5 provides simulation results on the
finite sample properties of IPW and matching estimators relative to parametric benchmarks. In section 6,
the estimators are applied to labor market data and a school voucher lottery. Section 7 concludes.
2
Related literature
The estimation of wages constitutes a prominent example for the sample selection problem in labor economics and was first addressed by Gronau (1974). As unobserved individual characteristics are likely to
affect both the probability of working and the potential wage, observed wages, i.e. potential wages conditional on working, will be correlated with the likelihood of employment which gives rise to selectivity bias.
Heckman (1974) proposed a maximum likelihood (ML) estimator to tackle selectivity bias when covariates
are linear and additive in the selection and outcome equation and unobservables are homoscedastic and bivariate Gaussian. Still for the linear and homoscedastic case, Heckman (1976, 1979) suggested a two-step
estimation approach known as the two-step heckit estimator. In the first step, the conditional mean of the
selection indicator is estimated and used to correct for the selectivity bias in the second step estimation
of the outcome equation. From a today’s perspective, these estimators are not very appealing as they
rely on overly restrictive parametric assumptions and are in general inconsistent when the unobservables’
distribution is misspecified.
The subsequent literature aims at relaxing these restrictions in various directions3 . Using ML esti3
An excellent survey on improvements in the sample selection literature is provided by Vella (1998).
2
mation, Gallant & Nychka (1987) suggest to approximate the bivariate density of the unobservables in
the selection and outcome equation to circumvent the joint normality assumption. Among two-step estimators, semiparametric alternatives have been proposed for first step and second step estimation, and
more recently for both. For the first step, Klein & Spady (1993) suggest an estimator for binary choice
models that is semiparametric in the sense that it does not put any restrictions on the distribution of
the unobservable term, while the parametric form of the single index still has to be known. The estimator attains the semiparametric efficiency bound of Chamberlain (1986) and Cosslett (1987) and allows for
heteroscedasticity of unknown form as long as it is related to the regressors only through the index. Alternatives - although asymptotically less efficient - are Manski’s (1975) maximum score estimator, Horowitz’s
(1992) smoothed maximum score estimator, Ichimura’s (1993) semiparametric least squares estimator, and
Han’s (1987) maximum rank correlation estimator, among others. Fr¨olich (2006) discusses nonparametric
estimators of binary choice models along with their small sample properties. Results suggest that in many
empirical settings, local logit estimation is likely to be more appropriate than parametric and semiparametric specifications.
For the second step, various authors suggest to allow for a nonparametric specification of the bias
correction function (related to the sample selection bias) in the outcome equation. Cosslett (1991) proposes
a two-step procedure where the marginal distribution function of the selection bias is approximated by a
step function of J intervals in the first step. In the second step, the outcome equation is estimated by a
OLS regression on the regressors and the J indicator (or dummy) variables. The estimator is consistent
if J increases with the sample size. Newey (1999) suggests to estimate the bias correction function by
series expansion after estimating the single index semiparametrically (e.g., using the Klein and Spady
estimator). He proposes GMM estimators and argues that efficiency gains can be obtained when additional
orthogonality conditions implied by the independence of errors in the bias corrected outcome equation and
the covariates are exploited. Pagan & Ullah (1999) discuss the optimal choice of such conditions with
respect to efficiency.
In contrast to Cosslett and Newey, Powell (1987) estimates the outcome equation conditional on the
estimated linear single index without estimating the bias correction function itself. Powell’s approach is
closely related to Robin’s (1988) semiparametric and partially linear model and is based on the intuition
that if two observations have similar values in the first stage single index, subtracting one observation
from the other eliminates sample selection bias. This allows consistent estimation of the coefficients in
the outcome equation. Powell (1987) therefore suggests an estimator based on pairwise comparisons of all
observations in the sample, where the contribution of each comparison is weighted by the difference in the
single index values. Still, the parametric form of the single index related to the selection probability has
to be known and is assumed to be linear. Ahn & Powell (1993) extend the approach of Powell (1987) to
the nonparametric estimation of the index based on kernel regression.
3
Das, Newey & Vella (2003) suggest to estimate both the selection and the outcome equation semiparametrically based on series approximation4 . The outcome depends on a general function of the covariates
and the bias correction term. The selection probability underlying the correction term is a nonparametric
function and possibly dependent of multiple indices. However, additivity between the covariates’ function, the correction term, and the unobservables has to be imposed. In contrast, Newey (2007) discusses
identification in nonseparable sample selection models where the outcome is an unknown (and potentially
non-additive) function of covariates and unobservables. Likewise, the model presented in section 3 does
not impose additivity between covariates, the selection probability, and unobservables in order to identify
ATEs and QTEs.5 .
Semiparametric estimators have been used in a range of empirical studies and the excerpt given below
is far from being exhaustive. Gerfin (1996) applies various first-step estimators to German and Swiss labor
market data to estimate the labor force participation of married women. He uses a probit specification,
Horowitz’s smoothed maximum score estimator, the Klein and Spady estimator, and quasi-maximum
likelihood estimation as proposed by Gabler, Laisney & Lechner (1993). Specification tests reject the
benchmark probit model for the German data, but not for the Swiss data. Also Kumar (2006) considers
(first step) estimation of female labor supply using the 1986 wave of the Panel Study of Income Dynamics
(PSID). The author is among the very few who estimate the decision of working nonparametrically. He
employs kernel estimators and compares their predictive power to logit, probit, and Manski’s (1975)
maximum score estimator. His findings suggest that the predictive power of nonparametric estimators
(about 95%) is considerably better than in parametric and semiparametric specifications (around 76%)
which perform particularly poor in predicting the outcome for non-participants.
Newey, Powell & Walker (1990) investigate data on married women’s hours worked previously analyzed
by Mroz (1987). They employ semiparametric two-step estimators using the methods of Klein & Spady
(1993), Ichimura (1993), Powell (1987), and Newey (1999) and obtain results which are quite similar to
parametric two-step estimation. Similarly, Melenberg & van Soest (1996) analyze vacation expenditures of
Dutch families and obtain almost identical results using parametric and semiparametric estimators based
on Klein and Spady first step estimation. Martins (2001) presents results on parametric and semiparametric
estimation of (first step) labor force participation decisions and (second step) wage equations for Portuguese
married women. Semiparametric estimation is based on Klein & Spady (1993) and Newey’s (1991) series
approximation. In contrast to Newey et al. (1990) and Melenberg & van Soest (1996), specification
tests indicate that the estimates obtained by parametric and semiparametric estimation are significantly
different. This is in line with Bhalotra & Sanhueza (2002) who investigate returns of schooling for women
in South Africa. The coefficient estimates on schooling are considerably higher when using semiparametric
4
5
However, in their empirical application they use a partially linear model.
Note that all methods discussed so far are concerned with the estimation of conditional (i.e., local) effects,
whereas we are interested in unconditional ATEs and QTEs.
4
estimation proposed by Newey (1999) and Klein & Spady (1993) than for the parametric specification
based on Heckman (1979). Thus, empirical evidence suggests that tight parametrization might imply
inconsistent estimation.
All studies discussed so far estimate conditional mean effects. Comparably few researchers attempted
to estimate conditional quantile effects to identify effect heterogeneity at different points in the conditional
outcome distribution. Using data from the US March Current Population Survey, Buchinsky (1998, 2001)
estimate female wage equations applying the methods of Ichimura (1993) and Newey (1999) to the quantile regression framework6 . The problem inherent to this literature is that independence between observed
covariates and unobservables conditional on the selection probability, an assumption consistency of virtually all point estimators relies upon, implies that conditional quantile and mean effects are of the same
magnitude. Conditional quantile effects are by assumption equal to conditional mean effects and quantile
regression does not yield more or different information about the effects if independence holds. If effects
are found to differ across conditional quantiles in empirical applications, this merely points to the violation
of the independence assumption and to the inconsistency of the estimator.
If conditional independence is likely to fail or no exclusion restriction is at hand, partial identification7
represents a valuable alternative to point identification. Recent empirical applications of partial treatment
effect identification in the presence of sample selection are provided in Lee (2005) and Lechner & Melly
(2007). Similarly to the latter, our framework is characterized by the double selection problem (i.e., selection (i) into the observed subpopulation and (ii) into treatment), but the difference is that we impose
conditional independence and the availability of an exclusion restriction in order to obtain point identification. Note that sample selection poses similar problems to identification as non-ignorable sample attrition
and item non-response8 related to unobservables, see for instance Wooldridge (2002). Thus, our estimator can also be applied to attrition and non-response related to unobservables as discussed in Fitzgerald,
Gottschalk & Moffitt (1998).
To conclude this section, it is important to highlight the differences between the double selection
problem considered in this paper and the control function literature promoted by Heckman & NavarroLozano (2004), Heckman & Vytlacil (2005), and Heckman, Urzua & Vytlacil (2006), among others, which
constitutes an important string of the evaluation literature devoted to the identification of local average
treatment effects (LATEs). In the latter case, sample selection models have been applied to problems
where treatment assignment is endogenous even conditional on observed covariates. The selection equation
characterizes the selection into treatment based on a continuous and exogenous instrument that shifts the
treatment probability but is independent of the unobservables. This allows identifying the LATE for the
6
For a discussion of quantile regression, see Koenker and Bassett (1978, 1982) and Koenker (2005).
See Manski (1989,1994) for a general discussion of partial identification in the presence of sample selection.
8
Identification in the presence of attrition and non-response is discussed Robins & Rotnitzky (1995), Robins,
7
Rotnitzky & Zhao (1995), Robins & Rotnitzky (1997), among others.
5
subpopulation of compliers, which are defined as those individuals who switch their treatment status due
to a shift in the instrument9 . Vytlacil (2002) shows that this approach is in principle equivalent to the
nonparametric LATE framework advocated by Imbens & Angrist (1994).
However, the problem considered in this paper is different, despite the use of sample selection models
in both set ups. After all, LATE identification is based on switching regression models where outcomes
are not censored and selection bias stems from the endogenous treatment decision. In this paper, it is
assumed that the treatment assignment would be ignorable conditional on the covariates, if all outcomes
were observed. Bias is, however, introduced by non-random selection into the observed subpopulation.
Thus, a selection model along with an exclusion restriction is used to characterize the probability of being
observed. Effect identification in the observed subpopulation is based on the conditional independence
assumption given the covariates and the selection probability. By conditioning on the covariates and the
sample selection propensity score, we are back in a quasi-random evaluation set up which is distinct from
the endogenous treatment assumption in the LATE related control function literature.
3
Model and identifying assumptions
In this section, we introduce a nonseparable sample selection model, where the latent outcome is an unknown function of two observed components, the treatment of interest and a vector of observed covariates,
and an unobserved term. Y ∗ denotes the latent outcome that is only partially observed, conditional on
selection. Let D denote the treatment which might be discrete or continuous, and X, U the covariates
and the unobserved term, respectively. Throughout the paper we will assume to have an i.i.d. sample of
n units, indexed by i = 1, ..., n. The latent outcome equation can be written as
Yi∗ = ϕ(Di , Xi , Ui ),
(1)
where ϕ(·) is an unknown function. We observe {Xi , Di } for all units in the sample, whereas Yi∗ is only
known for some non-random subsample. We denote the observed outcome as Y , which is
Yi = Yi∗ if Si = 1 and not observed otherwhise.
(2)
S is a binary selection function of the unknown function ζ(·):
Si = I{ζ(Di , Xi , Zi ) ≥ Vi }.
(3)
I{·} denotes the indicator function. In this nonparametric selection equation, Z represents a one or multidimensional instrument which is observable for all units. Theoretically, Z can be continuous, discrete, or
both. In any case it has to be relevant in the sense that it contributes importantly to S such that a ceteris
9
It therefore has to be assumed that the treatment varies monotonously with the instrument.
6
paribus change in Z shifts the selection probability considerably10 . The result that point identification
is not ruled out for a discrete Z may seem surprising and is therefore briefly discussed in appendix A.4.
V is an unobservable term that is not independent of U . By assumption, S is a function of at least one
element that is excluded in ϕ. Identification will crucially hinge on the availability of such an exclusion
restriction. However, the model is fairly general and does not impose parametric restrictions as linearity
or additivity on D, X, and U . To identify the causal effects of D, we utilize the potential outcome
framework advocated by Rubin (1974), among others. In the subsequent discussion, we will focus on point
identification of treatment effects for the observed population, i.e., conditional on being selected. Denote
the potential outcome for individual i and some hypothetical treatment D = d as
Yid
=
ϕ(d, Xi , Ui ).
In the labor market literature, D typically represents participation in a training program or years of
schooling whereas Y represents observed wages. We want to learn about the unconditional average
and quantile treatment effects11 (ATE, QTE) of D on Y by considering the differences in potential
outcomes for distinct hypothetical treatments. Both ATEs and QTEs bear intuitive interpretations for
policy recommendations. The ATE represents the mean effect for some population and is simply the
average difference in potential outcomes for distinct treatments. The QTE is the effect evaluated at a
particular point in the population’s potential outcome distribution, e.g., at the median or rank 0.75.
It simply measures the horizontal distance between two potential outcome distributions for distinct
treatments at the predefined rank. The quantile framework also allows evaluating inequality treatment
effects (ITE), which are differences in inequality measures (e.g., the interquartile range) of the potential
outcome distributions. This allows analyzing whether a treatment increases or decreases inequality, see
Firpo (2007b).
To formalize the discussion, let d0 , d00 , d00 < d012 denote two distinct treatments. ATEs and QTEs in
the observed population are defined as
∆ =
10
0
00
E[Y d ] − E[Y d ]
0
00
∆d 0
=
E[Y d |D = d0 ] − E[Y d |D = d0 ]
∆τ
=
QτY d0 − QτY d00
∆τd0
=
QτY d0 |D=d0 − QτY d00 |D=d0
Therefore, discrete and even binary instruments might in principle be used if changes from Z = 0 to Z = 1
largely affect the selection probability. However, powerful discrete instruments are most likely impossible to find
in reality.
11
By ‘unconditional’ we mean the global effects for the whole population of interest. In contrast, treatment
effects for covariates given would be local, i.e., conditional on specific values of the covariates.
12
In the binary treatment framework, e.g. participation vs non-participation in a training program, d0 = 1 and
d00 = 0.
7
∆ denotes the ATE for the observed population, ∆d0 is the average treatment effect on the treated (ATET),
i.e. conditional on being observed and receiving treatment D = d0 . Analogously, ∆τ , ∆τd0 represent the QTE
and quantile treatment effect on the treated (QTET) in the observed subpopulation. τ ∈ [0, 1] and denotes
the rank in the potential outcome distribution at which the effects are evaluated. E.g., τ = 0.5 yields the
median effect of the treatment. The unconditional quantiles are defined as QτY d = inf y Pr(Y d ≤ y) ≥ τ
and QτY d |D=d = inf y Pr(Y d ≤ y|D = d) ≥ τ . In the remainder of this paper, discussion will focus on the
ATE and QTE, as it is straightforward to obtain analogous results for the identification and estimation of
the ATET and the QTET.
The problem inherent to any causal analysis is that only one state of the world, i.e. the realized
outcome and treatment, is observed. In the sample selection model, this is even conditional on being
P
d
selected. In the observed subpopulation the realized outcome is defined as Yi =
d∈D Yi I{Di = d}
R
for discrete treatments and Yi = d∈D Yid I{Di = d}dd for continuous treatments, respectively, where D
denotes the (nonnegative and finite) space of possible treatments or treatment doses, respectively. To
infer on the unobserved potential outcomes, any regression method needs to impose intestable identifying
assumptions. The assumptions proposed in this paper are less restrictive than those in the parametric and
semiparametric sample selection literature, but more restrictive than those in the treatment evaluation
literature based on partial identification, as our framework allows for point identification. Briefly speaking,
identification is based on 3 key assumptions: (i) Conditional independence between potential outcomes
and treatments in the total population, (ii) the availability of an exclusion restriction to identify the
selection probability into the observed population, and (iii) conditional independence of observables and
unobservables given the sample selection propensity score.
Assumption 1: Conditional independence in the total population.
0
00
(1a) Y ∗d , Y ∗d ⊥D|X = x, ∀ x ∈ X (conditional independence of the latent outcome),
(1b) 0 < Pr(D = d|X) < 1 ∀d ∈ D (common support of D in X),
(1c) stable unit treatment variation assumption (SUTVA).
The conditional independence assumption (CIA) or selection on observables assumption is frequently
imposed in the treatment evaluation literature, see for instance Heckman, Ichimura & Todd (1997), Lechner
(1999), and Wunsch & Lechner (2008). (1a) states that the potential latent outcome is independent of the
treatment given the observed covariates X 13 . This implies that all factors jointly affecting the treatment
assignment and the latent outcome can be controlled for by conditioning on the covariates. The difference
to conventional evaluation studies relying on the CIA is that the outcome is not fully observed. (1b) is
13
In appendix A.4, we will briefly discuss identification under random treatment assignment, as in randomized
experiments and lotteries, such that one needs not condition on X.
8
the classical common support assumption and states that the selection probability must not be perfectly
predicted conditional on the covariates. (1c) states that the potential outcome for any unit i is stable
in the sense that it always takes the same value, independent of treatment allocations in the rest of the
population, see Rubin (1990) for further details. Assumption 1 implies that
0
E[Y ∗d |D
00
E[Y ∗d |D
0
=
d00 , X = x] = E[Y ∗d |D = d0 , X = x] = E[Y ∗ |D = d0 , X = x],
=
d0 , X = x] = E[Y ∗d |D = d00 , X = x] = E[Y ∗ |D = d00 , X = x].
00
0
00
Thus, the ATE for the total population conditional on X is ∆∗ (x) = E[Y ∗d |X = x] − E[Y ∗d |X =
x] = E[Y ∗ |D = d0 , X = x] − E[Y ∗ |D = d00 , X = x]. If the CIA holds, the effect of D on Y ∗ could be
identified if Y ∗ was fully observed. As this is not the case, we concentrate on the observed outcome Y .
To ease notation, we define E[Y |D = d, X = x] = E[Y ∗ |D = d, X = x, S = 1]. If selection is ignorable
R
conditional on X, then E[Y |D = d, X = x] = E[Y ∗ |D = d, X = x] as E[Y |D = d, X = x, U ]dFU =
R
E[Y ∗ |D = d, X = x, U ]dFU , where F denotes the cdf. This immediately implies that the treatment
0
00
effect conditional on X and S = 1 is identified by ∆(x) = E[Y d |X = x] − E[Y d |X = x]. By integration
R
over X we obtain the ATE in the observed population, ∆ = ∆(x)dFX|S=1 14 . If unobservables V and U
are not independent conditional on X, the effect of D on Y is confounded in the observed sample. Point
identification requires the availability of an instrument Z that predicts selection S but is not related to
the potential outcomes conditional on D, X. We therefore make the following assumption.
Assumption 2: Exclusion restriction.
(2a) Cov(Z, S|X, D) 6= 0 and Y ∗ ⊥ Z|D, X (exclusion restriction),
(2b) 0 < Pr(S = 1|D = d) < 1, ∀d ∈ D (common support of S in D),
(2c) (U, V )⊥(D, X, Z)| Pr(S = 1|D, X, Z) (conditional independence of unobservables and observables
given the selection probability),
(2d) FV (t), the cdf of V , is strictly monotonic in the argument t.
Assumption (2a) states that Z shifts S but is independent of the latent outcome given D, X. Therefore,
direct effects of Z on Y ∗ are ruled out15 . Together with assumption 1, this implies that F(Y ∗ |D=d,X=x) =
F(Y ∗ |D=d,X=x,Z) for all values of Z, where F(·|·) denotes the conditional cdf. (2b) rules out that the
treatment is a perfect predictor for sample selection. To see the usefulness of this assumption, consider the
case that latent realizations with D = d are never selected whatever values X, Z take. Obviously, treatment
d cannot be evaluated in the observed population and neither can any counterfactual population be defined
upon d. Likewise, perfect positive selection will cause identification problems, as discussed further below.
14
The ATE for the total population is identified, too, given that fX , the pdf of X, is observed for S = 0 and
R
that there is common support in fX and fX|S=1 . Then, ∆∗ = ∆(x)dFX .
15
A test for the validity of exclusion restrictions related to discrete instruments was recently proposed by Kitagawa
(2008).
9
By (2c), we impose that D, X, Z are jointly independent of the unobservables U, V conditional on the
sample selection propensity score. Even though conditional heteroscedasticity of unknown form is still
allowed for in this framework, any dependence between observables and unobservalbes is restricted to be
captured by the sample selection propensity score. (2c) is for instance violated if U is related to D in
the total population. In this case, the selection bias cannot be controlled for by conditioning on Pr(S =
1|D, X, Z), as unobserved interaction terms of U and D drive the selection probability16 . Assumptions
similar to (2c) are crucial for point identification in any selection model of both parametric and general
form. Its violation implies the inconsistency of virtually all point estimators proposed in the literature.
Nevertheless, (2c) is considerably weaker than most analogous assumptions made in the literature, as it
does not impose parametric restrictions on ζ. By monotonicity assumption (2d) it holds that Pr(S =
1|D, X, Z) = Pr(ζ(D, X, Z) ≥ V ) = FV (ζ(D, X, Z)). Thus, the likelihood to be observed increases
monotonically in ζ. Note that monotonicity is implicitly assumed in any linear index restriction frequently
used in the sample selection literature. For notational ease, let W ≡ (D, X, Z) and Pr(S = 1|D, X, Z) ≡
p(W ).
If (2c) and (2d) hold, U and D are independent conditional on p(W ) in the observed population. This
can be shown by applying the proof of theorem 1 in Newey (2007). Let a(U ) denote any bounded function
of U . Note that {S = 1} = {FV−1 (p(W )) ≥ V }. Then,
E [a(U )|D, p(W ), S = 1]
£
¤
E E [a(U )|V, D, X, Z] |D, p(W ), FV−1 (p(W )) ≥ V
£
¤
= E E [a(U )|V ] |D, p(W ), FV−1 (p(W )) ≥ V
£
¤
= E E [a(U )|V ] |p(W ), FV−1 (p(W )) ≥ V
=
= E [E [a(U )|V, p(W )] |p(W ), S = 1] = E [a(U )|p(W ), S = 1] .
If assumptions (1) and (2) are satisfied, selection bias in the observed population can be corrected for by
conditioning on p(W ). Thus, the identification of treatment effects requires the inclusion of the sample
selection propensity score as additional conditioning variable. To see this, note that the treatment effect
in the observed population given X and p(W ) is defined as
Z
∆(x, p(w)) =
ϕ(d0 , x, p(w), U )dFU |X=x,p(W )=p(w),S=1
Z
−
ϕ(d00 , x, p(w), U )dFU |X=x,p(W )=p(w),S=1
=
0
00
E[Y d |X = x, p(W ) = p(w)] − E[Y d |X = x, p(W ) = p(w)].
0
E[Y d |X = x, p(W ) = p(w)] is the expected potential outcome for a hypothetical treatment d0 given X
16
Huber & Melly (2008) provide a more detailed discussion of this issue in a semiparametric framework.
10
and p(W ). By the independence of U and D given p(W ) implied by (2c) and (2d), it holds that
Z
0
E[Y d |X = x, p(W ) = p(w)] = ϕ(d0 , x, p(w), U )dFU |X=x,p(W )=p(w),S=1
Z
= ϕ(d0 , x, p(w), U )dFU |D=d0 ,X=x,p(W )=p(w),S=1
= E[Y |D = d0 , X = x, p(W ) = p(w)].
Hence, the expected potential outcome is equal to the expected conditional outcome given D = d0 . The
00
same applies to E[Y d |X = x] so that E[Y |D = d0 , X = x, p(W ) = p(w)] − E[Y |D = d00 , X = x, p(W ) =
p(w)] = ∆(x, p(w)) and
0
E[Y d |D
00
E[Y d |D
0
=
d00 , X = x, p(W ) = p(w)] = E[Y d |D = d0 , X = x, p(W ) = p(w)] = E[Y |D = d0 , X = x, p(W ) = p(w)],
=
d0 , X = x, p(W ) = p(w)] = E[Y d |D = d00 , X = x, p(W ) = p(w)] = E[Y |D = d00 , X = x, p(W ) = p(w)].
00
The ATE ∆ is identified by integrating over the marginal distributions of X and p(W ).
Z Z
£ £
¤
£
¤¤
E Y |D = d0 , X = x, p(W ) = p(w) − E Y |D = d00 , X = x, p(W ) = p(w) dFX|p(W )=p(w),S=1 dFp(W )|S=1
Z Z
0
00
=
[E[Y d |X = x, p(W ) = p(w)] − E[Y d |X = x, p(W ) = p(w)]]dFX|p(W )=p(w),S=1 dFp(W )|S=1
0
00
= E[Y d ] − E[Y d ] = ∆.
(4)
Identification of QTEs is analogous, but requires that the conditional quantiles of interest are unique.
I.e., the density in the neighborhood of the quantiles must be bounded away from zero such that each
quantile corresponds to exactly one particular rank in the conditional distribution. Secondly, for an
intuitive interpretation of QTEs, the rank stability assumption has to be satisfied across treatments.
It states that individuals occupy the same rank in the respective conditional outcome distribution for
different treatments, see for instance Firpo (2007a) for further discussion. Let QτA denote the quantile at
−1
rank τ ∈ [0, 1] for some variable A, QτA = inf{a : FA (a) ≥ τ }. Then, FA (a) = QτA , i.e. the τ th quantile
of A is the inverse of its cdf evaluated at a. Let QτY d0 (x, p(w)) denote the τ th conditional quantile of the
0
potential outcome Y d given X = x, p(W ) = p(w), and S = 1. By assumption 2,
Z
FY |D,X,p(W ) (y|d0 , x, p(w)) =
I{ϕ(d0 , x, p(w), U ) ≤ y}dFU |D=d0 ,X=x,p(W )=p(w),S=1
Z
=
I{ϕ(d0 , x, p(w), U ) ≤ y}dFU |X=x,p(W )=p(w),S=1
=
−1
QτY d0 (x, p(w)).
The unconditional quantile of the potential outcome is identified as the inverse of the integration over the
marginal distributions of X and p(W ).
Z Z
−1
−1
QτY d0 (x, p(w))dFX|(p(W )=p(w),S=1 dFp(W )|S=1 = QτY d0 .
The difference between the quantiles with distinct treatments yields the QTE, ∆τ = QτY d0 − QτY d00 .
11
(5)
Identification of ∆, ∆τ in the observed population hinges on common support of the treatment in X
and p(W ). We therefore make a further assumption:
Assumption 3: Common support in the selected sample.
(3a) c < Pr(D = d|X, p(W ), S = 1) < 1 − c ∀d ∈ D, c > 0 (common support of D in X and p(W )).
(3) implies that the treatment probability is bounded away from zero in the observed population conditional
on the selection probability and observed covariates. It is obvious that (2b) is a necessary condition for
(3) to hold. To see this point, consider the case that (2b) is violated by assuming that all individuals
receiving treatment D = d are selected, i.e. D = d implies p(W ) = 1, independent of X, Z. Furthermore,
let p(W ) < 1 for any D 6= d. It follows that Pr(D = d|X = x, p(W ) = p(w)) = Pr(D = d|X = x, p(W ) =
1) = 1 ∀ x ∈ X , such that p(W ) = 1 perfectly predicts D and the common support assumption fails.
At this point, let us assume that (2b) and (3) are satisfied and consider the special case that there exist
some observations with p(W ) = 1. I.e., even though 0 < Pr(S = 1|D = d) < 1 holds, Pr(S = 1|W ) = 1 for
some triple(s) w = (d, x, z). Obviously, selection bias is not an issue for those observations and it follows
that E[Y |D = d, X = x, p(W ) = 1] = E[Y ∗ |D = d, X = x, p(W ) = 1]. This allows identifying local
treatment effects for the subpopulation with p(W ) = 1. It remains a priori unclear why this particular
population should be of any policy interest. However, if one is willing to impose the strong restriction of
treatment effect homogeneity across selection probabilities, i.e. ∆(x, p(w)) = ∆(x) ∀ p ∈ P, treatment
effects can be identified for other populations as well if there is common support in X. For instance, the
R
ATE for the observed population is ∆ = ∆(x|p(W ) = 1)dFX|S=1 for sufficient overlap in fX|p(W )=1 and
fX|S=1 . Identification based on p(W ) = 1 is known as ‘identification at infinity’ and was discussed by
Heckman (1990) and Andrews & Schafgans (1998). However, in empirical applications, observation with
selection probabilities close to one might be rare. Furthermore, effect homogeneity in p(W ) is a strong
assumption that might not hold in reality. We therefore concentrate on a more general identification
strategy using the whole distribution of p(W ).
After having established the identifying assumptions, we will now propose expressions for ∆, ∆τ based
on inverse probability weighting which can be used to build sample analogues required for estimation. Let
πd (X, p(W )) denote the treatment propensity score, i.e., the probability of receiving treatment D = d
conditional on X and p(W ), πd (X, p(W )) ≡ Pr(D = d|X, p(W ))17 . To control for selection into treatment,
we will henceforth condition on the πd (X, p(W )) instead of X and p(W ). Rosenbaum & Rubin (1983) have
shown that conditioning on the treatment propensity score is equivalent to conditioning on the covariates
directly, as both are balancing scores in the sense that they adjust the distributions of covariates in the
groups of treated and controls. However, conditioning on πd (X, p(W )) has the advantage that practical
17
For a binary treatment, the treatment propensity score is π1 (X, p(W )) and the nontreatment propensity score
is π0 (X, p(W )) = 1 − π1 (X, p(W )).
12
problems related to the nonparametric estimation using high dimensional covariates, e.g., empty cells for
particular combinations of covariate values, can be circumvented.
PROPOSITION 1 (Identification of mean effects).
Under assumptions 1,2, and 3, the ATE in the subpopulation of observed for two treatments d0 6= d00 is
identified by
¸
¸
·
S · I{D = d0 } · Y ∗
S · I{D = d00 } · Y ∗
E
−E
p(W ) · πd0 (X, p(W ))
p(W ) · πd00 (X, p(W ))
·
¸
·
¸
0
I{D = d } · Y
I{D = d00 } · Y
E
−E
.
πd0 (X, p(W ))
πd00 (X, p(W ))
·
∆
=
=
(6)
Proof: See appendix A.1.
The ATE is obtained by reweighing the outcome of each individual in the observed population by the
inverse of the conditional treatment probability given X and p(W ). Similar results are obtained for QTEs,
as both parameters are functions of the distribution of Y .
PROPOSITION 2 (Identification of quantiles).
Under assumptions 1,2, and 3, QτY d is an implicit function of
·
¸
·
¸
S · I{D = d}
I{D = d}
E
· I{Y ∗ ≤ QτY d } = E
· I{Y ≤ QτY d } = FY d (QτY d ) = τ
p(W ) · πd (X, p(W ))
πd (X, p(W ))
(7)
Proof: See appendix A.2.
It follows that
¸
I{D = d}
· I{Y < y} − τ ,
= arg zeroy E
πd (X, p(W ))
·
QτY d
which is a first order condition to
·
QτY d
¸
I{D = d}
= arg min E
· ρτ (Y − y) .
y
πd (X, p(W ))
(8)
ρτ (·) is the check function, an asymmetric loss function, suggested by Koenker & Bassett (1978) for quantile
estimation, ρτ (u) = u · (τ − I{u < 0}). It follows that ∆τ = QτY d0 − QτY d00 . Expressions (6) and (8) are
quite similar to the identification results obtained by Hirano, Imbens & Ridder (2003)18 and Firpo (2007a),
respectively. The difference is, however, that the latter assume unconfoundedness of the treatment effect
conditional on the treatment propensity score with respect to X alone, whereas we have to condition
on both X and p(W ) to control for selection bias into the observed population and into treatment. We
therefore extend the approach of Hirano et al. (2003) and Firpo (2007a) to the case of sample selection by
including the selection probability as additional covariate in the treatment propensity score.
18
The IPW estimator analyzed by Hirano et al. (2003) was first proposed by Horvitz & Thompson (1952).
13
4
Estimation
Both p(W ) and πd (X, p(W )) are unknown to the researcher and have to be estimated in order to be used
ˆ ∆
ˆ τ denote the estimates
in the weighting functions of the estimators of ∆, ∆τ . Let pˆ(W ), π
ˆ (X, pˆ(W )), ∆,
of the respective true parameters. Our estimation procedure can be described as follows:
1) Estimate pˆ(W ) by regressing S on D, X, Z,
2) estimate π
ˆd (X, pˆ(W )) by regressing D on X and pˆ(W ),
ˆ ∆
ˆ τ by the sample analogues of (6) and (8).
3) estimate ∆,
The sample selection propensity score p(W ) may be estimated by parametric (e.g., logit or probit), semiparametric (e.g., Klein and Spady, 1993, Ichimura. 1993), or nonparametric estimators. The latter seem
attractive if the structural form of p(W ) is not known (which is usually the case) and the dimension the
continuous elements in W is not too high. Hirano et al. (2003) suggest to estimate p(W ) by a logistic
power series approximation. I.e., they use a series of functions of W to approximate the log-odds ratio
of the selection probability. Another class of nonparametric estimators are kernel methods such as local
constant (Nadaraya-Watson) regression, see Ahn & Powell (1993) and Kumar (2006), or local logit19 .
All these methods are conditional mean estimators, but as pointed out by Li, Racine & Wooldridge
(2009), conditional probability estimators may also be used when dealing with binary outcomes. This is
obvious from the fact that
E[S|W = w] = Pr(S = 1|W = w) =
fS,W (1, w)
= p(w),
fW (w)
where f (·) denotes the pdf. An estimator of the sample selection propensity score is
(9)
fˆS,W (1,w)
,
fˆW (w)
where
f
(s,w)
fˆ(·) is the estimated pdf. Following Hall, Racine & Li (2004), S,W
can be consistently estimated
fW (w)
P
P
n
n
by fˆS,W (s, w) = n−1 i=1 κ(w, Wi , hn )Λ(s, Si , hn ) and fˆW (w) = n−1 i=1 κ(w, Wi , hn ). κ(·) and Λ(·)
denote generalized kernel functions related to continuous and discrete variables, see Hall et al. (2004) for
more details. hn denotes the vector of bandwidths for the continuous and discrete elements in W and S,
respectively, and might be determined by least squares or maximum likelihood cross validation.
Our framework explicitly allows for multiple treatments as discussed in Imbens (2000) and Lechner
(2001) or different treatment doses of a continuous treatment as considered by Hirano & Imbens (2004).
Let us assume that there is a finite set of discrete treatment choices, D ≡{0, 1, .., G} and G < ∞. The
19
Fr¨
olich (2001) investigates the finite sample properties of (global) logit, local constant, local linear, and local
logit estimators for a binary outcome, 4 continuous covariates, and 10 binary regressors. Local logit appears to
be substantially more appropriate than (global) logit, whenever the model specification is not encompassed by
the logit model, whereas local constant and local linear estimation perform worse than logit in the specifications
considered. In line with these results, Monte Carlo evidence in Fr¨
olich (2006) points to the superiority of local logit
compared to Klein and Spady and local constant estimation, at least for the data generating processes considered.
14
propensity scores π
ˆd (X, pˆ(W )) for all d ∈ D might be estimated simultaneously by multinomial probit or
logit20 . Alternatively, Lechner (2001) suggests to split estimation into a series of binomial models, where
the propensity score of each treatment relative to every other treatment is estimated by several binary
choice models. This procedure is computationally less costly than multinomial probit and also more robust,
as a misspecification of one choice model does not spill over to all other specifications. Thus, the methods
used for the estimation of p(·) might also be used for the estimation of πd (·).
Finally, we use the sample analogue of expression (6) to estimate the ATE for d0 > d00 by
ˆ =
∆
Pn
1
j=1
=
n
X
Sj
·
n
n
X
X
1
I{Di = d0 } · Yi
I{Di = d00 } · Yi
− Pn
·
π
ˆd0 (Xi , pˆ(Wi ))
π
ˆd00 (Xi , pˆ(Wi ))
j=1 Sj
i|S=1
ω
ˆ d0 ,i · Yi −
i|S=1
(10)
i|S=1
n
X
ω
ˆ d00 ,i · Yi =
i|S=1
n
X
[(ˆ
ωd0 ,i − ω
ˆ d00 ,i ) · Yi ] ,
i|S=1
where the weighting function ω
ˆ d,i is defined as
ω
ˆ d,i = Pn
1
j=1
Sj
·
I{Di = d}
.
π
ˆd (Xi , pˆ(Wi ))
Similarly, the QTE estimator is
ˆτ = Q
ˆ τ d0 − Q
ˆ τ d00 ,
∆
Y
Y
where
ˆ τ d = arg min
Q
Y
y
n
X
ω
ˆ d,i · ρτ (Yi − y).
(11)
i|S=1
ˆ τ can be written as
Thus, ∆
ˆτ
∆
= arg min Pn
y
j=1
− arg min Pn
y
= arg min
y
1
Sj
1
j=1
n
X
Sj
·
n
X
I{Di = d0 } · ρτ (Yi − y)
π
ˆd0 (Xi , pˆ(Wi ))
i|S=1
·
n
X
I{Di = d00 } · ρτ (Yi − y)
π
ˆd00 (Xi , pˆ(Wi ))
i|S=1
ω
ˆ d0 ,i · ρτ (Yi − y) − arg min
y
i|S=1
n
X
ω
ˆ d00 ,i · ρτ (Yi − y).
(12)
i|S=1
Again, (10) and (12) look similar to the estimators discussed in Hirano et al. (2003) and Firpo (2007a),
√
for which n-consistency, asymptotic normality, and semi-parametric efficiency were shown when using
a nonparametrically estimated propensity score. The major difference is that here weighting is based on
a nested propensity score that also accounts for the selection into the observed sample. Using a GMM
√
framework, appendix A.3 establishes n consistency and asymptotic normality of the proposed estimators
based on parametric propensity score estimation.
20
Caliendo & Kopeinig (2008) argue that multinomial probit is preferable as it relies on less restrictive assump-
tions.
15
Our estimation procedure includes the trimming function θ(n) that trims out π
ˆd (Xi , pˆ(Wi )) which are
close to the boundaries 0 and 1. I.e., estimation in (10) and (12) is based on
π
ˆdθ (Xi , pˆ(Wi )) ≡ max(θ(n), min(1 − θ(n), π
ˆd (Xi , pˆ(Wi ))),
where θ(n) is some ‘small’ number that decreases in the sample size n. θ(n) guarantees that no observations
obtains an arbitrarily large or small weight due to a propensity score estimate close to the boundary, as
this could seriously deteriorate the appropriateness of IPW methods in finite samples, see Khan & Tamer
(2007) and Busso, DiNardo & McCrary (2008). The estimator remains consistent because θ(n) → 0 as
n → ∞.
Propensity score matching may be used as an alternative method to IPW as both methods rely on the
same identifying assumptions, see Lechner (2007). A third possibility consists of estimating the conditional
outcomes for various treatments locally, e.g., by local linear kernel regression, and integrating over the
distribution of π
ˆd0 (X, pˆ(W )) to identify the unconditional effects. In the selection on observables framework
without sample selection, Heckman, Ichimura & Todd (1998) use this approach to estimate ATEs whereas
Melly (2006) estimates counterfactual distributions required for QTE estimation. All these methods allow
for effect heterogeneity in X and p(W ) and thus, for heterogenous QTEs across ranks τ .
In contrast, parametric and most semiparametric methods impose effect homogeneity in X and p(W ),
and thus, τ , by making restrictive linearity and additivity assumptions on the treatment, the covariates,
and the bias correction term:
Yi∗
=
αDi + Xi0 β + Ui ,
Si
=
Wi0 δ + Vi ,
E[Y |D, X, p(W )]
=
αDi + Xi0 β + λ(p(Wi )),
λ(p(W )) =
E[U |D, X, V > −W 0 δ].
α denotes the treatment coefficients, β, δ the coefficients on X and W , respectively, and λ(p(W )) the bias
correction term. Hence, nonparametric estimators may also be used to construct tests for homogenous
effects in in X, p(W ) by verifying whether QTEs are constant across τ . If QTE estimates differ significantly
at different points of the outcome distribution, parametric methods are inconsistent and we should therefore
rely on nonparametric estimators imposing less functional form assumptions21 .
21
Note that even though ∆τ is allowed to differ across τ in the nonparametric framework, the conditional QTE
∆τ (x, p(w)) is not. Non-constant ∆τ (x, p(w)) would point to the violation of U ⊥D|p(W ) and assumption (2c).
Then, the proposed estimators and virtually all point estimators suggested in the literature would be inconsistent.
The assumption of constant ∆τ (x, p(w)) is in principle testable, too, albeit very data hungry in a nonparametric
framework, in particular when X is high dimensional. Huber & Melly (2008) propose and apply such tests in a
semiparametric framework. In the presence of heterogenous ∆τ (x, p(w)), point identification is not feasible, but
16
5
Monte Carlo simulations
This section presents results of linear and nonlinear Monte Carlo simulations to examine the finite sample
properties of the proposed IWP and matching estimators relative to parametric maximum likelihood and
two-step estimators as well as to the naive estimator (i.e., the difference in the sample means of the observed
treated and observed nontreated observations). In all specifications, treatment D is binary and a function
of X. The first data generating process (DGP) represents a classical linear selection model with bivariate
normally distributed errors the covariance of which is 0.8.
Yi∗
= α1 Di + α2 Xi + Ui ,
Si
= I{β1 Di + β2 Xi + β3 Zi + Vi > 0},
Di
= I{γ1 Xi + εi > 0},
Yi
= Yi∗ if Si = 1,
X, Z
α1
∼ N (0, 1), U, V, ε ∼ N (0, 2), Cov(U, V ) = 0.8, Cov(U, ε) = 0,
= α2 = 1, β1 = β2 = 0.25, β3 = γ1 = 0.5.
We run 1000 Monte Carlo replications and estimate the median effect and the ATE by IWP for two
sample sizes (n = 700, 2800). The trimming factor is set to Tn = 0.05, 0.025 for the smaller and larger
sample, respectively. To estimate the standard errors of the IPW estimators, we draw 199 bootstrap
samples with replacement and set the bootstrap block size to the sample size n. In addition, the ATE is
estimated by two nearest neighbors matching using the R matching package developed by Sekhon (2007).
For the computation of standard errors we use the Abadie & Imbens (2006) estimator based on matching
observations within the same treatment group. This estimator is inconsistent as it does not account for
the uncertainty related to the estimation of the propensity scores, see also appendix A.3. We nevertheless
apply it to assess how severely its accuracy is affected by the inconsistency. The nested propensity scores
p(W ), π1 (X, p(W )) are estimated by probit specifications. We compare the IPW and matching results to
the parametric ML and heckit two-step estimators for sample selection models.
Table 5.1 displays the point estimates, standard errors (s.e.), and the mean squared errors
(MSEs) of the estimators.
As expected, the parametric benchmarks are superior to IPW and
matching in terms of MSEs due to correct parametric specification.
the nonparametric methods is satisfactory for both sample sizes.
However, the performance of
The IPW mean and matching
estimators even outperform the parametric methods in terms of small sample bias. Taking a look at
the s.e. estimates (ˆ
σ ), it appears that the bootstrap comes close to the true IPW standard errors in
particular for n = 2800. The same applies to the analytical estimates of the ML and two-step standard
treatment effects might still be bounded in the spirit of Manski (1989). Lee (2005) and Lechner & Melly (2007)
present empirical applications of interval estimation of treatment effects in the presence of sample selection.
17
errors. In contrast, the Abadie Imbens estimator considerably overestimates the matching standard error.
Table 5.1
Estimates and MSEs for the linear model with Gaussian errors
n=700
IPW median
(s.e.)
IPW mean
(s.e.)
matching
(s.e.)
ML
(s.e.)
two-step
ˆ ∆
ˆτ
∆,
MSE
0.977
0.084
(0.288)
0.997
0.063
(0.251)
0.997
0.064
(0.252)
0.982
0.040
(0.200)
0.993
0.046
n=2800
σ
ˆ
ˆ ∆
ˆτ
∆,
MSE
σ
ˆ
0.324
0.995
0.021
0.147
(0.067)
(0.145)
0.267
1.000
(0.053)
(0.122)
0.481
0.999
(0.047)
(0.121)
0.202
0.997
(0.012)
(0.100)
0.218
0.997
(0.033)
(0.103)
(s.e.)
(0.214)
naive
1.484
(s.e.)
(0.183)
(0.096)
true
1.000
1.000
0.268
1.495
(0.022)
0.015
0.127
(0.021)
0.015
0.332
(0.015)
0.010
0.101
(0.003)
0.011
0.104
(0.006)
0.255
We now consider the more interesting case of a nonlinear specification and treatment effect heterogeneity in X. The DGP is
Yi∗
= α1 Xi + α2 Xi2 + α3 Xi3 + Ui if Di = 1,
Yi∗
= δ1 Xi + δ2 Xi2 + δ3 Xi3 + Ui if Di = 0,
Si
= I{β1 Di + β2 Xi + β3 Zi + Vi } > 0,
Di
= I{γ1 Xi + εi } > 0,
Yi
= Yi∗ if Si = 1,
X, Z
α1
∼ N (0, 1), ε ∼ N (0, 1), U, V ∼ N (0, 2), Cov(U, V ) = 0.8, Cov(U, ε) = 0,
= 2, α2 = 6, α3 = 2, δ1 = δ2 = δ3 = 1, β1 = β2 = 0.25, β3 = γ1 = 0.5.
The outcome is a cubic function of X that differs for D = 0, 1. We would expect the parametric
estimators to be severely biased due to their inconsistency related to model misspecification.
In
contrast, the semiparametric IPW and matching estimators should still yield decent results. Table 5.2
presents the results for n = 700, 2800. All estimates are normalized with respect to the true treatment
effect, such that ∆ = 1. Again, we estimate p(W ), π1 (X, p(W )) by probit specifications. IPW and
18
matching are considerably more accurate than the parametric benchmarks, the MSEs of which are
more than 10 times larger for n = 2800. Obviously, the parametric estimators handle the nonlinearity
of the outcome in X and D very poorly. The results demonstrate the caveats related to restrictive
assumptions in sample selection models and demonstrate the merits of a more flexible model specification.
Table 5.2
Estimates and MSEs for the semi-nonlinear model with Gaussian errors
n=700
IPW mean
(s.e.)
matching
(s.e.)
ML
(s.e.)
two-step
n=2800
ˆ
∆
MSE
σ
ˆ
ˆ
∆
MSE
σ
ˆ
1.016
0.038
0.214
1.007
0.010
0.114
(0.059)
(0.102)
0.221
1.019
(0.042)
(0.069)
0.221
0.655
(0.052)
(0.092)
0.247
0.666
(0.083)
(0.077)
(0.194)
1.017
0.062
(0.248)
0.650
0.168
(0.215)
0.665
0.141
(s.e.)
(0.170)
naive
1.746
(s.e.)
(0.143)
(0.070)
true
1.000
1.000
0.577
1.757
(0.036)
0.005
0.147
(0.014)
0.128
0.113
(0.007)
0.112
0.111
(0.010)
0.577
Table 5.3 presents the results for the same DGP as before with the exception that the unobserved
terms U, V are now jointly t-distributed with four degrees of freedom and ε is t-distributed with four
degrees of freedom.
We therefore introduce misspecification with respect to the probit models of
p(W ), π1 (X, p(W )) where normally distributed errors are assumed. As before, the IPW and matching
estimators perform quite well and greatly outperform the parametric methods in terms of bias and
MSE. The misspecification of the nested propensity score does not seem to harm the accuracy of the
estimators. This is in line with Zhao (2008) who investigates the finite sample properties of propensity
score matching estimators and whose simulations suggest that ATE estimates are hardly affected (under
conditional independence) when matching on misspecified, but yet balancing propensity scores.
19
Table 5.3
Estimates and MSEs for the semi-nonlinear model with t-distributed errors
n=700
IPW mean
(s.e.)
matching
(s.e.)
ML
(s.e.)
two-step
6
n=2800
ˆ
∆
MSE
σ
ˆ
ˆ
∆
MSE
σ
ˆ
0.999
0.029
0.174
0.996
0.008
0.095
(0.032)
(0.091)
0.173
0.964
(0.019)
(0.057)
0.168
0.820
(0.075)
(0.083)
0.198
0.700
(0.026)
(0.066)
(0.169)
0.894
0.022
(0.105)
0.841
0.054
(0.170)
0.728
0.094
(s.e.)
(0.142)
naive
1.618
(s.e.)
(0.165)
(0.068)
true
1.000
1.000
0.409
1.657
(0.028)
0.005
0.128
(0.008)
0.039
0.098
(0.006)
0.095
0.100
(0.008)
0.436
Empirical applications
This section presents two applications. The first one is a classical wage regression using Italian survey data
from Ichino, Mealli & Nannicini (2008). The data set encounters 2030 individuals aged between 18 and
40 without stable jobs (i.e., open-ended contracts or self-employment) in January 2001 and was originally
investigated to assess the effectiveness of temporary work assignments. We use it to estimate the returns to
schooling for individuals that received secondary education or less. The dependent variable Y is log hourly
wage in November 2002 which is observed conditional on being employed. We are interested in the wage
effects of high school graduation (D = 1) vs. lower (secondary or primary) education (D = 0) for those
having received either the one or the other. 1115 individuals in the sample graduated from high school
and 637 have a lower education. Wages are observed (S = 1) for 747 individuals or 43%, of which 537
are treated and 210 are nontreated. Among other socio-economic information the data comprise labour
market experience, age, gender, regional dummies, and the grade obtained in the last degree (expressed
as a fraction of the highest mark), which may be considered as a proxy for unobserved ability. These
factors are potential confounders to education in the explanation of log hourly wages and are therefore
used as conditioning variables X in the outcome equation. They also enter the selection equation besides
education, as they are likely to affect the probability to work. In addition, the marital status and number
of children (along with interaction terms with gender), denoted as instruments Z, are included in the
selection equation but excluded in the outcome equation.
We estimate the ATE of high school graduation by IPW and two nearest neighbor caliper matching
20
and use probit specifications for the sample selection and treatment propensity scores22 .
The IPW
trimming factor is set to Tn = 0.05. But no treatment propensity score estimate π
ˆd (Xi , pˆ(Wi )) actually
has to be trimmed, as the maximum is 94.2% and the minimum is 8.2%. The histograms of π
ˆd (Xi , pˆ(Wi ))
for D = 1 and D = 0 presented in figure 6.1 show that the overlap in the treatment propensity scores
across treatment states is quite satisfactory.
Figure 6.1
Estimated treatment propensity scores for D = 1 and D = 0
Histogram of pi[d == 0]
100
Frequency
0
0
50
50
Frequency
100
150
150
Histogram of pi[d == 1]
0.2
0.4
0.6
0.8
0.0
pi[d == 1]
0.2
0.4
0.6
0.8
1.0
pi[d == 0]
The caliper in the matching algorithm defines the maximally acceptable distance in any match’s
propensity score in order to eliminate those matches that are not comparable in terms of their treatment
probabilities, i.e., lie outside the support. We set the caliper to 1 standard deviation (of the estimated
treatment propensity score), but no observations have to be dropped. After-matching balance tests
indicate decent balance, suggesting that treated and nontreated matches are comparable with respect
to the distribution of X and the estimated sample selection propensity score pˆ. In addition to the
semiparametric procedures we also estimate the ATE nonparametrically by directly matching on X and
pˆ, where the latter is obtained by nonparametric conditional density estimation as discussed in Li et al.
(2009). The caliper is again 1 standard deviation and 107 observations (14%) are discarded due to a
lack of common support. We also estimate the QTE at the median using IPW. As in the simulations,
standard errors of IPW and matching estimators are based on bootstrapping (999 draws) and the Abadie
Imbens (2006) estimator, respectively.
22
The treatment specification includes age, age squared, years unemployed, age*years unemployed, a dummy for
Sicily, and the grade obtained in the last degree. The selection equation additionally includes educational dummies,
gender, marital status, dummies for 1,2, and 3 children, and interaction terms between gender and children.
21
Table 6.1 provides the results for the non- and semiparametric estimators as well as for the parametric
ML and two-step (heckit) procedures. The estimates suggest that graduating from high school increases
the hourly wage on average by at least 6%. The median estimate is somewhat higher than the mean
estimates, but one would generally expect the QTE to diverge from the ATE if effects are heterogenous
with respect to X and p. Despite the limited sample size the IPW effects are significant at the 10% level.
Note that the parametric estimates are not too far away from the results obtained by semiparametric or
nonparametric estimation, but this need not be the case in other problems. It therefore seems advisable
to use both semi-/nonparametric and parametric estimators in empirical applications as the former are
more robust and the latter are generally more precise, given that the estimates obtained by both methods
are close.
Table 6.1
Average and median treatment effects (increase of hourly wage in %)
IPW mean
match (probit)
direct match*
ML
two-step
IPW median
ˆ ∆
ˆτ
∆,
0.073
0.055
0.066
0.087
0.080
0.105
(s.e.)
(0.038)
(0.036)
(0.037)
(0.037)
(0.046)
(0.034)
0.054
0.135
0.070
0.024
0.081
0.002
p-value
*107 observations (14%) dropped due to a lack of common support
The methods proposed in this paper may also be used as robustness checks, which we demonstrate in
the second application. Angrist et al. (2004) consider the effects of school vouchers on scores achieved in
a college admission test based on data from Colombias PACES program, which covered half the cost of
private secondary schooling. Many vouchers were assigned by lottery, which suggests that treatment effects
can be evaluated by comparing the test scores of voucher winners and losers just like in an experiment.
Experimental results in Angrist et al. (2004) imply that vouchers increase reading test scores on average
by roughly 0.7 points and this effect is significant at the 5 % level. However, as only 35% of students in the
sample of voucher applicants took the test, selection bias is an issue if test taking is non-random, e.g., if
voucher winners were more likely to be tested. Therefore, Angrist et al. (2004) use censored regression and
nonparametric bounds23 to account for potential sample selection. On balance, they still find substantial
gains from the PACES program. In what follows, an alternative way to check the effects’ robustness will be
presented by modeling the relationship between the likelihood to take the test, the (potential) test score,
and the incidence of winning a voucher. Thus, the necessity of an exclusion restriction is substituted by
imposing more structure on the model. We assume that the probability to be tested is characterized (i)
23
Note that there is no instrument for taking the test available in the data.
22
by a linear probability model (LPM) or (ii) by a probit model. The linear model has the form
p
= Pr(S = 1) = β1 Y ∗ + β2 D + η,
η
∼ unif(0.05, 0.45),
where p is the probability to take the test, Y ∗ are the potential test scores, and D is winning (D = 1)
or not winning (D = 0) a voucher. η is a randomly assigned baseline probability that is assumed to be
uniformly distributed between 5 and 45%. Similarly, the probit model is defined as
p =
Φ(β1 Y ∗ + β2 D + η),
where Φ(·) is the normal cdf. Hence, p is assumed to be related to both the test score and the school voucher.
The relation to the test score is due to the assumption that more able students with higher potential test
scores are also more likely to take the test. On top, voucher winners might be more encouraged by their
(more often private) schools to take the test which is one potential reason why p may be related to D.
The sample of test takers consists of 1223 observations24 for which p is computed. The sample average of
the test score is 47.356 and the test score’s standard deviation is 5.588.
We assess the robustness of the voucher effect estimate for different values of β1 , β2 , and γ using
IPW. In a perfect experiment, the probability to receive a voucher is independent of p and X such that
the unconditional treatment probability Pr(D = 1) (63.7% in the sample) can be used for estimation. In
this case, IPW yields an effect estimate of 0.683 (standard error based on 199 bootstrap draws: 0.329).
This is the same result as obtained by taking the difference in mean test scores of treated and nontreated
or regressing the test score on a treatment dummy. To account for selection, we specify π1 (X, p), the
propensity score for having received a voucher, as a probit model with the test-taking probability p and
other covariates X, namely age and age squared, as explanatory variables. It is therefore assumed that p
is known and can be controlled for to consistently estimate the voucher effect. By changing β1 , β2 , and γ
over a range of plausible values, the robustness of the voucher effect estimate can be investigated. E.g., for
the LPM, β1 = 0.001 implies that each additional point in the test comes with an increase in the likelihood
to take the test by 0.1 percentage points. β1 = 0.05 means that voucher winners have ceteris paribus a 5
percentage points higher probability to take the test than losers in the LPM.
Results are provided in table 6.2.
ˆ decreases in β1 , and β2 , suggesting positive
As expected, ∆
selection bias. Still, the estimates remain positive for most combinations of parameter values, albeit not
significantly different from zero at conventional levels in most cases25 .
24
25
1 observation was dropped due to a missing value in the reading test score
Standard errors are based on 199 bootstrap draws.
23
Table 6.2
IPW based robustness checks, linear probability model and probit model
linear probability model
β1 =0.001
β1 =0.003
β2
0.01
0.03
0.05
0.07
0.01
0.03
0.05
0.07
ˆ
∆
0.632
0.519
0.418
0.375
0.581
0.388
0.213
0.102
(0.322)
(0.338)
(0.348)
(0.392)
(0.332)
(0.353)
(0.338)
(0.409)
(s.e.)
β1 =0.005
β1 =0.007
β2
0.01
0.03
0.05
0.07
0.01
0.03
0.05
0.07
ˆ
∆
0.523
0.260
0.019
-0.154
0.463
0.140
-0.155
-0.382
(0.318)
(0.317)
(0.350)
(0.396)
(0.308)
(0.306)
(0.348)
(0.387)
(s.e.)
probit model
β1 =0.001
β1 =0.003
β2
0.01
0.03
0.05
0.07
0.01
0.03
0.05
0.07
ˆ
∆
0.632
0.515
0.398
0.321
0.580
0.382
0.189
0.038
(0.324)
(0.332)
(0.358)
(0.386)
(0.325)
( 0.329)
(0.353)
(0.377)
(s.e.)
β1 =0.005
β2
0.01
0.03
0.05
0.07
0.01
0.03
0.05
0.07
ˆ
∆
0.522
0.253
-0.007
-0.221
0.461
0.133
-0.180
-0.443
(0.316)
(0.315)
(0.338)
(0.385)
( 0.307)
(0.325)
(0.332)
(0.377)
(s.e.)
7
β1 =0.007
Conclusion
This paper discusses point identification and estimation of average and quantile treatment effects in the
presence of sample selection, attrition, and non-response related to unobservables. It extends methods
discussed by Hirano et al. (2003) and Firpo (2007a) for treatment evaluation in a selection on observables
framework to the case of a non-randomly drawn subpopulation related to unobservables. The main contribution of the paper is the proposition of nonparametric estimators which ‘kill two birds with one stone’
by controlling for selectivity bias with respect to (i) sample selection and (ii) treatment assignment, using
a nested propensity score characterizing either selection probability. The estimators rely on inverse probability weighting (IPW) and propensity score matching, where the (first stage) sample selection propensity score is included as additional covariate among other observed factors to compute the (second stage)
propensity to receive the treatment.
In contrast to most parametric and semiparametric procedures, the proposed estimators apply to
selection models of rather general form and allow for effect heterogeneity in the covariates and in the
sample selection propensity score. They constitute an alternative to conventional approaches whenever
one is interested in the unconditional effects of a particular treatment variable rather than a broader set
24
of regressors. Neither exact knowledge of the structural relation between the selection probability and
the outcome, nor additivity of the unobserved term in the outcome equation is required for consistency.
However, as for virtually all methods yielding point identification, joint independence of the observed and
unobserved factors in the selection and outcome equations must hold conditional on the sample selection
propensity score. Monte Carlo results suggest that IPW and matching estimators are considerably more
appropriate than parametric alternatives when the data generating process is nonlinear. The paper also
provides two empirical applications to Italian labor market data, see Ichino et al. (2008), and to a school
voucher lottery in Colombia previously analyzed by Angrist et al. (2004).
Further research might investigate the finite sample properties of the proposed estimators in more
detail and systematically evaluate their performance in terms of bias and mean squared error relative
to conventional parametric and semiparametric methods for various specifications of the selection and
outcome equations as well as the unobserved terms.
25
A
Appendix
A.1
Proof of proposition 1
∆, the ATE for the subpopulation with observed outcomes, is identified by
·
¸
·
¸
·
¸
·
¸
S·D·Y∗
S · (1 − D) · Y ∗
D·Y
(1 − D) · Y
∆ =E
−E
=E
−E
.
p(W ) · πd0 (X, p(W ))
p(W ) · πd00 (X, p(W ))
πd0 (X, p(W ))
πd00 (X, p(W ))
Proof:
·
¸
¸
S · I{D = d00 } · Y ∗
S · I{D = d0 } · Y ∗
−E
p(W ) · πd0 (X, p(W ))
p(W ) · πd00 (X, p(W ))
· · ·
¸
¸¸
0
∗
S · I{D = d00 } · Y ∗
S · I{D = d } · Y
E E E
−
|X, p(W ) |p(W )
p(W ) X
p(W ) · πd0 (X, p(W ))
p(W ) · πd00 (X, p(W ))
¸
¸¸
· · ·
I{D = d00 } · Y ∗
I{D = d0 } · Y ∗
−
|S = 1, X, p(W ) · p(W )|p(W )
E E E
p(W ) X
p(W ) · πd0 (X, p(W ))
p(W ) · πd00 (X, p(W ))
· · ·
¸
¸¸
I{D = d0 } · Y
I{D = d00 } · Y
E E E
−
|X, p(W ) |p(W )
p(W ) X
πd0 (X, p(W ))
πd00 (X, p(W ))
· · ·
¸
Y
E E E
|D = d0 , X, p(W ) · πd0 (X, p(W ))
p(W ) X
πd0 (X, p(W ))
¸¸
·
¸
Y
E
|D = d00 , X, p(W ) · πd00 (X, p(W ))|p(W )
πd00 (X, p(W ))
h £ £
¤
£
¤
¤i
£
¤
£
¤
E E E Y |D = d0 , X, p(W ) − E Y |D = d00 , X, p(W ) |p(W ) = E Y |D = d0 − E Y |D = d00
·
E
=
=
=
=
−
=
=
A.2
p(W )
E[Y
d0
X
00
] − E[Y d ] = ∆.
Proof of proposition 2
For the identification of ∆τ , the QTE for the subpopulation with observed outcomes, note that QτY d , the
τ th unconditional quantile of Y d , is an implicit function of the following expression:
·
¸
·
¸
S · I{D = d}
I{D = d}
E
· I{Y ∗ ≤ QτY d } = E
· I{Y ≤ QτY d } = FY d (QτY d ) = τ.
p(W ) · πd (X, p(W ))
πd (X, p(W ))
Proof:
·
¸
S · I{D = d}
· I{Y ∗ ≤ QτY d }
p(W ) · πd (X, p(W ))
¸
¸¸
· · ·
S · I{D = d}
E E E
· I{Y ∗ ≤ QτY d }|X, p(W ) |p(W )
p(W ) · πd (X, p(W ))
p(W ) X
¸
¸¸
· · ·
I{D = d}
· I{Y ∗ ≤ QτY d }|S = 1, X, p(W ) · p(W )|p(W )
E E E
p(W ) · πd (X, p(Z))
p(W ) X
· · ·
¸
¸¸
I{D = d}
E E E
· I{Y ≤ QτY d }|X, p(W ) |p(W )
πd (X, p(W ))
p(W ) X
¸
¸¸
· · ·
1
E E E
· I{Y ≤ QτY d }|D = d, X, p(W ) · πd (X, p(W ))|p(W )
πd (X, p(W ))
p(W ) X
i
h
E E [E [I{Y ≤ QτY d }|D = d, X, p(W )] |p(W )] = E [I{Y ≤ QτY d }|D = d] = τ.
E
=
=
=
=
=
p(W ) X
∆τ is identified by QτY d0 − QτY d00 .
26
A.3
Asymptotic distribution of the IWP estimator using parametric propensity score models
This section shows
√
n-consistency and asymptotic normality of IWP estimators using parametric models
for the selection into the observed population and into treatment. The properties are discussed in a GMM
framework that is similar to the one considered by Lechner (2009) for dynamic treatment evaluation.
It is assumed that the nested propensity scores p, πd for sample selection and treatment assignment are
known up to a finite number of coefficients. I.e., β ≡ (βs , βd ), where βs denotes the coefficients on
W ≡ D, X, Z in p = p(W, βs ) and βd the coefficients on X, p in πd = πd (X, p(W, βs ), βd ). Furthermore,
√
ˆ for instance a two step ML estimator of a
there exists a n-consistent, asymptotically normal estimator β,
nested probit or logit model with likelihood functions Ls (s, βs ), Ld (d, βd , βs ). Note that βˆd , the coefficient
estimates characterizing the treatment probability, are a function of the selection probability implied by βˆs
√
(which is n-consistent) rather than the true value βs . Murphy & Topel (1985) show that under certain
√
regularity conditions the two step ML estimator of βˆd is n-consistent and asymptotically normal26 . Let
k, g denote the
 score functions,
 i.e.,the first derivatives of
 the likelihood functions with respect to the
k(x, z, s, d, βs )
∂Ls (s, βs )/∂βs
=
. Using a GMM framework, the estimators of
coefficients: 
g(x, z, s, d, β)
∂Ld (d, βd , βs )/∂βd
the unknown values of βd , βs satisfy the conditions
n
1X
k(Xi , Zi , Si , Di ; βˆs )
n i=1
= 0.
n
1X
ˆ
g(Xi , Zi , Si , Di ; β)
n i=1
= 0.
These conditions allow predicting the sample selection and treatment propensity scores and will serve as
one part of the final GMM estimator that will also incorporate a moment condition related to the treatment
effects. We therefore reconsider the ATE estimator which we defined as
ˆ =
∆
n
X
[(ˆ
ωd0 ,i − ω
ˆ d00 ,i ) · Yi ]
i|S=1
=
n
X
[Si · (ˆ
ωd0 ,i − ω
ˆ d00 ,i ) · Yi ] ,
i=1
26
Murphy & Topel (1985) prove that
√
n(βˆd − βd )
→
N (0, Σ),
Σ
=
R2−1 + R2−1 [R30 R1−1 R3 − R40 R1−1 R3 − R30 R1−1 R4 ]R2−1 ,
R1
=
−E
R3
=
∂ 2 Ld (d, βd , βs )
∂ 2 Ls (s, βs )
,
R
=
−E
,
2
∂βs ∂βs0
∂βd ∂βd0
µ
¶0
∂ 2 Ld (d, βd , βs )
∂Ls (s, βs ) ∂Ld (d, βd , βs )
−E
,
R
=
E
,
4
∂βs ∂βd0
∂βs
∂βd
where ‘0 ’ denotes transposed.
27
with weights
ω
ˆ d,i = Pn
1
j=1 Sj
·
I{Di = d}
πd (Xi , p(Wi , βˆs ), βˆd )
.
It is straightforward to rewrite the estimator as
n
X
ˆ · Yi ,
ˆ = 1
∆
λi (x, z, s, d, β)
n i=1
with
ˆ =
λi (x, z, s, d, β)
=
=
n · Si · (ˆ
ωd0 ,i − ω
ˆ d00 ,i )
!
Ã
n
I{Di = d00 }
I{Di = d0 }
Pn
−
· Si ·
πd0 (Xi , p(Wi , βˆs ), βˆd0 ) πd00 (Xi , p(Wi , βˆs ), βˆd00 )
j=1 Sj
Ã
!
I{Di = d00 }
Si
I{Di = d0 }
·
−
,
ˆ
Π
πd0 (Xi , p(Wi , βˆs ), βˆd0 ) πd00 (Xi , p(Wi , βˆs ), βˆd00 )
ˆ denotes the unconditional probability to be observed, Π
ˆ ≡ (Pn Sj )/n. This allows us to
where Π
j=1
ˆ satisfying
formulate the estimator of ∆ as the value ∆
n
n
X
1X
ˆ ∆)
ˆ · Yi = 0,
ˆ =∆
ˆ−1
h(Xi , Zi , Si , Di ; β,
λi (x, z, s, d; β)
n i=1
n i=1
which constitutes the second ingredient of the GMM estimator. As in Lechner (2009), one particularity of
this otherwise standard parametric GMM problem (see Hansen, 1982, and Newey and McFadden, 1994)
is that some of the moment conditions depend only on a subset of unknown parameters. I.e., the moment
conditions g related to β do not depend on ∆ and furthermore, Ls (s, βs ) does not depend on βd . The regularity conditions required for consistency and asymptotic normality in this framework of sequential estimators were established by Newey (1984): Data must be generated from stationary and ergodic processes,
the moment functions and the respective derivatives must exist and must be measurable and continuous,
the parameters must be finite and not at the boundary of the parameter space, and the derivatives of the
moment conditions w.r.t. the parameters must have full rank. Furthermore, the sample moments must
converge to their population counterparts with decreasing variances and to uniquely identified values of
the unknown parameters. Applying the results of Newey (1984) and using the partitioned inverse formula
on the matrix of derivatives (w.r.t. to the unknown parameters βs , βd , ∆) of the moment conditions, the
asymptotic variance of the ATE estimator is equal to
√
ˆ
asVar( n∆)
=
−1
−1 −1
H∆
E[{h(·) + Hβd G−1
βd g(·) − (Hβs Gβd − Hβd Gβs )Kβs Gβd k(·)}
−1
−1 −1
0
×{h(·) + Hβd G−1
βd g(·) − (Hβs Gβd − Hβd Gβs )Kβs Gβd k(·)} ]H∆
=
−1 −1
Vhh + Hβd G−1
βd Vgh − (Hβs Gβd − Hβd Gβs )Kβs Gβd Vkh
−10 0
−10 0
−1 −1
−10 0
+Hβd G−1
βd Vgg Gβd Hβd + Vhg Gβd Hβd − (Hβs Gβd − Hβd Gβs )Kβs Gβd Vkg Gβd Hβd
−10 −10
0
+(Hβs Gβd − Hβd Gβs )Kβ−1
G−1
βd Vkk Gβd Kβs (Hβs Gβd − Hβd Gβs )
s
−10
−1
−10 −10
0
0
−Vhk G−10
βd Kβs (Hβs Gβd − Hβd Gβs ) − Hβd Gβd Vgk Gβd Kβs (Hβs Gβd − Hβd Gβs ) ,
28
where ‘0 ’ denotes transposed and
H∆
≡
Gβd
≡
Vhh
≡
·
¸
·
¸
∂h(·)
∂h(·)
∂λi (·)
∂h(·)
∂λi (·)
= 1, Hβd ≡ −E
= −E
Yi , Hβs ≡ −E
= −E
Yi ,
∂∆
∂βd
∂βd
∂βs
∂βs
∂g(·)
∂g(·)
∂k(·)
, Gβs ≡ E
, Kβs ≡ E
,
E
∂βd
∂βs
∂βs
ˆ · Yi ], Vgg ≡ E[g(·)g(·)0 ], Vkk ≡ E[k(·)k(·)0 ],
E[h(·)2 ] = Var[λi (x, z, s, d, β)
Vgh
≡
0
0
0
E[g(·)h(·)], Vhg ≡ Vgh
, Vkh ≡ E[k(·)h(·)], Vhk ≡ Vkh
, Vkg ≡ E[k(·)g(·)], Vgk ≡ Vkg
.
E
i (·)
Ignoring the estimation of the nested propensity score would amount to assuming that ∂λ
∂βd = 0 and
√ ˆ
∂λi (·)
ˆ
∂βs = 0 such that asVar( n∆) =Var[λi (x, z, s, d, β) · Yi ]. Note that this is what the Abadie & Imbens
(2006) variance estimator does for the nearest neighbor matching estimator and for which reason it is
inconsistent in the framework considered in this paper. As acknowledged by Lechner (2009), the full
ˆ · Yi ], depending on
variance might be smaller or larger than Var[λi (x, z, s, d, β)
∂λi (·)
∂β
and on the correlation
of the moment conditions. A consistent estimator of the asymptotic variance is obtained by using the
sample analogues of the terms in the formula or by bootstrapping.
We conclude this section by establishing a condition for the estimation of unconditional quantile
functions required to estimate QTEs. We defined the estimator of QτY d as
ˆτ d
Q
Y
= arg min
y
n
X
ω
ˆ d,i · ρτ (Yi − y),
i|S=1
n
= arg min
y
1X
[n · Si · ω
ˆ d,i · ρτ (Yi − y)] .
n i=1
This implies the first order condition
#
"
n
n
X
1X
1
S
I{D
=
d}
i
i
τ
τ
ˆ Q
ˆ d } − τ = 0,
ˆ d) =
· I{Yi < Q
hτ (xi , zi , si , di ; β,
·
Y
Y
ˆ πd (Xi , p(Wi , βˆs ), βˆd )
n i=1
n i=1 Π
which immediately serves as condition for GMM estimation. The asymptotic variance of the asymptotˆ τ d can be obtained in a similar way as outlined for the ATE estimator. As a
ically normal estimator Q
Y
ˆτ = Q
ˆ τ d0 − Q
ˆ τ d00 for distinct treatments d0 6= d00 is asymptotically normal,
consequence, the difference ∆
Y
Y
ˆ τ involves independent terms, see for instance the argumentation in Firpo (2007a). Therefore, the
too. ∆
ˆ τ can be easily obtained from the asymptotic variances of Q
ˆ τ d0 , Q
ˆ τ d00 as the coasymptotic variance of ∆
Y
Y
variance term is zero.
A.4
Identification in a randomized experiment with censored outcomes
Throughout this paper we assumed that treatment assignment is non-random and only unconfounded
conditional on observed covariates X and that X also affects selection. This is plausible in many interesting
evaluation problems as wage equations, where factors as tenure or experience are likely to affect both the
probability to work and the potential wage and may be confounders to the treatment ‘education’.
29
Let us now assume that the treatment is randomly assigned (i.e., independent of X) in the total
population27 such that the treatment propensity score in the observed population is only a function of the
sample selection propensity score, i.e., πd (p(·)) = Pr(D = d|p(·)), and the latter is only a function of Z
and D, p(D, Z) = Pr(S = 1|D, Z). This is useful for randomized experiments or lotteries with partially
observed outcomes where outcome censoring is non-random. Consider for instance the effect of school
vouchers assigned by a lottery on college admission test scores several years later. If only a subpopulation
takes the test and the participation probability is a function of the lottery win, point identification generally
requires an exclusion restriction to adjust for selection bias. Let Z denote an instrument satisfying this
restriction. Without loss of generality, we will discuss identification for a binary treatment D ∈ {1, 0}. Let
Pr(S = 1|D = d, Z) = pd (Z) denote the selection probability conditional on D = d. Then,
Pr(S = 1|D, Z) = p1 (Z) · D + p0 (Z) · (1 − D).
We denote the treatment propensity in the observed population conditional on pd (Z) by Pr(D = 1|pd (Z)) =
π1 (pd (Z)). For a fixed Z = z, π1 (p1 (z)) 6= π1 (p0 (z)), otherwise D is unrelated to pd (Z) and selection into
college admission tests is ignorable. Identification of treatment effects requires that treated and nontreated
observations with the same selection propensity score are available, which is obviously only feasible if Z
shifts the selection probability. I.e., it must hold that π1 (p1 (z 0 )) = π1 (p0 (z 00 )) for some values z 0 6= z 00 . In
general, Z needs to be continuous for point identification. To gain some intuition, assume the converse
that Z is discrete and either 1 or 0. Let Pr(S = 1|D = d, Z = z) = pdz . Then,
Pr(S = 1|D, Z)
= p1 (Z) · D + p0 (Z) · (1 − D)
= [p11 · Z + p10 · (1 − Z)] · D + [p01 · Z + p00 · (1 − Z)] · (1 − D)
For the identification of the ATET, all treated observations with p11 and p10 , respectively, have to be
compared to non-treated units with equal selection probabilities. However, in general, p11 6= p10 6= p01 6=
p00 . Let us consider two special cases where at least some combinations of D and Z yield equal sample
selection propensity scores among treated and nontreated. Firstly, let D and Z shift p equally into the
same direction, e.g., increase the selection probability. Then, p10 = p01 , but p11 6= p00 (and p11 6= p01 )
such that effects could only be point identified for a subpopulation. Secondly, let D and Z shift p equally
in absolute terms, but in opposite directions. Then, p11 = p00 , but p10 6= p01 (and p10 6= p00 ). Thus, even
in special cases, point identification is infeasible for the entire observed population if Z is binary. There
is, loosely speaking, an empty cells problem with respect to the sample selection propensity score.
This is not necessarily true for the scenario considered throughout this paper, where X needs to be
conditioned on in p(W ) = Pr(S = 1|D, X, Z) and in Pr(D = d|X, p(W )) = πd (X, p(W )) for unconfoundedness. If X is continuous and its range is sufficiently large, there may be common support in p(W ) for
27
The author thanks Josh Angrist, Michael Lechner, and Blaise Melly for comments motivating the following
discussion.
30
discrete Z. Even if this is not the case, there might still be common support in πd (X, p(W )) if the continuous X is sufficiently powerful in shifting πd (X, p(W )). In the latter case identification fails if we match
treated and nontreated observations directly on X and p(W ) due to empty cells w.r.t. p(W ), but matching
on πd (X, p(W )) is feasible. This result is related to the dimensionality reduction argument in the selection on observables framework advocating propensity score matching rather than direct matching to avoid
empty cells for particular combinations of covariate values.
In any case, whether Z is continuous or discrete, it needs to be ‘sufficiently’ relevant28 for p. To see
this reconsider the randomized framework with censored outcomes and assume the extreme case that Z
is not a relevant instrument for p at all. Then, Pr(D = 1| Pr(S = 1|D, Z)) = Pr(D = 1| Pr(S = 1|D))
and nonparametric identification breaks down. The same holds true conditional on X, implying that
Pr(D = 1|X, Pr(S = 1|D, X, Z)) = Pr(D = 1|X, Pr(S = 1|D, X)).
28
Simulation methods may be used to investigate what ‘sufficiently relevant’ means in a particular scenario.
31
References
Abadie, A. & Imbens, G. (2006), ‘Large sample properties of matching estimators for average treatment
effects’, Econometrica 74, 235–267.
Ahn, H. & Powell, J. (1993), ‘Semiparametric estimation of censored selection models with a nonparametric
selection mechanism’, Journal of Econometrics 58, 3–29.
Andrews, D. & Schafgans, M. (1998), ‘Semiparametric estimation of the intercept of a sample selection
model’, Review of Economic Studies 65, 497–517.
Angrist, J. (1997), ‘Conditional independence in sample selection models’, Economics Letters 54, 103–112.
Angrist, J., Bettinger, E. & Kremer, M. (2004), ‘Long-term educational consequences of secondary school
vouchers: Evidence from administrative records in colombia’, NBER Working Paper no. W10713.
Bhalotra, S. & Sanhueza, C. (2002), ‘Parametric and semi-parametric estimations of the return to schooling
in south africa’, unpublished manuscript.
Buchinsky, M. (1998), ‘The dynamics of changes in the female wage distribution in the usa: A quantile
regression approach’, Journal of Applied Econometrics 13, 1–30.
Buchinsky, M. (2001), ‘Quantile regression with sample selection: Estimating women’s return to education
in the u.s.’, Empirical Economics 26, 87–113.
Busso, M., DiNardo, J. & McCrary, J. (2008), ‘Finite sample properties of semiparametric estimators of
average treatment effects’, unpublished manuscript.
Caliendo, M. & Kopeinig, S. (2008), ‘Some practical guidance for the implementation of propensity score
matching’, Journal of Economic Surveys 22, 31–72.
Chamberlain, G. (1986), ‘Asymptotic efficiency in semiparametric models with censoring’, Journal of
Econometrics 32, 189–218.
Cosslett, S. (1987), ‘Efficiency bounds for distribution-free estimators of the binary choice and censored
regression models’, Econometrica 55, 559–585.
Cosslett, S. (1991), Distribution-free estimator of a regression model with sample selectivity, in W. Barnett,
J. Powell & G. Tauchen, eds, ‘Nonparametric and semiparametric methods in econometrics and
statistics’, Cambridge University Press, Camdridge, UK, pp. 175–198.
Das, M., Newey, W. & Vella, F. (2003), ‘Nonparametric estimation of sample selection models’, Review of
Economic Studies 70, 33–58.
Firpo, S. (2007a), ‘Efficient semiparametric estimation of quantile treatment effects’, Econometrica
75, 259–276.
Firpo, S. (2007b), ‘Inequality treatment effects’, unpublished manuscript.
Fitzgerald, J., Gottschalk, P. & Moffitt, R. (1998), ‘An analysis of the impact of sample attrition on the
second generation of respondents in the michigan panel study of income dynamics’, Journal of Human
Resources 33, 300–344.
Fr¨olich, M. (2001), ‘Applied higher-dimensional nonparametric regression’, University of St. Gallen Discussion Paper no. 2001-12.
32
Fr¨olich, M. (2006), ‘Non-parametric regression for binary dependent variables’, Econometrics Journal
9, 511–540.
Gabler, S., Laisney, F. & Lechner, M. (1993), ‘Seminonparametric estimation of binary-choice models with
an application to labor-force participation’, Journal of Business & Economic Statistics 11, 61–80.
Gallant, A. & Nychka, D. (1987), ‘Semi-nonparametric maximum likelihood estimation’, Econometrica
55, 363–390.
Gerfin, M. (1996), ‘Parametric and semi-parametric estimation of the binary response model of labour
market participation’, Journal of Applied Econometrics 11, 321–339.
Gronau, R. (1974), ‘Wage comparisons-a selectivity bias’, Journal of Political Economy 82, 1119–1143.
Hall, P., Racine, J. & Li, Q. (2004), ‘Cross-validation and the estimation of conditional probability densities’, Journal of the American Statistical Association 99, 1015–1026.
Han, A. (1987), ‘Non-parametric analysis of a generalized regression model: The maximum rank correlation
estimator’, Journal of Econometrics 35, 303–316.
Hansen, L. (1982), ‘Large sample properties of generalized method of moment estimators’, Econometrica
50, 1029–1054.
Heckman, J. J. (1974), ‘Shadow prices, market wages and labor supply’, Econometrica 42, 679–694.
Heckman, J. J. (1976), ‘The common structure of statistical models of truncation, sample selection and
limited dependent variables and a simple estimator for such models’, Annals of Economic and Social
Measurement 5, 475–492.
Heckman, J. J. (1979), ‘Sample selection bias as a specification error’, Econometrica 47, 153–161.
Heckman, J. J. (1990), ‘Varieties of selection bias’, American Economic Review, Papers and Proceedings
80, 313–318.
Heckman, J. J. & Navarro-Lozano, S. (2004), ‘Using matching, instrumental variables, and control functions
to estimate economic choice models’, The Review of Economics and Statistics 86, 30–57.
Heckman, J. J. & Vytlacil, E. (2005), ‘Structural equations, treatment effects, and econometric policy
evaluation 1’, Econometrica 73, 669–738.
Heckman, J. J., Ichimura, H. & Todd, P. (1997), ‘Matching as an econometric evaluation estimator:
Evidence from evaluating a job training programme’, Review of Economic Studies 64, 605–654.
Heckman, J. J., Ichimura, H. & Todd, P. (1998), ‘Matching as an econometric evaluation estimator’, Review
of Economic Studies 65, 261–294.
Heckman, J. J., Urzua, S. & Vytlacil, E. (2006), ‘Understanding instrumental variables in models with
essential heterogeneity’, The Review of Economics and Statistics 88, 389–432.
Hirano, K. & Imbens, G. W. (2004), The propensity score with continuous treatments, in A. Gelman &
X. L. Meng, eds, ‘Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives’, New York: Wiley, pp. 73–84.
Hirano, K., Imbens, G. W. & Ridder, G. (2003), ‘Efficient estimation of average treatment effects using
the estimated propensity score’, Econometrica 71, 1161–1189.
33
Horowitz, J. L. (1992), ‘A smoothed maximum score estimator for the binary response model’, Econometrica 60, 505–531.
Horvitz, D. G. & Thompson, D. J. (1952), ‘A generalization of sampling without replacement from a finite
universe’, Journal of the American Statistical Association 47, 663–685.
Huber, M. & Melly, B. (2008), ‘Quantile regression in the presence of sample selection’, unpublished
manuscript.
Ichimura, H. (1993), ‘Semiparametric least squares (sls) and weighted sls estimation of single-index models’,
Journal of Econometrics 58, 71–120.
Ichino, A., Mealli, F. & Nannicini, T. (2008), ‘From temporary help jobs to permanent employment:
what can we learn from matching estimators and their sensitivity?’, Journal of Applied Econometrics
23, 305–327.
Imbens, G. W. (2000), ‘The role of the propensity score in estimating dose-response functions’, Biometrika
87, 706–710.
Imbens, G. W. (2004), ‘Nonparametric estimation of average treatment effects under exogeneity: a review’,
The Review of Economics and Statistics 86, 4–29.
Imbens, G. W. & Angrist, J. (1994), ‘Identification and estimation of local average treatment effects’,
Econometrica 62, 467–475.
Khan, S. & Tamer, E. (2007), ‘Irregular identification, support conditions, and inverse weight estimation’,
unpublished manuscript.
Kitagawa, T. (2008), ‘Testing for exclusion restriction in the selection model’, unpublished manuscript.
Klein, R. W. & Spady, R. H. (1993), ‘An efficient semiparametric estimator for binary response models’,
Econometrica 61, 387–421.
Koenker, R. (2005), Quantile Regression, Cambridge University Press.
Koenker, R. & Bassett, G. (1978), ‘Regression quantiles’, Econometrica 46, 33–50.
Koenker, R. & Bassett, G. (1982), ‘Robust tests for heteroskedasticity based on regression quantiles’,
Econometrica 50, 43–62.
Kumar, A. (2006), ‘Nonparametric conditional density estimation of labour force participation’, Applied
Economics Letters 13, 835–841.
Lechner, M. (1999), ‘Earnings and employment effects of continuous off-the-job training in east germany
after unification’, Journal of Business and Economic Statistics 17, 74–90.
Lechner, M. (2001), Identification and estimation of causal effects of multiple treatments under the conditional independence assumption, in M. Lechner & F. Pfeiffer, eds, ‘Econometric Evaluations of Active
Labor Market Policies in Europe’, Heidelberg: Physica.
Lechner, M. (2007), ‘A note on the relation of weighting and matching estimators’, University of St. Gallen
Discussion Paper no. 2007-34.
Lechner, M. (2009), ‘Sequential causal models for the evaluation of labor market programs’, Journal of
Business and Economic Statistics 27, 71–83.
Lechner, M. & Melly, B. (2007), ‘Earnings effects of training programs’, IZA Discussion Paper no. 2926.
34
Lee, D. S. (2005), ‘Training, wages, and sample selection: estimating sharp bounds on treatment effects’,
NBER Working Paper no. W11721.
Li, Q., Racine, J. & Wooldridge, J. (2009), ‘Efficient estimation of average treatment effects with mixed
categorical and continuous data’, forthcoming in the Journal of Business and Economics Statistics.
Manski, C. F. (1975), ‘Maximum score estimation of the stochastic utility model of choice’, Journal of
Econometrics 3, 205–228.
Manski, C. F. (1989), ‘Anatomy of the selection problem’, The Journal of Human Resources 24, 343–360.
Manski, C. F. (1994), The selection problem, in C. Sims., ed., ‘Advances in Econometrics: Sixth World
Congress’, Cambridge University Press, pp. 143–170.
Martins, M. F. O. (2001), ‘Parametric and semiparametric estimation of sample selection models: An
empirical application to the female labour force in portugal’, Journal of Applied Econometrics 16, 23–
39.
Melenberg, B. & van Soest, A. (1996), ‘Parametric and semi-parametric modelling of vacation expenditures’, Journal of Applied Econometrics 11(1), 59–76.
Melly, B. (2006), ‘Estimation of counterfactual distributions using quantile regression’, unpublished
manuscript.
Mroz, T. A. (1987), ‘The sensitivity of an empirical model of married women’s hours of work to economic
and statistical assumptions’, Econometrica 55, 765–799.
Mulligan, C. B. & Rubinstein, Y. (2008), ‘Selection, investment, and women’s relative wages over time’,
Quarterly Journal of Economics 123, 1061–1110.
Murphy, K. M. & Topel, R. H. (1985), ‘Estimation and inference in two-step econometric models’, Journal
of Business and Economic Statistics 3, 88–97.
Newey, W. K. (1984), ‘A method of moments interpretation of sequential estimators’, Economics Letters
14, 201–206.
Newey, W. K. (1999), ‘Two-step series estimation of sample selection models’, MIT Working Papers no.
99-04.
Newey, W. K. (2007), ‘Nonparametric continuous/discrete choice models’, International Economic Review
48, 1429–1439.
Newey, W. K. & McFadden, D. (1994), Large sample estimation and hypothesis testing, in R. Engle &
D. McFadden, eds, ‘Handbook of Econometrics’, Elsevier, Amsterdam.
Newey, W. K., Powell, J. L. & Walker, J. (1990), ‘Semiparametric estimation of selection models: Some
empirical results’, American Economic Review 80, 324–328.
Pagan, A. & Ullah, A. (1999), Nonparametric Econometrics, Cambridge University Press, Cambridge.
Powell, J. (1987), ‘Semiparametric estimation of bivariate latent variable models’, unpublished manuscript.
University of Wisconsin-Madison.
Robins, J. M. & Rotnitzky, A. (1995), ‘Semiparametric efficiency in multivariate regression models with
missing data’, Journal of the American Statistical Association 90, 122–129.
35
Robins, J. M. & Rotnitzky, A. (1997), ‘Analysis of semi-parametric regression models with non-ignorable
non-response’, Statistics in Medicine 16, 81–102.
Robins, J. M., Rotnitzky, A. & Zhao, L. P. (1995), ‘Analysis of semiparametric regression models for
repeated outcomes in the presence of missing data’, Journal of the American Statistical Association
90, 106–121.
Robinson, P. M. (1988), ‘Root-n-consistent semiparametric regression’, Econometrica 56, 931–954.
Rosenbaum, P. & Rubin, D. B. (1983), ‘The central role of the propensity score in observational studies
for causal effects’, Biometrika 70, 41–55.
Rubin, D. (1974), ‘Estimating causal effects of treatments in randomized and nonrandomized studies’,
Journal of Educational Psychology 66, 688–701.
Rubin, D. B. (1990), ‘Formal modes of statistical inference for causal effects’, Journal of Statistical Planning
and Inference 25, 279–292.
Sekhon, J. S. (2007), ‘Multivariate and propensity score matching software with automated balance optimization: The matching package for r’, forthcoming in the Journal of Statistical Software.
Vella, F. (1998), ‘Estimating models with sample selection bias: A survey’, The Journal of Human Resources 33, 127–169.
Vytlacil, E. (2002), ‘Independence, monotonicity, and latent index models: An equivalence result’, Econometrica 70, 331–341.
Wooldridge, J. (2002), ‘Inverse probability weigthed m-estimators for sample selection, attrition and stratification’, Portuguese Economic Journal 1, 141–162.
Wunsch, C. & Lechner, M. (2008), ‘What did all the money do? on the general ineffectiveness of recent
west german labour market programmes’, Kyklos 61, 134–174.
Zhao, Z. (2008), ‘Sensitivity of propensity score methods to the specifications’, Economics Letters 98, 309–
319.
36