ApEc 8212 Econometric Analysis II -- Lecture #16
Transcription
ApEc 8212 Econometric Analysis II -- Lecture #16
ApEc 8212 Econometric Analysis II -- Lecture #16 Models of Sample Selection and Attrition (Wooldridge, Chapter 19, Sections 1-6) I. Introduction So far in this class we have assumed that the data we have are a random sample from some underlying population. But sometimes the data collected are not a random sample, and sometimes the relationships of interest are not observed for some part of the population, although in theory the relationship exists for all members of the population. Examples are: 1. We are interested in estimating the determinants of women’s wages, not just of women who are working but of all women (those who are not working would earn a wage if they did work). 2. We want to estimate the impact of some education policy on the test scores of high school students, but the data are collected only from students currently in school and thus data do not exist for students who “drop out”. The first example is of data censoring, where we have a random sample of the population but for some 1 members of the sample (and perhaps some of the population) we are missing data on one or more of the variables. More specifically, we observe those variables only if they fall within a particular range. The second example is data truncation. We don’t have a random sample of the population of interest. A third possibility is incidental truncation. Certain variables are observed only if other variables take particular values. The first example fits this. If the data are not a random sample then we say that they are a selected sample, and that some kind of selection mechanism generates the selected sample. II. Data Censoring Let’s start with an example. In some data sets if the value of a variable is “too high” it is just set at some large value for that variable. For example, for a household income variable, for any households with an annual income of $200,000 or more the income variable is set to $200,000. This is called top coding. It is important to realize that this is different from a corner solution (Tobit) model where, say, there are a 2 lot of values at zero (or some other number). In a Tobit the “real values” of the variable are in fact zero. Yet in data censoring the “real” values are not equal to the “top code” value. However, the estimation methods are the same, which sometimes confuses people when it is time to interpret the estimates. Let’s start with a simple linear model: y = x′β + u, with E[u| x] = 0 Let w be the observed value of y. In the above top coding example, we would have: w = min(y, 200,000) Binary censoring Sometimes the w variable is simply a binary (dummy) variable . Suppose we want to estimate the willingness of the population in some community to pay for a public good, such as a public park. Let wtp represent this “willingness to pay”. Since people may have trouble telling interviewers their precise willingness to pay, one approach is to randomly choose a value, call it “r” (which could vary for the people in the survey), and ask survey respondents a very simple question: Are you willing to pay r for a new public park? 3 Let yi be the willingness to pay for person i. Person i’s answer to the above question, when asked for a “reference” value of ri, can be coded as: wi = 1[yi > ri] Assume that yi = xi′β + ui. How can we estimate β with such data? If we are willing to assume that: ui| xi, ri ~ N(0, σ2) then we can use probit estimation if we make the further assumption that ri is independent of ui (independent of yi conditional on xi): D(yi| xi, ri) = D(yi| xi) This allows us to specify the probability that wi = 1, conditional on xi and ri as: Prob[wi = 1| xi, ri] = Prob[yi > ri| xi, ri] Prob[ui/σ > (ri – xi′β)/σ| xi, ri] = 1 – Φ((ri – xi′β)/σ) = Φ((xi′β - ri)/σ) Question: In the probit model, we could only estimate β/σ, not β or σ separately. Is that also the case here? [Hint: We know the values of the ri’s] 4 We can use maximum likelihood methods to estimate this model. However, as with the standard probit, if u is heteroscedastic or not normally distributed than our estimates will be inconsistent. If we could get people to tell us their willingness to pay (their value for y), we could use linear methods, but we probably should not use this specification if a substantial proportion report a willingness to pay of zero. In this case, a (Type I) Tobit specification makes more sense. See p.782 of Wooldridge for further discussion and a couple other ideas for specifying willingness to pay. Interval Coding Sometimes the value of y is not the precise value but only an “ordered” indicator variable that denotes which “interval” y falls into. The most common example is income; in some household surveys respondents are not asked their precise income, but only what “range” it falls in. This type of data is called interval-coded (or interval censored) data. Assume again that E[y| x] = x′β. Let the known interval limits be r1 < r2 < … rJ. The censored variable, w, is related to y as follows: 5 w = 0 <=> y ≤ r1 w = 1 <=> r1 < y ≤ r2 : : : : w = J <=> y > rJ In terms of estimation this is very much like an ordered probit or logit. If we assume that the error term is normally distributed we have an ordered probit, and the log likelihood function is given by: ℓi(β, σ) = 1[wi = 0]ln[Φ((r1 - xi′β)/σ)] + 1[wi = 1]ln[Φ((r2 - xi′β)/σ) - Φ((r1 - xi′β)/σ)] + … … + 1[wi = J]ln[1 - Φ((rJ - xi′β)/σ)] Question: In the standard ordered probit, we can only estimate β/σ. Is that the case here? What is the intuition for your answer? What does it imply for estimating partial effects? Additional comments for this model (see Wooldridge, p.784): 1. Sometimes the r’s vary over different observations. That does not cause any problems for estimation. 6 2. If any x variables are endogenous, this can be fixed using the Rivers and Vuong (1988) method. 3. Panel data methods (random effects) can be used. Censoring from Above and Below Recall the “top coding” example above. It is straightforward to estimate this using the Tobit approach discussed in the previous lecture. The important thing is to be careful when interpreting the results. If y really does have values beyond the “top code” (which it does) then this is “real” censoring and what we want to estimate is β; we are not interested in estimating the probability that an observation hits the “top code”. In contrast, in “corner solution” models we are interested in that probability. For a more detailed discussion, see Wooldridge, pp.785-790. Wooldridge also provides an overview of sample selection, with 2 examples, on pp.790-792. III. When is Sample Selection NOT a Problem? Sample selection does not always lead to bias if standard methods (e.g. OLS) are applied to the selected sample. In general, if sample selection is 7 based on exogenous explanatory variables, that is on variables that are uncorrelated with the error term in the equation of interest, then there is no problem with applying standard methods to the selected sample. Let’s start with linear models, both OLS and 2SLS (IV). The model is: y = β1x1 + β2x2 + … + βKxK + u = x′β + u E[zu] = 0 where z is a vector of L instruments for possible use in IV estimation. Note that x1 is just a constant. If x = z then E[xu] = 0, so we can use OLS: E[y| x] = x′β Returning to the general case where some elements in x may be correlated with u, let s be a binary (dummy variable) selection indicator. For any member of the population s = 1 indicates that the observation is not “blocked” from being drawn into our sample, but if s = 0 then it is “blocked”, i.e. cannot be in our sample. Suppose we draw a random sample of {xi, yi, zi, si} from some population. In fact, if si = 0 it cannot be 8 in our sample. The 2SLS estimate of β using the observed data, which can be denoted as βˆ 2SLS, is: N N N i 1 i 1 i 1 βˆ 2SLS = [((1/N) sixizi′)((1/N) sizizi′)-1((1/N) sizixi′)]-1 N N i 1 i 1 -1 N × ((1/N) sixizi′)((1/N) sizizi′) ((1/N) siziyi) i 1 (Notice that when si = 0 the observation is dropped from the estimation.) Next, replace yi with xi′β + ui: N N N i 1 i 1 i 1 βˆ 2SLS = β + [((1/N) sixizi′)((1/N) sizizi′)-1((1/N) sizixi′)]-1 N N i 1 i 1 -1 N × ((1/N) sixizi′)((1/N) sizizi′) ((1/N) siziui) i 1 Everthing to the right of β “disappears” if E[siziui] = 0. More formally, we have the following theorem: Theorem 19.1: Consistency of 2SLS (and OLS) under Sample Selection Assume that E[u2] < ∞, E[xj2] < ∞ for all j = 1, … K, and E[zj2] < ∞ for all j = 1, … L. Assume also that: E[szu] = 0 9 rank{E[zz′| s = 1]} = L rank{E[zx′| s = 1]} = K Then plim[βˆ 2SLS] = β, and βˆ 2SLS is asymptotically normally distributed. The assumption that E[szu] = 0 is key, so it merits further discussion. Note first that E[zu] does not imply that E[szu] = 0. However, if E[zu] and s is independent of z and u then we have: E[szu] = E[s]E[zu] = 0 (if E[zu]) The assumption that s is independent of z and u is very strong. In effect it assumes that the censored observations are “dropped randomly”, which if true implies that censoring does not lead to bias. This can be called missing completely are random (MCAR). A somewhat more realistic assumption is that the selection (censoring) is a function of the exogenous variables but not a function of u. That is, conditional on z, u and s are uncorrelated: E[u| z, s] = 0 10 You should be able to show (applying iterated expectations) that this implies E[szu]. This is sometimes called exogenous sampling. That is, after conditioning on (controlling for) z, s has no predictive power for (and so is uncorrelated with) u. To see why selection that is a function only of the exogenous variables implies that E[u| z, s] = 0, “strengthen” the assumption E[zu] = 0 to E[u| z] = 0. Then when selection is a function only of z, which can be expressed as s = h(z), we have: E[u| z] = 0 => E[u| z, h(z)] = 0, which => E[u| z, s] = 0 More generally, if s is independent of z and u (which is true if s is independent of y, z and x), then E[u| z, s] = 0. Note that if we don’t need to instrument for x, that is if E[u| x, s] = 0, then we have: E[y| x, s] = E[y| x] = x′β Sometimes the assumption that E[y| x, s] = x′β is called missing at random (MAR). If we make the more general assumption that, conditional on z, u and s are independent, that is D[s| z, u] = D[s| z], then: 11 Prob[s = 1| z, u] = Prob[s = 1| z] If we add the homoscedasticity assumption that E[u2| z, s] = σ2, then the standard estimate of the covariance matrix for βˆ 2SLS is valid (for details see p.796 of Wooldridge). If there is heteroscedasticity we can use the heteroscedastic-robust standard errors for 2SLS (see pp.16-17 of Lecture 4). For OLS, analogous results hold. Just replace z with x in everything (including the assumptions). A final useful result occurs when s is a non-random function of x and some variable v not in x: s = s(x, v). Here we allow u and s to be correlated: E[u| s] ≠ 0. If the joint distribution of u and v is independent of x, then E[u| x, v] = E[u| v]. This implies that: E[y| x, s(x, v)] = E[y| x, v] = x′β + E[u|v] Assuming a particular functional form for E[u| v], such as a linear form so E[u| v] = γv, implies E[y| x, s] = x′β + γv Thus adding v as a regressor will give consistent estimates of β (and γ) even for a sample that includes only the observations with s = 1. 12 Similar results hold for nonlinear models. See Wooldridge, pp.798-799. IV. Selection Based on y (Truncated Regression) Suppose that inclusion in the sample is based on the value of the y variable. One example is a 1970s study of the impact of a “negative income tax”. This was an experimental study that excluded households whose income was more that 1.5 times the poverty line. Suppose that y is a continuous variable, and that data are available for y (and x) if the following holds: a1 < yi < a2, which implies si = 1[a1 < yi < a2] where 1[ ] is an indicator function that = 1 if the condition inside the brackets holds, otherwise it = 0. Continue to assume that E[yi| xi] = xi′β. We are interested in estimating β. The selection rule depends on y, and thus depends on u. If we simply apply OLS to the sample for which si = 1, the estimates of β will be inconsistent (draw a picture to give intuition). Maximum likelihood estimation will give consistent estimates. This requires specification of the conditional 13 density function of yi, conditional on xi: f(y| xi; β, γ), where γ is a vector of parameters regarding u. It is easier to start with the cdf (cumulative density function) of yi conditional on xi and si = 1 (a1 < yi < a2): P[ y i y, s i 1 | x i ] P[yi ≤ y| xi, si = 1] = P[s i 1 | x i ] Because we are conditioning on si = 1 we need to divide the probability that both yi ≤ y and si = 1 by the probability that si = 1. P[si = 1| xi] = P[a1 < yi < a2| xi], which is equal to F(a2| xi; β, γ) - F(a1| xi; β, γ). If a2 = ∞ (no upper limit to y) then F(a2| xi; β, γ) = 1. If a1 = -∞ (no lower limit to y) then F(a1| xi; β, γ) = 0. What about the numerator of the above expression? For any y between a1 and a2 we have: P[yi ≤ y, si = 1| xi] = P[a1 < yi < y| xi] = F(y| xi; β, γ) - F(a1| xi; β, γ) The density of the cdf on p.7, conditional on yi ≤ y, si = 1 and xi, is obtained by differentiating the cdf with respect to y: 14 f(y| xi; si = 1) = f ( y | x i ; β, γ ) F(a 2 | x i ; β, γ ) F(a 1 | x i ; β, γ ) [Wooldridge’s notation is p(y| xi; si = 1) for the density.] To estimate β we need to specify the distribution of u and then use (conditional) maximum likelihood methods (we are conditioning on si = 1). Replace y in the above expression for f(y| xi; si = 1) with yi for all observed data. Typically we assume that u is normally distributed with variance σ2. This amounts to assuming that y ~ N(x′β, σ2). This is called the truncated Tobit model. In this case γ is just σ2. Question: How is this different from the Tobit model discussed in Lecture 15? Unfortunately, if the normality and homoscedasticity assumptions are false, we get inconsistent estimates. V. A Probit Selection Equation (selection on “y2”) A more complicated model involves missing observations for y (and perhaps some of the associated x’s) based on the value of some other endogenous variable, call it “y2”. The most common example of this is when we do not observe wages for 15 people who are not working. We can think of another equation, one for hours (y2). We do not observe y (wages) when y2 = 0 (the person is not working). The first paper that examined this in detail was Gronau (1974); this example is explained in Wooldridge, pp.802-803. More simply, y2 can be a dummy variable that = 0 when a person is not working and = 1 if a person is working. The following 2 equations give the general model: y1 = x1′β1 + u1 y2 = 1[x′δ2 + v2 > 0] where again 1[ ] is an indicator function and x1 is a subset of the variables in x. These 2 equations are sometimes called the Type II Tobit Model. It is useful to state some assumptions that will be used in much of the rest of this lecture: Assumption 19.1: (a) x and y2 are always observed, but y1 is observed only if y2 = 1. (b) Both u1 and v2 are independent of x, but perhaps not of each other. (c) v2 ~ N(0,1) (d) E[u1| v2] = γ1v2 16 Note that (b) implies that if we observed all the observations we could estimate E[y1| x] using OLS. So the problem here is not endogenous x variables. So, how can we estimate this thing? Any random “draw” from a population gives the following for some observation i: {y1, y2, x, u1 and v2}. The problem is that when y2 = 0 we do not observe y1, although it does really exist (e.g. if a person did choose to work he or she would get some wage). We definitely observe, and thus we can hope to estimate, E[y1| x, y2 = 1]. We can also estimate P[y2 = 1| x], which under the assumptions will give us an estimate of δ2. But with this can we estimate β1? The answer is: yes, we can. To start, note that: E[y1| x, v2] = x1′β1 + E[u1| x, v2] = x1′β1 + E[u1| v2] = x1′β1 + γ1v2 One useful implication of this expression is that when γ = 0 then E[y1| x, v2] = x1′β = E[y1| x]. In this case we can estimate β by running OLS n the observations for which y1 is observed. For the case where γ ≠ 0 we can modify the above (conditioning on y2 is more useful than conditioning on v2 since we observe y2 but not v2): 17 E[y1| x, y2] = E[E[y1| x, v2]| x, y2] = x1′β1 + γ1E[v2| x, y2] = x1′β1 + γ1h(x, y2) where h(x, y2) is defined as E[v2| x, y2]. [The first equality holds by the law of iterated expectations; (x, y2) is the “smaller information set” relative to (x, v2) because x and v2 together tell us what y2 is, but x and y2 together do not tell us what v2 is (they just give a range for v2).] If we knew the functional form of h(x, y2), then by Theorem 19.1 we could create the variable h(x, y2) from x and y2 and then regress y1 on x1 and h(x, y2) to obtain a consistent estimate of β using only the observations for which y1 is observed. In particular, if y1 is observed then we know y2 = 1. Thus we need to know h(x, 1) = E[v2| x, y2 = 1] = E[v2| x, v2 > -x′δ2] = E[v2| v2 > -x′δ2] = λ(x′δ2), where λ(x′δ2) = (x′δ2)/Φ(x′δ2). (The second to last equality follows from the independence of v2 and x). Thus we have: E[y1| x, y2 = 1] = x1′β1 + γ1λ(x′δ2) 18 This suggests the following procedure for estimating β, which was introduced by Heckman (1976, 1979): 1. Estimate δ2 by running a probit on y2 and x. 2. Generate λ(x′ δˆ 2). 3. Regress y1 on x1 and on λ(x′δˆ 2) to estimate β and γ1. This procedure gives estimates of β and γ1 that are consistent and asymptotically normal. You can test whether there is “sample selection bias” by testing whether γ1 = 0. When γ1 ≠ 0 things get a little messy because it introduces heteroscedasticity. This heteroscedasticity is not hard to fix, the harder to fix problem is that the standard errors for β and γ1 need to be recalculated because δˆ 2 is an estimate of δ2, and the variance of this estimate must be accounted for in calculating the covariance matrix of β and γ1. An important point is that, technically speaking, the assumption that v2 is normally distributed means that we do not have to have any “identifying” variables in x beyond what is in x1. This sort of “identification from a functional form assumption” (the assumption 19 that v2 is normally distributed) is not very credible. It is much more convincing to have some variable in x that is not in x1, and to have a sound (economic) theoretical reason for excluding that variable from x1. Two final notes: 1. If you assume that u1 is also normally distributed, then you can use (conditional) maximum likelihood, which is more efficient then this 2-step method. 2. Some recent methods have been developed that do not require either u1 of v2 to follow any particular distribution. References are Ahn and Powell (1993) and Vella (1998). Endogenous Explanatory Variables Suppose that one of the x variables is endogenous. This leads to the model: y1 = z1′δ1 + α1y2 + u1 y2 = z′δ2 + v2 y3 = 1[z′δ3 + v3 > 0] 20 The errors u1, v2 and v3 are all assumed to be uncorrelated with z but they could be correlated with each other. We are primarily interested in estimating δ1 and α1, the parameters for the structural equation of interest. The following assumption clarifies what data are observed: Assumption 19.2: (a) z and y3 are always observed, y1 and y2 are observed if y3 = 1. (b) u1 and v3 are independent of z. (c) v3 ~ N(0, 1) (d) E[u1| v3] = γ1v3 (e) E[zv2] = 0, and zδ2 = z1δ21 + z2δ22, where δ22 ≠ 0. Note that (e) is needed to identify the first equation. We can write the structural equation as: y1 = z1′δ1 + α1y2 + g(z, y3) + e1 where g(z, y3) = E[u1| z, y3] and e1 ≡ u1 - E[u1| z, y3]. By definition, E[e1| z1, y3] = 0. Note: g(z, y3) here plays the same role as h(x, y2) on p.11 (control for selection bias). 21 If we knew g(z, y3), we could just estimate this equation using 2SLS on the selected sample. Since we do not know it, we need to estimate it, just as we did above where we used OLS after generating λ( ). Thus we can use the following procedure: 1. Estimate δ3 for selection equation using a probit. 2. Generate the inverse Mills ratio λ(z′δˆ 3) 3. Estimate the following equation on the selected sample, using 2SLS: y1 = z1′δ1 + α1y2 + γ1λ(z′δˆ 3) + error The IVs are z and λ(z′δˆ 3). Some other points: If γ1 = 0 then there is no sample selection bias and you can just use 2SLS on the select sample and ignore the selection equation. As in the selection equation with only exogenous explanatory variables, the standard errors need to be adjusted since δˆ 3 is an estimate of δ3. 22 y2 could be a dummy variable, and no assumptions are needed that v1 and/or v2 follow a particular distribution. There should be at least 2 variables in z that are not in z1, since 2 variables being instrumented (counting as λ(z1′δˆ 3) as an IV for g(z, y3)). Returning to the case where all explanatory variables are exogenous, Wooldridge explains (pp.831-814) how to estimate a model where both y1 and y2 are dummy variables. 23