Small Sample Properties of GMM and GEL Estimation and Inference
Transcription
Small Sample Properties of GMM and GEL Estimation and Inference
Small Sample Properties of GMM and GEL Estimation and Inference Procedures for Social Interaction Models Bryan S. Graham* (Initial Draft: April 17th , 2005) (This Draft: May 5th , 2005) by Abstract This paper studies the small sample properties of generalized method-of-moments (GMM) and generalized empirical likelihood (GEL) estimators and testing procedures. For linear IV-type models both two-step GMM and GEL achieve first order efficiency by replacing an infeasible optimal instrumental variable with a consistent estimate (c.f., Newey 1993). GEL, unlike two-step GMM, uses a semiparametrically efficient estimate of this optimal instrument. As a result the first order conditions for GEL are asymptotically mean zero, while those for two-step GMM are not. Many discrepancies in the small sample performance of the two estimators and their associated inference procedures are attributable to this difference. The paper uses the ‘linear-in-means’ model of social interactions as an organizing example. Conditional variance restrictions (moments) provide one means of identifying the strength of social interactions in this model (Graham 2005). The Monte Carlo designs, loosely calibrated to the peer effects and learning application pursed by Graham (2005), are also interpretable as being based upon linear simultaneous-equations models with heteroscedastic and right-skewed errors. This is a useful setting for comparing the small sample properties of GMM and GEL estimators. While first order asymptotically equivalent, for the designs considered here, higher order expansions suggest non-trivial differences in asymptotic bias, both between GMM and GEL, as well as among specific members of the GEL family (Newey and Smith 2004). The usefulness of higher order expansions for rationalizing differences in small sample bias is directly evaluated. The coverage properties of alternative approaches to constructing confidence intervals are also investigated. Including two new approaches to constructing confidence intervals specific to GEL. GEL estimators pose sizable computational challenges. As a result information on their small sample properties is limited and their use in actual empirical research even more so. This paper outlines and uses two alternative algorithms for computation. JEL Classification: C12, C15, C31. Key Words: Social Interactions, Excess Variance Contrasts, GMM, Generalized Empirical Likelihood (GEL), Quasi-Maximum Likelihood (QML), Bias, Confidence Intervals, Monte Carlo. * I would like to thank Gary Chamberlain for many extensive and very useful conversations. I have also found comments and suggestions from Caroline Hoxby, Dale Jorgenson, Larry Katz, and Jim Stock as well as feedback from seminar participants at the University of California - San Diego, Harvard-MIT and the Harvard Econometrics Lunch to be very useful. Financial support provided by the Program of Fellowships for Junior Scholars, the MacArthur Research Network on Social Interactions and Economic Inequality. All the usual disclaimers apply. Correspondence: Department of Economics, Harvard University, Littauer Building, Cambridge, MA 02138. e-mail: [email protected]. 1 1 Introduction It is now widely recognized that the small sample properties of the conventional two-step GMM estimator and associated inference procedures can be extraordinarily poor in empirical settings of substantive interest to economists (e.g., Tauchen 1986, Altonji and Segal 1996, Clark 1996, Hansen, Heaton and Yaron 1996). Motivated by these concerns several alternative estimators and testing procedures for statistical models defined by moment restrictions have been suggested. These include the empirical likelihood (EL) estimator of Imbens (1997) and Qin and Lawless (1994), the exponential tilting (ET) estimator of Kitamura and Stutzer (1997) and Imbens, Spady and Johnson (1998), as well as the continuously updating (CU) estimator of Hansen, Heaton and Yaron (1996). Newey and Smith (2004) show that these estimators are members of a common family — dubbed generalized empirical likelihood (GEL) estimators — and characterize their higher-order properties.1 Parallel work focuses on improving the coverage properties of confidence intervals for parameters identified by moment conditions. Alternatives are typically linked to various members of the GEL family of estimators. Examples include intervals formed by inverting the ‘S-statistic’ of Stock and Wright (2000), the minimum-χ2 ‘tilting’ parameter statistic of Imbens and Spady (2002), the score statistic of Guggenberger and Smith (2004) and Kleinbergen (2005), and the ‘many weak moments’ Wald statistic of Newey and Windmeijer (2005). Despite a substantial amount of promising theoretical research, little is known about the small sample properties of these estimators. An important reason for this lack of knowledge is that the most convenient representation of GEL parameter estimates is as the solution to a saddle-point (i.e., min-max) problem that is difficult to quickly and reliably solve (c.f., Mittelhammer, Judge and Schoenberg 2005). Monte Carlo experiments are correspondingly costly, with the few smallscale studies currently available being quite limited in their coverage.2 With a few exceptions, such as Imbens (1997, 2002), existing Monte Carlo studies are also not calibrated to specific economic data sets.3 The Monte Carlo experiments reported below directly respond to these concerns. The basic design uses the ‘linear-in-means’ model of social interactions with identification arising from conditional variance restrictions (Graham 2005). The design, calibrated to the Project STAR peer effects application of Graham (2004), is equivalent to one based on a linear simultaneous equations model with modest endogeneity as well as heteroscedastic and right-skewed first-stage and structural errors. This setting provides an interesting context in which to explore the small sample properties of 1 Imbens (2002) provides an accessible, practitioner-oriented, introduction to this literature. In some Monte Carlo studies the outer portion of the saddle-point problem is solved by a costly grid-search to ensure numerical accuracy (e.g., Hansen, Heaton and Yaron 1996, Stock and Wright 2000, Guggenberger 2005a, 2005b). In others the saddle-point representation is replaced by an expensive, but apparently reliable, penaltyfunction approach (Imbens, Spady and Johnson 1998, Ramalho 2005). For computational reasons, Guggenberger and Hahn (2005) even study the properties of a two-step approximation to the empirical likelihood estimator. 3 Most Monte Carlo studies are based upon the linear simultaneous equations model with homoscedastic errors and a single endogenous regressor (e.g., Guggenberger 2005a, 2005b, Guggenberger and Hahn 2005) or simple AR(1) dynamic panel data models (e.g., Bond, Bowsher and Windmeijer 2001, Newey and Windmeijer 2005). Others are based on stylized Monte Carlo designs inherited from older studies on the finite sample properties of GMM (e.g., Imbens, Spady and Johnson 1998, Imbens and Spady 2002, Ramalho 2005). 2 2 GMM and GEL estimators and test statistics. Previous research indicates that two-step GMM performs very poorly when the identifying moments are derived from variance-covariance restrictions and there is modest overidentification (Altonji and Segal 1996, Clark 1996, Imbens 1997, Horowitz 1998). Introducing endogeneity is unlikely to improve the situation. The higher order expansions of Newey and Smith (2004) also suggest substantively important differences both between GMM and GEL, as well as among specific GEL estimators in settings like those considered here. Section 2 outlines the ‘linear-in-means’ model and quickly reviews the key identification result of Graham (2004). The section also outlines the linear simultaneous-equations interpretation of the model. Sections 3 reviews the two-step GMM estimator and associated testing procedures. Section 4 defines the GEL estimator and reviews some of its main first order asymptotic properties. Section 5 discusses the connection between GEL and (semiparametrically) efficient estimates of expectations. In the context of linear IV-type models, both two-step GMM and GEL achieve first order asymptotic efficiency by replacing an infeasible optimal instrument with a consistent estimate. While GEL uses an efficient estimate of this instrument, two-step GMM does not. Section 6 discusses how this difference affects the higher order asymptotic properties of the two estimators and associated testing procedures. Section 7 outlines and discusses issues of inference with GEL estimators. Section 8 outlines the various Monte Carlo designs and reports simulation results. Newey and Smith’s (2004) asymptotic bias formulae are complex functions of the conditional variance and skewness of the moment vector. I calculate asymptotic bias analytically for each estimator and Monte Carlo design and compare it to the corresponding, small sample, Monte Carlo estimate of median bias. This provides a direct test of the usefulness of higher order asymptotic theory for understanding the bias properties of GMM and GEL estimators. While exercises of this type have been undertaken in the context of the linear simultaneous-equations model with homoscedastic and symmetric errors (e.g., Hahn, Hausman and Kuersteiner 2004), there is no previous work of this kind in the GMM/GEL context. Suggestions for applied researchers are given in Section 9. Proofs and derivations are collected in Appendices A to D. Appendix E outlines the computational methods used in the Monte Carlo work. The approach builds on the Newton-Raphson and quasi-Newton procedures suggested by Owen (2001) and Imbens (2002) respectively. I have found my implementation of this method, which uses a novel estimate of the initial Hessian matrix, to be fast and reliable in most settings. For weakly identified designs with few overidentifying restrictions, the profiled saddle-point criterion function can be very flat in the parameter of interest, making it difficult for standard quasi-Newton algorithms to maintain positive definiteness of the approximate Hessian. For this setting I outline a computationally convenient GEL analog of Fisher-scoring. Appendix F provides auxiliary details on the Monte Carlo experiments. 2 The linear-in-means model In this paper I work with the reduced form of the ‘linear-in-means’ model of social interactions yci = αc + (γ − 1) εc + εci , 3 (1) where c = 1, . . . , N indexes social groups and i = 1, . . . , Mc individuals within groups. I assume the presence of group-size variation (i.e., V ar (Mc ) > 0). In particular, the marginal distribution of P group sizes is multinomial with Mc ∈ M ≡ {m1 , . . . , mS }, Pr (Mc = ms ) = ps > 0 and Ss=1 ps = 1. Equation (1) is the reduced form of a simple structural model of social interactions reviewed by Graham (2005). In equilibrium, individual outcomes depend on group-level variables, αc , own characteristics, εci , and mean peer group characteristics, εc . The parameter of interest is γ, the ‘social multiplier’. Assume (1) is a reduced form model of student achievement as in Graham (2005) with αc equaling teacher quality, εci student ability, and εc the average ability of students in the cth classroom. If a low ability student is replaced with a high ability student average achievement in the classroom will increase for pure compositional reasons as well as because the high ability student raises peer quality, and hence positive learning spillovers, in the classroom. The ratio of the full equilibrium effect to the pure compositional effect equals γ. Since γ is a multiplier its null value is one. I assume that E [yci |Mc ] = E [αc |Mc ] = E [εci |Mc ] = 04 and consider the identifying assumptions £ ¤ E εc ε0c |Mc = (σ 2 (Mc ) − σ εε )IMc + σεε ιMc ι0Mc V ar (αc |Mc ) = σ 2α , (2) Cov (αc , εc |Mc ) = σαε ιMc , where εc = (εc1 , . . . , εcMc )0 and IMc and ιMc denote an Mc × Mc identity matrix and an Mc × 1 vector of ones respectively. Under (2) we can assume, without loss of generality, that σεε = σ αε = 0, which is done henceforth.5 The variance-covariance matrix of individual outcomes yc = (yc1 , . . . , ycMc )0 , conditional on group size, is then ¤ £ ¡ ¢ V ar (yc |Mc ) = Ω (φ; Mc ) = σ2 (Mc ) IMc + σ 2α + γ 2 − 1 σ 2 (Mc ) Mc−1 ιMc ι0Mc . (3) The goal is to estimate and conduct inference on the social interactions parameter γ 2 . Observe that (3) follows a one-way error component structure; this is motivated by the combination of within-group exchangeability of individuals and conditional exchangeability of social groups across groups of the same size. This suggests that the relevant unrestricted or reduced form model involves a total of 2S parameters corresponding to the two error components for each of the S group sizes. Identification comes from imposing restrictions on how these error components vary with group size. ¡ ¡ ¡ ¢0 ¢0 ¢0 Let φ = σ 20 , θ0 with σ 2 = σ21 , . . . , σ 2S where σ2s = σ 2 (ms ) , θ = σ 2α , γ 2 and dim (φ) = ¡ ¢ dim σ 2 + dim (θ) = S + K. The parameter space is φ∈Φ≡ 4 ©¡ 2 2 2 ¢ 2 ª σ , σα , γ : σ > RS+ , σ 2α ≥ 0, γ 2 > 0 . (4) This condition can be satisfied in practice by working with residuals from a preliminary nonparametric regression of yci on Mc . Since the support of Mc is discrete, no smoothing is required with a sufficiently large data set; one can simply work with deviations of yci from its group size specific mean. 5 For example, under (2) we can always redefine non-orthogonal εci and αc to equal ε∗ci = εci − E ∗ [εci |αc ] and α∗c = αc + E ∗ [εci |αc ] where E ∗ [εci |αc ] denotes the (minimum mean-square-error) linear predictor of αc for εci (c.f, Chamberlain 1984, p. 1249). 4 Identification requires that the conditional variance of yc is given by (3) for some choice of φ. Condition 1 For some φ0 ∈ Φ, V ar (yc |Mc ) = Ω (φ0 ; Mc ) . Condition 1 is a strong requirement. While the error-component structure of V ar (yc |Mc ) is well motivated by exchangeability arguments, the assumptions that V ar (αc |Mc ) , Cov (αc , εc |Mc ) and Cov (εci, εcj |Mc ) (for i 6= j) are all constant in group-size are substantive. Graham (2004) discusses the plausibility of these assumption in the context of a peer effects and learning application that uses data from Project STAR. In general their plausibility depends on the underlying technology of the phenomenon under consideration, as well as the process by which individuals are assigned to peers of a given ‘quality’ (εc ) and to ‘locations’ (αc ). Also required is: Condition 2 For all φ ∈ Φ, φ 6= φ0 E [kΩ (φ0 ; Mc ) − Ω (φ; Mc )k] = S X s=1 ps × kΩ (φ0 ; ms ) − Ω (φ; ms )k > 0, where expectations are taken with respect to the marginal distribution of Mc , which has probability mass ps at each of the s = 1, . . . , S points in its support. Condition 2 will hold as long as there is group-size variation (i.e., S ≥ 2). To see this consider the following transforms of the within- and between-group variation of the data: gcw = Mc−1 (Mc − 1)−1 XMc i=1 (yci − y c )2 , gcb = y 2c . (5) Conditioning on group-size and taking expectations yields E [gcw |Mc ] = (rc0 σ 2 )Mc−1 , E[gcb |Mc ] = σ 2α + γ 2 (rc0 σ2 )Mc−1 , (6) where rc is an S × 1 vector with a 1 in the sth element if Mc = ms and zeros elsewhere such that rc0 σ2 = σ2 (Mc ). Observe that E [gcw |Mc ] equals the amount of between-group variance in outcomes (across groups of size Mc ) that we would ‘expect’ to observe in the absence of both unobserved group-level heterogeneity (σ 2α = 0) and social interactions (γ 2 = 1). Under (2) the ratio of differences ¤ £ ¤ £ E gcb |Mc = m − E gcb |Mc = m0 , γ = E [gcw |Mc = m] − E [gcw |Mc = m0 ] 2 m, m0 ∈ M, m 6= m0 (7) identifies γ 2 . The strength of social interactions is estimated by the ratio of actual differences in between-group variation in outcomes across groups of different sizes to the difference in variation we would ‘expect’ to observe in the absence of social interactions. In the absence of group-size variation the contributions of σ2α and γ 2 (rc0 σ2 )Mc−1 to the between-group variance in the data are not separately identifiable. If group size variation is modest identification may be weak, in the sense that the denominator of (7) is small. 5 From (6) we have the conditional moment restriction E [u (gc , θ 0 ) |Mc ] = 0, (8) ¢0 ¡ where u (gc , θ) = gcb − σ 2α − γ 2 gcw with gc = gcw , gcb . In principle, (8) implies an infinite number of unconditional moment restrictions of the form E [f (Mc ) u (gc , θ 0 )] = 0 for different functions of group size f (Mc ). With Mc equalling one of only a finite number of values however, the number of unconditional moment restrictions implied by (8) is also finite (although it may be large in certain empirical applications). The full set of unconditional moments implied by (8) is thus E[ψ (zc , θ0 )] = E [rc u (gc , θ0 )] = 0, (9) where zc = (rc0 , gc0 )0 . I am interested in the properties of different estimators and associated inferential procedures based on (9). One direct implication of the discreteness of the group-size support is that an efficient estimator based on (9) attains the semiparametric efficiency bound for the estimation problem defined by (8) (c.f., Chamberlain 1987, 1992). In order to interpret the Monte Carlo experiments reported below in the light of previous research on the small sample properties of linear simultaneous-equations estimators (e.g., Hahn, Hausman and Kuersteiner 2004), it is convenient to recast the excess variance model in that form. ¢0 ¡ Let vcb = gcb − E[gcb |rc ], vcw = gcw − E [gcw |rc ] and Π = σ 21 /m1 , . . . , σ2S /mS . Using this notation the excess variance social interactions model is gcb = σ2α + γ 2 gcw + uc (10) gcw (11) = rc0 Π + vcw , where uc = vcb − γ 2 vcw . Since gcb and gcw are functions of the between- and within-group squares of the underlying data we would typically expect, in contrast to the ‘classical’ linear simultaneousequations model, uc and vcw to have skewed distributions in addition to being heteroscedastic with respect to Mc . To develop some further intuition about the properties of the model it is useful to consider the special case where yc |Mc is multivariate normal. In this setting the least squares regression of gcb on a constant and gcw is inconsistent with a probability limit of p γb2LS → γ 2 · κ, κ= V ar(σ2 (Mc ) Mc−1 ) . E[2σ 4 (Mc ) Mc−2 (Mc − 1)−1 ] + V ar(σ 2 (Mc ) Mc−1 ) (12) As highlighted by (12), the motivation for instrumenting gcw with group size is measurement error: gcw is a noisy measure of the true ‘expected’ between-group variance in outcomes, σ2 (Mc ) Mc−1 . The ¡ ¢ unconditional variance of gcw consists of two components. The first, equaling V ar σ 2 (Mc ) Mc−1 , is the actual amount of variation in ‘expected’ between-group variance in the sample. The second, i h −1 w 4 −2 equaling V ar (gc |Mc ) = E 2σ (Mc ) Mc (Mc − 1) , is measurement error. The attenuation bias in γ b2LS thus takes a standard signal-to-noise ratio form. 6 The correlation between uc and vcw , the disturbances of the structural (10) and first-stage (11) equations respectively, provides a measure of the degree of simultaneity. Under normality we have ρuv h¡ ¢2 i ⎤−1/2 E σ 2α + σ 2 (Mc ) Mc−1 i⎦ = − ⎣1 + h . E σ 4 (Mc ) Mc−2 (Mc − 1)−1 ⎡ The absolute level of simultaneity, |ρuv |, is decreasing in σ 2α , the of amount group-level heterogeneity, and is increasing in V ar (gcw |Mc ), or the measurement error portion of the variation in gcw . For a given marginal distribution of group size, the degree of endogeneity can be calibrated by choice of σ 2α . b will have a large sample distribution Under normality the estimated first-stage coefficients, Π, of √ D b − Π) → N(Π N (0, 2Vw ) , where Vw equals Vw = diag ½ 1 σ 4S 1 σ 41 1 1 , . . . , 2 2 p1 m1 m1 − 1 pS mS mS − 1 ¾ . The concentration parameter, a unitless measure of instrument strength therefore equals (c.f., Bound, Jaeger, and Baker 1995, Stock, Wright and Yogo 2002) μ2 = ¢−1 N 0 0¡ DΠ, Π D DVw D0 2 (13) where D equals the first S − 1 rows of the within-group operator. Note that (13) is the population Wald statistic6 corresponding to the null hypothesis that ‘expected’ between-group variance does not change with group size or non-identification (i.e., H0 : Π1 = . . . = ΠS = 0). Define the ‘first stage F-statistic’ as the feasible Wald statistic divided by the number of excluded instruments (which in this case is S − 1).7 Staiger and Stock (1997) show that in large samples E [F − 1] ' μ2 /(S − 1), suggesting that the F-statistic provides a useful measure of the strength of identification in the case of a single endogenous regressor. The Monte Carlo results reported in Section 8 are based on a total of twelve designs corresponding to different degrees of overidentification and instrument strength. For all designs N = 300 and σ 2α is chosen to keep ρuv fixed at approximately −0.20; these values are consistent with the Project STAR peer group effects application pursed in Graham (2004). I consider designs with S = 2, 4, 7 and 12 group sizes, corresponding to zero, two, five and ten overidentifying restrictions. The distribution of group sizes is calibrated in each case to keep μ2 / (S − 1) equal to 5 (‘weak instruments’), 20 (‘strong instruments’) or 50 (‘very strong instruments’). The data generating process (DGP) is (εc , αc )|Mc ∼ MV N with a variance-covariance matrix given by (3) where σ 2α is chosen as described above and σ 21 = . . . = σ2S = 1 and γ = 2. 6 The population Wald statistic uses the population values of Π and Vw in place of estimates. Many computer regression packages, such as Stata, refer to a Wald statistic normalized in this way as an ‘Fstatistic’ (see http://www.stata.com/support/faqs/stat/wald.html ). 7 7 3 Generalized method of moments Let ψ N (θ) = PN c=1 ψ(zc , θ)/N, the sample average of the identifying moments. A generalized method-of-moments (GMM) estimator chooses b θ such that a linear combination of ψN (θ) equals zero. That is b θ GM M satisfies θGM M ) = 0, (14) WN∗ · ψN (b p where WN∗ is a K ×S weight matrix that may depend on the data such that WN∗ → W0∗ . The asymptotic variance minimizing choice for the weight matrix is Γ00 Ω−1 0 (e.g., Hansen 1982, Chamberlain 1982, 1984), where £ ¤ Γ0 = E ∂ψ (zc , θ 0 ) /∂θ 0 , S×K £ ¤ Ω0 = E ψ (zc , θ 0 ) ψ (zc , θ 0 )0 . S×S In practice, the Jacobian and second moment terms, Γ0 and Ω0 respectively, are replaced with consistent estimates. For example, Hansen’s (1982) optimal two-step estimator can be interpreted as replacing Γ0 and Ω0 in (14) with ΓN (b θ) = N −1 XN c=1 ∂ψ(zc , b θ)/∂θ0 , ΩN (θ) = N −1 XN c=1 ψ(zc , θ)ψ(zc , θ)0 , where θ is an inefficient preliminary estimate of θ based upon a non-stochastic weight matrix. In empirical work b θ is typically defined as the minimizer of a quadratic form in ψN (θ): p 1 b θ = arg min ψ N (θ)0 WN−1 ψ N (θ), θ∈Θ 2 (15) where WN → W0 is an S × S weight matrix. Inspecting the first order conditions associated with (15) indicates that this definition of b θ is equivalent to setting WN∗ equal to ΓN (b θ)0 WN−1 in the representation of b θ given by (14) above. The asymptotic variance minimizing choice of WN is therefore any matrix which is consistent for Ω0 , the second moments of ψ(zc , θ). In addition to Hansen’s (1982) two-step estimator, the two-stage least squares (2SLS) estimator is frequently used in applied work when the identifying moments are linear in the parameters, as is the case in (9). Let xc = (1, gcw )0 = −∂u (gc , θ 0 ) /∂θ 0 ; the 2SLS estimator uses the weight matrix hP i hP i−1 N N 0 0 WN∗ = x r /N × r r /N . 2SLS is typically inefficient since the conditional varic c c c c=1 c=1 P p 0 0 ance of u (gc , θ 0 ) is not constant in group-size; N c=1 rc rc /N → E [rc rc ] is therefore not proportional to Ω0 .8 Despite is name, 2SLS is a ‘one-step’ GMM estimator since a preliminary estimate of θ is not required for weight matrix formation. In large samples GMM estimators converge in distribution to 8 √ D −1 N(b θ − θ 0 ) → N (0, A−1 0 B0 A0 ), A0 = Γ00 W0−1 Γ−1 0 , B0 = Γ00 W0−1 Ω0 W0−1 Γ0 (16) Homoscedasticity of u (gc , θ0 ) only holds for the special case where σ2 (Mc ) is proportional to group size, Mc (i.e., σ (Mc ) ≡ λ · Mc ). However, identification breaks down in this case since E [gcw |Mc ] = λ, which is constant in group size. 2 8 where W0 is the probability limit of the (quadratic form) weight matrix used for estimation. If this ¡ ¢−1 . matrix is consistent for Ω0 , then the variance expression simplifies to I(θ)−1 = Γ00 Ω−1 0 Γ0 −1 Among estimators based on (9), I(θ) is the asymptotic efficiency bound (c.f., Hansen 1982, Chamberlain 1987). Because Mc has a discrete multinomial distribution I(θ)−1 is also the efficiency bound for the problem defined by the conditional moment restriction (8). For this reason all GMM p estimators based on (9) which use a weight matrix WN∗ such that WN∗ → W0∗ ≡ Γ00 Ω−1 0 attain the semiparametric efficiency bound for the problem defined by (8). A direct application of Chamberlain’s (1987) results yields a semiparametric efficiency bound of n h io−1 £ ¤−1 E E [xc |Mc ]0 E u2c |Mc E [xc |Mc ] where uc = u (gc , θ 0 ). For ψ (zc , θ 0 ) given by (9) it is straightforward to show that ⎞ p1 E[xc |Mc = m1 ] ⎟ ⎜ .. ⎟, Γ0 = − ⎜ . ⎠ ⎝ pS E[xc |Mc = mS ] ⎛ © £ £ ¤ ¤ª Ω0 = diag p1 E u2c |Mc = m1 , . . . , pS E u2c |Mc = mS . (17) Multiplying out we have ∙ PS ¸−1 E [xc |Mc = ms ]0 E [xc |Mc = ms ] s=1 ps E [u2c |Mc = ms ] n h io−1 £ ¤−1 = E E [xc |Mc ]0 E u2c |Mc E [xc |Mc ] . ¡ 0 −1 ¢−1 Γ0 Ω0 Γ0 = Thus I (θ 0 ) also equals the Fisher Information associated with the conditional moment problem. While the first order asymptotic properties of two-step GMM are identical to those of the (infeasible) oracle estimator which uses Γ00 Ω−1 0 as the weight matrix, Newey and Smith (2004) show that its higher order properties are affected by weight matrix estimation. Altonji and Segal (1996) and Clark (1996), among others, provide Monte Carlo evidence consistent with this result. Using the same data to estimate the optimal weight matrix as well as θ can induce a correlation between the sampling error in WN∗ and ψN (θ). For a fixed sample size, sampling error in WN∗ = ΓN (b θ)0 ΩN (θ)−1 , and hence the bias of b θ, will be increasing in the number of overidentifying restrictions since this increases the dimension of WN∗ . Thus although the two-step GMM estimator based on (9) attains the semiparametric efficiency bound, its finite sample properties may be very poor since implementation requires estimating a weight matrix with a potentially large number of unique elements. Applied researchers typically ‘solve’ this problem by either arbitrarily reducing the number of identifying moments (e.g., Tauchen 1986) or by using inefficient one-step procedures such as the equally weighted minimum distance (EWMD) procedure introduced by Abowd and Card (1989). In applied dynamic panel data analysis the ‘one-step’ 2SLS estimator described above is often used instead of efficient two-step GMM (c.f., Blundell and Bond 1998, Bond, Bowsher and Windmeijer 2002). All of these procedures involve losses of asymptotic efficiency. 9 3.1 Inference with GMM estimators Consider hypotheses of the form H0 : c (θ 0 ) = 0, (18) with C0 = ∂c (θ 0 ) /∂θ 0 . It will be useful to define the following notation. Let b θ and e θ denote the unrestricted and restricted estimates of θ respectively (i.e., e θ minimizes (15) subject to (18)). e Furthermore, let θ and θ respectively denote consistent preliminary unrestricted and restricted estimates, where such estimates are required for efficient weight matrix estimation. Finally, for any function of θ, let Fb = F (b θ) etc. For any choice of weight matrix, WN∗ , we can conduct inference using either the Wald or score statistics: bA b−1 C b 0 )−1 c(b bA b−1 B W M (b θ) = N · c(b θ)0 (C θ), e 0 (C eA e−1 C e 0 )−1 C eA e−1 ψN (e e−1 C eA e−1 B S M (e θ) = N · ψ N (e θ)A θ), (19) (20) b and B b are constructed from estimates of their constituent elements as given in (16). Both where A statistics are χ2J random variables in large samples, where J = rank (C0 ) .9 For the efficient two-step GMM estimator simpler forms of both statistic are appropriate: b −1 b 0 −1 b b Γ bN Ω−1 Γ W (b θ) = N · c(b θ)0 (C( N N ) C ) c(θ), −1 −1 (21) −1 e Γ e e e 0 e e −1 e 0 e S(e θ) = N · ψN (e θ)0 Ω N N (ΓN ΩN ΓN ) ΓN ΩN ψ N (θ). (22) The statistic given by (22) is different from the score statistic given by Newey and West (1987). It differs from their formulation in that the variance-covariance matrix of the score vector is estimated under the null, as opposed to the alternative. Intuitively, this difference should improve the size properties of the test by improving the precision of the weight matrix estimate (c.f., Bond and Windmeijer, 2002). Regardless, the above formulation of the test statistic is more in the spirit of the score principle as applied in the maximum likelihood setting since it is based entirely on quantities computed under the null. The minimized two-step GMM criterion function (15) is also a χ2 random variable. This suggests an additional approach to inference. Let SN (θ, θ) = 12 ψN (θ)0 ΩN (θ)−1 ψN (θ). The Sargan-Hansen difference or ‘deviance’ statistic LR(e θ, b θ) = 2[SN (e θ, e θ) − SN (b θ, θ)], (23) θ, e θ) and SN (b θ, θ) is the GMM analog of a likelihood ratio. The weight matrices used to form SN (e are computed under the null and alternative respectively.10 All of the above test statistics can be inverted to form confidence sets. Let θ 1 be the scalar 9 Equations (19) and (20) follow from the standard theory of M-estimation. An excellent textbook exposition of this theory is given by Wooldridge (2002, Chapter 12). 10 Newey and West (1987) advocate the use the same weight matrix in both quadratic forms, ensuring the statistic is positive in finite samples. 10 parameter for which a confidence set is desired. Define e θ(θ1 ) = (θ1 , θ 2 (θ 1 )0 )0 to be the constrained GMM estimate of θ holding θ 1 fixed at a given null value. A (1 − α) × 100 percent confidence set based on the Sargan-Hansen difference statistic is o n , θ (θ 1 ) , b θ) ≤ χ2,1−α CIαLR (θ1 ) = θ1 |LR(e 1 where χ12,1−α is the 1 − α quantile of the χ21 distribution. Confidence sets based on other statistics are defined analogously. Newey and West (1987) show that for moment functions and hypotheses that are linear in θ the two-step GMM forms of W , S, and LR are numerically identical if the same estimate of Ω0 is used throughout. For example, if Ω0 is estimated without imposing the null restrictions, then all of the test statistics are identical to the standard Wald test. The three statistics as defined above, however, are distinct even in the linear case. This is because the Wald test computes Ω0 under the alternative, the score test under the null, while the Sargan-Hansen difference test uses both a restricted and unrestricted estimate of Ω0 . These differences may be practically important in finite samples, particularly when testing distant nulls as is done implicitly when forming confidence sets. Confidence sets constructed using the S and LR tests associated with the efficient GMM estimator need not be symmetric around b θ. The LR statistic is also invariant to differentiable reparameterizations of the null hypothesis. Under weak identification, in the sense described by Stock and Wright (2000), confidence intervals constricted using the above test statistics are invalid, since the test statistics in this case have non-standard limiting distributions. 4 Generalized empirical likelihood Newey and Smith (2004) provide a masterful synthesis of the generalized empirical likelihood (GEL) literature, Imbens (2002) a user-oriented introduction. The exposition in this section emphasizes the characterization of b θ GEL as the solution to a saddle-point (i.e., min-max) problem. This allows for the straightforward derivation of large sample properties using standard results on M-estimators (e.g., Newey and McFadden 1994, Wooldridge 2002, Chapter 12). ª def © Let V be an open interval containing zero and ΛN (θ) = λ : λ0 ψc (θ) ∈ V, c = 1, . . . , N . b θ GEL is the solution to the following saddle-point problem def b θ GEL = arg min sup lN (λ, θ) θ∈Θ (24) λ∈ΛN (θ) ¡ 0 ¢ P where Θ is a compact set of potential parameter values, lN (λ, θ) = N c=1 ρ λ ψ c (θ) the unprofiled GEL criterion function and ρ (v) a concave scalar function on V, normalized such that ρ1 (0) = ρ2 (0) = −1 (with ρj = ∂ j ρ/∂vj ). For the empirical likelihood (EL), exponential tilting (ET) and continuous updating (CU) estimators ρ (v) equals ln (1 − v) , −ev and −(1 + v)2 /2 respectively. By concentrating out the S × 1 vector of ‘tilting’ parameters, λ, we get a profiled objective function of def p eN (θ) , θ), lN (θ) = sup lN (λ, θ) = lN (λ (25) λ∈ΛN (θ) 11 eN (θ) denotes the maximizing value of the tilting parameter for a given sample size and where λ value of θ. b θGEL is the minimizer of (25). λN (θ) , θ), with each candidate value Calculating b θ GEL requires an outer minimization over lN (e eN (θ). While unavailable in closed form — except of θ requiring an inner maximization solving for λ eN (θ) is simple to calculate numerically. Imbens, Spady and Johnson (1998), Owen for CU — λ (2001), Mittelhammer, Judge and Schoenberg (2005) and Imbens (2002) discuss various algorithms. Two alternative methods of computing b θGEL that appear to work well in practice are outlined in Appendix E. The first is a specific implementation of the quasi-Newton procedure suggested by Imbens (2002) while the second is a GEL-analog of Fisher-scoring. The large sample properties of b θ GEL are well-known and summarized by the following Theorem (c.f., Imbens 1997, Qin and Lawless 1994, Newey and Smith 2004). Theorem 1 Under correct specification (i.e., (9) or E [ψ (zc , θ 0 )] = 0) and regularity conditions, p (b λ, b θ) → (0, θ 0 ) with a limiting distribution of √ N à b λ b θ − θ0 ! D →N à " 0, 0 P0 0 I (θ 0 )−1 #! , (26) ¢ ¡ −1 0 −1 −1 −1 Γ0 Ω0 . where I (θ0 ) = Γ00 Ω−1 0 Γ0 and P0 = Ω0 − Ω0 Γ0 I (θ 0 ) Proof. See Appendix B. A key implication of Theorem 1 is the first order equivalence of all GEL estimators with one another as well as with two-step GMM. It is also straightforward to characterize the asymptotic properties of b θ GEL under misspecification by viewing GEL as an M-estimator. Under misspecification there is no θ0 ∈ Θ such that E [ψ (zc , θ0 )] = 0. In this setting b θ is no longer consistent but (generally)11 converges to the pseudoeN (θ)0 ψ(zc , θ)] and, by standard large sample results for M-estimators true value θ∗ = arg min E[ρ(λ (c.f., Newey and McFadden 1994, Wooldridge 2002, Chapter 12), continues to have a limiting normal distribution of where √ D N(b θ − θ ∗ ) → N (0, Ap (θ ∗ )−1 B p (θ ∗ ) Ap (θ∗ )−1 ), Ap (θ) = E [H p (zc , θ)] , (27) B p (θ) = V ar (sp (zc , θ)) , with sp (zc , θ) and H p (zc , θ) equaling the cth observation’s contribution to the score and Hessian of the profiled GEL objective function (25) respectively. Analytic expressions for sp (zc , θ) and H p (zc , θ) are provided in Appendix A. As pointed out by Imbens (1997), under misspecification the limiting pseudo-true parameter value, (λ∗ , θ∗ ), varies with the specific GEL estimator. The limit distribution given by (26) is a special case of the general M-estimator result given by (27), as the next Theorem demonstrates. 11 Schennach (2003) shows that the asymptotic variance of the EL member of the GEL family of estimators may become undefined in the presence of misspecification due to the presence of a singularity in its influence function (see also Ragusa 2005). 12 Theorem 2 Under correct specification an information matrix equality of the form A (θ 0 ) = B (θ 0 ) = I (θ 0 ) (28) holds. Proof. See Appendix B. Although its proof is simple and straightforward, to my knowledge Theorem 2 is a new result.12 One of its implications is that the ‘observed information’ matrix is consistent for I (θ 0 ). That is, p p b (θ)/N → I (θ 0 ) . 5θθ lN (29) The Hessian is a direct function of the estimated parameter and realized sample, as well as an eN (b eN (b indirect function of them through their impact on the (profiled) tilting parameter λ θ). If λ θ) is instead replaced by its limiting (population) value of zero, it is straightforward to show that PN 0 −1 b b0 Ω b b b −1 b −1 the inverse ‘Hessian’ simplifies to (Γ N N ΓN ) /N where ΓN = N c=1 ∂ ψ c /∂θ and ΩN = PN b b 0 −1 /N and c=1 ψ c ψ c /N (c.f., equation (67) in Appendix A). This is a direct estimate of I (θ0 ) is loosely akin to the conditional Fisher information estimator used in the parametric maximum likelihood setting, taking a form similar to that of a conventional GMM variance estimate.13 4.1 Minimum discrepancy formulation An informative alternative characterization of b θGEL is as the solution to the following minimum discrepancy (MD) problem (b θ, π b ) = arg min θ∈Θ, π 1 ,...,π N XN c=1 h (π c ) , s.t. XN c=1 πc ψc (θ) = 0, XN c=1 πc = 1, (30) where π is an N × 1 vector of probability weights. Newey and Smith (2004, Theorem 2.2, p. θ GEL when h (πc ) is a member of the Cressie-Read family 224) show duality between b θ M D and b (η+1) η+1 (N πc ) −1 η 1 of discrepancies. The duality holds for h (πc ) = η(η+1) and ρ (v) = − (1+ηv) (with N η+1 14 expressions being interpreted as limits for η = 0 or −1). The minimum discrepancy formulation is important since it indicates that we can interpret GEL as simultaneously choosing a set of sample weights, π b , and a parameter value, b θ, such that in the re-weighted sample the moment conditions are mean zero, as is assumed to be the case in the P b c = 0). Subject to this restriction the weights are chosen to be as sampled population (i.e., N bcψ c=1 π close as possible to ιN /N = (1/N, . . . , 1/N )0 , the empirical measure of the data. Algorithmically, 12 Kleinbergen (2005) and Newey and Windmeijer (2005) show that V ar (sp (zc , θ0 )) = I (θ 0 ) for the special case of the continuously updating (CU) estimator and the entire GEL family of estimators respectively. Kitamura, Tripathi and Ahn (2005, p. 1678) — in the context of an extension of the EL estimator for conditional moment models — note that the Hessian of the (profiled) empirical likelihood criterion function, divided by the sample size, is consistent for I (θ0 ) . 13 b N with ΩN , where the The standard two-step GMM variance estimator for cross-section data typically replaces Ω − indicates evaluation at a preliminary consistent, but typically inefficient, estimate of θ. 14 In an interesting and wide-ranging working paper Ragusa (2005) shows that GEL-MD duality extends beyond the Cressie-Read family. 13 P for a candidate value of θ a re-weighting of the data, π e (θ), such that N e c (θ) ψc (θ) = 0 is c=1 π solved for; θ is then chosen to make the requisite re-weighting close to the empirical distribution. ‘Closeness’ is defined in terms of the empirical discrepancy measure, h (πc ), used. For EL and ET h (πc ) equals − ln (π c ) and πc ln (πc ) respectively. In both cases ‘closeness’ is defined in terms of the Kullback-Leibler information criterion (KLIC). For EL expectations are taken with respect to the empirical measure of the data such that b θ EL is chosen to minimize K (e π(θ), ιN /N ) = EFN ∙ ½ ¾¸ X N dFπ (z, θ) ln {ln(e π c (θ)) − ln (1/N)} (1/N), = c=1 dFN (z) subject to the restrictions in (30) and where FN (z) = XN c=1 1 (zc ≤ z) /N, Fπ (z, θ) = XN c=1 1 (zc ≤ z) π e c (θ), (31) are the empirical distribution function and the fitted cumulative distribution function (CDF) of z R respectively. An important property of the fitted CDF is that ψ(z, θ)dFπ (z, θ) = 0. EL chooses θ such that, among feasible distributions (i.e., those for which the moments are mean zero), the KLIC-discrepancy between π e (b θ) and ιN /N is smallest. ET reverses the roles of the fitted and empirical measure of the data, with b θ ET chosen to minimize ∙ ½ ¾¸ X N dFN (z) e (θ)) = EFπ ln {ln(1/N ) − ln(e πc (θ))} π ec (θ). = K (ιN /N, π c=1 dFπ (z, θ) For CU h (π c ) = (1/2) × [(Nπ c )2 − 1]/N ; therefore b θCU minimizes the Euclidean distance between b π e (θ) and ιN /N. To summarize, GEL chooses b θ to minimize the distance between the empirical distribution function FN (z) and the fitted CDF. This interpretation also makes clear what GEL does in the θ) converges to F ∗ (z), which, among the presence of misspecification. Under misspecification Fπ (z, b set of distributions for which the moments are mean zero, is ‘closest’, in the sense of the metric, h (·), to the true distribution function (Imbens 1997, p. 367).15 Since the relevant metric varies across different GEL estimators, under misspecification the limiting pseudo-true value θ ∗ also varies with the choice of estimator. For any given value of θ we can concentrate out the N × 1 vector of probability weights, π e (θ), leading to the saddle-point characterization of the GEL estimator given by (24). Newey and Smith (2004) show that the probability weights associated with (30) can be recovered directly from the saddle-point problem by ρ (e λ (θ)0 ψ c (θ)) π ec (θ) = PN 1 , e 0 c=1 ρ1 (λ (θ) ψ c (θ)) 15 c = 1, . . . , N. (32) It is also straightforward to show that the S×1 vector of tilting parameters, b λ, in the saddle-point This result is analogous to the behavior of maximum likelihood under misspecification (e.g., Gourieroux and Monfort 1993). 14 representation of the problem are proportional to the Lagrange multipliers on the first constraint in b has an intuitive interpretation the MD one (c.f., Imbens 2002, Newey and Smith 2004). Therefore λ as a measure of the tightness on the restriction that the moments be mean zero in the re-weighted sample. The following Theorem, due to Imbens (1997), shows that (31) is an efficient estimate of the CDF of z. As will be argued below, the characterization of the GEL probabilities as efficient estimates of the underlying distribution function of z — efficient because they take into account the prior knowledge that ψ (zc , θ0 ) is mean zero — is important for understanding the attractive higher order bias properties of b θ GEL . Theorem 3 Let ω A = Pr (z ∈ A). The estimate ω bA = ω e A (b θ) = p Z 1 (z ∈ A) dFeN (z, b θ) = XN c=1 1 (zc ∈ A) π e c (b θ) is consistent for ω A (i.e., ω b A → ω A ) and attains the semiparametric efficiency bound with a limiting distribution of √ N(b ω A − ω A ) → N (0, V (b ω A )) where V (b ωA ) = ω A (1 − ω A ) − ψ 0A P0 ψA with ψA = E [ψ (zc , θ0 ) × 1 (zc ∈ A)] = ω A × E [ψ (zc , θ 0 ) |zc ∈ A] . Proof. See Imbens (1997) and Appendix B. The first term is V (b ω A ) is simply the limiting variance of the sample frequency of the event zc ∈ A. The second term represents the asymptotic gain in precision from exploiting (9). Intuitively, R θ) follows directly from efficiency of b θ and because ψ(z, b θ)dFπ (z, b θ) = 0, as is efficiency of Fπ (z, b assumed to be true in the population from which the sample was drawn. For what follows it is convenient to define the additional set of weights: e k ∗ (e λ (θ)0 ψ c (θ)) e , kc (θ) = PN c e∗ e 0 c=1 kc (λ (θ) ψ c (θ)) c = 1, . . . , N, (33) ec (θ) = π where e kc∗ (v) = ρ1 (v)+1 . Note that for EL k ec (θ), but this does not hold in general. v Using the two sets of GEL weights, as well as the empirical measure, we can consider three def consistent estimators for the expectation of any function of z and θ. For μa = E [a(zc , θ 0 )] we P P P b b b aπ = N b c a(zc , b θ), and (3) b ak = N have: (1) b aN = N1 N c=1 a(zc , θ), (2) b c=1 π c=1 kc a(zc , θ). The next aπ is an efficient estimate section demonstrates that, when all that is known is that E [ψ(zc , θ 0 )] = 0, b of the mean of a(zc , θ 0 ). 15 5 Efficient semiparametric estimation of expectations Suppose we wish to estimate the mean of the scalar function a(zc , θ0 ).16 Although the distribution of zc is not parametrically specified, the problem is only semiparametric since prior information in the form of the moment restriction (9) is available. Brown and Newey (1998) consider this problem is some generality (c.f., Imbens and Lancaster (1994) for an interesting empirical application of related ideas). Assume b θ is an efficient estimate of θ; a common, obvious and consistent estimate of μa is the P b sample mean b aN = N c=1 a(zc , θ)/N, but this fails to utilize any of the information contained in (9). Alternatively one could use an estimate of the (minimum mean-square error) linear predictor b 0c evaluated at ψ b 0c = ψ b 0N (the sample mean of the moment vector): of ac given 1 and ψ 0 bN , μ ba = bb1 + bbψ ψ (34) b 0c : where b b1 and bbψ are the least squares coefficients from the regression of ac onto 1 and ψ à bb1 bbψ ! = à PN 1 b c /N ψ c=1 à ! !−1 à P ! PN b 0 N μ ψ /N a b /N p c a c → PNc=1b PNc=1b b 0 0 ac /N Ω−1 ψ ψ /N 0 Σaψ c=1 ψ c b c=1 c c (35) b N are consistent for μa aN , is consistent since bb1 and ψ with Σaψ = E[ac ψ0c ]. This estimator, like b 0 and zero respectively. If ac and ψc are correlated (i.e., bψ = Ω−1 0 Σaψ 6= 0) we might also expect (34) to be variance-reducing relative to sample mean estimator. However (34) is still inefficient since it fails to exploit the prior knowledge that ψ c is known to be mean zero in the population from which b N with zero gives the efficient estimator μ the sample was drawn. Replacing ψ b∗a = bb1 . By applying standard results on partitioned inverses to (35) we have μ b∗a equal to XN b −1 ψ b N )−1 b −1 ψ b c )/N b 0N Ω b aψ,N Ω μ b∗a = (1 − ψ (b ac − Σ N N c=1 XN −1/2 = (ac − Σaψ Ω−1 ). 0 ψ c )/N + op (N (36) c=1 Inspection of (36) reveals that μ b∗a is (asymptotically) uncorrelated with the sample average of the b∗a is an efficient estimator of μa — when the only prior information available moments, ψN . That μ is (9) — follows directly from this lack of correlation since any such correlation could be used to construct an asymptotically more efficient estimator. Theorem 4 μ b∗a attains the semiparametric efficiency bound for estimating μa = E[a(zc , θ0 )] when all that is known about the distribution of zc is (9) (i.e., that ψ(zc , θ) is mean zero for some θ = θ 0 ). p We have μ b∗a → μa with a limiting distribution of √ D N (b μ∗a − μa ) → N (0, V (b μ∗a )) 16 The generalization of what follows to matrix-valued functions is straightforward but comes at the expense of considerably more notation. Results for the general case are given in Theorem 6 below. 16 where −1 0 −1 0 −1 0 Γ0 Ω0 Σaψ . V (b μ∗a ) = Σaa − Σaψ Ω−1 0 Σaψ + Σaψ Ω0 Γ0 I (θ0 ) = Σaa − Σaψ P0 Σ0aψ . Proof. See Appendix B and also c.f. Brown and Newey (1998, Theorem 2, p. 456). Each component of V (b μ∗a ) has an interesting interpretation. Assume that θ0 is known; in this case the first element of V (μa ) is the asymptotic variance of the sample mean estimator, aN . The second element represents the asymptotic gain in precision from utilizing the prior information that ψ(zc , θ0 ) is mean zero. In practice θ0 is not known, the last term represents the asymptotic penalty θ. which arises from replacing θ 0 with the efficient estimate b To get some intuition for the efficiency result of Theorem 4 consider the alternative linear estimator μ b+ b ∗a +αψ N for the special case of a single moment restriction (i.e., dim(ψc ) = S = 1) a =μ and θ 0 known. The large sample variance of μ b+ a is V ar(b μ+ a) σ2a (1 − ρ2aψ ) α2 σ 2ψ + , = N N where ρaψ is the correlation between ac and ψ c . V ar(b μ+ b+ b ∗a . a ) is minimized at α = 0 or μ a = μ Choosing α 6= 0 introduces correlation between μ b+ a and ψ N which, when present, can be used to reduce variance. Note that the asymptotic relative efficiency of μ b∗a over aN — the sample mean of ac — is 1/(1−ρ2aψ ). The intuition is straightforward: when ac and ψc are correlated, the prior knowledge that ψc is mean zero is informative about μa . A straightforward but nvery important connection exists between Theorem 4 and the use of the o0 GEL probabilities π e (b θ) = π e1 (b θ), . . . , π eN (b θ) to estimate expectations. Manipulating (36) we can ∗ rewrite μ b a as the weighted sum, μ b∗a = Z a(z, b θ)dFπ (z, b θ CU ) = N X c=1 π ec (b θ CU )a(zc , b θCU ), eN (θ)0 ψc (θ) 1+λ , π ec (θ) = PN 0 e c=1 1 + λN (θ) ψ c (θ) (37) P where Fπ (z, b θ CU ) = N ec (b θ CU ) is the estimate of the CDF of z associated with the c=1 1 (zc ≤ z) π continuously updating (CU) estimator. In deriving (37) I have used the result that for ρ (v) = b N (see equation (77) in Appendix eN (b b −1 ψ −(1+v)2 /2 the vector of tilting parameters, λ θ), equals −Ω N C). b −1 > 0 b 0N Ω Given the prior knowledge that ψc is mean zero, the form of π e(b θCU ) is intuitive. If ψ N then ‘large’ realizations of ψc are over-represented in the sample and π e(b θ CU ) downweights such b −1 < 0 then ‘large’ realizations are under-represented and π b 0N Ω observations accordingly; if ψ e(b θCU ) N upweights such observations accordingly (c.f., Back and Brown 1993). R That μ b ∗a = a(z, b θ CU )dFπ (z, b θCU ) is efficient suggests that Fπ (z, b θCU ) is an efficient estimate of the CDF of z, but this is precisely what Theorem 3 shows to be true. Since any convex ρ (v) can be approximated by a quadratic function in the neighborhood of zero, replacing π ec (b θ CU ) with 17 the probability weights associated with any other GEL estimator also yields efficient estimators of the mean of a(zc , θ 0 ) (see Theorem 6 below). Equations (36) and (37) show that estimating the mean of a(zc , θ0 ) using the GEL probability weights is asymptotically equivalent to using the linear b 0c where ψ b c is evaluated at its population mean value of zero. predictor of ac given 1 and ψ PN e c (b θ)a(zc , b θ) is an efficient estimate of μa and that efficient estimates are That b aπ = c=1 π asymptotically uncorrelated with ψN , the sample average of the moments, is key to understanding the attractive asymptotic bias properties of GEL estimators derived by Newey and Smith (2004). 6 Anticipating the small sample properties of b θ GM M and b θ GEL Intuition regarding the finite sample bias GMM and GEL estimates of θ can be drawn from their respective first order conditions. Theorem 5 The first order conditions associated with the GMM and GEL are N 1 Xe hc u(gc , b θ) = 0 N c=1 (38) 2 where e hc is an estimate of the infeasible optimal instrument, hc = Γ0 Ω−1 0 rc = −E [xc |Mc ] /E[uc |Mc ]. For two-step GMM and GEL e hc equals M e b 0N Ω−1 =Γ hGM c N rc , e b0π Ω b −1 rc , hGEL =Γ c k (39) where ΩN is computed using the preliminary consistent, but typically inefficient, estimate θ. Proof. See Appendix C. Theorem 5 indicates that a key feature of GEL estimators, relative to the conventional two-step θ) and e kc (b θ), in the construction of the GMM estimator, is their use of the probability weights, π e c (b optimal instrumental variable. All GEL estimates use an efficient estimate of the Jacobian matrix in their construction of e hc and EL also uses an efficient estimate of the variance of the moments. It is useful to start by considering the asymptotic bias of two-step GMM. Theorem 5 implies that the bias of b θ GM M equals b θ − θ0 = µX N c=1 e hc x0c /N ¶−1 µX N c=1 e hc u (gc , θ0 ) /N ¶ . (40) First assume that xc is replaced with its conditional expectation, E [xc |Mc ]; this eliminates small sample bias due to endogeneity (i.e., from sampling error in the Jacobian matrix). In this setting bias arises solely from the correlation between the heteroscedasticity-correcting denominator of the estimated optimal instrument and the residual. That is, from the correlation between b 2c |Mc ] = E[u PN 2 i=1 uc · 1 (Mi = Mc ) P N i=1 1 (Mi = Mc ) 18 (41) and u (gc , θ0 ) . In group-size cells where some groups ‘happen to have’ unusually small or large average outcomes, y c , this heteroscedasticity-correcting weight will be small (since between-group variation in yci will be large). In these cells uc will also tend to be positive. This implies that less weight will be placed on positive realizations of the sampling error than on negative. Since uc is mean zero, the finite-sample bias of b θ will be downward.17 The direction of bias due to correlation between sampling variability in the Jacobian and uc operates in the same direction but for different reasons. In group-size cells that include groups with ‘usually large’ amounts of within-group outcome variation the numerator of the second element of e hc , PN w w i=1 gi · 1 (Mi = Mc ) b E [gc |Mc ] = P , (42) N i=1 1 (Mi = Mc ) will be large. In these same cells uc will tend to be negative. Again this implies that negative θ. realizations of uc receive more weight in estimation, inducing an additional downward bias in b b Examining the nature of bias in θGM M which arises from Jacobian variability clarifies how GEL estimators eliminate this source of bias asymptotically. For simplicity consider the case where Ω0 is known. Sampling error in the Jacobian term is therefore the main source of bias in b θ GM M . Evaluating (40) we have approximately E[b θGM M − θ 0 ] ' E "µ XN e hc x0c /N c=1 ¶−1 # ×E ∙µX N c=1 b N )−1 ] × E[Γ b 0N Ω−1 ψN ] b 0N Ω−1 Γ = E[(Γ 0 0 ¸ ∙ X N 1 −1 0 −1 xc rc Ω0 ψ N = I (θ 0 ) × E c=1 N £ ¤ I (θ 0 )−1 × E xc rc0 Ω−1 = 0 rc uc N ∙ ¸ Σxu (Mc ) I (θ 0 )−1 ×E = N p (Mc ) σ 2u (Mc ) I (θ 0 )−1 XS Σxu (ms ) = × , s=1 σ 2 N u (ms ) e hc uc /N ¶¸ (43) where p (Mc ) is the marginal probability of observing a group of size Mc (i.e., p (Mc ) = ps if Mc = ms ), Σxu (Mc ) = E [xc uc |Mc ] is the K ×1 vector of conditional covariances between xc and uc £ ¤ and σ 2u (Mc ) = E u2c |Mc is the conditional variance of uc . Note that (43) generally increases with S, the number of group sizes.18 Intuitively this is because increases in S enlarge the dimension of, and hence variability in, the estimated Jacobian. 17 This argument directly mirrors that given by Altonji and Segal (1996) for the bias of the optimal minimum distance (OMD) estimator in covariance-restriction models. −1 SΣxu 18 0) Under homoscedasticity of uc (43) simplifies to I(θN . Under the additional assumption that the (implied) σ2 u ‘first-stage’ errors, vc , are homoscedastic (as is assumed in the conventional linear simultaneous equations model), and that xc is scalar, the bias expression further simplifies to Σxu 1 , E[b θGM M − θ0 ] = 2 2 σ v μ /S + 1 19 GEL estimators use alternative estimates of Γ0 and Ω0 to construct e hc . Some simple algebra GEL equals shows that the second element of e hc e hGEL = c 1 ps − 1 ps PN bi i=1 π PN b i=1 ki · giw · 1 (Mi = Mc ) ·u b2i · 1 (Mi = Mc ) . kc — in e hGEL generate Theorem 6 provides intuition on how the presence of the GEL weights — π bc and b c b the attractive asymptotic bias properties of θGEL found by Newey and Smith (2004, Theorem 4.2, p. 228). θ) be a K × L matrix a(zc , θ 0 ) = (a1 (zc , θ0 ), . . . , aL (zc , θ 0 )) with Theorem 6 Let b ac = a(zc , b PN 0 p (l) b acl ψ c → Σaψ = E [al (zc , θ0 )ψ(zc , θ0 )0 ], the K ×S covariance between the lth column of a(zc , θ 0 ) c=1 b (1) (L) and ψ(zc , θ 0 ). Let Σaψ = (Σaψ , . . . , Σaψ ) be the K × LS matrix of the L individual covariance matrices. For the GEL probability weights π bc and b kc , given by (32) and (33) respectively, and the empirical weights 1/N we have the expansions XN c=1 XN c=1 XN c=1 π bc b ac = N −1 b kc b ac = N b ac /N = N XN c=1 XN −1 c=1 XN −1 c=1 −1/2 (ac − Σaψ (IL ⊗ Ω−1 ) 0 ψ c )) + op (N 1 −1/2 (ac + ρ3 Σaψ (IL ⊗ Ω−1 ), 0 ψ c )) + op (N 2 ac + op (N −1/2 ), and hence, to the order of the approximations, we have £ ¡ ¢¤ E b aπ IL ⊗ ψ0N = 0 £ ¡ ¢¤ 1 1 (1 + ρ3 Σaψ ) E b ak IL ⊗ ψ0N = N 2 £ ¡ ¢¤ 1 E b aN IL ⊗ ψ0N Σaψ . = N Proof. See Appendix C. Theorem 6 formalizes the claim made in Section 5 that using π b to estimate expectations is semiparametrically efficient when all that is known is (9). Asymptotic uncorrelatedness between b aπ and ψN follows directly from efficiency since any such correlation, if present, could be used to construct an alternative estimate of the mean of a(zc , θ0 ) with lower variance. A direct application of Theorem 6 demonstrates how GEL estimators, by using an efficient estimate of the Jacobian, asymptotically eliminate bias due to correlation of its sampling error with ψN . Let d denote an S 2 × 1 vector with a total of S ones appropriately placed such that that (IS ⊗ ψ 0c )d = ψc .19 Continuing to consider the case where Ω0 is known and evaluating (40) with where μ2 is the ‘concentration parameter’ measuring instrument strength (c.f., Stock, Wright and Yogo 2002). This is precisely the standard expression for the finite sample bias of the 2SLS estimator (e.g., Hahn and Hausman 2002). Under the above conditions two-step GMM with Ω0 known is 2SLS. 19 That is d has a total of S ones located in its first element, its (S + 1)th element, its (S + 2)th element and so on with all remaining elements equal to zero. 20 e b 0π Ω−1 rc we have: hc = Γ 0 E[b θGEL − θ 0 ] ' I (θ0 )−1 × E = = = = = I (θ0 )−1 N I (θ0 )−1 N I (θ0 )−1 N I (θ0 )−1 N I (θ0 )−1 N ∙X N π bc xc rc0 Ω−1 0 ψN c=1 ¸ (44) £ ¤ −1 × E (xc rc0 Ω−1 0 − ΣΓψ (IS ⊗ Ω0 ψ c ))ψ c £ ¤ −1 0 × E (xc rc0 Ω−1 0 − ΣΓψ (IS ⊗ Ω0 ψ c ))(IS ⊗ ψ c )d £ ¤ 0 −1 0 × E xc rc0 Ω−1 0 (IS ⊗ ψ c )d − ΣΓψ (IS ⊗ Ω0 ψ c ψ c )d 0 × ΣΓψ d − ΣΓψ (IS ⊗ Ω−1 0 E[ψ c ψ c ])d × ΣΓψ d − ΣΓψ d = 0, where the second line applies Theorem 6 with ac equal to the K ×S matrix xc rc0 Ω−1 0 and the fifth line 0 uses that fact that Σaψ = E[ac (IS ⊗ ψ c )]. The GEL probabilities asymptotically eliminate bias due to correlation between variability in the Jacobian matrix and the sample moments by ‘subtracting off’ that part of the variation in ΓN which is correlated with ψN . Unlike two-step GMM, asymptotic bias due to Jacobian variability is zero and therefore not increasing in the number of group sizes, S. Theorems 4.5 of Newey and Smith (2004) provides stochastic expansions for the GMM and GEL estimators that formalize the intuition driving the above examples. From their theorem we have E[b θ − θ0 ] = b(θ 0 )/N with b(θ 0 ) differing by estimator according to bORACLE (θ 0 ) = BO , ³ ρ ´ b (θ 0 ) = BO + 1 + 3 BΩ , 2 GM M b (θ 0 ) = BO + BΓ + BΩ + BW . (45) GEL Specializing Newey and Smith’s (2004) expressions for BO , BΓ , BΩ , and BW to the ‘linear-in-means’ model outlined in Section 2 we have h i BO = I (θ 0 )−1 E h0c hc I (θ0 )−1 Σxu (Mc ) , −1 BΩ = I (θ 0 ) S X E [x |m ] μ (m ) p c s 23u s3/2 , σ2u (ms ) [σ u (ms )] s=1 BΓ = I (θ 0 )−1 S X Σxu (ms ) s=1 σ2u (ms ) (46) £ ¤ BW = −2I (θ 0 )−1 E hc rc0 (H − HW )0 Σxu (Mc ) , ¢−1 ¡ −1 0 0 with H = Γ00 Ω−1 Γ0 Ω−1 Γ0 W0 , μ3u (Mc ) = E[u3c |Mc ] and the remaining 0 Γ0 0 , HW = (Γ0 W0 Γ0 ) terms as defined above. As noted by Newey and Smith (2004) each element of b(θ0 ) has an intuitive interpretation. BO /N equals the asymptotic bias of the infeasible oracle estimator which uses the true Jacobian and weight matrices to form the optimal instrument hc = Γ00 Ω−1 0 rc ; this component of bias is typically small and not sensitive to the degree of overidentification. BΓ /N equals the bias arising 21 from the correlation between sampling error in the estimated Jacobian and ψ N . All GEL estimators, because they use an efficient estimate of Γ0 when forming the optimal instrument, asymptotically eliminate this source of bias. In weakly identified settings, in the sense of Stock and Wright (2000), this source of bias can be substantial. BΩ /N equals the bias due to correlation between the sampling error in the estimated variancecovariance matrix of of the moments (i.e., the weight matrix) and ψ N . The EL member of the GEL family asymptotically eliminates bias due to weight matrix variability since it uses an efficient estimate of Ω0 , as well as Γ0 , when forming e hc . Other GEL estimators are sensitive to this source of bias depending on the sign and magnitude of ρ3 . For CU and ET ρ3 equals zero and one respectively, indicating that CU removes none of the bias due to weight matrix variability relative to two-step GMM, while ET removes about half of this source of bias. b N is a function of The BΩ /N bias term depends on the conditional skewness of u(gc , θ). Since Ω the sample second moments of uc , in the absence of skewness it will be uncorrelated with uc . For the linear-in-means model u(gc , θ) is a function of the within- and between-group squares of the original data, it is therefore likely a right-skewed random variable. This suggests that the BΩ /N bias component is negative in the current context. BW /N is the bias associated with constructing the optimal (quadratic form) weight matrix in two-step GMM using a preliminary consistent, but inefficient, estimate of θ. An additional iteration of the two-step GMM estimator, such that the weight matrix is estimated using an efficient estimate of θ, eliminates this source of bias. Equation (45) suggests a set of patterns to look for in the Monte Carlo results reported in Section 8 below. First, the small sample bias of GMM, CU, and ET should all be increasing in the number of distinct group sizes, S. ET should be least sensitive to S, two-step GMM most sensitive. The empirical likelihood (EL) member of the GEL family has the same asymptotic bias as the infeasible oracle estimator. Its small sample bias should be unaffected by the magnitude of S. Of course even higher order asymptotic approximations need not be a good guide to the sampling properties of these estimators in specific empirical settings. 7 Inference with GEL estimators As noted in the introduction, associated with GEL are an array of alternative test statistics. These include the score statistic of Kleinbergen (2005) and Guggenberger and Smith (2004), the S-statistic — a GMM analog of the Anderson-Rubin statistic — proposed by Stock and Wright (2000), likelihood ratio type statistics as in Smith (1997), and the tilting parameter tests of Imbens, Spady and Johnson (1998) and Imbens and Spady (2002). This section attempts to organize these various approaches to inference into a common framework. Virtually all of the ‘new’ test statistics suggested above are actually just members of the standard M-estimator versions of the classical ‘trinity’ of the Wald, Score or Likelihood Ratio (i.e., criterion function) type statistics. Most suggestions found in the literature can be categorized into one of two conceptual approaches. In the first, and by far the most common approach, direct hypotheses of the form 22 H0 : θ = θ 0 are evaluated using the Wald, Score or Likelihood Ratio principle as applied to the profiled GEL criterion function (e.g., Qin and Lawless 1994, Kitamura and Stutzer 1997, Smith 1997, Kleinbergen 2005, Guggenberger and Smith 2004). In the second, less common, approach hypotheses concerning θ are tested indirectly by assessing λ(θ 0 ) = 0 (e.g., the tilting parameter’s proximity to zero, i.e. hypotheses take the form H0 : e Corcoran, Davison and Spady 1995, Imbens 1997, Imbens, Spady and Johnson 1998, Imbens and Spady 2002). As with the first approach, indirect tests can be based on the Wald, Score, and Likelihood Ratio principles; although in practice the range of existing suggestions does not cover all these possibilities. As with bias, the asymptotic size of some GEL tests often depends critically on their use of the GEL probabilities. For simplicity as well as to focus ideas, in this section I assume that p ΩN (θ 0 ) → Ω0 sufficiently quickly such that the effects of weight matrix variability can be ignored. 7.1 The classical trinity Score statistics. The role played by the GEL probabilities in assuring accurate inference is easiest to see when comparing Newey and West’s (1987) score test for the two-step GMM estimator with the corresponding score test for GEL (e.g., Guggenberger and Smith 2004, Kleinbergen 2005). From Section 3, the score of the two-step GMM quadratic form criterion function evaluated at √ θ 0 and scaled by 1/ N is √ √ M (θ )/ N = − NΓN (θ 0 )0 Ω−1 sGM 0 N 0 ψ N (θ 0 ) + op (1). (47) Appendix D shows that the score of the profiled GEL criterion function can be written as √ √ spN (θ0 )/ N = − NΓπ (θ0 )0 Ω−1 0 ψ N (θ0 ) + op (1). (48) √ The main difference between the two scores is that spN (θ0 )/ N is formed using an efficient estimate of the expected Jacobian of the moments. A sequence of arguments identical to those use to derive (43) and (44) above give £ ¤ 1 XS Σxu (ms ) M E sGM (θ0 )/N = , N s=1 σ 2 N u (ms ) ¤ £ E spN (θ0 )/N = 0. (49) Because the GEL score is constructed using an efficient estimate of the Jacobian its (higher order) asymptotic mean is zero. The two-step GMM score uses an inefficient estimate of the Jacobian. The sampling variability in this estimate is correlated with ψN (θ 0 ), generating asymptotic bias in the score. The score of the two-step GMM criterion function is not mean zero even when evaluated at the true parameter, θ 0 . This suggests that in small samples, particularly with many overidentifying restrictions, tests based on the two score statistics may have very different size properties. The contrasts between the two statistics are even more dramatic when we consider their respective properties under weak identification. One way to model weak identification is to assume that √ the expectation of ψ(zc , θ) is of order 1/ N uniformly in θ (e.g., Stock and Wright 2000). This 23 captures the idea that the gradient of E[ψ(zc , θ)] is flat in θ and hence its poor identification. In particular, assume that ¸ Γ∗ ∂ψ(zc , θ 0 ) = Γ0N = 0ξ , E 0 N ∂θ ∙ with ‘strong identification’ ξ = 0 and Γ0N = Γ∗0 is a fixed full rank matrix. ‘Weak identification’ can be modelled by setting ξ = 1/2. In this case Γ0N will not be a full rank matrix in large samples, leading to ‘vanishing’ identification. In order to make further progress we require the following corollary to Theorem 6. √ Corollary 1 Let a(zc , θ 0 ) be a K ×1 vector obeying a central limit theorem such that N (aN (θ0 )− D μa ) → A with A ∼ N (0, Σaa ) . A direct corollary of Theorem 6 is asymptotic independence of ψN and aπ with à Ã" # " ! #! √ 0 ψN (θ0 ) 0 Ω0 D N →N , . 0 aπ (θ 0 ) − μa 0 Σaa − Σaψ Ω−1 0 0 Σaψ Following the general approach of Kleinbergen (2005, Theorem 1)20 we can apply a central limit theorem such that N ξ Γπ converges in distribution to G, a normal random variable with mean Γ∗0 .21 Note that in the case of strong identification ξ = 0 and we have N ξ Γπ converging in probability to √ the fixed matrix Γ∗0 . Let P ∼ N (0, Ω0 ) be the limiting distribution of Nψ N . By Corollary 1 G and P are independent random variables, therefore √ D N ξ Γπ (θ0 )0 Ω−1 N ψN (θ 0 ) → G0 Ω−1 0 0 P. −1/2 we have Conditioning on a specific realization of G and scaling by (G0 Ω−1 0 G) −1/2 0 −1 G Ω0 P |G = g ∼ N (0, IK ). (G0 Ω−1 0 G) However by virtue of independence G and P this result also holds regardless of the specific realization of G (i.e., unconditionally). Therefore √ D −1/2 0 −1 (Γ0π Ω−1 Γπ Ω0 Nψ N (θ0 ) → N (0, IK ). 0 Γπ ) The final step in this argument breaks down if Γπ (θ 0 ) is replaced with the inefficient ΓN (θ 0 ). Since variability in ΓN (θ0 ) will be correlated with ψ N across repeated samples the score’s large sample (unconditional) distribution will be non-normal and depend on nuisance parameters (Guggenberger and Smith 2004). Although the efficient estimate of the Jacobian is also very noisy in weakly identified settings, across repeated samples its variability will be independent of ψN . 20 See also Caner (2005, Theorem 5) and Guggenberger and Smith (2004, Theorem 4). In order to avoid the introduction of tedious notation involving the vec(·) operator, in what follows I assume that K = dim (θ0 ) = 1 and hence that Γπ is an S × 1 vector. 21 24 It is also illuminating to consider the joint null of H0 : λ = 0, θ = θ0 . (50) Note that (50) simultaneously considers the hypothesis that θ = θ 0 as well as the validity of the overidentifying moment restrictions at θ = θ 0 . It is straightforward to form GEL-based score tests for (50). The (S + K) × 1 score for the √ √ unprofiled GEL criterion function under (50) equals sN (0, θ 0 )/ N = N (ψN (θ 0 ), 0)0 with a variance of diag{Ω0 , 0} (see Appendix B and D). A feasible score statistic is therefore S AR = N · ψN (θ0 )0 Ωπ (θ 0 )−1 ψ N (θ 0 ), (51) which is essentially the ‘S-statistic’ of Stock and Wright (2000)22 . That is, S AR is a GEL-analog of the Anderson-Rubin statistic; it is a χ2 random variable with S degrees of freedom under (50). Forming a score test for validity of the moments (i.e., that λ = 0) conditional on a known value of θ 0 yields a interesting type of Sargan-Hansen overidentification statistic. Appendix D shows that the appropriate score statistic in this case equals SJ = N · ψN (θ0 )0 Pπ ψN (θ 0 ) (52) 0 −1 0 −1 −1 0 −1 = N · ψN (θ0 )0 Ω−1 π ψ N (θ0 ) − N · ψ N (θ 0 ) Ωπ Γπ (Γπ Ωπ Γπ ) Γπ Ωπ ψ N (θ0 ) = S AR − S K , with Γπ = Γπ (θ 0 ) and similarly for Ωπ . S J is a χ2 random variable with S − K degrees of freedom. From (51) and (52) we see that S AR = S J + S K , highlighting how S AR combines a score test for the null that θ = θ0 with a Sargan-Hansen type test for the validity of the overidentifying moments at θ = θ 0 . b If instead of conditioning on θ = θ0 we condition on b θ we have Γ0π Ω−1 π ψ N (θ) ' 0 by the GEL b −1 b θ)0 Ω first order conditions of Theorem 5 and hence (52) simplifies to N ·ψN (b π ψ N (θ), a conventional Sargan-Hansen type statistic. Since S J and S K are asymptotically independent at θ = θ 0 one can test (50) sequentially. First, test the validity of the overidentifying moment restrictions at θ = θ0 using S J then, conditional on failing to reject these restrictions, test the null that θ = θ0 using the score statistic, S K . This procedure, a version of which is suggested by Kleinbergen (2005) for the continuously updating (CU) estimator, avoids ‘spuriously’ accepting the θ = θ 0 null when θ 0 is in a region of the parameter space where the identifying moment restrictions are violated (e.g., near a maximum of the GEL criterion). The overall asymptotic size of this procedure equals αJ + αK + αJ αK , where αJ and αK denote the size of the individual tests. In the Monte Carlo experiments I consider the small sample properties of two versions of the score statistic. The first, which exploits the information equality, is 22 Their statistic replaces Ωπ (θ0 ) with ΩN (θ0 ). 25 e0π Ω e −1 e −1 p e S(e θ) = spN (e θ)0 (Γ π Γπ ) sN (θ)/N, (53) θ) = ∇θ lpN (e θ) is the score of the profiled GEL criterion function evaluated at e θ (c.f., where spN (e equation (65) in Appendix A). Recall that e θ denotes the GEL point estimates computed under the p b −1 e 0π Ω e −1 e −1 null hypothesis (18). One could also replace (Γ π Γπ ) /N in (53) with H (θ) . N Guggenberger and Smith (2004) consider an alternative variant of the score statistic given by e π (Γ e −1 e −1 e 0 e e ee e 0π Ω θ) = N · λ( θ)0 Γ S GS (e π Γπ ) Γπ λ(θ). (54) e0 Ω e −1 e e0 e −1 e −1 e 0 e −1 e S K (e θ) = N · ψ N π Γπ (Γπ Ωπ Γπ ) Γπ Ωπ ψ N , (55) e−1 e 0 e e−1 e e−1 0 −1 e e−1 p e θ) = spN (e θ)0 A S M (e p C (C Ap Bp Ap C ) C Ap sN (θ)/N, (56) While Kleinbergen (2005) and Newey and Windmeijer (2005) suggest the variant which is similar in form to the Newey and West (1987) score statistic for two-step GMM. The √ θ)/ N , the actual (scaled) score of the profiled latter two variants — (54) and (55) — replace spN (e GEL criterion function, with op (1) approximations. This suggests that the variant given by (53) is preferable and this is the form used in the Monte Carlo experiments.23 The second version of the score statistic examined in the Monte Carlo experiments follows directly from standard M-estimator results. Making no use of the information matrix equality suggests a robust statistic of PN p p 0 e ep is estimated by H p (e sc e sc /N. where A c=1 e N θ)/N and Bp by the outer-product-gradient estimator M The S score statistic has not been suggested in the context of GEL previously. The large sample distribution of the score statistic is χ2J under the maintained null (recalling that rank (C) = J). The score statistic retains its good properties under weak identification when some elements of θ are replaced with constrained GEL estimates as long as the K −J parameters (or ‘combinations’ of parameters) that are unconstrained by (18) are strongly identified (Guggenberger and Smith 2004, Kleinbergen 2005). Likelihood ratio statistics. It is also possible to base inference directly on the profiled GEL p b criterion function, lN (θ). Smith (1997) shows that the in large samples the minimized (and normalized) GEL criterion function is a χ2 random variable: LRJ (b θ) = 2 XN c=1 D eb [ρ(λ( θ)0 ψ c (b θ)) − ρ0 ] → χ2S−K , θ) measures the extent to which the data where S − K equals the degree of overidentification. LRJ (b need to be re-weighted in order for the moment restrictions to be satisfied. It can form the basis of an overidentification test. e π is replaced by Ω e N then the three statistics are identical for the continuously updating (CU) estimator If Ω (Guggenberger and Smith 2004). 23 26 Hypotheses like (18) can be tested using LR(b θ, e θ) = 2 XN c=1 ee eb [ρ(λ( θ)0 ψ c (e θ)) − ρ(λ( θ)0 ψc (b θ))]. The LR statistic is the EL analog of a log-likelihood ratio in a fully parametric model. In the absence of any moment restrictions the empirical distribution of the data places probability 1/N on each point. Overidentification necessitates a re-weighting of the data in order to ensure that the moment conditions are satisfied. As more and more restrictions are placed on the data, the weights will diverge from the empirical measure in a finite sample. The LR statistic measures the incremental effect of imposing the null hypothesis (18) on this divergence. The large sample distribution of LR is χ2J under the maintained null. The LR statistic does not have standard limiting distributions under weak identification (Guggenberger and Smith 2004). This is because b θ is inconsistent in this setting. However the following Anderson-Rubin type statistic LRAR (e θ) = 2 XN c=1 ee [ρ(λ( θ)0 ψc (e θ)) − ρ0 )], is a χ2 random variable in large samples — regardless of the strength of identification — with degrees of freedom equal to S − (K − J) (again assuming that the K − J parameters unconstrained by (18) are strongly identified). Note that LRAR (e θ) = LRJ (b θ) + LR(b θ, e θ). Wald statistics. The Monte Carlo experiments conducted below consider four variants of the Wald statistic. The first variant, analogous to the Wald statistic associated with two-step GMM, is based on a direct estimate of I (θ 0 )−1 : b −1 b −1 b −1 b b Γ b 0π Ω W (b θ) = N · c(b θ)0 [C( π Γπ ) C] c(θ). (57) −1 b −1 b b p (b θ) = c(b θ)0 [CH W H (b N θ) C] c(θ). (58) Alternatively, exploiting (29), we have a statistic based on the ‘observed information’ in the sample: Kitamura, Gautam and Ahn (2004) note that W H leads to asymptotically valid inference, but do not evaluate its small sample properties. A third Wald statistic, which although new is implicit in the work of Imbens (1997), treats GEL as an M-estimator, ignoring the information matrix equality given by Theorem 2: −1 b p (b W (b θ) = c(b θ)0 [CH N θ) M ½X N sp (b θ)spc (b θ)0 c=1 c ¾ p b −1 b −1 b HN (θ) C] c(θ). (59) Newey and Windmeijer (2005) suggest another Wald statistic, which like (59) is based on a sandwich estimate of variance-covariance matrix for b θ, but where the outer-product estimator for 27 the variance of the score is replaced by a direct estimate. Their statistic is p b −1 b −1 b −1 b0 b −1 b b p (b θ) = N · c(b θ)0 [CH W N W (b N θ) (Γπ Ωπ Γπ )HN (θ) C] c(θ). (60) Newey and Windmeijer (2005) derive (60) using an alternative ‘many weak moments’ asymptotics where the number of moments grows with the sample size. Under their asymptotics GEL remains consistent with a normal limiting distribution as long as the number of moments does not increase too quickly with the sample size. However — heuristically — Theorem 2 does not hold in this setting. The variance of the score is larger than the average Hessian under ‘many weak moments’ asymptotics and therefore the conventional variance estimator is too small. Fortunately, the sandwich variance estimator is consistent for the correct limiting variance in this setting. Since (59) and (60) are estimating the same quantity they should behave similarly in finite samples. It is interesting that a simple M-estimator argument leads to essentially the same test statistic as the more complicated ‘many weak moments’ asymptotics of Newey and Windmeijer (2005). In weakly identified settings, with the number of moments held fixed, none of these statistics will lead to valid inference (Stock and Wright 2000). This is because GEL is inconsistent with a non-standard limiting distribution in such settings. Tilting parameter tests. Imbens, Spady and Johnson (1998) suggest the following Wald test for the null of correct specification: eN (b be θ) = N · λ θ)0 R θ), λN (b IS J (b (61) b is a ‘robust’ sandwich estimate of the variance of the moments — Ω0 — equalling where R b= R ∙X N c=1 ¸ ∙X ¸−1 ∙X ¸ N N 0 0 2 b b0 b b b b b ρ2 ψc ψ c /N × b ρ1 ψc ψc /N × b ρ2 ψc ψc /N . c=1 c=1 (62) The use of an estimate of Ω0 as the weight matrix in a quadratic form of the tilting parameters is appropriate because P0 Ω0 P0 = P0 and hence Ω0 is a generalized inverse of P0 (c.f., Imbens, b as an estimate of Spady and Johnson 1998). A heuristic motivation for the specific choice of R Ω0 is provided in Appendix D. Imbens, Spady and Johnson (1998) and Imbens and Spady (2002) θ) has very good size properties. Presumably these report Monte Carlo experiments where IS J (b results are driven by the superior accuracy of the asymptotic normal approximation to the sampling distribution of e λN (b θ) relative to the corresponding approximation for b θ itself. A difference between (61) evaluated at the constrained and unconstrained parameter estimates provides a direct test of (18). Alternatively Imbens and Spady (2002) suggest the minimum-χ2 statistic D b e eN (b eN (e λN (e θ) − e λN (b θ))0 R( θ) − λ θ)) → χ2J , IS(b θ, e θ) = N · (λ for testing (18). The coverage properties of confidence intervals formed by inverting IS(b θ, e θ) are evaluated in the Monte Carlo experiments below. 28 8 Monte Carlo The goal of the Monte Carlo experiments is to evaluate the small sample properties of GMM and GEL estimators and associated testing procedures for the linear-in-means model in light of the theoretical predictions of Sections 6 and 7. The Monte Carlo experiments mimic key features of the Project STAR data set used in Graham (2004) to estimate peer group effects. Five estimators based on the conditional moment restriction (8) are evaluated: 1. GMM: This is Hansen’s (1982) optimal two-step estimator. One-step 2SLS estimates are used to construct the weight matrix. 2. ET: The ‘exponential tilting’ member of the GEL family of estimators. This estimator is advocated by Imbens, Spady and Johnson (1998). It is computationally attractive relative to other GEL estimators. 3. EL: The ‘empirical likelihood’ member of the GEL family of estimators. This estimator has the smallest asymptotic bias among all GEL estimators. 4. 2SLS: This is a ‘one-step’ GMM estimator which uses WN = ( matrix. It is inefficient, but computationally straightforward. P rc rc0 /N)−1 as a weight 5. QML: This is an M-estimator which use the log-likelihood appropriate if yc |Mc were multivariate normal as a criterion function. Normality is not required for the consistency of b θ QM L but it is only efficient in that case. In the designs considered below, the QML estimator is in fact the maximum likelihood (ML) estimator; however, the Wald and Score statistics used for hypothesis testing are of the QML/robust form (c.f., MaCurdy 1982, Gourieroux and Monfort 1993, Wooldridge 2002). It is also useful to compare the properties of the above estimators to those of the infeasible oracle estimator which uses the unknown optimal linear combination of the sample average of the moment vector — Γ00 Ω−1 0 — for estimation. The Γ0 and Ω0 matrices are given by (17). A closed form expression for Ω0 requires auxiliary distributional assumptions. Under normality Appendix F shows that Ω0 is diagonal with elements Ωss 0 ) ( ∙ ¸2 4 1 2 2 2 1 4 2σ s , = ps · 2 σ α + γ σs +γ ms m2s ms − 1 s = 1, . . . , S. b The oracle chooses b θ O to satisfy Γ00 Ω−1 0 ψ N (θO ) = 0. 8.1 Small sample bias Table 1 reports the theoretical values of each of the four asymptotic bias components — BO /N, BΓ /N, BΩ /N, and BW /N — for all twelve Monte Carlo designs. The bias values are divided by 1/2 ). Note that the standard the appropriate asymptotic standard error of b γ 2 (i.e., by (I (θ0 )−1 γγ /N) 29 error used for scaling varies across designs and hence the ranking of these scaled asymptotic biases across designs need not correspond to rankings based on absolute bias. The scaled bias components vary substantially across the twelve designs. Holding the population first stage F statistic, μ2 /(S − 1) constant the bias due to Jacobian and weight matrix variability increases with the degree of overidentification. Jacobian bias is greatest for the weakly identified designs of Panel A, the weight matrix bias for the very strongly identified designs of Panel C. Higher order theory suggests that the two-step GMM, ET and EL estimators vary in their sensitivity to the Jacobian and weight matrix bias components. The twelve designs therefore provide an interesting setting in which to compare small sample bias in the GMM, ET and EL estimates of γ 2 and to assess the ability of higher order asymptotic theory to rationalize any differences. The b asymptotic biases of these three estimators differ dramatically across some of the designs. The most extreme example is for the S = 12, μ2 /(S − 1) = 50 design where the asymptotic bias γb2 — as a fraction of its standard error — is -0.8369, -0.3721 and -0.0001 for the GMM, ET and EL estimators respectively (see Table 2). These are extraordinary differences for first order equivalent estimators. Table 2 reports Monte Carlo estimates of the median bias of b γ 2 for each estimator and design. Median bias is reported instead of mean bias since GEL estimators, like LIML, may lack finite moments in small samples (c.f., Guggenberger 2005a, 2005b, Guggenberger and Hahn 2005). Also reported is the asymptotic bias for each estimator and design calculated analytically using the formulae given in Section 6. The bias estimates, as well as their theoretical counterparts, are scaled q by I(θ 0 )−1 γγ /N . The results are based upon B = 5, 000 successful Monte Carlo draws for each design.24 The standard error of each of the scaled median bias estimates is approximately 0.0177.25 The qualitative bias ranking implied by Newey and Smith’s (2004) expansions are mirrored by the small sample median bias estimates. Among GMM, ET and EL median bias is greatest for the two-step GMM estimator and smallest for EL estimator. The quasi-maximum likelihood estimator, which is the ML estimator in the current setting, consistently has the lowest median bias among all feasible estimators. The ORACLE, QML, and EL estimators are higher order equivalent in terms of bias for these designs. However the EL estimator performs noticeably worse than either the QML or ORACLE estimators. Importantly the median bias of EL is sensitive to the number of moments, unlike its asymptotic counterpart. Overall, higher order expansions do a reasonable job of rationalizing differences in the small sample median bias of different estimates of γ 2 however, important discrepancies remain. For example for the S = 12, μ2 /(S − 1) = 50 design the median bias of the EL, QML and ORACLE estimates is -0.2232, -0.0447, and -0.0497 respectively, although the asymptotic biases of these three estimates are identical. Table 3 directly evaluates the usefulness of Newey and Smith’s (2004) asymptotic bias expressions for understanding the finite sample properties of the GMM and GEL estimators. Under the null hypothesis that an estimator’s asymptotic bias accurately characterizes its small sample 24 25 There were no convergence failures for any of the estimators across all draws and designs. The notes to Table 2 explain how this standard error is estimated. 30 median bias we have, for N and B large enough, √ D B(BiasM C − BiasA ) ' N (0, π/2), where BiasM C is scaled median bias and BiasA is the scaled asymptotic bias of the corresponding estimate; both of which are reported in Table 2. A weighted least squares regression of BiasM C p onto BiasA — with weights 1/ π/2B, the coefficient on BiasA constrained to equal 1 and no constant — should thus have a root mean square error of 1. Panel A of Table 3 reports minimum-χ2 tests for this null estimator-by-estimator. For all estimators median bias differs significantly from the values predicted by the higher order asymptotics of Newey and Smith (2004). The asymptotic bias predictions are most accurate for the ET and QML estimators and least so for the GMM and EL estimators. Panel B of Table 3 reports weighted least squares regressions based upon the idea that D ∗ ∗ ∗ + αΓ BΓ∗ + αΩ BΩ + αW BW , π/2B), BiasM C ' N (αO BO −1/2 ∗ ∗ ∗ = BNO × [I(θ 0 )−1 and similarly for BΓ∗ , BΩ and BW , and the values of αO , αΓ , with BO γγ /N ] αΩ and αW equal to those implied by (45) (again under the null that an estimator’s asymptotic ∗ ∗ and BW vary little across the bias accurately characterizes it small sample median bias). Since BO designs considered here, αO and αW are constrained to equal the values given in (45) while αΓ and αΩ are allowed to deviate from their theoretical values. For the two-step GMM estimator higher order theory suggests that αΓ = αΩ = 1, for the GEL estimators it suggests αΓ = 0 and αΩ = 1 + ρ3 /2, or 1/2 and 0 for ET and EL respectively. For QML, which is ML for these designs, it suggests αΓ = αΩ = 0. The actual regressions suggest that two-step GMM is noticeably less sensitive to Jacobian and weight matrix variability than would be predicted on the basis of asymptotic theory. For the ET and QML estimators, αΓ and αΩ are not significantly different from their predicted values. For EL, unlike asymptotic bias, small sample median bias does appear to be sensitive to weight matrix variability, albeit much less so than either the GMM or ET estimator. 8.2 Accuracy of the asymptotic approximation to the sampling distribution of b θ and confidence interval coverage Table 4 reports scaled median absolute error (MAE) values for each estimator/design. MAE is a robust measure of dispersion. Under the null that the first order asymptotic approximation to the sampling distribution of b γ 2 is accurate, the scaled MAE estimates should be close to one. Deviations from this benchmark could be due to either non-negligible bias or because the actual sampling distribution has thick or thin tails. Table 5 gives the ratio of a robust quantile-based estimate q of the standard deviation of the 2 sampling distribution of γb to is true asymptotic standard error, I(θ 0 )−1 γγ /N, as well as the ratio q of the median estimated asymptotic standard error relative to I(θ0 )−1 γγ /N . With the exception of the weakly identified designs, where there is evidence of thick tails in the GEL and QML estimators, 31 the Monte Carlo standard deviation of b γ 2 is close to the value predicted by the standard asymptotic normal approximation. However the ratio of the median estimated standard error to the true asymptotic standard error is well below one for the heavily overidentified designs (S = 7, 12). This is consistent with Γ0 and/or Ω0 being difficult to estimate precisely in small samples, biasing the estimated standard errors downwards. To see that this is indeed the case it is illuminating to examine the standard variance estimator PN 0 b b b 0 Ω−1 b for θ GM M — (Γ N N ΓN )/N — in detail. The Jacobian term, ΓN , equals − c=1 rc xc /N. Variability in this term is modest when θ is strongly identified. The sth row of the estimated Jacobian matrix b cw |Mc = ms ]) where equals (1, E[g b cw |Mc E[g ∙ ¸ XN XMg 1 1 2 = ms ] = − (ygi − yg ) × 1 (Mg = ms ) , g=1 i=1 ms ns − Ns P with Ns = N c=1 1 (Mc = ms ) . When ns = Ns ms is moderately large this term is estimated with reasonable precision. For the S = 12, μ2 /(S − 1) = 5 ‘weakly’ identified design, 5, 000 Monte Carlo b cw |Mc = m6 ] of (standard deviation in parentheses) draws give a bias for E[g E[{ΓN (θ 0 ) − Γ0 }6,2 ] = " 0.0000022 (0.0001623) # . Both the bias and standard deviation are small relative to the population value of −0.0029885. The estimated Jacobian, because it uses the individual-level within-group variation of the data, is typically quite precise in linear-in-means applications. The estimated variance-covariance matrix of the moments for two-step GMM, ΩN (θ), equals the outer-product ΩN (θ) = N −1 where XN c=1 Ns ps = , N © ª u(gc , θ)2 rc rc0 = diag p1 σ b2u (m1 ), . . . , pS σ b 2u (mS ) , σ b2u (ms ) = PN u(gcb , θ)2 · 1 (Mc = ms ) . PN c=1 1 (Mc = ms ) c=1 Since uc is a function of squares of yci , the elements in Ω0 are functions of the conditional fourth moments of the original data. Fourth moments are difficult to estimate with precision. For N = 300 and S = 12, each of the twelve distinct elements of ΩN (θ) are estimated using an average of only 25 observations! In small samples with modest overidentification, ΩN (θ) is thus a very imprecise estimate of Ω0 . Consider the S = 12, μ2 /(S − 1) = 50 ‘very strongly’ identified design. Even allowing Ω0 to be estimated using the ‘true’ residuals — u(gcb , θ 0 ) — we have for the 6th diagonal element of ΩN (θ0 ) a bias and standard deviation of " # © ª 18.1738 ]= , E[ ΩN (θ 0 )−1 − Ω−1 0 6,6 (32.2942) 32 which is very large relative to the population value of 38.7998. Although the asymptotic normal approximation to the sampling distribution of b γ 2 is reasonable accurate (with the exception of the weakly identified designs), consistently estimating I(θ)−1 appears to be challenging. Coverage rates for different 95 percent confidence intervals under weak, strong and very strong identification are given in Tables 6, 7 and 8 respectively. Several patterns consistently emerge. The ET and EL Wald intervals based on W H , which uses an ‘observed information’ variance estimate, performs significantly better than traditional intervals based on W . This is especially true in the weakly identified designs. The Wald intervals based upon either the Newey-Windmeijer or Mestimator sandwich variance estimators, W NW and W M , do even better, with W M typically doing best. The Score statistics consistently have coverage closest to 95 percent across all designs. Among the two Score statistics the standard M-estimator version, S M , which does not exploit the information equality result of Theorem 2, outperforms the efficient versions suggested by Kleinbergen (2005), Guggenberger and Smith (2004) and Newey and Windmeijer (2005). For the S = 12, μ2 /(S − 1) = 5 design for example, the traditional Wald intervals have an actual coverage of only 79.0 percent while the S M intervals have a coverage of 93.9 percent, very close to nominal coverage. The superior coverage of the GEL-based score statistic intervals is consistent with the analysis of Section 7. Recall that the EL first order conditions use a semiparametrically efficient estimate of the optimal instrumental variable. The instrument is constructed from an efficient estimates of both the Jacobian and weight matrix; therefore the EL score has an asymptotic bias of zero under the null. The GMM score, in contrast, is asymptotically biased for zero. Hence we would expect the EL score-based intervals to exhibit superior coverage. Construction of the score statistic requires an estimate of I(θ)−1 which, as argued above, is difficult to estimate. The contrast between the S and S M statistics in terms of size may reflect the fact that the latter effectively uses a superior estimate for I(θ)−1 . This is a topic for further research. The LR intervals, which base inference directly upon the GEL criterion function, perform similar to the W H intervals, which is not surprising as W H is a quadratic approximation to LR in the neighborhood of b θ GEL . The IS intervals, suggested by Imbens and Spady (2002), consistently perform worst of all. I attribute this poor performance to the quadratic form structure of the statistic where the weight matrix is equal to an estimate of Ω0 , which is very difficult to precisely estimate in the current setting. For the two-step GMM estimator the coverage of the intervals based on changes in quadratic form criterion function (23) perform quite well, with coverage typically close to the highest performing ET and EL S M intervals.26 For QML the W M and S M intervals perform similarly, both with actual coverage very close to 95 percent across all designs considered. Note that, although QML is ML in the current setting, W M and S M are M-estimator versions of the Wald and Score statistics that do not take into account simplifications arising from the information matrix equality. The two-stage least squares (2SLS) W M and S M intervals are consistently outperformed by their 26 Bond, Bowsher and Windmeijer (2001) find that these intervals perform well in strongly identified dynamic panel data settings where there are often many overidentifying restrictions. 33 ET, EL and QML counterparts. However the 2SLS W M intervals consistently have better coverage than the two-step GMM W intervals. 9 Suggestions for applied researchers The Monte Carlo results have an number of implications for applied researchers estimating models identified by moment conditions. Particularly for those interested in the social interactions application of Graham (2005). First, the bias properties of EL are consistently superior to those of two-step GMM and ET. In the designs considered here QML has the best bias properties among feasible estimators. However the QML estimator is also the maximum likelihood estimator for these designs. It will be interesting to examine the performance of QML relative to EL in settings where is not efficient.27 In weakly identified settings, where Jacobian variability is an important source of bias, GEL estimators outperform 2SLS as well as two-step GMM in terms of median bias. In settings where bias due to weight matrix variability is a concern, the comparison is less clear cut. 2SLS eliminates bias due to weight matrix variability by virtue of using a non-stochastic weight matrix. This comes, of course, with a loss in asymptotic efficiency. In strongly identified settings, where Jacobian bias is small, 2SLS tends to display lower median bias than both ET and EL. In such settings the improved performance in terms of bias may compensate for any efficiency losses. However, bias correction of either ET or EL provides an alternative which, in principle, involves no efficiency loss 28 . In terms of inference, the simple and newly suggested W M and S M statistics, derived by treating ET and EL as M-estimators, consistently outperform all other alternatives considered. For those researchers using the two-step GMM estimator inference based directly upon the quadratic form criterion function also performs well. For strongly identified designs an applied researcher uninterested in programming could report 2SLS point estimates, but conduct inference using the two-step GMM criterion function, and reasonably expect good bias and coverage properties, although this approach would not be advisable in weakly identified settings. Although these recommendations are guided by specific Monte Carlo designs, they are likely to be useful for a wide variety of linear instrumental variable models. In particular theory provides a strong case for privileging GEL over GMM and for conducting inference using the GEL score statistic. Qualitatively the Monte Carlo evidence supports these preferences (although I find that the sandwich score statistic performs better than the forms suggested by Guggenberger and Smith (2004) and Kleinbergen (2005)). Overall GEL estimators and inference procedures offer real improvements relative to two-step GMM in terms of bias and coverage; performing well across all the designs considered here. Unfortunately GEL estimators are computationally challenging and require special programming. They are therefore less convenient than both 2SLS and two-step GMM, which are readily available in standard software packages. For researchers interested in using the GEL estimators and testing procedures evaluated here, a self-contained MATLAB program for models defined by linear moment 27 28 Monte Carlo experiments where yc |Mc is multivariate Laplace are in progress. In fact, Newey and Smith (2004) prove that bias-corrected EL is higher order efficient. 34 conditions is available from the author. 35 Appendices A Calculating the score and Hessian of the profiled generalized empirical likelihood p objective function, lN (θ) This appendix provides analytic formulae for the score and Hessian of the profiled and unprofiled GEL criterion function. The expressions generalize and extend those given by Owen (2001) for the empirical likelihood estimator. It is convenient to express the score and Hessian of the profiled GEL objective function (25) def in terms of the first and second derivatives of the unprofiled objective function, lN (λ, θ) = PN 0 c=1 ρ(λ ψ c (θ)). The first derivatives of lN with respect to λ and θ are 5λ l N = J×1 N X c=1 ¡ ¢ ρ1 λ0 ψ c ψ c , 5θ lN K ×1 ¶ µ ¡ 0 ¢ ∂ψ c 0 = ρ1 λ ψ c λ, ∂θ 0 c=1 N X (63) while the second derivatives and cross-partial equal 5λλ lN J ×J 5θθ lN K×K 5θλ lN K×J = N X c=1 ¡ ¢ ρ2 λ0 ψ c ψ c ψ0c , (64) ⎤ ⎤⎡ ⎤0 ⎡ ⎡ J N J J (j) (j) 2 (j) X X X ¡ 0 ¢ X ¡ ¢ ∂ ψc ∂ψc ∂ψc ⎦ = ρ1 λ ψ c ⎣ ρ2 λ0 ψc ⎣ λj ⎦ ⎣ λj ⎦ 0 λj + ∂θ∂θ ∂θ ∂θ c=1 c=1 j=1 j=1 j=1 ⎤ ⎡ µ ¶ µ ¶ N J N (j) X X ¡ ¢ X ¡ 0 ¢ ∂ψc 0 0 ∂ψ c ∂ 2 ψc ⎦ + , = ρ1 λ0 ψ c ⎣ λ ρ ψ λλ λ 2 c 0 j 0 0 ∂θ∂θ ∂θ ∂θ c=1 j=1 c=1 N X ¶ ¶ µ µ N ¡ 0 ¢ ∂ψ c 0 X ¡ 0 ¢ ∂ψc 0 0 = ρ1 λ ψ c + ρ2 λ ψc λψc . ∂θ0 ∂θ 0 c=1 c=1 N X p p p The score (spN (θ) = 5θ lN (θ)) and Hessian (HN (θ) = 5θθ lN (θ)) can be written in terms of (63) and (64). By the chain and product rules the score of the profiled objective function (25) is spN (θ) = à ∂e λN (θ) ∂θ 0 !0 5λ lN + 5θ lN (65) = 5θ lN , where the second equality follows from the Envelope Theorem (i.e., since 5λ lN = 0 at the maximizer eN (θ 0 ) = arg sup lN (λ, θ0 )). The cth observation’s contribution to the score is λ λ∈ΛN (θ 0 ) " e ∂λ s (zc , θ) = ∂θ0 p #0 ∙ e ψ c )ψc + ρ1 (λ e ψc ) ∂ψc ρ1 (λ ∂θ0 0 0 36 ¸0 e λ, e on θ has been surpressed to simplify notation. where dependence of ψc , π c , and λ 0 0 0 e e ) 5λλ lN (∂ λ/∂θ ) + 2(∂ e λ/∂θ 0 )0 5λθ lN + 5θθ lN . However we can use The Hessian equals (∂ λ/∂θ the fact that ∂e λN (θ) = − (5λλ lN )−1 5λθ lN , (66) ∂θ0 to write the Hessian as p 0 HN (θ) = 5θθ lN − 5θλ lN (5λλ lN )−1 5λθ lN . (67) To understand (66) note that totally differentiating 5λ lN w.r.t. to θ we have " # XN e 0 0 ∂ψ c ∂ψ ∂ λ c 0 e ψc ) 0 = ρ (λ + ρ (e λ ψc )ψc e + ψc 0 λ c=1 1 c=1 2 ∂θ 0 ∂θ0 ∂θ ∙ ¸ XN XN 0 ∂e λ e0 ∂ψc + e0 ψc ) ∂ψc + ρ2 (λ e0 ψ c )ψ c λ = ρ 1 (λ ρ2 (e λ ψc )ψc ψ0c 0 , 0 0 c=1 c=1 ∂θ ∂θ ∂θ XN 0 and solving for ∂ e λ/∂θ 0 yields eN (θ) ∂λ ∂θ 0 = − ∙X N c=1 0 ρ2 (e λ ψc )ψc ψc 0 = − (5λλ lN )−1 5λθ lN , ¸−1 × ∙X N e0 ∂ψc e0 ψ c ) ∂ψ c + ρ2 (λ e0 ψc )ψc λ ρ (λ 0 c=1 1 ∂θ ∂θ 0 ¸ which equals (66). B B.1 Proof of asymptotic normality and related results 0 0 Proof of Theorem 1 (Asymptotic normality of (b λGEL , b θGEL )0 ) Asymptotic normality of b θ GEL has been shown by Imbens (1997) and Qin and Lawless (1994) for the case of empirical likelihood and by Kitamura and Stutzer (1997) for the case of exponential tilting. Newey and Smith (2004) give a rigorous and general proof for all members of the GEL family (Theorem 3.2, p. 226). The reader is referred to their proof for regularity conditions. Taking consistency as a given, a sketch of the derivation of the limiting distribution is as follows: from (63) and (64) above we have at (θ, λ) = (θ0 , 0) 5λ lN (0, θ 0 ) = − 5λλ lN (0, θ0 ) = − N X c=1 ψc ψ0c , XN c=1 ψc, 5θ lN (0, θ 0 ) = 0, 5θθ lN (0, θ0 ) = 0, 5θλ lN (0, θ0 ) = (68) N X ∂ψ c=1 ∂θ c 0 0 . Expanding each row of the stacked first order conditions w.r.t. λ and θ around (0, θ 0 ) in a mean-value expansion we have λ, b θ) = sN (0, θ0 ) + HN (λ, θ) 0 = sN (b 37 à b λ b θ − θ0 ! , where sN (λ, θ) = à 5λ lN (λ, θ) 5θ lN (λ, θ) ! , HN (λ, θ) = à 5λλ lN (λ, θ) 5λθ lN (λ, θ) 5θλ lN (λ, θ) 5θθ lN (λ, θ) ! (69) b b and each row of the Hessian matrix is evaluated at a different mean value between (λ, θ) and (0, θ 0 ). √ Multiplying through by 1/ N and rearranging we have √ N à b λ b θ − θ0 ! ∙ HN (λ, θ) =− N ¸−1 1 sN (0, θ 0 ) √ . N By (68) and the uniform weak law of large (UWLLN) numbers HN (λ, θ) p →− N à Ω0 −Γ0 −Γ00 0 ! def = A0 , (70) £ £ ¤ ¤ where Γ0 = E ∂ψ c (θ 0 ) /∂θ 0 and Ω0 = E ψc (θ) ψ c (θ)0 . Using the continuity of the inverse of a non-singular matrix and standard results on partitioned inverses gives ∙ HN (λ, θ) N ¸−1 p →− à −1 −Ω−1 P0 0 Γ0 I (θ0 ) −I (θ 0 )−1 Γ00 Ω−1 −I (θ0 )−1 0 ! , (71) ¢ ¡ −1 0 −1 −1 −1 where I (θ0 ) = Γ00 Ω−1 Γ0 Ω0 . 0 Γ0 and P0 = Ω0 − Ω0 Γ0 I (θ 0 ) Applying the central limit theorem to the normalized score vector gives √ D sN (0, θ0 )/ N → N (0, B0 ) , def B0 = " Ω0 0 0 0 # (72) which with Slutsky’s theorem and (71) gives (26). Q.E.D. B.2 Proof of Theorem 2 (Information Matrix Equality) p b (θ)/N)−1 By the continuity of the inverse of a non-singular matrix and the law of large numbers (HN p −1 converges to I (θ 0 ) . To derive the asymptotic variance of the score of lN (θ0 ) — the profiled GEL criterion function — we can adapt standard results on concentrated likelihood functions given by Amemiya (1985, pp. 125 - 126). By the Envelope Theorem the score evaluated at θ0 equals p (θ 0 ) = 5θ lN (e λN (θ 0 ) , θ0 ), spN (θ0 ) = 5θ lN eN (θ0 ) = arg sup lN (λ, θ 0 ) . A mean value expansion around e where λ λN (θ 0 ) = 0 gives λ∈ΛN (θ0 ) ¡ ¢ eN (θ 0 ) spN (θ 0 ) = 5θ lN (0, θ 0 ) + 5θλ lN λ, θ · λ ¢ ¡ = 5θλ lN λ, θ · e λN (θ 0 ) , 38 (73) ¡ ¢ eN (θ 0 ) , θ 0 ) and (0, θ 0 ) and with the second line following from (68) and where λ, θ lies between (λ varies from row-to-row. We also have, for fixed and known θ0 , √ eN (θ0 ) LD Nλ = − LD µ 1 5λλ lN (0, θ0 ) lim N →∞ N ¶−1 √ 5λ lN (0, θ 0 )/ N, (74) where the = denotes that both sides of (74) have the same limiting distribution. Rearranging (73) and using (74) as well as (71) above we have √ D spN (θ 0 ) / N → Γ00 Ω−1 0 · N (0, Ω0 ) . −1 and hence Therefore the asymptotic variance of the score equals Γ00 Ω−1 0 Γ0 = I (θ0 ) E [H p (zc , θ 0 )] = V ar (sp (zc , θ0 )) = I (θ0 ) , (75) which is an information matrix equality for GEL estimators. Q.E.D. B.3 Proof of Theorem 3 (Limiting distribution of ω bA = PN c=1 1 (zc ∈ A) π ec (b θ)) θ) is Consistency of ω b A for ω A = Pr (zc ∈ A) follows from the fact that in large samples Fπ (z, b an op (1) approximation to the empirical distribution function. By the Glivenko-Cantelli Theorem (e.g., van der Vaart 1998, p. 266) Fπ (z, b θ) is thus consistent for F0 (z). The limiting distribution of ω b A can be computed using a straightforward method-of-moments ¡ 0 0 ¢0 argument. Let β = ωA , θ and consider the augmented (1 + S) × 1 vector of moments m (zc , β) = ª0 © ωA − 1 (zc ∈ A) , ψ (zc , θ)0 . From Chamberlain (1987, 1992) the variance bound for β is −1 I (β) ¸ ¸¾−1 ½ ∙ ∙ ∂m (zc , β 0 ) 0 £ ∂m (zc , β 0 ) 0 ¤−1 = E E m (zc , β 0 ) m (zc , β 0 ) E , ∂β 0 ∂β 0 where # ¸ " 1 0 ∂m (zc , β) E = , ∂β 0 0 Γ0 ∙ £ ¤ E m (zc , β) m (zc , β)0 = " ω A (1 − ω A ) −ψ0A −ψA Ω0 # with ψ A = E [ψ (zc , θ) × 1 (zc ∈ A)] = ω A × E [ψ (zc , θ) |zc ∈ A] as stated in the Theorem. Since ω b A is efficient its limiting variance equals the relevant element of I (β)−1 . By standard results on partitioned inverses we have I (β) = à −1 −1 −1 0 (A − ψ0A Ω−1 −(A − ψ0A Ω−1 0 ψA ) 0 ψ A ) ψ A Ω0 Γ0 −1 0 0 −1 −1 −Γ00 Ω−1 Γ00 (Ω0 − ψA A−1 0 ψ A (A − ψ A Ω0 ψ A ) 0 ψ A ) Γ0 ! . Again applying the partitioned inverse formula we get the upper-left element of I (β)−1 equal to 0 −1 0 −1 0 −1 0 −1 I (β)−1 ωω = (A − ψ A Ω0 ψ A ) + ψ A Ω0 Γ0 [Γ0 (Ω0 − ψ A A0 ψ A ) Γ0 0 −1 −1 0 −1 −1 0 −1 −Γ00 Ω−1 0 ψ A (A − ψ A Ω0 ψ A ) ψ A Ω0 Γ0 ] Γ0 Ω0 ψ A , 39 the result follows by showing that −1 −1 −1 0 −1 0 −1 0 − Ω−1 (Ω0 − ψA A−1 0 ψA ) 0 ψ A (A − ψ A Ω0 ψ A ) ψ A Ω0 = Ω0 , −1 −1 −1 0 −1 0 −1 0 = Ω−1 which is true since (Ω0 − ψA A−1 0 ψA ) 0 + Ω0 ψ A (A − ψ A Ω0 ψ A ) ψ A Ω0 . Q.E.D. B.4 Proof of Theorem 4 (Efficient estimation of expectations) The semiparametric efficiency bound for estimating β = (μ0a , θ 0 )0 — when the only prior information is the moment restriction (9) — can be calculated directly using the results of Chamberlain (1987, 1992). Efficiency of μ∗a follows by showing that its asymptotic variance attains this bound. Similarly to the proof of Theorem 3 consider the augmented vector of moments m (zc , β) = ª0 © (ac − μa )0 , ψ (zc , θ)0 with an expected Jacobian and covariance matrix of ¸ " −IK ∂m (zc , β) = E 0 ∂β 0 ∙ 0 Γ0 # , £ ¤ E m (zc , β) m (zc , β)0 = " Σaa Σaψ Σ0aψ Ω0 # . where K = dim(ac ). Applying the bound from Chamberlain (1987) and solving for I (β)−1 aa gives −1 0 −1 0 −1 0 −1 I (β)−1 Γ0 Ω0 Σaψ . aa = Σaa − Σaψ Ω0 Σaψ + Σaψ Ω0 Γ0 I (θ 0 ) (76) If θ 0 is known the final term in I (β)−1 aa drops out with the variance bound simplifying to Σaa − −1 0 Σaψ Ω0 Σaψ . √ To derive the limiting variance of N μ∗a replace ψN (b θ) in (36) with a mean value expansion around θ 0 to yield μ∗a ½ ¶ µ X ¾ ∂ψ(zc , θ b 1 −1 b b (θ − θ 0 ) = aN − Σaψ,N ΩN ψN (θ 0 ) + N ∂θ 0 o n b −1 ψN (θ0 ) + ΓN (b b aψ,N Ω = aN − Σ θ − θ0) . N p b aψ,N → By the UWLLN and the continuity of the inverse of a non-singular matrix we have Σ Σaψ , √ p −1 p −1 b b ΩN → Ω0 , and ΓN → Γ0 . Replacing N (θ − θ0 ) with the asymptotically equivalent quantity √ N ψN (θ 0 ) (see the proof to Theorem 1 above) we have −I (θ0 )−1 Γ00 Ω−1 0 √ √ √ ∗ √ −1 0 −1 N μa = N aN − Σaψ Ω−1 NψN (θ 0 ) + Σaψ Ω−1 Γ0 Ω0 Nψ N (θ0 ) + op (1). 0 0 Γ0 I (θ 0 ) 40 Applying the central limit theorem and some tedious algebra we have √ √ √ −1 0 V ar( Nμ∗a ) = V ar( NaN ) + Σaψ Ω−1 0 V ar( Nψ N )Ω0 Σaψ √ −1 0 −1 −1 0 −1 0 +Σaψ Ω−1 Γ0 Ω0 V ar( N ψN )Ω−1 Γ0 Ω0 Σaψ 0 Γ0 I (θ 0 ) 0 Γ0 I (θ0 ) √ √ 0 √ √ −1 0 −1 −1 −1 −2Σaψ Ω0 Cov( NψN , N aN ) + 2Σaψ Ω0 Γ0 I (θ0 ) Γ0 Ω0 Cov( N ψN , Na0N ) √ √ −1 0 −1 0 −1 0 Γ0 Ω0 Σaψ −2Σaψ Ω−1 0 Cov( Nψ N , N ψ N )Ω0 Γ0 I (θ0 ) −1 0 −1 0 0 −1 = Σaa + Σaψ Ω−1 Γ0 Ω0 Σaψ 0 Σaψ + Σaψ Ω0 Γ0 I (θ 0 ) −1 0 −1 0 −1 0 −1 0 0 −1 −2Σaψ Ω−1 Γ0 Ω0 Σaψ − 2Σaψ Ω−1 Γ0 Ω0 Σaψ 0 Σaψ + 2Σaψ Ω0 Γ0 I (θ 0 ) 0 Γ0 I (θ0 ) −1 0 −1 0 0 −1 = Σaa − Σaψ Ω−1 Γ0 Ω0 Σaψ , 0 Σaψ + Σaψ Ω0 Γ0 I (θ 0 ) which equals the bound given by (76). Therefore μ∗a is efficient. Q.E.D. ‘Rao-Blackwell’ argument for efficiency. An alternative proof of the efficiency of μ∗a follows from a straightforward application of the Rao-Blackwell Theorem (e.g., Rao 1973, pp. 317 - 318). For simplicity consider the case where ψ(zc , θ0 ) is scalar and θ 0 is known. The unbiasedness/consistency restriction limits the class of potential estimators to those which can be written b+ b ∗a + αψN . as linear combinations of aN and ψN . Consider one such alternative estimator μ a = μ For any α ∈ [0, −2Cov(b μ∗a , ψN )/V ar(ψ N )] this estimator will have lower variance than μ b ∗a unless Cov(b μ∗a , ψN ) = 0 since ¡ +¢ μ∗a , ψN ) + α2 V ar(ψN ) ≤ V ar (b μ∗a ) V ar μ ba = V ar (b μ∗a ) + 2αCov(b μ∗a , ψN ) = 0 is a necessary condition for for all Cov(b μ∗a , ψN ) different from zero. Therefore Cov(b μ b∗a to be efficient. Sufficiency of this condition follows by observing that for μ b+ b ∗a + αψ N and a = μ Cov(b μ∗a , ψN ) = 0 we have E[b μ∗a (b μ∗a − μ b+ μ∗a ψ N ] = 0, a ] = −αE[b ¡ +¢ ¡ ∗ +¢ ¡ ∗ +¢ ba , we therefore have or equivalently that V ar (b μ∗a ) = Cov μ ba, μ ba . Let ρ2 = Cov μ ba , μ b a /V ar μ ¡ +¢ ¡ +¢ ∗ 2 V ar (b μa ) = ρ V ar μ ba ≤ V ar μ ba with the latter constraint holding with equality only when 0 ∗ μ ba = μ ba . Q.E.D. C C.1 Form of GEL estimated optimal instruments and probability expansions Proof of Theorem 5 (Form of estimated instrument, e hGEL ) c To derive the implied GEL estimates of the optimal instrumental variable, hc = −E [xc |Mc ] × E[u (gc , θ 0 )2 |Mc ]−1 , given by (39) it is useful to rearrange the GEL first order conditions as in Theorem 2.3 of Newey and Smith (2004, p. 224). Following their proof we can manipulate the first 41 order condition for λ, given by (63) above, to get N X 0 = b c )ψ b0 ψ bc ρ 1 (λ c=1 N h X = N i X 0 bc − b c) + 1 ψ bc ψ λψ ρ1 (b c=1 " N X = c=1 N b c) + 1 X 0 ρ1 (b λψ b b b c. b ψ ψ c ψc λ − b0 ψ bc λ c=1 P Solving for b λ and multiplying through by N c=1 b λ= c=1 # 0 "N X c=1 b0ψ b )+1 ρ1 (λ c /N 0 b b λψ then yields c b b cψ b 0c kc ψ #−1 " # N 1 Xb ψ , N c=1 c (77) where b kc is as defined in (33). Substituting (77) into (63), the first order condition for θ, and P b0 b dividing by N c=1 ρ1 (λ ψ c ) yields "N X c=1 π bc à bc ∂ψ ∂θ 0 !0 # " N X c=1 b b cψ b 0c kc ψ #−1 " # N 1 Xb ψc = 0, N c=1 where π bc is as defined in (32). ¡ ¢ b 0π Ω b −1 rc we therefore have For ψc = rc u (gc , θ) = rc gcb − x0c θ and e hc = Γ k N 1 X e GEL b h θ) = 0, (gc − x0cb N c=1 c which can be rearranged to give (38) and (40). Q.E.D. Proof of Theorem 6 (GEL probability expansions) PN The first result of the proof — the expansion for bc b ac — extends Newey’s (2002) result to c=1 π the case where a(zc , θ) is a matrix. Elements of this proof strategy are also used by Caner (2003, Theorem 5) and Guggenberger and Smith (2004, Theorem 4). The expansion for b kc is derived along the same lines. √ b bounded in probability (i.e., From Newey and Smith (2004, Theorem 3.1, p. 226) we have Nλ ¯ 0 ¯ p ¯ ¯ b = Op (N −1/2 )) and that max ¯b b c ¯ → 0. λ λψ C.2 {c:1,...,N } 42 b b b = 0 we have Taking a mean value expansion of 5λ lN (λ, θ) around λ ¶ 0 b b b0 b ψ ρ2 (λ ψ ) ψ c c c (λ − 0) c=1 c=1 µX ¶ XN N 0 b b b = − ψc /N − ψc ψ c /N b λ + op (N −1/2 ) 0 = − XN = − XN bc + ψ c=1 which solving for b λ yields Expanding PN c=1 c=1 µX N c=1 b c /N − Ω0 λ b + op (N −1/2 ), ψ −1/2 b b λ = −Ω−1 ). 0 ψ N + op (N 0 b c )/N around λ b = 0 we have ρ1 (b λψ XN c=1 0 b c )/N bψ ρ 1 (λ = −1 + µX N c=1 (78) 0 0 b c )ψ b c /N ρ2 (λ ψ b N + op (N −1/2 ) b0 ψ = −1 − λ ¶ b λ (79) = −1 + op (N −1/2 ). θ) be a K × L matrix a(zc , θ 0 ) = (a1 (zc , θ 0 ), . . . , aL (zc , θ0 )) with Σaψ as defined Let b ac = a(zc , b in the statement of the Theorem. P b0 b ac /N we have Expanding N c=1 ρ1 (λ ψ c )b XN c=1 0 b c )b ρ1 (b λψ ac /N ∙X N ¸ 0 b b = − b ac /N + ρ (λ ψ c )b ac (IL ⊗ ψ c )/N (IL ⊗ λ) (80) c=1 c=1 2 ∙X ¸ XN N b N ) + op (N −1/2 ) b 0c )/N (IL ⊗ Ω−1 ψ b ac /N + ac (IL ⊗ ψ b = − 0 XN c=1 = − = − XN c=1 XN c=1 = Op (1) . 0b c=1 −1/2 b b ac /N + Σaψ (IL ⊗ Ω−1 ) 0 ψ N ) + op (N −1/2 b (b ac − Σaψ (IL ⊗ Ω−1 ) 0 ψ c ))/N + op (N Combining (79) and (80) we have: XN c=1 π bc b ac = = = = as claimed. 0 bc) = Let b kc∗ (b λψ 0 b )+1 ρ1 ( b λψ c . 0 b b λψ PN b0 b ac /N c=1 ρ1 (λ ψ c )b PN b0 b c=1 ρ1 (λ ψ c )/N P ac − Σaψ (IL − N c=1 (b XN c=1 XN c=1 ¡ −1/2 ¢ b ⊗ Ω−1 0 ψ c ))/N + op N ¡ ¢ −1 + op N −1/2 −1/2 b (b ac − Σaψ (IL ⊗ Ω−1 ) 0 ψ c ))/N + op (N −1/2 (ac − Σaψ (IL ⊗ Ω−1 ), 0 ψ c ))/N + op (N Using l’Hopital’s rule and a slight abuse of notation we have a mean c 43 value expansion around b λ → 0 of XN c=1 0 b b c )/N kc∗ (b λψ à ! 0b b 0b b bc 1 XN −ψ 1 XN ρ2 (λ ψ ρ ψ ) ψ (λ ) ψ c c b = + − 1 0 c c λ 0b bc 2 c=1 ψ c=1 b N N λ ψc (λ ψ c ) à 0 0 ! b cψ b c 1 ρ3 ψ b cψ bc 1 XN ρ3 ψ b + op (N −1/2 ) = −1 + − λ c=1 b b N 2 ψc ψc 1 b0 b ψN + op (N −1/2 ) = −1 + ρ3 λ 2 = −1 + op (N −1/2 ), (81) where the shift from the first to the second line uses of a double application of l’Hopital’s rule29 to 0b b b → 0. Again using l’Hopital’s rule we have evaluate ρ1 (λ 0ψb c )2ψc as λ (λ ψ)c XN c=1 b b c )b b0 ψ kc∗ (λ ac /N = − XN c=1 µX b ac /N (82) ¶ 1 b 0 )/N (IL ⊗ λ) b + op (N −1/2 ) ac (IL ⊗ ψ ρ3 b c c=1 2 XN 1 −1/2 b = − (b ac + ρ3 Σaψ (IL ⊗ Ω−1 ). 0 ψ c ))/N + op (N c=1 2 + N Combining (81) and (82) we have XN c=1 0 b kc b ac = = XN 1 −1/2 b (b ac + ρ3 Σaψ (IL ⊗ Ω−1 ) 0 ψ c ))/N + op (N 2 1 −1/2 (ac + ρ3 Σaψ (IL ⊗ Ω−1 ), 0 ψ c ))/N + op (N c=1 2 c=1 XN b c ) is defined by (33). λψ where b kc = k(b £ ¡ ¢¤ For the second part of the Theorem it suffices to show that E b ak IL ⊗ ψ0N = (1 + 12 ρ3 Σaψ )/N PN b ac ). By the first part of the Theorem we have, to the order of the (recall that b ak = c=1 kcb 29 0b b (λ ψc )ψc Specifically l’Hopital’s rule is applied twice to − ρ1(λ . 0b ψ )2 0b b0 ρ (λ ψ )ψ − 22(λ0 ψbc ) c , which c 1 b 0 at λ = 0. ρ ψ 2 3 c c The first application gives − is undefined for λ = 0. The second application then gives 44 0b b b0 ρ3 (λ ψ c )ψc ψ c , b 2ψ c 0b b b0 ρ2 (λ ψ c )ψ c ψc 0b b 2(λ ψ c )ψ c = which evaluates to approximation, ∙X ¸ £ ¡ ¡ ¢¤ ¢ N 1 0 −1 0 E b ak IL ⊗ ψN (ac + ρ3 Σaψ (IL ⊗ Ω0 ψc ))/N IL ⊗ ψN = E c=1 2 ∙X ¸ ¡ ¡ ¢ 1 ¢ N 1 0 −1 0 E (ac IL ⊗ ψN + ρ3 Σaψ (IL ⊗ Ω0 ψ c ) IL ⊗ ψN ) = c=1 N 2 ¸ ∙X X N N 1 1 1 0 = Σaψ /N + E ψ ψ ) ρ3 Σaψ (IL ⊗ Ω−1 c N 0 c=1 c=1 2 N N 1 XN 1 1 XN ρ Σaψ (IL ⊗ IS )/N Σaψ /N + = c=1 c=1 2 3 N N 1 1 1 = Σaψ + ρ Σaψ Nµ N 2¶ 3 1 1 1 + ρ3 Σaψ , = N 2 as claimed. The remaining results follow analogously. Q.E.D. D D.1 Test statistics GEL Score statistic based on the unprofiled criterion Consider the joint null H0 : (λ0 , θ 0 )0 = (00 , θ 00 )0 ; this combines a test for moment validity (λ = 0) with a test on the parameter of interest (θ = θ 0 ). Let β = (λ0 , θ 0 )0 , the null hypothesis can be expressed as c(β 0 ) = 0 with c(β 0 ) = Cβ 0 − (00 , θ 0 )0 and C = IS+K . Using the standard M-estimator form of the score statistic given by Wooldridge (2002, p. 365) among others we have −g −1 −1 0 −g 2 S AR = sN (0, θ0 ) A−g 0 C (CA0 B0 A0 C) CA0 sN (0, θ0 ) /N ∼ χS , where A−g denotes a generalized inverse of A. Using (69), (70) and (72) to substitute in for sN (0, θ 0 ) , A0 and B0 respectively, fact that Ω0 is a generalized inverse of P0 and multiplying out yields S AR = N · ψ N (θ 0 )Ω−1 0 ψ N (θ0 ), which, with Ω0 replaced by Ωπ (θ 0 ), is the statistic given by (51). If we wish to test H0 : λ = 0 conditional θ = θ 0 we can express the null as c(β) = Cβ with −1 −1 −1 −1 0 C = (IS , 0). Using the fact that CA−1 0 = (P0 , − Ω0 Γ0 I (θ0 ) ) and CA0 B0 A0 C = P0 we have a score statistic of SJ = N · ψN (θ 0 )0 P0 ψN (θ 0 ) 0 −1 0 −1 −1 0 −1 = N · ψN (θ 0 )0 Ω−1 0 ψ N (θ 0 ) − N · ψ N (θ0 ) Ω0 Γ0 (Γ0 Ω0 Γ0 ) Γ0 Ω0 ψ N (θ 0 ) = S AR − S K ∼ χ2S−K which, replacing Γ0 and Ω0 with Γπ (θ0 ) and Ωπ (θ 0 ), gives the expression given in the text. 45 D.2 GEL Score statistic based on the profiled criterion From Theorem 2 we have 0 spN (e θ) spN (e θ) √ ∼ χ2J , I (θ 0 )−1 √ N N e 0π Ω e −1 e −1 gives (53). From (63) and (65) we have replacing I (θ0 )−1 with (Γ π Γπ ) θ) spN (e √ N à ! ec 0 0 1 XN ∂ ψ e c) e eψ = √ λ ρ (λ c=1 1 ∂θ 0 N à !0 ec √ √ XN ∂ψ e λ + Nop (N −1/2 ) π ec = − N 0 c=1 ∂θ √ e 0π e λ + op (1), = − NΓ PN −1/2 ), as given by (79) above. e0 e where the second line uses that c=1 ρ1 (λ ψ c )/N = −1 + op (N Substituting into (53) gives the Guggenberger and Smith (2004) form of the score statistic (54). √ √ p e √ e Using the fact that N e λ = − NΩ−1 0 ψ N + op (1) (c.f., (78) above) we also have sN (θ)/ N = √ b N + op (1); substituting into (53) and replacing Ω−1 with Ω e 0π Ω−1 ψ e −1 NΓ π givens the Kleinbergen 0 0 (2005) and Newey and Windmeijer (2005) form of the score statistic (55). D.3 Tilting parameter test To motivate the robust estimate of Ω0 suggested by Imbens, Spady and Johnson (1998) begin by taking a mean value expansion of the first order condition for λ — given by (63) above — around b = λ∗ , holding θ fixed at its estimated value. This yields after some rearranging λ ¸−1 X ∙ X √ √ N N 1 0 b b b0 b c )ψ e b bc/ N , N (λ(θ) − λ∗ ) = − ρ2 (λ ψc )ψc ψc ρ1 (λ0∗ ψ c=1 c=1 N suggesting that √ N (e λ(b θ) − λ∗ ) has an approximate variance of b−1 = R ∙ ¸−1 ∙ X ¸ ∙ X ¸−1 N N 0 0 1 1 XN 1 2b b b b b b b ρ ψψ × b ρ ψψ × b ρ ψ ψ , c=1 2 c c c=1 1 c c c=1 2 c c N N N eN (b where both λ and λ∗ have been replaced with λ θ) for the purposes of computation. Under correct specification λ∗ = 0; under this null the IS statistic, using the above variance formula is bλ eN (b eN (b θ)0 R θ) with an (approximate) χ2S−K distribution. Pagan and Robertson (1997, pp. 28 Nλ 30) discuss this derivation in more detail. The derivative of e λN (θ0 ) with respect to θ has an interesting interpretation. From (66) the eN (θ 0 ) = 0 equals Jacobian associated with the null restriction λ ¶ µ eN (θ 0 ) 5λλ lN −1 5λθ lN ∂λ . =− N N ∂θ 0 46 where 5λλ lN /N = Ω0 + op(N −1/2 ) and 5λθ lN (e λN (θ0 ), θ0 )/N = S×K = = = N 1 X e0 ] ∂ψ c e 0 ψ c ) + ρ 2 (λ e0 ψc )ψ c λ [ρ1 (λ N c=1 ∂θ0 ¸ N N ∙ 0 1 X e0 ∂ψc 1 X ∂ψ c 0 e ψc ) ρ (λ ρ (λ ψ c ) 0 + (IK ⊗ ψ c ) (IK ⊗ e λ) N c=1 1 N c=1 2 ∂θ ∂θ0 ¸ N N ∙ 1 X ∂ψc 1 X ∂ψc 0 −1/2 − + (IK ⊗ ψc ) (IK ⊗ Ω−1 ) 0 ψ N ) + op (N N c=1 ∂θ 0 N c=1 ∂θ 0 ¸ N ∙ 1 X ∂ψc −1 − − ΣΓψ (IK ⊗ Ω0 ψc ) + op(N −1/2 ), N c=1 ∂θ 0 ¤ £ −1/2 eN (θ 0 ) equals the product and hence 5θ e λN (θ 0 ) = −Ω−1 ). 5θ λ ΓN − ΣΓψ (IK ⊗ Ω−1 0 0 ψ N ) +op (N of −Ω−1 0 and an efficient estimate of the Jacobian matrix. E Computation Each of the suggested algorithms applies a gradient-based optimization procedure to the profiled GEL criterion function. This requires concentrating out the S × 1 vector of tilting parameters, λ, for each iteration of θ. Fortunately this is straightforward to do and can be done so quickly. eN (θ) to form lp (θ), the profiled GEL criterion function. For a given value of θ, compute λ N e For the continuous updating (CU) estimator λN (θ) exist in closed form, but otherwise it must be solved for numerically by using the first and second derivatives of lN (λ, θ) with respect to λ — given by (63) and (64) above — to implement a Newton-Raphson procedure. To get starting e∗N (θ) = values for this procedure take a Taylor expansion of (63) around e λN (θ) = 0 and solve for λ ´−1 ³P ³P ´ ∗ N N eN (θ) for CU since the quadratic − ψc (θ) ψc (θ)0 ψc (θ) . Note that e λ (θ) = λ c=1 N c=1 approximation is exact in this case. p The profiled GEL criterion function, lN (θ), can be combined with the expression for its score vector given by (65) to implement any of a number of gradient-based optimization procedures.30 I have found that the BFGS quasi-Newton method, as implemented by MATLAB 6.0’s fminunc() function, works well in practice and this is the method used in the Monte Carlo experiments undertaken for the paper. If the implementation allows for a user-specified initial approximation to the Hessian matrix one should use N × ΓN (θ)ΩN (θ)−1 ΓN (θ), where θ are the starting values (e.g., 2SLS estimates). One can also use the expression given for the Hessian, equation (67) above, to implement a Newton-Raphson procedure where the i + 1 updates to θ equal p θ (i+1) = θ(i) − [HN (θ (i) )]−1 spN (θ (i) ), p (θ (i) ) with N ×ΓN (θ (i) )ΩN (θ (i) )−1 ΓN (θ(i) ) with iteration continuing until θ (i+1) ' θ (i) . Replacing HN I have found that the version of spN (θ) that does note take advantage of the simplification due to the Envelope Theorem, given in the first line of (65), works best is practice. 30 47 p results in a GEL analog of Fisher-scoring. If HN (θ (i) ) is not positive definite at any point during the optimization procedure it can be replaced with N × ΓN (θ (i) )ΩN (θ (i) )−1 ΓN (θ(i) ) for a single iteration, resulting in a hybrid Newton-Raphson/Fisher-scoring algorithm that works quite well and is numerically stable in a variety of settings. For the empirical likelihood (EL) case, following Owen (2001, p. 235), I replace the ρ (v) function with ) ( −1 ≥ v ln (1 − v) 1 − N ρ∗ (v) = . ¡ ¢ ln N −1 − 1.5 + 2(1 − v)N − 12 (1 − v)2 N 2 1 − N −1 < v For max πc (b θ) ≤ 1, ρ∗ (v) will result in the same solution as using ρ (v) with the advantage that ρ∗ (v) has unrestricted domain, while using ρ (v) requires optimization to be conducted under maintained restriction that 1 − λ0 ψc (θ) > 0 for all c = 1, . . . , N. The derivatives of ρ (v) that enter the score vector are modified accordingly. For CU and ET ρ (v) is as given in the main text. F Additional details on the Monte Carlo experiments In order to form the optimal instrument for the case where yc |Mc is multivariate normal with a covariance matrix given by (3) the following results on scaled χ2p random variables are useful (these facts are easy to verify using Stein’s Lemma (Casella and Berger 1990, pp. 188 - 189). ¤ £ E λχ2p = pλ ¡ ¢ V ar λχ2p = 2pλ2 h¡ £ ¤¢3 i = 8pλ3 . E λχ2p − E λχ2p Under joint normality of αc and εc the within- and between-group ‘squares’, gcw and gcb respectively, are independent scaled χ2 random variables: σ 2 (Mc ) 1 χ2 , Mc Mc − 1 Mc −1 ¶ µ 2 2 2 σ (Mc ) χ21 . ∼ σα + γ Mc gcw |Mc ∼ gcb |Mc To calculate conditional moments of the residual — u (gc , θ) — it is convenient to rearrange u (gc , θ) = gcb − σ 2α − γ 2 gcw as u (gc , θ) = (gcb − E[gcb |Mc ]) − γ 2 (gcw − E [gcw |Mc ]) , ¤ £ 2 (M ) 2 (M ) c c and E [gcw |Mc ] = σ M . Using independence of gcw and gcb and where E gcb |Mc = σ 2α + γ 2 σ M c c the results on the centered moments of a scaled χ2 random variable given above we have: h 2 E u (gc , θ) |Mc h 3 E u (gc , θ) |Mc i i = 2 = 8 µ µ σ2α σ2α +γ +γ 2σ 2 (M ) c Mc 2 2 σ (Mc ) Mc 48 ¶2 ¶3 +2 +8 γ 4 σ 4 (Mc ) 1 Mc2 Mc − 1 γ 8 σ 8 (Mc ) 1 . 3 Mc (Mc − 1)2 Using the form of the optimal instrument given by Chamberlain (1987, 1992) and Newey (1993) for the conditional moment E [u (gc , θ) |Mc ] = 0 and the above results we have hc = − h E [xc |Mc ] E u (gc , θ)2 |Mc i i−1 h = − (1, E [gcw |Mc ])0 × E ρ (gc , θ)2 |Mc #−1 ¸0 " µ ¶2 ∙ 2 σ 2 (Mc ) γ 4 σ4 (Mc ) 1 2 2 σ (Mc ) = − 1, × 2 σα + γ +2 . Mc Mc Mc2 Mc − 1 References Abowd, John M. and David Card. (1989). “On the covariance structure of earnings and hours changes,” Econometrica 57 (2): 411 - 445. Altonji, Joseph G. and Lewis M. Segal. (1996). “Small-sample bias in GMM estimation of covariance structures,” Journal of Business and Economic Statistics 14 (3): 353 - 366. Amemiya, Takeshi. (1985). Advanced Econometrics. Cambridge, MA: Harvard University Press. Back, Kerry and David P. Brown. (1993). “Implied probabilities in GMM estimators,” Econometrica 61 (4): 971 - 975. Bound, John, David A. Jaeger and Regina M. Baker. (1995). “Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak,” Journal of the American Statistical Association 90 (430): 443 450. Blundell, Richard and Stephen Bond. (1998). “Initial conditions and moment restrictions in dynamic panel data models,” Journal of Econometrics 87 (1): 115 - 143. Bond, Stephen, Clive Bowsher and Frank Windmeijer. (2001). “Criterion-based inference for GMM in autoregressive panel data models,” Economic Letters 73 (3): 379 - 388. Bond, Stephen and Frank Windmeijer. (2002). “Finite sample inference for GMM estimators in linear panel data models,” CEMMAP Working Paper CWP04/02. Brown, Bryan W. and Whitney K. Newey. (1998). “Efficient semiparametric estimation of expectations,” Econometrica 66 (2): 453 - 464. Caner, Mehmet. (2003). “Exponential tilting with weak instruments: estimation and testing,” Mimeo, University of Pittsburgh. Casella, George and Roger L. Berger (1990). Statistical Inference. Belmont, CA: Duxbury Press. 49 Chamberlain, Gary. (1982). “Multivariate regression models for panel data,” Journal of Econometrics 18 (1): 5 - 46. Chamberlain, Gary. (1984). “Panel data," Handbook of Econometrics 2: 1247 - 1318 (Z. Griliches & M.D. Intriligator, Eds.). Amsterdam: North-Holland. Chamberlain, Gary. (1987). “Asymptotic efficiency in estimation with conditional moment restrictions,” Journal of Econometrics 34 (3): 305 - 334. Chamberlain, Gary. (1992). “Efficiency bounds for semiparametric regression,” Econometrica 60 (3): 567 - 596. Clark, Todd E. (1996). “Small-sample properties of estimators of nonlinear models of covariance structure,” Journal of Business and Economic Statistics 14 (3): 367 - 373. Corcoran, S.A., A.C. Davison and R.H. Spady. (1995). “Reliable inference from empirical likelihoods,” Mimeo, Nuffield College, Oxford University. Ferguson, Thomas S. (1996). A Course in Large Sample Theory. London: Chapman & Hall. Gourieroux, Christian and Alain Monfort. (1993). “Pseudo-likelihood methods,” Handbook of Statistics 11 : 335 - 362 (G.S. Maddala, C.R. Rao & H.D. Vinod, Eds.). Amsterdam: North-Holland. Graham, Bryan S. (2005). “Identifying social interactions through excess variance contrasts,” Mimeo, Harvard University. Guggenberger, Patrik. (2005a). “Monte-carlo evidence suggesting a no moment problem of the continuous updating estimator,” Economics Bulletin 3 (13): 1- 6. Guggenberger, Patrik. (2005b). “Finite-sample evidence suggesting a heavy tail problem of the generalized empirical likelihood estimator,” Mimeo, UCLA. Guggenberger, Patrik and Jinyong Hahn. (2005). “Finite sample properties of the 2-step empirical likelihood estimator,” Mimeo, UCLA. Guggenberger, Patrik and Richard J. Smith. (2004). “Generalized empirical likelihood estimators and tests under partial, weak and strong identification,” Mimeo, UCLA. Hahn, Jinyong and Jerry Hausman. (2002). “Notes on bias in estimators for simultaneous equations models,” Economic Letters 75 (2): 237 - 241. Hahn, Jinyong, Jerry Hausman, and Guido Kuersteiner. (2004). “Estimation with weak instruments: accuracy of higher-order bias and MSE approximations,” Econometrics Journal 7 (1): 272 - 306. Hansen, Lars Peter. (1982). “Large sample properties of generalized method of moments estimators," Econometrica 50 (4): 1029 - 1054. 50 Hansen, Lars Peter, John Heaton and Amir Yaron. (1996). “Finite-sample properties of some alternative GMM estimators,” Journal of Business and Economic Statistics 14 (3): 262 - 280. Horowitz, Joel L. (1998). “Bootstrap methods for covariance structures,” Journal of Human Resources 33 (1): 39 - 61. Imbens, Guido. W. (1997). “One-step estimators of over-identified generalized method of moments models,” Review of Economic Studies 64 (3): 359 - 383. Imbens, Guido. W. (2002). “Generalized method of moments and empirical likelihood,” Journal of Business and Economic Statistics 20 (4): 493 - 506. Imbens, Guido W. and Tony Lancaster. (1994). “Combining micro and macro data in microeconometric models,” Review of Economic Studies 61 (4): 655 - 680. Imbens, Guido W. and Richard Spady (2002). “Confidence intervals in generalized method of moments models,” Journal of Econometrics 107 (1): 87 - 98. Imbens, Guido W, Richard Spady and Phillip Johnson (1998). “Information theoretic approaches to inference in moment condition models,” Econometrica 66 (2): 333 - 357. Kitamura, Yuichi and Michael Stutzer. (1997). “An information-theoretic alternative to generalized method of moments estimation,” Econometrica 65 (4): 861 - 874. Kitamura, Yuichi, Gautam Tripathi and Hyungtaik Ahn. (2004). “Empirical likelihood-based inference in conditional moment restriction models,” Econometrica 72 (6): 1667 - 1714. Kleinbergen, Frank. (2005). “Testing parameters in GMM without assuming that they are identified,” Econometrica, forthcoming. MaCurdy, Thomas E. (1982). “The use of time series processes to model the error structure of earnings in a longitudinal data analysis,” Journal of Econometrics 18 (1): 83 - 114. Mittelhammer, Ron C., George Judge and Ron Schoenberg. (2005). “Empirical evidence concerning the finite sample performance of EL-type structural equation estimation and inference methods,” Identification and Inference for Econometric Models: A Festschrift in Honor of Thomas Rothenberg: 282 - 305 (D.W.K Andrews & J.H. Stock, Eds.). Cambridge: Cambridge University Press. Newey, Whitney K. (1993). “Efficient estimation of models with conditional moment restrictions,” Handbook of Statistics 11 : 419 - 454 (G.S. Maddala, C.R. Rao & H.D. Vinod, Eds.). Amsterdam: North-Holland. Newey, Whitney. (2002). “Notes on GMM and Generalized Empirical Likelihood,” Mimeo, MIT. 51 Newey, Whitney K. and Daniel McFadden. (1994). “Large sample estimation and hypothesis testing,” Handbook of Econometrics 4: 2111 - 2245 (R.F. Engle & D.F. McFadden, Eds.). Amsterdam: North-Holland. Newey, Whitney K. and Richard J. Smith. (2004). “Higher order properties of GMM and generalized empirical likelihood estimators,” Econometrica 72 (1): 219 - 255. Newey, Whitney K. and Kenneth D. West. (1987). “Hypothesis testing with efficient method of moments estimation,” International Economic Review 28 (3): 777 - 787. Newey, Whitney K. and Frank Windmeijer. (2005). “Many weak moment asymptotics for generalized empirical likelihood estimators,” Mimeo, MIT. Owen, Art. B. (2001). Empirical Likelihood. New York: Chapman & Hall/CRC. Pagan, Adrian R. and John Robertson. (1997). “GMM and its problems,” Mimeo, ANU. Qin, Jin and Jerry Lawless. (1994). “Empirical likelihood and general estimating equations,” Annals of Statistics 22 (1): 300 - 325. Ragusa, Giuseppe. (2005). “Alternatives to GMM: properties of minimum divergence estimators,” Mimeo, UCSD. Ramalho, Joaquim J.S. (2005). “Small sample bias of alternative estimation methods for moment condition models: Monte Carlo evidence for covariance structures,” Studies in Nonlinear Dynamics and Econometrics 9 (1) Article 3. Rao, C. Radhakrishna. (1973). Linear Statistical Inference and its Applications, 2 nd . Ed. New York: John Wiley and Sons. Schennach, Susanne M. (2003). “Exponentially tilted empirical likelihood,” Mimeo, University of Chicago. Smith, Richard J. (1997). “Alternative semi-parametric likelihood approaches to generalized method of moments estimation,” Economic Journal 107 (443): 503 - 519. Smith, Richard J. (2000). “Empirical likelihood estimation and inference,” Applications of Differential Geometry to Econometrics: 119 - 150 (P. Marriot & M. Salmon, Eds.). Cambridge: Cambridge University Press. Staiger, Douglas and James H. Stock. (1997). “Instrumental variables regression with weak instruments,” Econometrica 65 (3): 557 - 586. Stock, James H. and Jonathan H. Wright. (2000). “GMM with weak identification,” Econometrica 68 (5): 1055 - 1096. Stock, James H., Jonathan H. Wright and Motohiro Yogo. (2002). “A survey of weak instruments and weak identification in generalized method of moments,” Journal of Business and Economic Statistics 20 (4): 518 - 529. 52 Tauchen, George. (1986). “Statistical properties of generalized method-of-moments estimators of structural parameters obtained from financial market data,” Journal of Business and Economic Statistics 4 (4): 397 - 416. van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge: Cambridge University Press. Wooldridge, Jeffrey M. (2002). Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: The MIT Press. 53 Table 1: Individual asymptotic bias components for γ 2 under weak, strong and very strong identification (multivate normal DGP) μ2 /(S − 1) = 5 S=2 S=4 S=7 S = 12 Panel B: μ2 /(S − 1) = 20 S=2 S=4 S=7 S = 12 Panel C: μ2 /(S − 1) = 50 S=2 S=4 S=7 S = 12 Panel A: ∗ BO BΓ∗ ∗ BΩ ∗ BW − − −0.0000 −0.0000 − − −0.2028 −0.2465 − − −0.0670 −0.1751 − − −0.0017 −0.0011 −0.0000 −0.0000 −0.0001 −0.0001 − −0.0553 −0.1092 −0.1347 − −0.0413 −0.2040 −0.4310 − −0.0003 −0.0028 −0.0032 −0.0000 −0.0001 −0.0001 −0.0001 − −0.0434 −0.0533 −0.0894 − −0.2200 −0.2658 −0.7440 − −0.0008 −0.0024 −0.0033 The table reports the asymptotic bias of γ b2 due to each of the four components given by (46) for each of the twelve Monte Carlo designs. The reported bias values are divided by the asymptotic standard 1 ∗ γ 2 , (i.e., BO = BNO error of b −1 1/2 etc.). Notes: [I(θ0 )γγ /N ] 54 Table 2: Median bias of competing estimates of γ 2 under weak, strong and very strong identification (multivate normal DGP) S Panel A: μ /(S − 1) = 5 2 2 BiasMC GMM ET EL QML Oracle 2SLS Panel B: μ2 /(S − 1) = 20 GMM ET EL QML Oracle 2SLS Panel C: μ2 /(S − 1) = 50 GMM ET EL QML Oracle 2SLS 4 BiasA BiasM C − 7 12 BiasA BiasM C −0.2106 −0.0281 −0.0004 −0.0267 0.0049 −0.1796 BiasA −0.2716 −0.0335 −0.0000 −0.0000 −0.0000 − BiasM C −0.3255 −0.0343 0.0002 −0.0012 0.0050 −0.2203 BiasA −0.4227 −0.0876 −0.0001 −0.0001 −0.0001 − − 0.0080 0.0080 0.0080 0.0080 0.0080 0.0080 −0.0000 −0.0000 −0.0000 −0.0000 −0.0000 −0.0000 −0.0901 −0.0334 −0.0240 −0.0198 −0.0181 −0.0732 −0.0970 −0.0207 −0.0000 −0.0000 −0.0000 − −0.2021 −0.0804 −0.0371 −0.0014 −0.0017 −0.1038 −0.3160 −0.1021 −0.0001 −0.0001 −0.0001 − −0.3867 −0.2149 −0.1412 −0.0378 −0.0359 −0.1540 −0.5689 −0.2156 −0.0001 −0.0001 −0.0001 − 0.0063 0.0063 0.0063 0.0063 0.0063 0.0063 −0.0000 −0.0000 −0.0000 −0.0000 −0.0000 −0.0000 −0.2014 −0.0885 0.0003 −0.0294 0.0333 −0.0459 −0.2642 −0.1100 −0.0001 −0.0001 −0.0001 − −0.2054 −0.1124 −0.0617 −0.0045 −0.0092 −0.0594 −0.3216 −0.1330 −0.0001 −0.0001 −0.0001 − −0.5468 −0.3622 −0.2232 −0.0447 −0.0479 −0.1484 −0.8369 −0.3721 −0.0001 −0.0001 −0.0001 − The table reports Monte Carlo estimates of scaled median bias (BiasM C ) for competing estimates of b γ 2 (i.e., the median bias of b γ 2 divided by −1 1/2 its asymptotic standard error, [I(θ0 )γγ /N ] ). Reported statistics are based on B = 5, 000 successful Monte Carlo draws. The column labelled BiasA Notes: −1 gives the asymptotic bias of γ b2 — again divided by [I(θ 0 )γγ /N] 1/2 — as calculated analytically using Newey and Smith’s (2004) expansions. The standard D 1/2 error of the median bias estimates are approximately [(π/2)/5000]1/2 ' 0.01772454 since for N and B large enough γ b2 /[I(θ0 )−1 'N(BiasA , 1) γγ /N ] and √ D B(BiasMC − BiasA )'N (0, π/2) (c.f., Ferguson 1996, Theorem 13, p. 89). 55 Table 3: (Monte Carlo) median bias relative to theoretical predictions of Newey and Smith (2004) (multivate normal DGP) GMM ET EL QML Panel A r.m.s.e. 8.0032 [pv = 0.00] 1.3356 [pv = 0.07] 5.4783 [pv = 0.00] 1.4655 [pv = 0.03] αΓ 0.8414 (0.0530) −0.1224 (0.0612) −0.1537 (0.1017) 0.0371 (0.0727) Panel B αΩ 0.6293 (0.0210) 0.4964 (0.0243) 0.3063 (0.0404) 0.0413 (0.0289) r.m.s.e. 0.9040 [pv = 0.56] 1.0447 [pv = 0.36] 1.7351 [pv = 0.01] 1.2401 [pv = 0.16] The table reports coefficients and root mean squared error (r.m.s.e.) from weighted least squares regressions of the BiasM C estimates reported in Table 2 onto the corresponding theoretical bias components reported in Table 1. The weights equal [(π/2)/5000]−1/2 (see the notes to Table 2 for a discussion of why this is a reasonable estimate of the inverse standard error of BiasM C ). Panel A reports the r.m.s.e. ∗ , B ∗ , B ∗ and B ∗ where the coefficients are constrained to the from a regression of BiasM C onto BO Γ Ω W ∗ to vary freely appropriate theoretical values given in (45). Panel B allows the coefficients on BΓ∗ and BΩ (with the remaining coefficients continuing to be constrained to their theoretical values). Constants are not included in any of the regressions. Regressions are performed estimator-by-estimator using data from the eight overidentified designs. Under the null that the theoretical bias expressions accurately capture the finite sample median bias properties of the given estimator r.m.s.e. should be close to one with RSS ∼ χ2DF , where RSS equals the residual sum-of-squares and DF the degrees of freedom in the regression. The p-value for the χ2 test is reported below the r.m.s.e. values. Notes: 56 Table 4: Median absolute error (MAE) of competing estimates of γ 2 under weak, strong and very strong identification (multivate normal DGP) S Panel A: μ2 /(S − 1) = 5 GMM ET EL QML Oracle 2SLS Panel B: μ2 /(S − 1) = 20 GMM ET EL QML Oracle 2SLS Panel C: μ2 /(S − 1) = 50 GMM ET EL QML Oracle 2SLS 2 MAE 4 MAE 7 MAE 0.9824 1.1610 1.1628 1.1053 1.0273 0.9681 12 MAE 0.9919 1.1951 1.1803 1.0885 0.9720 0.9379 − − 0.9994 0.9994 0.9994 0.9994 0.9994 0.9994 1.0249 1.0476 1.0433 1.0205 1.0095 1.0055 1.0319 1.0519 1.0420 1.0196 0.9962 0.9948 1.1127 1.0884 1.0605 1.0113 1.0025 1.0191 0.9913 0.9913 0.9913 0.9913 0.9913 0.9913 1.0295 1.0144 1.0211 1.0153 1.0055 0.9813 1.0540 1.0366 1.0304 0.9939 0.9926 0.9810 1.1779 1.0962 1.0531 0.9965 1.0042 1.0019 Notes: The table reports Monte Carlo estimates of the (scaled) median absolute error (MAE) of com1/2 b2 ; the MAE estimates are multiplied by 1.4826 and divided [I(θ0 )−1 . Reported peting estimates of γ γγ /N ] statistics based on B = 5, 000 successful Monte Carlo draws. If the first-order asymptotic normal approxiγ 2 is accurate, then the scaled MAE values should, up to simulation mation to the sampling distribution of b error, equal one. 57 Table 5: Standard deviation of competing estimates of γ 2 under weak, strong and very strong identification (multivate normal DGP) S Panel A: μ2 /(S − 1) = 5 GMM ET EL QML Oracle 2SLS Panel B: μ2 /(S − 1) = 20 GMM ET EL QML Oracle 2SLS Panel C: μ2 /(S − 1) = 50 GMM ET EL QML Oracle 2SLS 2 4 7 12 med(se b γb 2 ) std(b γ2 ) −1 1/2 −1 [I(θ 0 )γγ /N] [I(θ0 )γγ /N ]1/2 med(se b γb 2 ) std(b γ2) −1 1/2 −1 [I(θ 0 )γγ /N] [I(θ0 )γγ /N ]1/2 med(se b γb 2 ) std(b γ2) −1 1/2 −1 [I(θ0 )γγ /N ] [I(θ 0 )γγ /N]1/2 med(se b bγ2 ) std(b γ2) −1 1/2 −1 [I(θ0 )γγ /N ] [I(θ 0 )γγ /N]1/2 − − 0.9685 1.2283 1.2455 1.1628 1.0675 0.9664 0.8586 0.8550 0.8781 1.1127 1.0000 0.9154 0.9373 1.2701 1.2788 1.1418 1.0253 0.9307 0.7837 0.7662 0.7937 1.1225 1.0000 0.8937 1.0705 1.0705 1.0705 1.0705 1.0705 1.0705 1.0156 1.0156 1.0156 1.0156 1.0000 1.0156 1.0105 1.0485 1.0598 1.0294 1.0254 1.0086 0.9439 0.9410 0.9529 1.0157 1.0000 0.9738 1.0110 1.0673 1.0621 1.0270 1.0110 0.9813 0.8909 0.8821 0.9037 1.0183 1.0000 0.9571 1.0165 1.0844 1.0732 1.0233 1.0011 0.9747 0.8342 0.8107 0.8428 1.0066 1.0000 0.9472 1.0112 1.0112 1.0112 1.0112 1.0112 1.0112 0.9920 0.9920 0.9920 0.9920 1.0000 0.9920 1.0342 1.0414 1.0359 1.0265 1.0191 1.0020 0.9520 0.9508 0.9606 0.9918 1.0000 0.9726 1.0339 1.0371 1.0414 1.0047 1.0042 0.9928 0.9004 0.8901 0.9157 0.9887 1.0000 0.9673 1.0225 1.0467 1.0532 1.0083 0.9898 0.9731 0.8374 0.8092 0.8400 0.9905 1.0000 0.9493 1/2 Notes: The table reports the ratio between the actual sampling standard deviation of γb2 and its asymptotic standard error (std(b γ 2 )/[I(θ0 )−1 ). γγ /N ] −1 1/2 b γb 2 )/[I(θ0 )γγ /N] ). std(b γ 2 ) equals the Also reported is the ratio between the median estimated asymptotic standard error and the true one (med(se γ 2 ; this a robust estimate formed by subtracting the 5th quantile from the 95th quantile of the Monte standard deviation of the Monte Carlo estimates of b 2 γ and dividing by 2 × 1.645; med(se b bγ 2 ) denotes the median estimated asymptotic standard error of γb2 based on the conventional Carlo distribution of b variance-covariance estimator. 58 Table 6: Actual coverage of alternative 95 percent asymptotic confidence intervals for γ 2 under weak identification (multivate normal DGP) μ2 /(S − 1) = 5 Panel A: W WH W NW WM S SM LR IS Panel B: W WH W NW WM S SM LR IS S=2 GMM ET EL S=4 QML ORACLE 2SLS GMM ET − 0.918 − − − 0.926 − 0.917 − 0.874 0.908 0.931 0.945 0.929 0.960 0.885 0.817 EL QML ORACLE 2SLS − S=7 0.878 − 0.934 − 0.910 − − − 0.932 0.958 − 0.938 0.942 − − − 0.917 − 0.949 − 0.953 0.965 − 0.926 0.888 − 0.952 − 0.826 − − − 0.865 − − − 0.903 − 0.907 − 0.790 0.849 0.887 0.907 0.910 0.939 0.838 0.746 S=12 0.807 − 0.943 − 0.848 − − − 0.879 0.952 − 0.928 0.902 − − − 0.884 − 0.952 − 0.929 0.964 − 0.925 0.837 − 0.949 − 0.747 − − − Notes: Table reports coverage rates for nominal 95 percent intervals constructed by inverting the various test statistics outlined in the main text (i.e., Pr(γ 2 ∈ CI α (γ 2 , σ e2α (γ 2 ))). Coverage is estimated as the fraction of Monte Carlo repetitions where a test of H0 : γ 2 = γ 20 fails to reject, where γ 20 denotes the DGP value. Reported results are for the case of weak identification (μ2 /(S − 1) = 5). 59 Table 7: Actual coverage of alternative 95 percent asymptotic confidence intervals for γ 2 under strong identification (multivate normal DGP) μ2 /(S − 1) = 20 Panel A: W WH W NW WM S M S LR IS Panel B: W WH W NW WM S SM LR IS S=2 GMM ET 0.968 − − − 0.949 − 0.949 − 0.968 0.968 0.968 0.968 0.947 0.973 0.943 0.917 0.891 − − − 0.909 − 0.917 − 0.889 0.902 0.914 0.920 0.922 0.929 0.902 0.878 EL S=4 QML ORACLE 2SLS 0.968 − 0.968 − 0.968 0.968 0.968 − 0.945 − 0.962 0.964 0.943 − 0.923 − S=7 0.902 − 0.911 − 0.922 0.948 0.927 − 0.916 − 0.928 0.947 0.911 − 0.887 − GMM ET 0.927 − − − − 0.968 − − 0.950 − − 0.929 0.950 − − − 0.934 − − − 0.937 − 0.935 − 0.931 0.939 0.942 0.946 0.936 0.948 0.927 0.907 0.948 − − − − 0.938 − − 0.948 − − 0.923 0.951 − − − 0.856 − − − 0.894 − 0.906 − 0.842 0.867 0.887 0.886 0.902 0.901 0.874 0.843 EL QML ORACLE 2SLS 0.933 − 0.937 − 0.943 0.954 0.945 − 0.934 − 0.940 0.947 0.927 − 0.909 − S=12 0.863 − 0.866 − 0.874 0.948 0.883 − 0.875 − 0.887 0.949 0.870 − 0.850 − 0.939 − − − − 0.950 − − 0.948 − − 0.927 0.947 − − − 0.948 − − − − 0.929 − − 0.951 − − 0.902 0.953 − − − Notes: Table reports coverage rates for nominal 95 percent intervals constructed by inverting the various test statistics outlined in the main text (i.e., Pr(γ 2 ∈ CI α (γ 2 , σ e2α (γ 2 ))). Coverage is estimated as the fraction of Monte Carlo repetitions where a test of H0 : γ 2 = γ 20 fails to reject, where γ 20 denotes the DGP value. Reported results are for the case of strong identification (μ2 /(S − 1) = 20). 60 Table 8: Actual coverage of alternative 95 percent asymptotic confidence intervals for γ 2 under very strong identification (multivate normal DGP) μ2 /(S − 1) = 50 Panel A: W WH WM W NW S M S LR IS Panel B: W WH WM W NW S SM LR IS S=2 GMM ET 0.955 − − − 0.949 − 0.949 − 0.955 0.955 0.955 0.955 0.948 0.957 0.944 0.930 0.897 − − − 0.918 − 0.923 − 0.901 0.909 0.917 0.917 0.923 0.924 0.914 0.897 EL S=4 QML ORACLE 2SLS 0.955 − 0.955 − 0.955 0.955 0.955 − 0.944 − 0.944 0.949 0.944 − 0.938 − S=7 0.914 − 0.916 − 0.917 0.946 0.919 − 0.916 − 0.918 0.947 0.915 − 0.908 − GMM ET 0.948 − − − − 0.955 − − 0.953 − − 0.933 0.953 − − − 0.930 − − − 0.936 − 0.937 − 0.932 0.937 0.941 0.943 0.941 0.947 0.931 0.913 0.953 − − − − 0.939 − − 0.956 − − 0.888 0.953 − − − 0.836 − − − 0.875 − 0.886 − 0.844 0.864 0.880 0.876 0.892 0.887 0.868 0.839 EL QML ORACLE 2SLS 0.937 − 0.941 − 0.943 0.949 0.945 − 0.937 − 0.933 0.939 0.934 − 0.922 − S=12 0.879 − 0.881 − 0.878 0.951 0.889 − 0.877 − 0.889 0.947 0.878 − 0.854 − 0.943 − − − − 0.949 − − 0.947 − − 0.922 0.947 − − − 0.950 − − − − 0.937 − − 0.952 − − 0.886 0.955 − − − Notes: Table reports coverage rates for nominal 95 percent intervals constructed by inverting the various test statistics outlined in the main text (i.e., Pr(γ 2 ∈ CI α (γ 2 , σ e2α (γ 2 ))). Coverage is estimated as the fraction of Monte Carlo repetitions where a test of H0 : γ 2 = γ 20 fails to reject, where γ 20 denotes the DGP value. Reported results are for the case of very strong identification (μ2 /(S − 1) = 50). 61