ML Estimation and Hypothesis Testing
Transcription
ML Estimation and Hypothesis Testing
Maximum Likelihood and Hypothesis Testing The previous discussion has provided some indication as to how to use Maximum Likelihood techniques to obtain parameter estimates Lets now talk about some post-estimation issues wrt hypothesis testing under the maximum likelihood framework Finite sample Asymptotic hypothesis tests 1 Maximum Likelihood and Hypothesis Testing As we discussed earlier, a standard test procedure is based on relative likelihood function values under a null hypothesis(s) versus its value unrestricted Likelihood Ratio Test (finite sample): Compare sample likelihood function value, l(•), under the assumption that the null hypothesis(s) is true, [l(Ω0)] vs. its value with unrestricted parameter choice [l*(Ω)] Null hypothesis(s) (H0) could reduce the allowable set of parameter values. What does this do to the maximum likelihood function value? If the two resulting maximum likelihood function values are close 2 enough → can not reject H0 Maximum Likelihood and Hypothesis Testing Is this difference in likelihood function values large? Likelihood ratio (λ): l * l 0 unrestricted l(·) restricted l(·) λ is a random variable since l(·)’s depend on yi’s which are R.V.’s What are possible values of λ? l(Ω0)≤ l*(Ω) → λ ≥1 3 Maximum Likelihood and Hypothesis Testing Likelihood Ratio (LR) Principle Null hypothesis defining Ω0 is rejected if λ > 1 Need to establish critical level of λ, λC, that is unlikely to occur under H0 (e.g., is 1.1 far enough away from 1.0 given that λ is a RV)? Reject H0 if estimated value of λ > λC λ = 1→Null hypothesis(s) does not significantly reduce parameter space H0 not rejected Result conditional on sample 4 Maximum Likelihood and Hypothesis Testing General Likelihood Ratio (LR) Test Procedure Choose probability of Type I error, a (e.g., test significance level) Given a, find value of C that satisfies: P( > C | H0 is true) = α Evaluate test statistic based on sample information l * l 0 Reject (fail to reject) null hypothesis if > C (< C) 5 Maximum Likelihood and Hypothesis Testing LR test of the mean of Normal y~N(μ, σ2) Distribution (µ) with s2 not known LR*= μˆ μ 0 σ 2u y μˆ ˆ y μ where σ 2 2 u variance of mean T T 1 This implies the following test procedures given LR* and assuming a normally distributed RV and testing for the mean, F-Test μ̂ μ 0 LR*= σ 2u t-Test LR * 2 T ~ F1,T-1 μ̂ μ 0 = ~t σ 2u T Note: Finite sample tests given linear model T-1 LR test of hypothesized value of s2 6 Maximum Likelihood and Asymptotic Hypothesis Tests Previous tests based on finite samples Use asymptotic tests when appropriate finite sample test statistic is unavailable These rely on Central Limit Theorem and asymptotic normality of ML estimators For example when have models where the functional form is nonlinear in parameters Three asymptotic tests commonly used: Asymptotic Likelihood Ratio Test Wald Test Lagrangian Multiplier (Score) Test For a review: Greene: 524-534, JHGLL: 105-110. Buse article (on website) 7 Maximum Likelihood and Asymptotic Hypothesis Tests Lets start simple where we have a single parameter, θ Assume you want to Undertake ML estimation of θ Test of the hypothesis, H0: c(θ)=0 The logic of the three asymptotic tests procedures can be obtained from the following figure Plots for alternative values of θ: The sample log-likelihood function, L(θ) The gradient of the sample loglikelihood function, dL(θ)/dθ The value of the function, c(θ) 8 Maximum Likelihood and Asymptotic Hypothesis Tests L(θ), dL(θ)/dθ, c(θ) dL(θ)/dθ .5* Likelihood Ratio L(θ) L H0: c(θ)=0 LR c(θ) Lagrangian Multiplier Wald θ θML,R θML Greene, p. 499 9 Maximum Likelihood and Asymptotic Hypothesis Tests H0: c(θ)=0 Summary of Asymptotic Hypothesis Tests Asymptotic Likelihood Ratio Test: If c(θ) is valid, them imposing it should not impact L(θ) significantly. → base test on difference in L(θ)’s across the two points θML and θML,R Wald Test: If c(θ) is valid, then c(θML) should be close to 0 → base test on θML and c(θML) reject H0 if c(θML) is significantly different from 0 Lagrange Multiplier Test: If c(θ) is valid, then θML,R should be near the point the maximizes LF → slope of LF should be close to 0 base test on slope of LLF at θML,R 10 Maximum Likelihood and Asymptotic Hypothesis Tests These three tests are asymptotically equivalent under H0 They can behave differently in a small samples Small sample properties are typically unknown → Choice among them is typically made on basis of computational ease LR test requires calculation of both restricted and unrestricted estimates of parameter vector Wald tests requires only unrestricted estimate of parameter vector LM test requires only restricted estimate of parameter vector 11 Maximum Likelihood and Asymptotic Hypothesis Tests Let y1,…,yT be a set of RVs w/joint PDF: fT(y1,…,yT| ) where is a (K x 1) vector of unknown parameters where Ω (allowable parameter space) The sample likelihood and log-likelihood functions given our data are: l(|y1…yT) L(|y1…yT) ≡ lnl(|y1…yT) As sample size, T, becomes larger, lets consider testing the hypotheses: R()=[R1(),R2(),…,RJ()]' → Ω0 allowable parameter space under the J hypotheses 12 Maximum Likelihood and Asymptotic Hypothesis Tests Likelihood Ratio λ ≡ l*(Ω)/l(Ω0) or l(ML)/l(0) l*(Ω) = Max[l(|y1,…,yT):Ω] Unrestricted likelihood function l(Ω0) = Max[l(|y1,…,yT):Ω0] Restricted likelihood function assuming null hypothesis is true Asymptotic Likelihood Ratio (LR) LR ≡ 2ln(λ) = 2[L*()-L(0)] L(·) ≡ lnl(·) What is the sign of LR? LR~χ2J asymptotically where J is the number of joint null hypothesis (restrictions) Theorem 16.5, Greene p.500 13 Maximum Likelihood and Asymptotic Hypothesis Tests Asymptotic Likelihood Ratio Test 0 l H0: Ω0 Ll L .5LR L0 L L≡ Log-Likelihood Function LR ≡ 2ln()=2[L(1)-L(0)] LR~c2(α,J) asymptotically (p.105 JHGLL) Evaluated L(•) at both 1 and 0 l generates unrestricted L(•) max L(0) value obtained under H0 14 Maximum Likelihood and Asymptotic Hypothesis Tests Greene defines the above as: -2[L(0)-L*()] Result is the same Buse, p.153, Greene p.498-500 As noted above, given H0 true, LR has an approximate χ2 dist. with J DF Reject H0when LR > χ2c where χ2c is the predefined critical value of the distribution given J DF and desired Pr(Type I error). In MATLAB we can generate the critical χ2 value of H0 testing critical_valu = chi2inv(1-p_type_1,num_rest) p_type_1 is Pr(Type I error) 15 Maximum Likelihood and Asymptotic Hypothesis Tests Example of an inappropriate use of the likelihood ratio Use the LR to test one distributional assumption against another (e.g., normal vs. logistic) The parameter spaces and therefore likelihood functions are unrelated To be appropriate, the restricted model needs to be nested within the original likelihood function → alternative model obtained from original model via parameter restrictions only not a change in functional form as in the above normal vs. logistic example 16 Maximum Likelihood and Asymptotic Hypothesis Tests Asymptotic Likelihood Ratio (LR) LR ≡ 2ln(λ) = 2[L*()-L(0)] L(·) ≡ lnl(·) L*() ≡ Unrestricted LLF L() ≡ Restricted LLF subject to c(Θ) =0 LR ~ χ2J asymptotically J is the number of joint null hypothesis (restrictions) 17 Maximum Likelihood and Asymptotic Hypothesis Tests A size corrected asymptotic Likelihood Ratio statistic In our estimation of the error term variance we often use a correction factor that accounts for the number of parameters used in estimation Improved the approx. to the sampling distribution of the statistic generated from its limiting χ2 distribution A similar correction factor has been suggested by Mizon (1977, p.1237)) and by Evans and Savin (1982, p.742) to be applied to the asymptotic LR These size-correction factors have been applied to the asymptotic LR to improve its small sample properties 18 Maximum Likelihood and Asymptotic Hypothesis Tests Mizon’s (1977) size corrected LR statistic: T-K-1+ J 2 LR = LR T K = Number of explanatory variables including intercept term J = the number of joint hypotheses LR = the traditionally calculated loglikelihood ratio = 2[L*()-L(0)] C 19 Maximum Likelihood and Asymptotic Hypothesis Tests Lets provide a motivation for the Asymptotic Wald test Suppose consists of 1 element H0: Θ = Θ0 or Θ – Θ0= 0 2 samples Generate different LF estimates Same value maximizes the LF’s 20 Maximum Likelihood and Asymptotic Hypothesis Tests 0 l Max at same point Ll .5LR0 L0 L .5LR1 L10 H0: 0 L1 Two samples L 0.5LR will depend on two factors: Distance between l and 0(+) The curvature of the LF (+) 21 Maximum Likelihood and Asymptotic Hypothesis Tests Impact of Curvature on LR Shows Need For Wald Test 0.5LR will depend on Distance between l and 0(+) The curvature of the LF (+) V() represents LF curvature Single Parameter V Θ d L Θ 2 dΘ 1 0 2 Θ = Θl Don’t forget the “–” sign Wald Test based on the following statistic given the H0: Θ = Θ0 W=(l-0)2 V(|=l)-1 assuming concavity Maximum Likelihood and Asymptotic Hypothesis Tests Impact of Curvature on LR Shows Need For Wald Test 0 l Max at same point Ll .5LR0 L0 L L1 .5LR1 L10 V Θ d L Θ 2 1 dΘ2 Θ=Θl Two samples H0: 0 W=(l-0)2 V(|=l)-1 L W~c2J asymptotically Note: Evaluated at l unrestriced value 23 Maximum Likelihood and Asymptotic Hypothesis Tests Wald statistic: weights squared distance, (l - 0)2 by the curvature of LF instead of using the differences in LF values as in LR test Two sets of data may produce the same (l - 0)2 value but give different LR values because of H0: Θ = Θ0 curvature The more curvature, the more likely H0 not true (e.g., test statistic is larger) W l 0 V Θl 2 More curvature 1 where V Θl d2 L Θ dΘ 2 is larger d L Θ 2 1 dΘ2 d2L Θ dΘ 2 Θ=Θl 1 issmaller Greene, p. 500-502 gives alternative motivation, Buse, p. 153-154 24 Maximum Likelihood and Asymptotic Hypothesis Tests The asymptotic covariance matrix of the ML estimator is based on the Information Matrix of estimated parameters If the form of the expected values of the 2nd derivatives of the LLF are known: L θ ML θML I θ ML E θ MLθML A 2nd estimator is: 2 1 ˆ θML ˆI θ ML Measure of curvature A 3rd estimator ˆ ˆ ˆ θML ˆI θ ML 1 L θ ML θ MLθML 2 is: 1 gˆ i gˆ i t 1 T 1 1 1 1 ˆ ˆ G G L yi | θ ML ˆ where gˆ i G gˆ 1,gˆ 2 ,..., gˆ T θ ML K x1 25 Maximum Likelihood and Asymptotic Hypothesis Tests Let c(Θ) represent J joint hypotheses Extending to J simultaneous hypotheses, K parameters and more complex H0’s: - 1 1 é ù éc (Q ML )- q ù¢êd (Q ML )I (Q ML ) d (Q ML )¢ú ûê W = ë ú J x K (1x1) ( ) K x K ( ) K x J ( ) úû (1x J ) êë q is target éc (Q ML )- q ù Can use any of ë û the 3 methods (J x1) ¶ c (Q ) d (Q ML ) º ; W ~ c 2J asymptotically ¶ Q | Q = Q ML Note that c(∙), d(∙) and I(∙) evaluated at ML , the unrestricted value When cj(q) of the form: j=j0, j=1,…K (K x K) identity matrix 2 L θ ML d()=IK, θ θ ML ML 2 W=(l-0) I(|=l) 26 Maximum Likelihood and Asymptotic Hypothesis Tests In summary, Theorem 16.6, Greene p.501: With the set of hypotheses represented by H0: c(θ) = q, the Wald statistic is: W c θ ML q Asy.Var c θ ML q 1 c θ ML -q d(θML)I(θML)-1 d(θML)′ 27 Maximum Likelihood and Asymptotic Hypothesis Tests The Wald test is based on measuring the extent to which the unrestricted estimates fail to satisfy the hypothesized restrictions Large values of W→ large deviations of c(Θ) away from q are weighted by a matrix involving curvature of the loglikelihood function Shortcomings of the Wald test It is a pure significance test against the null hypothesis not necessarily for a specific alternate hypothesis The test statistic is not invariant to the formulation of the restrictions Test of a function θ=β/(1-γ) equals a specific value q. 28 Maximum Likelihood and Asymptotic Hypothesis Tests θ≡β/(1-γ) = q? Two ways to evaluate this expression Determine variance of the non-linear function of β and γ β-q(1-γ) = 0 which is equivalent but a linear restriction based on the two parameters, β and γ. The Wald statistics for these two tests could be different and might lead to different inferences 29 Maximum Likelihood and Asymptotic Hypothesis Tests Summary of Lagrange Multiplier (Score) Test Based on the curvature of the loglikelihood function [L(Θ)] but this time at the restricted log-likelihood function value At unrestricted maximum: dL dΘ 0 S ML ML Score of Likelihood Function 30 Maximum Likelihood and Asymptotic Hypothesis Tests 0 S() ≡ dL/d Θ0 clser to optimum under sample B L0 S()=0 S(0) Two samples H0: Θ = Θ0 LA LB L Sample B has the greater curvature of L(∙) when evaluated at 0 Both samples have the same gradient at Θ0, the hypothesize value 31 Maximum Likelihood and Asymptotic Hypothesis Tests Summary of Lagrange Multiplier (Score) Test How much does S() depart from 0 when evaluated at the hypothesized value? Weight squared slope (to get rid of negative sign) by curvature The greater the curvature, the closer 0will be to the maximum value Weight by V() →smaller test statistic the more curvature in contrast to Wald test Single V Θ In contrast to Wald which uses V(Θ)-1 d L Θ 2 dΘ 1 Parameter 0 2 Θ = Θ0 32 Maximum Likelihood and Asymptotic Hypothesis Tests Summary of Lagrange Multiplier (Score) Test How much does S() depart from 0 when evaluated at the hypothesized value? Small values of test statistic, LM, will be generated if the value of L(0) is close to the maximum value, L(l), e.g. slope close to 0 33 Maximum Likelihood and Asymptotic Hypothesis Tests When comparing samples A and B: Sample B→ smaller test statistic because 0 is nearer max. of its loglikelihood (e.g. S(0) closer to zero) 0 S() ≡ dL/d L0 S()=0 S(0) Two samples LA L LB 34 Maximum Likelihood and Asymptotic Hypothesis Tests Suppose we maximize the log-likelihood subject to the set of constraints, c(Θ)-q = 0 λ is the set of Lagrangian multipliers associated with the J-constraints (hypotheses) Solution to constrained maximization problem must satisfy the following two sets of FOC’s: L* =L c -q (1) L* where d (2) L* λ L c d()λ 0 c - q 0 35 Maximum Likelihood and Asymptotic Hypothesis Tests L* =L c -q If the restrictions are valid, then imposing them will not lead to a significant difference in maximized value of the LF i.e., L*(Θ)|Θ=Θ0 ≈ L(Θ)|Θ=Θ0 → second term in the 1st first-order condition, d(Θ)′λ, will be small Specifically, λ will be small given d(Θ) is the derivative of c(Θ) whose value will probably not be small We could directly test if λ=0 L* L d()λ 0 This should be close to zero if null hypothesis is true c θ d θ θ 36 Maximum Likelihood and Asymptotic Hypothesis Tests H0: Θ = Θ0 Alternatively, at the restricted maximum, from (1) the derivatives of the LF are: L* L θ d()λ 0 c d L gR d(0 )λ ML 0 If hypotheses are in fact true then gR = 0 → the derivatives of the unrestricted LLF, L(Θ) evaluated at Θ0 ≈ 0 37 Maximum Likelihood and Asymptotic Hypothesis Tests ? L 0 Derivative of original LLF but at the restricted values 0 The above implies we need to determine whether the K slopes are jointly zero Evaluate slopes at restricted values The variance of the gradient vector of L(Θ) is the information matrix, I(Θ), which has been used in the evaluation of the parameter covariance matrix The above is property P.3 of the L(Θ) stated in my introductory ML comments (Greene, p.488) Need to evaluate I(Θ) matrix at the restricted parameter vector • I(Θ)|Θ = Θ0 = negative of the expected value of the LLF Hessian matrix at Θ = Θ0 38 Maximum Likelihood and Asymptotic Hypothesis Tests Summary of Lagrange Multiplier (Score) Test S() ≡ dL/d 0 Gradient vector (of L(Θ)) variance is I(Θ) L0 S()=0 S(0) Two samples LA LB L S(0)=dL/d|=0 LM= S(0)2 I(0)-1 I(0) = -d2E(L/d2|=0) Restricted values LM~c2J asympt. 39 Maximum Likelihood and Asymptotic Hypothesis Tests Extending to multiple parameters - 1 ¢ S(Q 0 ) I Q ( ) S(Q 0 ) 0 LM = (1x K ) (K x K ) (K x1) LM ~ c 2J S(Θ) ≡ dL/dΘ, Var(S(ΘML)) = −E(HML) = IΘ) [Theorem 16.2 Greene p. 488] Theorem 16.7, Greene p. 503 provides LM statistic Buse, pp. 154-155 40 Maximum Likelihood and Asymptotic Hypothesis Tests LR, W, LM differ in type of information required LR requires both restricted and unrestricted parameter estimates (e.g., evaluate LF twice) W requires only unrestricted estimates LM requires only restricted estimates If log-likelihood is quadratic with respect to Θ, the 3 tests result in same numerical values for large samples 41 Maximum Likelihood and Asymptotic Hypothesis Tests All test statistics distributed asymptotic c2 with J d.f. (number of joint hypotheses) Theoretically, in finite samples W ≥ LR ≥ LM →LM more conservative in the sense of rejecting H0 less often Lets revisit the previous example where we examined the relationship between income and educational attainment Greene, p. 531 Previously we examined the following conditional exponential distribution: 1 Inci β Edu i f Inci | Edu i ,β e β Edu i 42 Maximum Likelihood and Asymptotic Hypothesis Tests f Inc t 1 Inc β Edu t e t β Edu t Greene, p.531 extends the exponential density to a more general gamma distribution where the exponential is nested To save on notation, define the 1 following: β t β Edu t The general gamma distribution can be represented via the following: βρt -Inc t β t f Inc t | Edu t ,β,ρ Incρ-1 e t ρ ρ-1 ρ μ e-μ dμ 0 ρ additional parameter gamma function 43 Maximum Likelihood and Asymptotic Hypothesis Tests Given the above sample log-likelihood we have the following derivatives (Greene, p. 531): T T L ρ β t Inc tβ 2t β t 1 t 1 dlnΓ ρ dρ T L T lnβ t Tψ ρ ln Inc t t 1 t 1 L 2 T 2 ρ β t 2 β t 1 L2 2 T 2 Inc tβ3t Tψ* ρ d 2 lnΓ ρ t 1 dρ 2 T 2L βt β t 1 Note: I do not use these in estimation but in post-estimation evaluation of Hessians and hypothesis testing 44 Maximum Likelihood and Asymptotic Hypothesis Tests βt ρ-1 ρ μ e-μ dμ 0 1 β Edu t βρt -Inc t β t f Inc t | Edu t ,β,ρ Incρ-1 e t ρ Gamma dist. When ρ=1 → Γ(ρ)=1 →Exponential distribution is nested within the above general distribution We can test the null hypothesis that the distribution of income is exponential versus gamma wrt educational attainment H0: ρ = 1 H1: ρ ≠ 1 Exponential dist. f Inci 1 Inc β Edui e i β Edu i 45 Maximum Likelihood and Asymptotic Hypothesis Tests ρ-1 ρ μ e-μ dμ 0 βt 1 β Edu t βρt -Inc t β t f Inc t | Edu t ,β,ρ Incρ-1 e t ρ Gamma dist. The total sample log-likelihood for the general gamma distribution is: T 1 L β,ρ ρ ln β Edu t TlnΓ ρ + t=1 T T Inc t ρ-1 ln Inc t t=1 t=1 β Edu t As before the total sample log-likelihood for the exponential distribution is T T INCt L β ln β EDU t t 1 t 1 β EDU t 46 Maximum Likelihood and Asymptotic Hypothesis Tests The following MATLAB code is used to estimate the β and ρ parameters Unrestricted parameter estimates implicitly obtained by setting first derivatives equal to 0 Restricted parameter estimates by setting ρ=1 and ∂L(β|ρ=1)/∂β=0 Three estimates of parameter covariance matrix are obtained NR (Hessian-based) ΣNR=[-∂2L/∂ΘΘ′]-1 GN (Expected Hessian-based) ΣGN=[-E(∂2L/∂ΘΘ′)]-1 where E(Inci|Edui)=ρ(β+Edui) BHHH (Sum Sq. and Cross Prod.) Estimation ΣBH=[Σi(∂Li/∂Θ) (∂Li/∂Θ)′]-1 Method 47 Maximum Likelihood and Asymptotic Hypothesis Tests 2L θθ I θ θ θl 1 2L θθ θ θl θ θ R 0.85628 2.2423 2.2423 7.4569 2 L θθ θ θ l 1 5.4940 1.6520 1.6520 0.6309 0.021659 0.66885 0.66885 32.8987 48 Maximum Likelihood and Asymptotic Hypothesis Tests ρ-1 ρ μ e-μ dμ 0 βt 1 β Edu t βρt -Inc t β t f Inc t | Edu t ,β,ρ Incρ-1 e t ρ Gamma dist. The total sample log-likelihood for the general gamma distribution is: T 1 L β,ρ ρ ln β Edu t TlnΓ ρ + t=1 T T Inc t ρ-1 ln Inc t t=1 t=1 β Edu t As before the total sample log-likelihood for the exponential distribution is T T INCt L β ln β EDU t t 1 t 1 β EDU t 49 Maximum Likelihood and Asymptotic Hypothesis Tests Lets test whether the distribution is gamma or exponential H0: ρ=1 (exponential) H1: ρ≠1 Confidence Interval: 95% CI based on unrestricted parameter results 3.1517 1.96 0.6309 1.5942,4.7085 Likelihood Ratio Test: restricted LLF Unrestricted LLF λ=2[– 82.9144 – (–88.4377)]=11.046 1 DF, critical value is 3.842 50 Maximum Likelihood and Asymptotic Hypothesis Tests Wald Test Remember, Wald test based on unrestricted parameter estimates and associated covariance matrix Our null hypothesis: c(Θ)-q = ρ-1=0 c ρˆ ρˆ 1 Est.Asy.Var. c ˆ q Est.Asy.Var. ˆ 0.6310 Wald Statistic 3.1517-1 W= 0.6310 2 =7.3379 51 Maximum Likelihood and Asymptotic Hypothesis Tests Lagrangian Multiplier Test Based on restricted parameter and parameter covariance estimates BHHH estimate of the covariance matrix typically used (especially when we restrict one of the coefficients to be at a certain value) Note still (2 x 2) with only 1 free paramater - 1 ¢ LM = S(q0 ) I (q0 ) S(q0 ) ILM 0.009944 0.2676 0.2676 11.1972 -1 0.0099 0.2676 LM= 0.0000 7.9162 0.2676 11.1972 S(Θ)' 0.0000 =15.6870 7.9162 52 Maximum Likelihood and Asymptotic Hypothesis Tests Lagrangian Multiplier Test If had used a Hessian-based estimate of the covariance matrix, LM = 5.1182 In summary, the LR, Wald and LM test statistics have different values in small samples even though testing the same null hypothesis and same critical value of 3.842 It is possible to reach different conclusions depending on which one is used For example, LM test based on Hessian and undertaken with α=0.01 instead of 0.05 (χc2=6.635) 53 Maximum Likelihood and Asymptotic Hypothesis Tests In finite samples, such differences are expected There are no clear rules on which test to rely on when such a result exists May suggest that more data is needed 54