ML Estimation and Hypothesis Testing

Transcription

Maximum Likelihood and
Hypothesis Testing
 The previous discussion has provided
some indication as to how to use
Maximum Likelihood techniques to
obtain parameter estimates
 Lets now talk about some post-estimation
issues wrt hypothesis testing under the
maximum likelihood framework
Finite sample
Asymptotic hypothesis tests
1
Hypothesis Testing
 As we discussed earlier, a standard test

procedure is based on relative likelihood
function values under a null hypothesis(s)
versus its value unrestricted
Likelihood Ratio Test (finite sample):
Compare sample likelihood function
value, l(•), under the assumption that the
null hypothesis(s) is true, [l(Ω0)] vs. its
value with unrestricted parameter choice
[l*(Ω)]
Null hypothesis(s) (H0) could reduce
the allowable set of parameter values.
What does this do to the maximum
likelihood function value?
If the two resulting maximum
likelihood function values are close
2
enough → can not reject H0
Hypothesis Testing


Is this difference in likelihood function
values large?
Likelihood ratio (λ):

l * 
l 
0

unrestricted l(·)
restricted l(·)
 λ is a random variable since l(·)’s
depend on yi’s which are R.V.’s
What are possible values of λ?
 l(Ω0)≤ l*(Ω) → λ ≥1
3
Hypothesis Testing
 Likelihood Ratio (LR) Principle
 Null hypothesis defining Ω0 is
rejected if λ > 1
 Need to establish critical level of λ,
λC, that is unlikely to occur under H0
(e.g., is 1.1 far enough away from 1.0
given that λ is a RV)?
 Reject H0 if estimated value of λ > λC
 λ = 1→Null hypothesis(s) does not
significantly reduce parameter space
 H0 not rejected
 Result conditional on sample
4
Hypothesis Testing
 General Likelihood Ratio (LR) Test
Procedure
 Choose probability of Type I error, a
(e.g., test significance level)
 Given a, find value of C that satisfies:
P( > C | H0 is true) = α
 Evaluate test statistic based on sample
information

l * 
l 
0

 Reject (fail to reject) null hypothesis if
 > C (< C)
5
Hypothesis Testing
 LR test of the mean of Normal
y~N(μ, σ2)
Distribution (µ) with s2 not known
LR*=
 μˆ  μ 0 
σ 2u
  y  μˆ 
ˆ
y

μ


where σ 2 
2
u
variance of mean
T
T 1
 This implies the following test procedures
given LR* and assuming a normally
distributed RV and testing for the mean,
 F-Test
μ̂  μ 0 

LR*=
σ 2u
 t-Test
LR
*
2
T
~ F1,T-1
μ̂  μ 0 

=
~t
σ 2u
T
Note: Finite sample
tests given linear model
T-1
 LR test of hypothesized value of s2
6
Asymptotic Hypothesis Tests
 Previous tests based on finite samples
 Use asymptotic tests when appropriate
finite sample test statistic is unavailable
 These rely on Central Limit Theorem
and asymptotic normality of ML
estimators
 For example when have models
where the functional form is
nonlinear in parameters
 Three asymptotic tests commonly used:
 Asymptotic Likelihood Ratio Test
 Wald Test
 Lagrangian Multiplier (Score) Test
 For a review:
 Greene: 524-534, JHGLL: 105-110.
 Buse article (on website)
7
 Lets start simple where we have a single
parameter, θ
 Assume you want to
 Undertake ML estimation of θ
 Test of the hypothesis, H0: c(θ)=0
 The logic of the three asymptotic tests
procedures can be obtained from the
following figure
 Plots for alternative values of θ:
 The sample log-likelihood
function, L(θ)
 The gradient of the sample loglikelihood function, dL(θ)/dθ
 The value of the function, c(θ)
8
L(θ),
dL(θ)/dθ,
c(θ)
dL(θ)/dθ
.5* Likelihood
Ratio
L(θ)
L
H0: c(θ)=0
LR
c(θ)
Lagrangian
Multiplier
Wald
θ
θML,R
θML
Greene, p. 499
9
H0: c(θ)=0
 Summary of Asymptotic Hypothesis Tests
Asymptotic Likelihood Ratio Test: If
c(θ) is valid, them imposing it should
not impact L(θ) significantly.
 → base test on difference in L(θ)’s
across the two points θML and θML,R
Wald Test: If c(θ) is valid, then c(θML)
should be close to 0
 → base test on θML and c(θML)
 reject H0 if c(θML) is significantly
different from 0
Lagrange Multiplier Test: If c(θ) is
valid, then θML,R should be near the
point the maximizes LF
 → slope of LF should be close to 0
 base test on slope of LLF at θML,R
10
 These three tests are asymptotically
equivalent under H0
 They can behave differently in a small
samples
 Small sample properties are typically
unknown
 → Choice among them is typically
made on basis of computational ease
 LR test requires calculation of both
restricted and unrestricted estimates
of parameter vector
 Wald tests requires only
unrestricted estimate of parameter
vector
 LM test requires only restricted
estimate of parameter vector
11
 Let y1,…,yT be a set of RVs w/joint PDF:
fT(y1,…,yT| ) where  is a (K x 1)
vector of unknown parameters where
  Ω (allowable parameter space)
 The sample likelihood and log-likelihood
functions given our data are:
l(|y1…yT)
L(|y1…yT) ≡ lnl(|y1…yT)
 As sample size, T, becomes larger, lets
consider testing the hypotheses:
R()=[R1(),R2(),…,RJ()]'
→   Ω0
allowable parameter space
under the J hypotheses
12
 Likelihood Ratio
 λ ≡ l*(Ω)/l(Ω0) or l(ML)/l(0)
 l*(Ω) = Max[l(|y1,…,yT):Ω]
 Unrestricted likelihood function
 l(Ω0) = Max[l(|y1,…,yT):Ω0]
 Restricted likelihood function
assuming null hypothesis is true
 Asymptotic Likelihood Ratio (LR)
 LR ≡ 2ln(λ) = 2[L*()-L(0)]
 L(·) ≡ lnl(·)
 What is the sign of LR?
 LR~χ2J asymptotically where J is the
number of joint null hypothesis
(restrictions)
 Theorem 16.5, Greene p.500
13
Asymptotic Likelihood Ratio Test
0
l
H0:   Ω0
Ll

L
.5LR
L0
L
L≡ Log-Likelihood
Function
LR ≡ 2ln()=2[L(1)-L(0)]
LR~c2(α,J) asymptotically (p.105 JHGLL)
Evaluated L(•) at both 1 and 0
l generates unrestricted L(•) max
L(0) value obtained under H0
14

Greene defines the above as:
-2[L(0)-L*()]
 Result is the same
 Buse, p.153, Greene p.498-500
 As noted above, given H0 true, LR has an
approximate χ2 dist. with J DF
 Reject H0when LR > χ2c where χ2c is
the predefined critical value of the
distribution given J DF and desired
Pr(Type I error).
 In MATLAB we can generate the
critical χ2 value of H0 testing
critical_valu
= chi2inv(1-p_type_1,num_rest)
p_type_1 is Pr(Type I error)
15
 Example of an inappropriate use of the
likelihood ratio
 Use the LR to test one distributional
assumption against another (e.g.,
normal vs. logistic)
 The parameter spaces and therefore
likelihood functions are unrelated
 To be appropriate, the restricted
model needs to be nested within the
original likelihood function
→ alternative model obtained from
original model via parameter
restrictions only not a change in
functional form as in the above
normal vs. logistic example
16
 Asymptotic Likelihood Ratio (LR)
 LR ≡ 2ln(λ) = 2[L*()-L(0)]
 L(·) ≡ lnl(·)
 L*() ≡ Unrestricted LLF
 L() ≡ Restricted LLF
subject to c(Θ) =0
 LR ~ χ2J asymptotically
 J is the number of joint null
hypothesis (restrictions)
17
 A size corrected asymptotic Likelihood

Ratio statistic
 In our estimation of the error term
variance we often use a correction
factor that accounts for the number of
parameters used in estimation
 Improved the approx. to the sampling
distribution of the statistic generated
from its limiting χ2 distribution
A similar correction factor has been
suggested by Mizon (1977, p.1237)) and
by Evans and Savin (1982, p.742) to be
applied to the asymptotic LR
 These size-correction factors have
been applied to the asymptotic LR to
improve its small sample properties
18
 Mizon’s (1977) size corrected LR
statistic:
T-K-1+  J 2 
LR =
LR
T
K = Number of explanatory variables
including intercept term
J = the number of joint hypotheses
LR = the traditionally calculated loglikelihood ratio
= 2[L*()-L(0)]
C
19
 Lets provide a motivation for the
Asymptotic Wald test
 Suppose  consists of 1 element
 H0: Θ = Θ0 or Θ – Θ0= 0
 2 samples
 Generate different LF estimates
 Same  value maximizes the LF’s
20
0
l

Max at same point
Ll
.5LR0
L0
L
.5LR1
L10
H0: 0
L1
Two samples
L
 0.5LR will depend on two factors:
Distance between l and 0(+)
The curvature of the LF (+)
21

Impact of Curvature on LR
Shows Need For Wald Test
0.5LR will depend on
 Distance between l and 0(+)
 The curvature of the LF (+)
 V() represents LF curvature
Single Parameter
V Θ  
d L Θ
2
dΘ
1
0
2
Θ = Θl
Don’t forget the “–” sign
 Wald Test based on the following
statistic given the H0: Θ = Θ0
W=(l-0)2 V(|=l)-1
assuming
concavity
Impact of Curvature on LR
Shows Need For Wald Test
0
l
Max at same point
Ll
.5LR0
L0

L
L1
.5LR1
L10
V Θ  
d L Θ
2
1
dΘ2
Θ=Θl
Two samples
H0:   0
W=(l-0)2 V(|=l)-1
L
W~c2J asymptotically
Note: Evaluated at l
unrestriced
value
23

Wald statistic: weights squared distance,
(l - 0)2 by the curvature of LF instead
of using the differences in LF values as in
LR test
 Two sets of data may produce the
same (l - 0)2 value but give
different LR values because of
H0: Θ = Θ0
curvature
 The more curvature, the more likely
H0 not true (e.g., test statistic is larger)
W   l  0  V  Θl 
2
More curvature  

1
where V  Θl   
d2 L Θ 
dΘ
2
is larger  
d L Θ
2
1
dΘ2
d2L Θ 
dΘ
2
Θ=Θl
1
issmaller
Greene, p. 500-502 gives alternative
motivation, Buse, p. 153-154
24


The asymptotic covariance matrix of the
ML estimator is based on the Information
Matrix of estimated parameters
If the form of the expected values of the
2nd derivatives of the LLF are known:
   L  θ   
ML
θML   I  θ ML    E 

θ MLθML  




A 2nd estimator is: 
2
1

ˆ θML   ˆI  θ ML  

Measure of curvature
A 3rd estimator
ˆ
ˆ
ˆ θML   ˆI  θ ML  


1
   L  θ   
ML
  

  θ MLθML  
2
is:
1


   gˆ i gˆ i 
t 1

T
1
1
1
1
ˆ
ˆ
 G G 
L  yi | θ ML  ˆ
where gˆ i 
G   gˆ 1,gˆ 2 ,..., gˆ T 
θ ML
 K x1
25

Let c(Θ) represent J joint hypotheses
Extending to J simultaneous hypotheses,
K parameters and more complex H0’s:
- 1
1
é
ù
éc (Q ML )- q ù¢êd (Q ML )I (Q ML ) d (Q ML )¢ú
ûê
W = ë
ú
J
x
K
(1x1)
(
)
K
x
K
(
)
K
x
J
(
) úû
(1x J ) êë
q is target
éc (Q ML )- q ù
Can use any of
ë
û
the 3 methods
(J x1)
¶ c (Q )
d (Q ML ) º
; W ~ c 2J asymptotically
¶ Q | Q = Q ML


Note that c(∙), d(∙) and I(∙) evaluated at
ML , the unrestricted value
When cj(q) of the form: j=j0,
j=1,…K
(K x K) identity matrix
  2 L  θ ML  


 d()=IK,


θ

θ
 ML ML 
2
 W=(l-0) I(|=l)
26

In summary, Theorem 16.6, Greene
p.501: With the set of hypotheses
represented by H0: c(θ) = q, the Wald
statistic is:

W  c  θ ML   q  Asy.Var c  θ ML   q 

1
c  θ ML  -q 
d(θML)I(θML)-1 d(θML)′
27


The Wald test is based on measuring the
extent to which the unrestricted
estimates fail to satisfy the hypothesized
restrictions
Large values of W→ large deviations
of c(Θ) away from q are weighted by a
matrix involving curvature of the loglikelihood function
Shortcomings of the Wald test
It is a pure significance test against
the null hypothesis not necessarily for
a specific alternate hypothesis
The test statistic is not invariant to the
formulation of the restrictions
Test of a function θ=β/(1-γ) equals
a specific value q.
28
θ≡β/(1-γ) = q?

Two ways to evaluate this expression
 Determine variance of the non-linear
function of β and γ
 β-q(1-γ) = 0 which is equivalent but a
linear restriction based on the two
parameters, β and γ.
 The Wald statistics for these two tests
could be different and might lead to
different inferences
29
Summary of Lagrange
Multiplier (Score) Test

Based on the curvature of the loglikelihood function [L(Θ)] but this time
at the restricted log-likelihood function
value
 At unrestricted maximum:
dL   
dΘ   
 0  S  
  ML
ML
Score of Likelihood
Function
30
0
S() ≡ dL/d

Θ0 clser to optimum
under sample B
L0
S()=0
S(0)
Two samples
H0: Θ = Θ0
LA
LB


L
Sample B has the greater curvature of
L(∙) when evaluated at 0
Both samples have the same gradient at
Θ0, the hypothesize value
31

Summary of Lagrange
How much does S() depart from 0
when evaluated at the hypothesized
value?
 Weight squared slope (to get rid of
negative sign) by curvature
 The greater the curvature, the closer
0will be to the maximum value
 Weight by V() →smaller test
statistic the more curvature in
contrast to Wald test
Single
V Θ  
In contrast to Wald
which uses V(Θ)-1
d L Θ
2
dΘ
1
Parameter
0
2
Θ = Θ0
32

Summary of Lagrange
How much does S() depart from 0
when evaluated at the hypothesized
value?
 Small values of test statistic, LM, will
be generated if the value of L(0) is
close to the maximum value, L(l),
e.g. slope close to 0
33
 When comparing samples A and B:
 Sample B→ smaller test statistic
because 0 is nearer max. of its loglikelihood (e.g. S(0) closer to zero)
0
S() ≡ dL/d
L0

S()=0
S(0)
Two samples
LA
L
LB
34
 Suppose we maximize the log-likelihood
subject to the set of constraints, c(Θ)-q = 0
 λ is the set of Lagrangian multipliers
associated with the J-constraints
(hypotheses)
 Solution to constrained maximization
problem must satisfy the following two
sets of FOC’s:
L*    =L      c    -q 
(1)
L*   


where d    
(2)
L*   
λ
L   

c   
 d()λ  0

 c   - q  0
35
L*    =L      c    -q 
 If the restrictions are valid, then imposing
them will not lead to a significant
difference in maximized value of the LF
 i.e., L*(Θ)|Θ=Θ0 ≈ L(Θ)|Θ=Θ0
 → second term in the 1st first-order
condition, d(Θ)′λ, will be small
Specifically, λ will be small given
d(Θ) is the derivative of c(Θ) whose
value will probably not be small
We could directly test if λ=0
L*    L   

 d()λ  0


This should be close to zero
if null hypothesis is true
c  θ 
d θ 
θ
36
H0: Θ = Θ0
 Alternatively, at the restricted maximum,
from (1) the derivatives of the LF are:
L*    L  θ 

 d()λ  0
c   


d   

L   
 gR 
 d(0 )λ ML
   
0
 If hypotheses are in fact true then
gR = 0
 → the derivatives of the unrestricted
LLF, L(Θ) evaluated at Θ0 ≈ 0
37
?
L   
0
  
Derivative of original LLF
but at the restricted values
0
 The above implies we need to determine
whether the K slopes are jointly zero
 Evaluate slopes at restricted values
 The variance of the gradient vector of
L(Θ) is the information matrix, I(Θ),
which has been used in the evaluation
of the parameter covariance matrix
 The above is property P.3 of the
L(Θ) stated in my introductory ML
comments (Greene, p.488)
 Need to evaluate I(Θ) matrix at the
restricted parameter vector
• I(Θ)|Θ = Θ0 = negative of the
expected value of the LLF
Hessian matrix at Θ = Θ0
38
Summary of Lagrange
S() ≡ dL/d
0

Gradient vector (of
L(Θ)) variance is I(Θ)
L0
S()=0
S(0)
Two samples
LA
LB
L
S(0)=dL/d|=0
LM= S(0)2 I(0)-1
I(0) = -d2E(L/d2|=0)
Restricted values
LM~c2J asympt.
39

Extending to multiple parameters
- 1
¢
S(Q 0 )
I
Q
(
)
S(Q 0 )
0
LM =
(1x K ) (K x K ) (K x1)
LM ~ c 2J
S(Θ) ≡ dL/dΘ,
Var(S(ΘML)) = −E(HML)
= IΘ)
[Theorem 16.2 Greene p. 488]
 Theorem 16.7, Greene p. 503 provides
LM statistic
Buse, pp. 154-155
40
 LR, W, LM differ in type of information
required
 LR requires both restricted and
unrestricted parameter estimates (e.g.,
evaluate LF twice)
 W requires only unrestricted estimates
 LM requires only restricted estimates
 If log-likelihood is quadratic with respect
to Θ, the 3 tests result in same numerical
values for large samples
41
 All test statistics distributed asymptotic
c2 with J d.f. (number of joint
hypotheses)
 Theoretically, in finite samples
W ≥ LR ≥ LM
 →LM more conservative in the sense
of rejecting H0 less often
 Lets revisit the previous example where
we examined the relationship between
income and educational attainment
Greene, p. 531
Previously we examined the following
conditional exponential distribution:
1
  Inci  β  Edu i  
f  Inci | Edu i ,β  
e
β  Edu i
42
f  Inc t  
1
 Inc  β  Edu t 
e  t
β  Edu t
 Greene, p.531 extends the exponential
density to a more general gamma
distribution where the exponential is
nested
To save on notation, define the
1
following: β 
t
β  Edu t
The general gamma distribution can
be represented via the following:
βρt
-Inc t β t
f  Inc t | Edu t ,β,ρ  
Incρ-1
e
t
 ρ 

ρ-1
  ρ    μ  e-μ dμ
0
ρ additional parameter
gamma
function
43
 Given the above sample log-likelihood
we have the following derivatives
(Greene, p. 531):
T
T
L
 ρ  β t   Inc tβ 2t
β
t 1
t 1
dlnΓ  ρ 
dρ
T
L T
  lnβ t  Tψ  ρ    ln  Inc t 
 t 1
t 1
L
2
T
2

ρ
β

t
2
β
t 1
L2
 2
T
 2  Inc tβ3t
 Tψ*  ρ 
d 2 lnΓ  ρ 
t 1
dρ 2
T
 2L
   βt
β
t 1
 Note:
I do not use these in estimation but
in post-estimation evaluation of Hessians
and hypothesis testing
44
βt 

ρ-1
  ρ    μ   e-μ dμ
0
1
β  Edu t
βρt
-Inc t β t
Incρ-1
e
t
 ρ 
Gamma dist.
 When ρ=1 → Γ(ρ)=1
 →Exponential distribution is nested
within the above general distribution
 We can test the null hypothesis that
the distribution of income is
exponential versus gamma wrt
educational attainment
H0: ρ = 1 H1: ρ ≠ 1
Exponential dist.
f  Inci  
1
  Inc  β  Edui  
e  i
β  Edu i
45

ρ-1
  ρ    μ   e-μ dμ
0
βt 
1
β  Edu t
βρt
-Inc t β t
Incρ-1
e
t
 ρ 
Gamma dist.
 The total sample log-likelihood for the
general gamma distribution is:
T
1 

L  β,ρ   ρ  ln  β  Edu t   TlnΓ  ρ  +

t=1 
T
T
Inc t
 ρ-1  ln  Inc t   
t=1
t=1 β  Edu t
 As before the total sample log-likelihood
for the exponential distribution is
T
T
INCt
L  β     ln  β  EDU t   
t 1
t 1 β  EDU t
46
 The following MATLAB code is used to
estimate the β and ρ parameters
 Unrestricted parameter estimates
implicitly obtained by setting first
derivatives equal to 0
 Restricted parameter estimates by
setting ρ=1 and ∂L(β|ρ=1)/∂β=0
 Three estimates of parameter
covariance matrix are obtained
NR (Hessian-based)
ΣNR=[-∂2L/∂ΘΘ′]-1
GN (Expected Hessian-based)
ΣGN=[-E(∂2L/∂ΘΘ′)]-1
where E(Inci|Edui)=ρ(β+Edui)
BHHH (Sum Sq. and Cross Prod.)
Estimation
ΣBH=[Σi(∂Li/∂Θ) (∂Li/∂Θ)′]-1
Method
47
2L
θθ
I θ 
θ θl
1
2L
θθ
θ θl
θ θ R
 0.85628 2.2423


 2.2423 7.4569 


2

L

 
 θθ

θ

θ
l

1
 5.4940 1.6520 



1.6520
0.6309


 0.021659 0.66885 



0.66885

32.8987


48

ρ-1
  ρ    μ   e-μ dμ
0
βt 
1
β  Edu t
βρt
-Inc t β t
Incρ-1
e
t
 ρ 
Gamma dist.
 The total sample log-likelihood for the
general gamma distribution is:
T
1 

L  β,ρ   ρ  ln  β  Edu t   TlnΓ  ρ  +

t=1 
T
T
Inc t
 ρ-1  ln  Inc t   
t=1
t=1 β  Edu t
 As before the total sample log-likelihood
for the exponential distribution is
T
T
INCt
L  β     ln  β  EDU t   
t 1
t 1 β  EDU t
49
 Lets test whether the distribution is

gamma or exponential
 H0: ρ=1 (exponential) H1: ρ≠1
Confidence Interval: 95% CI based on
unrestricted parameter results
 3.1517  1.96 0.6309
 1.5942,4.7085
 Likelihood Ratio Test:
restricted LLF
Unrestricted LLF
 λ=2[– 82.9144 – (–88.4377)]=11.046
 1 DF, critical value is 3.842
50
 Wald Test
 Remember, Wald test based on
unrestricted parameter estimates and
associated covariance matrix
 Our null hypothesis: c(Θ)-q = ρ-1=0
 c  ρˆ   ρˆ  1
 Est.Asy.Var. c  ˆ   q  
Est.Asy.Var. ˆ   0.6310
 Wald Statistic
3.1517-1

W=
0.6310
2
=7.3379
51
 Lagrangian Multiplier Test
 Based on restricted parameter and
parameter covariance estimates
 BHHH estimate of the covariance
matrix typically used (especially when
we restrict one of the coefficients to
be at a certain value) Note still (2 x 2) with
only 1 free paramater
- 1
¢
 LM = S(q0 ) I (q0 ) S(q0 )
 ILM
0.009944 0.2676 


0.2676
11.1972


-1

0.0099 0.2676 
LM=  0.0000 7.9162 

0.2676 11.1972 
S(Θ)'
0.0000 
=15.6870
7.9162 


52
 Lagrangian Multiplier Test

 If had used a Hessian-based estimate
of the covariance matrix, LM =
5.1182
In summary, the LR, Wald and LM test
statistics have different values in small
samples even though testing the same
null hypothesis and same critical value of
3.842
 It is possible to reach different
conclusions depending on which one
is used
 For example, LM test based on
Hessian and undertaken with α=0.01
instead of 0.05 (χc2=6.635)
53
 In finite samples, such differences are
expected
 There are no clear rules on which test
to rely on when such a result exists
 May suggest that more data is needed
54

ML Estimation and Hypothesis Testing

Transcription

Similar documents

Full text

Modality: Introductory Worksheet

Investigating Behaviour Unit 2

science_fair_resources

CSE 5522 Homework 1

Document 6482607

Lecture Outlines PowerPoint Chapter 1 Earth Science, 12e Tarbuck

How to communicate risks using a heat map CGMA TOOLs

Vocabulary Review Worksheet