1 Large Sample Theory 1.1 Basics

Transcription

1 Large Sample Theory 1.1 Basics
Empirical Analysis I
Part A
Class by Professor Azeem M. Shaikh
Notes by Jorge L. García1
1
Large Sample Theory
1.1
Basics
De…nition 1.1 (Convergence in Probability) A sequence of random variables fxn : n
in probability to another random variable x if 8" > 0,
Pr fjxn
1g is said to converge
xj > "g ! 0 as n ! 1:
Notation 1.2 Generally, the above de…nition is referred as
p
xn ! x as n ! 1.
lim Pr fjxn xj > "g ! 0:
n!1
A basic result about convergence in probability is the Weak Law of Large Numbers.
1.2
Weak Law of Large Numbers
Theorem 1.3 (Weak Law of Large Numbers) Let x1 ; : : : ; xn be an i:i:d: sequence of random variables with
distribution P . Assume E [xi ] = (P ) exists, i.e. 1 < (P ) < 1 or E [jxi j] < 1, then the sample mean
behaves as follows
n
_
1X
p
xi ! (P ) :
(1)
xn =
n i=1
Remark 1.4 Identically distributed means
Pr fxi
tg = Pr fxj
tg 8i; j:
(2)
for any constant t 2 R.
Remark 1.5 Independent stands for
Pr fx1
t1 ; : : : ; xk
tj g =
Y
1 j k
Pr fxj
tj g :
(3)
Theorem 1.6 (Chebyshev’s Inequality) For any random variable x with distribution P and " > 0,
Pr fjx
Proof. Obviously, if g (x)
(P )j > "g
f (x) 8x, then E [g (x)]
g (x)
:
=
V ar [x]
:
"2
(4)
E [f (x)]. Pick
= I01 fjx
(P )j > "g
jx
(P )j
"
2
(x
(P ))
"
: f (x) :
1 This
(5)
notes were taken during the …rst …ve weeks of the …rst class of the quantitative core course (Empirical Analysis, Part
A) at the University of Chicago Economics Department Ph. D. program taught by Professor Azeem M. Shaikh.
1
8" > 0. Taking expectations in both sides
Pr fjx
V ar [x]
:
"2
(P )j > "g
(6)
which completes the proof.
_ p
Proof. (Weak Law of Large Numbers) The idea is to show that xn !
_
xn
8" > 0; Pr
(P ), i.e.
(P ) > " ! 0 as n ! 1:
(7)
By Chebyshev’s Inequality,
_
Pr
_
xn
V ar xn
"2
V ar [xi ]
:
n"2
(P ) > "
=
But note that
V ar[xi ]
n"2
(8)
! 0 as n ! 1, which completes the proof.
^
De…nition 1.7 (Consistency) An estimator
n
of a parameter
is consistent if, 8" > 0,
^
Pr
>"
n
! 0 as n ! 1:
(9)
Remark 1.8 Suppose x1 ; : : : ; xn are i:i:d: with distribution P and (P ) exists. By the Weak Law of Large
_ p
_
Numbers xn ! (P ), then the natural consistent estimator of the (P ) is xn .
1.3
Existence of Moments
h
i
k
Notation 1.9 The k th raw moment of x is E xk and it exists if E jxj < 1.
h
Notation 1.10 The k th centered moment x is E (x
k
(P ))
i
h
and it exists if E j(x
Remark 1.11 The second centered moment is the variance by de…nition.
k
(P ))j
i
< 1.
Theorem 1.12 (Jensen’s Inequality) For any random variable x and a convex function g,
E [g (x)]
gE [x] :
(10)
Proof. Since g is convex,
g (x) = sup l (x)
(11)
l2L
where
L = fl : l (x)
g (x) 8xg
(12)
= E sup l (x)
(13)
and l is a linear function. Then,
E [g (x)]
l2L
E [l (x)]
sup E [l (x)]
l2L
=
sup l (E [x])
l2L
= g (E [x]) :
which completes the proof.
2
Claim 1.13 The existence of higher order moments implies the existence of lower order moments.
Proof. Take g (x) = x for
Then, by Jensen’s inequality
1. Say that the higher order moment exists and it is
h
i
k
1 < E jxj
:
h
i
k
E jxj
h
i
k
E jxj
< 1
(14)
(15)
which completes the proof.
1.4
Further Illustrations of Chebyshev Inequality
Suppose that x1 ; : : : ; xn with distribution P = Bernoulli (q) for 0 < q < 1. We want to form a con…dence
region for q := (p) of level (1
), i.e. Cn (x1 ; : : : ; xn ) s.t. Pr fq 2 Cn g
(1
) 8q. By Chebyshev
Inequality,
_
Pr
_
xn
V ar xn
"2
V ar [xi ]
n"2
q (1 q)
n"2
1
n"2
q >"
=
=
(16)
8" > 0. Then,
Pr
_
xn
q
1
n"2
(1
)
"
> 1
(17)
for the correct choice of ". Hence
Cn
_
=
0 < q < 1 : xn
=
_
xn
_
"; xn
q
"
(18)
+" :
Claim 1.14 (Vector Generalization of Convergence in Probability) Marginal convergence in probability implies joint convergence in probability.
Let xn denote a sequence of random vectors and x another random vector. Denote as xn;j the j th
p
p
component of xn . Given that xn;j ! xj as n ! 1 8j = 1; : : : ; k, then xn ! x.
Proof. Notice that
8
9
<X
=
2
Pr fjjxn xjj > "g = Pr
(xn;j xj ) > "2
(19)
:
;
j
8
9
< [
" =
Pr
jxn;j xj j > p
:
k ;
1 j k
k max Pr jxn;j
j
! 0 as n ! 1:
3
"
xj j > p
k
using the following fact
Pr (a1 +
Pr a1 > g : : : g an >
n
n
1
0
[
= Pr @
an > A
n
+ an > )
(20)
1 i n
n max Pr an >
1 i n
n
:
then the proof is complete.
Theorem 1.15 (Continuous Mapping Theorem) Let xn be a sequence of vectors and x be another random
p
vector taking its value in Rk such that xn ! x. Let g : Rk ! Rd be a continuous function on a set C such
that Pr fx 2 Cg = 1. Then
p
g (xn ) ! g (x) as n ! 1:
This is 8" > 0,
Proof. For
Pr fjg (xn )
g (x)j > "g ! 0 as n ! 1:
yj < f jg (x)
> 0, let B = x 2 Rk : 9y with jx
8y; jx
yj
jxn
xj
g jg (x)
)
g (y)j > " . Notice that if x 2
=B
g (y)j
g jg (xn )
"
g (x)j
(21)
":
Then,
Pr fjg (xn )
g (x)j > "g
=
Pr fjg (xn ) g (x)j > " \ x 2 B g + Pr fjg (xn )
Pr fx 2 B g + Pr fjxn xj > g
= Pr fx 2 B \ Cg + Pr fjxn xj > g
= Pr (;) + Pr fjxn xj > g
! 0 as n ! 1:
g (x)j > " \ x 2
= B g(22)
the last step follows from the continuity of g (x) 8x 2 C:
Example 1.16 Let x1 ; : : : ; xn be an i:i:d: sequence of random variables with distribution P . Suppose
V ar [xi ] < 1. Then, the natural consistent estimator of 2 (P ) is
1
Sn2 =
n
1
n
X
xi
_
xn
2
:
2
=
(23)
i=1
(Proof )
Notice
Sn2
=
=
Now let (x; y; z) :=
n
1
n 1; n
n
X
_ 2
x2i ; xn
i=1
n
n
n
n
n
n
n
!
n
1 X
xi
1 n i=1
n
1 X 2
x
1 n i=1 i
, note that x y
1X 2 _2
;
x ; xn
1 n i=1 i
!
_
xn
(24)
_ 2
xn
z 2 is a continuous mapping and
2
! 1; E x2i ; E [xi ]
4
2
as n ! 1:
Then,
Sn2 !
2
as n ! 1
which completes the proof.
De…nition 1.17 (Convergence in rth moment). A sequence of random vectors fxn : n
rth moment to another random vector x if
1g converges in
r
E [jxn
xj ] ! 0 as n ! 1:
Claim 1.18 Convergence in rth moment implies convergence in probability.
Proof. Let fxn : n 1g be a sequence of random vectors and x be another random vector. Obviously, if
g (x) f (x) 8x, then E [g (x)] E [f (x)] 8x. Now,
g (x)
:
:
= I01 fjxn xj > "g
jxn xj
"
r
jxn xj
"
= f (x) :
(25)
Taking expectations in both sides,
r
E [jxn xj ]
.
"r
Pr fjxn
xj > "g
Pr fjxn
xj > "g ! 0 as n ! 1:
(26)
which implies that
and completes the proof.
Claim 1.19 Convergence in probability implies convergence in rth moment.
Proof. Actually, one can show this is not the case. Let xn = n with probability
probability 1 n1 . First, note that 8" > 0,
Pr fjxn
1
n
! 0 as n ! 1:
0j > "g
=
1
n
and xn = 0 with
(27)
However, E [xn ] = 1 6= E [xn jxn = 0] = 0.
Remark 1.20 A su¢ cient condition for convergence in probability implying convergence in rth moment is
that xn are all bounded.
1.5
Convergence in Distribution
De…nition 1.21 (Convergence in distribution) A sequence fxn : n
converge in distribution to x on Rk if
Pr fxn
bg
Pr fx
1g of random vectors on Rk is said to
bg ! 0 as n ! 1:
where b 2 Rk .
Notation 1.22 As de…ned above, convergence in distribution is denoted as follows
d
xn ! x:
5
Remark 1.23 In the convergence in distribution de…nition, x may be any random variable.
Lemma 1.24 (Portmonteau). Let a sequence fxn : n
tribution on Rk . Then
1g of random vectors on Rk converge to x in dis-
1. E [f (xn )] ! E [f (x)] 8 continuous, bounded, real-valued functions f .
2. E [f (xn )] ! E [f (x)] 8 bounded, lipschitz, real-valued functions f .2
3.
lim inf E [f (xn )]
n!1
E [f (x)] 8 non-negative, continue, real-valued function f:
(28)
lim sup Pr fxn 2 F g
Pr fx 2 F g :
(29)
lim inf Pr fxn 2 Gg
Pr fx 2 Gg :
(30)
4. For any closed set F ,
n!1
5. For any open set G,
n!1
6. For all sets B such that Pr fx 2 Bg = 0,
Pr fxn 2 Bg ! Pr fx 2 Bg
where
B
= cl (B) int (B)
c
= cl (B) \ (int (B)) :
(31)
Proof. Not showed in class.
Lemma 1.26 Let fxn : n
random vector, and fyn : n
d
1g be a sequence of random vectors such that xn ! x, where x is another
p
d
1g another sequence of random variables such that xn ! yn , then yn ! x.
Proof. By Pormenteau, it su¢ ces to show that for all bounded, real-valued, lipschitz function f
E [f (yn )] ! E [f (x)] .
Notice that
E [f (yn )]
because E [f (xn )]
jf (yn )
E [f (xn )] + E [f (xn )]
d
E [f (x)] ! 0 , E [f (yn )]
d
E [f (xn )] ! 0
E [f (x)] ! 0 by Pormenteau. Now,
jf (yn ) f (xn )j I01 fjxn yn j > "g + jf (yn ) f (xn )j I01 fjxn
2BI01 fjxn yn j > "g + L jxn yn j I01 fjxn yn j "g
2BI01 fjxn yn j > "g + L"
f (xn )j
yn j
"g
(32)
(33)
where L is the lipschitz constant and B is the boundary of the function f . Then,
jE [f (yn )]
E [f (xn )]j
E [jf (yn ) f (xn )j]
E 2BI01 fjxn yn j > "g + L"
= 2B Pr fjxn yn j > "g + L"
= 0
for the correct choice of ".
2
De…nition 1.25 A function f is lipschitz if 9L < 1 such that f (x)
6
f (y)
L jx
yj.
(34)
Theorem 1.27 Convergence in probability implies convergence in distribution, i.e.
p
d
xn ! x ) xn ! x:
Proof. Apply Lemma 1:26 where xn := x, x := x, yn := y.
Exercise 1.28 The converse is generally false.
However, it is interesting that there is a case where the converse does hold.
Lemma 1.29 If fxn : n
1g is a sequence of random variables such that
d
xn ! c
then,
p
xn ! c:
Proof. Need to show that
Pr fjxn
cj > "g ! 0 as n ! 1:
By Pormentau
lim sup Pr fxn 2 F g
n!1
Pr fx 2 F g
(35)
for any closed set F . Notice,
lim sup Pr fjxn
cj > "g
lim sup Pr fjxn
n!1
n!1
cj > g
c
Pr fc 2 (B (c)) g
= 0;
for
(36)
(37)
< ", which completes the proof.
In general, marginal convergence in distribution does not imply joint convergence in distribution, i.e.
d
d
d
xn ! x; yn ! y ; (xn ; yn ) ! (x; y) :
Remark 1.30 An exception to the prior statement is y = c.
Lemma 1.31 Let fxn : n
d
d
1g such that xn ! x and fyn : n
1g such that yn ! c, then
d
(xn ; yn ) ! (x; c) :
Proof. Need to show
k(xn ; yn )
(xn ; c)k = kyn
ck
p
! 0:
By the prior lemma is enough to show
d
(xn ; c) ! (x; c) :
By Pormentau, this requires that
E [f (xn ; c)] ! E [f (x; c)]
8 bounded constant value f . But, f ( ; c) is bounded real continuous because
d
xn ! x:
which completes the proof.
7
(38)
Exercise 1.32 Show that this does not hold for any random variable.
One of the basic results of convergence in probability is the Central Limit Theorem.
Theorem 1.33 (Univariate Central Limit Theorem). Let fxn : n
with probability P and 2 (P ) = V ar [xi ] < 1, then
p
_
d
n xn
(P ) ! N 0;
2
1g be a sequence of random variables
(P ) :
Proof. Not shown in class.
Remark 1.34 This result is true even when
2
(P ) = 0, with N (0; 0) interpreted as 0.
Lemma 1.35 (Crammer-Wold) Let fxn : n 1g be a sequence of random vectors and t any given vector.
Then,
d
d
xn ! x , t0 xn ! t0 x:
Proof. Not shown in class.
Theorem 1.36 (Multivariate Central
Limit Theorem). Let fxn : n
X
vectors with distribution P and
(P ) = V ar (xi ) < 1, then
p
_
d
n xn
(P ) ! N 0;
Proof. By Crammer-Wold this is true i¤
t0
,
p
_
n xn
n
p 1X
n
t0 xi
n i=1
(P )
d
! t0 N 0;
d
2
t0 (P ) ! N 0; t0
X
1g be a sequence of random i:i:d:
(P ) :
d
(P ) ! t0 N 0; t0
X
X
(P ) t
(P ) t
which holds by the Univariate Central Limit Theorem.
Theorem 1.37 (Continuous Mapping Theorem) Let xn be a sequence of vectors and x be another random
d
vector taking its value in Rk such that xn ! x. Let g : Rk ! Rd be a continuous function on a set C such
that Pr fx 2 Cg = 1. Then
d
g (xn ) ! g (x) as n ! 1:
Proof. By Pormeteau, is enough to show that
lim sup Pr fg (xn ) 2 F g
n!1
Pr fg (x) 2 F g :
(39)
But notice that
Pr fg (xn ) 2 F g =
where g
1
Pr x 2 g
1
(F )
Pr x 2 cl g
1
(40)
(F )
(F ) = fx : g (x) 2 F g. Therefore,
lim sup Pr fg (xn ) 2 F g
lim sup Pr x 2 cl g
n!1
n!1
Pr x 2 cl g
=
Pr x 2 g
Pr x 2 g
8
1
1
1
(F )
(F ) [ C c
(F ) :
1
(F )
(41)
To …nish the proof, claim that
cl g
1
(F )
1
g
(F ) [ C c :
(42)
Take
1
z 2 cl g
(F ) :
Suppose z 2 C. Need to show that z 2 g 1 (F ), by the de…nition of the closure then there exists zn ! z
such that
zn 2 g 1 (F ) ) g (zn ) 2 F ) g (z) 2 F:
which completes the proof.
Corollary 1.38 (Slutsky Theorem) Let
p
yn ! c
d
xn ! x
then,
d
yn xn ! cx
d
yn + xn ! c + x:
Example 1.39 Let fxn : n 1g be a sequence of random variables with distribution p on R and
By the Central Limit Theorem,
p _
d
n xn
(P ) ! N 0; 2 (P ) :
By the Weak Law of Large Numbers,
p
Sn2 !
By the Continuous Mapping Theorem,
2
p
Sn !
By Slutsky Theorem
1.6
p
_
n xn
Sn
2
(P ) < 1.
(P ) :
(P ) :
(P )
d
! N (0; 1) :
Hypothesis Testing
Consider a simple test where one wants to contrast the following hypothesis
H0
H1
:
:
(p) 0
(P ) > 0:
(43)
In this context:
1. Type 1 Error: rejecting H0 when it should not be rejected.
2. Type 2 Error: fail to reject H0 when it should be rejected.
Normally,
is de…ned as the amount of Type I Error that the test is supposed to tolerate.
De…nition 1.40 A test is consistent in level if and only if
lim sup Pr fT ype I Errorg
n!1
whenever H0 is true.
9
(44)
2
Example 1.41 Let x1 ; : : : ; xn be i:i:d: with distribution P such that 0 <
p _
n xn
(P ) d
! N (0; 1) :
Sn
(P ) < 1. Then
If we want to do hypothesis testing
H0
H0
:
:
(P ) 0
(P ) > 0
(45)
at signi…cance level. A natural choice of statistic is the t statistic is
p _
nxn
tn =
:
Sn
Let cn = z1
1
=
(1
(46)
) be standard normal c:d:f:
Claim 1.42 The test is consistent in level.
Proof. We need to show that
lim sup Pr
p
n!1
whenever H0 is true, i.e.
lim sup Pr
n!1
(P )
p
_
nxn
> z1
Sn
(47)
0. Notice
_
nxn
> z1
Sn
=
lim sup Pr
n!1
lim sup Pr
n!1
=
lim sup
(p
(p
(P )
_
+
p
n (P )
> z1
Sn
)
n xn
(P )
> z1
Sn
(p _
n xn
(P )
Pr
Sn
1
n!1
=
_
n xn
Sn
lim sup (1
(z1
z1
)
(48)
)!
))
n!1
= 1 (1
=
:
because Pr
p
)
_
n(xn
(P ))
Sn
z1
!
as n ! 1, since
p
_
n(xn
(P )) d
!
Sn
N (0; 1).
Exercise 1.43 Show that the proof is consistent in the two tails case.
1.6.1
The p-value
The p-value is useful as a summary of all the possible types of that the tester is willing to contrast against.
Hence, is the smallest for which the test is rejected. For example, in the last test, the p value is the
10
following
p
value =
=
=
inf
inf
inf
=
inf
>
1
(
(
(
(
2 (0; 1) :
)
_
n xn
(P )
> z1
Sn
!
)
p _
n xn
(P )
> (z1 )
Sn
!
)
p _
n xn
(P )
>1
Sn
!)
p _
n xn
(P )
>1
Sn
!
(P )
p
2 (0; 1) :
2 (0; 1) :
2 (0; 1) :
p
H0
= 1
_
n xn
Sn
(49)
(tn ) :
Example 1.44 Let x1 ; : : : ; xn be a sequence of i:i:d: random variables with distribution P = Bernoulli (q)
with 0 < q < 1. Arguing as above,
p _
n xn q
d
q_
! N (0; 1) :
_
xn 1 xn
The idea is to construct an interval cn = cn (x1;:::; xn ) such that Pr fq 2 Cn g ! (1
9
8
p _
=
<
n xn q
Pr z 2 < q _
< z1 2 ! 1
_
;
:
xn 1 xn
which implies
8
<
_
Cn = q : xn
:
q_
_
xn 1 xn
p
z
n
q
2
q_
_
xn 1 xn
_
p
z
xn +
n
Remark 1.45 8q; " > 0, 9N ("; q) : 8n > N ("; q)
jPr fq 2 Cn g
(1
2
9
=
;
). Then,
:
)j < ":
(50)
This can be misleading. It can actually be true that
inf Pr fq 2 Cn g = 0:
(51)
0<q<1
1=n
Suppose q = (1 ")
for some q > 0, which implies Pr fx1 = 1; : : : ; xn = 1g = 1 ". If x1 = 1; : : : ; xn =
1, then cn = f1g and Pr fq 2
= Cn g.
Example 1.46 Let x1 ; : : : ; xn be a sequence of i:i:d: random vectors with distribution P on Rk . Say
1 and that it is non-singular, by the Central Limit Theorem
p
Also, note that z 0 (P )
estimator of (P ) is
1
z~
2
k
_
d
n xn
(P ) ! N (0;
where z~N (0;
^
n
=
1
n
1
(P ) <
(P )) :
(P )), by the Continuous Mapping Theorem. The natural
n
X
xi
i=1
11
_
xn
xi
_
0
xn :
^ p
n !
Exercise 1.47 Show that
Claim 1.48
1
(P ).
^
p
(P ) !
n
1
.
Proof. By the Continuous Mapping Theorem.
Exercise 1.49 Consider the following test
H0
H1
:
:
(P ) 0
(P ) > 0:
with the statistic
tn =
and cn = ck;1
its p-value.
1.7
(the k
th
2
k
quantile of the
p
_
nxn
^
0
1
n
p
_
nxn
(52)
distribution). Show that the proof is consistent in level and …nd
Delta Method
1g be a sequence of random vectors on Rk and x another
Theorem 1.50 (Delta Method) Let fxn : n
vector such that
d
tn (xn
c) ! x
for tn ! 1 as n ! 1. Also, let g : Rk ! Rd be a di¤ erentiable function at c. Let Dg (c) be the Jacobian
matrix of g so it has dimensions d k. Then,
tn (g (xn )
Proof. By Taylor’s Theorem g (x)
o (jhj).3 Then,
tn (g (xn )
d
g (c)) ! Dg (c) x:
g (c) = Dg (c) (x
c) + R (jx
g (c)) = Dg (c) tn (xn
cj) where R (j0j) = 0 and R (jhj) =
c) + tn R (jx
cj) :
(53)
By the Continuous Mapping Theorem,
d
Dg (c) tn (xn
d
tn R (jx
c) ! Dg (c) x
cj) ! 0:
Note that there are no restrictions in the dimensions of Dg (c).
Example 1.51 Let x~N (0; ), then
tn (g (xn )
d
0
g (c)) ! N 0; Dg (c)
Dg (c) :
Exercise 1.52 (Delta Method) Let fxi : i 1g is a sequence of random variables with distribution P =
Bernoulli (P ), 0 < q < 1. Notice that V ar [xi ] = g (q) = q (1 q) and, also
p
_
n xn
d
q ! N (0; g (q))
_
so what is the limiting distribution for, after normalization and centering, g xn . By the Delta Method
p
_
n g xn
d
g (q) ! Dg (q) N (0; g (q)) = N 0; (1
2
2q) q (1
q) :
Notice, however, that g 0 (q) = 1 2q and if q = 1=2 and V ar = 0. Then, then the idea of constructing a
con…dence interval makes no sense.
3 This
is de…ned below, but notice that o (jhj) means that
R(jhj)
jhj
12
! 0 as n ! 1.
Solution 1.53 (Second Order Approximation for the Delta Method) Use the idea as in the proof of the Delta
Method and derive another distribution for g (q). Note that
_
g xn
where R (jxn
_
g (q) = Dg (q) xn
qj) = o jxn
2
qj
q +
D2 g (q) _
xn
2
2
q
+ R (jxn
qj)
(54)
. Then
_
n g xn
g (q)
2
=
n (xn q)
p
n (xn q)
=
d
!
N
0;
1
4
(55)
2
2
1
2
N (0; 1)
4
1 2
:
4 1
=
=
Theorem 1.54 (Cauchy-Shuartz Inequality) Let U; Y be any two random variables such that E U 2 ; E Y 2
1, then
2
E [U Y ]
E U2 E Y 2 :
(56)
Proof. Firstly, show that E [U Y ] exists, i.e. E [jU j] ; E [jY j] < 1 ) E [U Y ] < 1:
2
(jU j
jV j)
0
,
2
2
jV j
jU j
+
2
2
,
2
2
E [jV j]
E [jU j]
+
2
2
jU V j
E [jU V j] :
Secondly, show the inequality
0
h
E (U
2
V)
= E U2
Pick a minimizer
i
2 E [U V ] + 2 E V 2 :
h
i
2
to obtain the inequality and E (U
V ) = 0 yields the inequality.
Example 1.55 Let f(xi ; yi ) : i
and
1g be an i:i:d: sequence with distribution P . Also, let E [jxi j] ; E [jyi j] < 1
Cov [xi ; yi ]
=
x;y
= E [(xi E [xi ]) (yi E [yi ])]
= E [xi yi ] E [xi ] E [yi ] :
E [xi yi ] exists by Cauchy-Shuartz Inequality. If in addition V ar (xi ) ; V ar (yi ) > 0, then
x;y
Cov [xi ; yi ]
=p
V ar (xi ) V ar (yi )
:
Also, x;y
1 by Cauchy-Shuartz Inequality. An obvious estimator of
consistent estimators of its components.
13
x;y
comes from replacing the
<
1.8
Tightness or Bounded Probability
Sometimes xn may not have a limiting distribution, but a weaker property as tightness may hold.
De…nition 1.56 (Tightness) If 8" > 0 9B > 0 such than
inf Pr fjxn j
Bg
n
1
":
This de…nition generalizes the idea of boundness of a sequence of real numbers to a sequence of random
variables. (That is why sometimes the concept is referred to as bounded in probability).
^
Notation 1.57 If
n
^
n
0
is tight we say
n
is
n
consistent.
Exercise 1.58 Show that tight consistency implies consistency.
d
Exercise 1.59 Prove that if xn ! x, then xn is tight.
Theorem 1.60 (Prohorov) Let fxn : n
9 a subsequence xn;j such that
1.8.1
1g be a sequence of random variables. Assume xn is tight. Then,
d
xn;j ! x:
Stochastic Order Symbols
De…nition 1.61 If fan : n
1g is a sequence of real numbers such that
an ! 0
then
an = op (1) :
De…nition 1.62 By analogy, if fxn : n
(57)
1g is a sequence of random variables such that
p
xn ! 0
then
xn = op (1) :
Notation 1.63 If fxn : n
(58)
1g is a sequence of tight random variables, then
xn = Op (1) :
(59)
Notation 1.64 A more general statement is, if xn = op (Rn ) ;then
xn = Rn yn for some yn = op (1) :
(60)
Notation 1.65 Also, if xn = Op (Rn ) ;then
xn = Rn yn for some yn = Op (1) :
Conjecture 1.66 Let xn = op (1) and yn = Op (1), then xn yn = op (1).
Proof. Assume that the conjecture does not follow, so
Pr fjxn yn j > "g 9 0 as n ! 1:
Then, 9 > 0 and a subsequence
n;j
such that
Pr fjxn;j yn;j j > "g !
But yn;j is tight, then there is a subsequence
n;j;l
as n ! 1:
such that
p
d
yn;j;l ! y g xn;j;l ! y:
By Slutsky’s Theorem
yn;j;l xn;j;l ! 0
which implies a contradiction. )(. The converse is true and the proof is complete.
14
(61)
2
Conditional Expectations
Let (y; x) be a random vector where y takes values in R and x takes values in Rk . Suppose that E y 2 < 1
and de…ne
M = m (x) jm : Rk ! R; E m2 (x) < 1 :
(62)
Consider the following minimization problem
inf
m(x)2M
h
E (y
2
m (x))
i
:
Claim 2.1 a solution exists, i.e. 9m (x) 2 M such that
h
i
h
2
E (y m (x)) = inf E (y
m(x)2M
2
m (x))
i
:
(63)
~
Claim 2.2 Any two solutions m (x) and m (x) satisfy
~
Pr m (x) = m (x)
= 1:
(64)
De…nition 2.3 Let E [yjx] be any solution. Then, it is the best linear predictor of y given x.
Theorem 2.4 m (x) is the solution to the minimization problem above i¤ E [(y
M.
Proof. For any m (x) 2 M , note that
h
i
2
E (y m (x))
=
h
i
2
= E (y m (x) + m (x) m (x))
h
i
2
= E (y m (x)) + 2E [(y m (x)) (m (x)
Suppose othogonality holds, then the middle term is equal to zero. So
h
i
h
i
2
2
E (y m (x))
E (y m (x)) :
m (x)) m (x)] = 08m (x) 2
(65)
h
m (x))] + E (m (x)
2
m (x))
i
(66)
Suppose m (x) solved the minimization problem. Let m (x) 2 M b given. Consider m (x) + m (x) 2 M .
Then,
i
h
i
h
2
2
2
E (y m (x))
E (y m (x)
m (x))
=
E m2 (x) + 2 E [(y m (x)) m (x)]
08 :
Then, the croos term should be zero.
Exercise 2.5 Show that the solution is unique.
The condition E y 2 < 1 is not necessary. All that is required is E [jyj] < 1. In this case E [yjx] is any
m (x) such that E [jm (x)j] < 1 and E (y m (x)) I01 fx 2 Bg = 0 for essentially all sets B.
15
2.1
Proporties of the Conditional Expectation
Assume all E [yjx] exists.
1) If y = f (x), then E [yjx] = f (x).
2) E [y + zjx] = E [yjx] + E [zjx], then
E (y + z
E [zjx]) I01 fx 2 Bg
E [yjx]
=
(67)
E (y
= 0:
E
[yjx]) I01
fx 2 Bg + E (z
E
[zjx]) I01
fx 2 Bg
3) E [f (x) yjx] = f (x) E [yjx].
4) Pr fy 0g = 1, then Pr fE [yjx] 0g = 1.
5) E [y] = E [E [yjx]]
More generally,
E [E [yjx1 ; x2 ] jx1 ] = E [E [yjx1 ]]
(68)
Need to show that
E (E [yjx1 ; x2 ]
Note that
E (E [yjx1 ; x2 ]
E [yjx1 ]) I01 fx 2 Bg
E [yjx1 ]) I01 fx1 2 Bg = 0:
= E E [yjx1 ; x2 ] I01 fx1 2 Bg
= E
= E
= 0:
6) If x ? y, then
E [yjx1 ; x2 ] I01
yI01 fx1 2 Bg
fx1 2 Bg
I01
E [yjx1 ] I01 fx1 2 Bg
E [E
fx1 2 Bg
[yjx1 ]] I01
(69)
fx1 2 Bg
E [E [yjx1 ]] I01 fx1 2 Bg]
E [yjx] = E [y]
(70)
Exercise 2.6 Show that independence implies mean indepenence.
Exercise 2.7 Show that mean independence implies uncorrelatedness.
3
Linear Regression
Let (y; x; u) be a random vector such that y; u takes values in R and x takes values in Rk+1 . Also, suppose
that x0 = (x0 ; : : : ; xk ) with x0 = 1. Then, 9 2 Rk+1 such that
y = x0 + u:
(71)
There are some possible interpretations for this model.
1. Linear Conditional Expectation
Assume E [yjx] = x0
and de…ne u = y
E [yjx] = y
x0 . By de…nition
y = x0 + u:
Note that E [ujx] = E [yjx] E [E [yjx] jx] = 0() E [ux] = 0). In this case
summarizing a feature of the distribution of (y; x), namely E [yjx].
(72)
is a convenient way of
2. Best Linear Approximation to E [yjx] or best linear predictor of y given x.
E [yjx] may not be linear so the interest is for the best linear approximation, i.e.
h
i
2
4
min E (E [yjx] x0 b) :
b2Rk+1
4 This notes where taken during the …rst …ve weeks of the …rst class of the quantitative core course (Empirical Analysis,
Parrt A) at the University of Chicago Economics Department Ph. D. program taught by Professor Azeem M. Shaikh. They
cover Large Sample Theory, Hypothesis Testing, Conditional Expectations, and Linear Regression in a rigorous way.
16
Claim 3.1 The problem is equivalent to
h
min E (y
2
x0 b)
b2Rk+1
i
Proof.
h
E (E [yjx]
2
x0 b)
i
h
= E (E [yjx]
2
x0 b)
y+y
x0 b) + (y
= E V 2 + 2V (y
i
(73)
x0 b)
h
2E [V X 0 b] + E (y
= E V 2 + 2E [V Y ]
2
x0 b)
i
which completes the proof because the …rst term is zero and the following two are constant.
Now, notice that
h
min E (y
b2Rk+1
2
x0 b)
has the following implication for the solution
E [x (y
then if u = y
i
x0 )] = 0
(74)
x0 :
E [xu] = 0.
(75)
3. Causal Model of y given x (and other stu¤!)
Assume y = g (x; u), where x are the observed characteristics of y and u the unobserved ones. Such a
g ( ) is a model of y as a function of x and u.
In this case the e¤ect of xj on y holding x j ; u constant is determined by g ( ). Assume further that
g (x; u) = x0 + u, then j has a causal interpretations, but no one knows E [ux] or E [ujx].
3.1
Estimation
Let (x; y; u) be a random vector, where y and u take values on R and x on Rk+1 , and the …rst element of x
is equal to 1. Then, there exists 2 Rk+1 such that y = x0 + u, with E [xu] = 0, E [xx0 ] < 1 and it has
no perfect colinearity. There is perfect colinearity in x if 9c 2 Rk+1 such that c 6= 0 and Pr fc0 x = 0g = 1.
To solve for , notice that
E [xu]
= 0
)
0
E [x (y x )] = 0
)
0
E [xx ]
= E [xy] :
Then, if E [x0 x] is invertible,
= (E [xx0 ])
1
E [xy] :
(76)
(77)
Lemma 3.2 E [xx0 ] is invertible i¤ there is no perfect colinearity in x.
Proof. ()) E [xx0 ] is invertible. Assume that there is perfect colinearity, then
9c 2 Rk+1
: c 6= 0; Pr fc0 x = 0g = 1:
Then
E [xx0 ] c ) E [xx0 c] = 0
17
(78)
and E [x0 x] is not invertible, which implies a contradiction.
(() There is no perfect colinearity in x. Assume E [xx0 ] is not invertible, then
c : E [xx0 ] c = 0
90
6
=
)
0
0
c E [xx ] c =
)
2
E [c0 x]
=
)
0
Pr fc x = 0g =
(79)
0
0
1
and contradicts that there is no perfect colinearity.
Remark 3.3 If there is perfect colinearity, then 9 many solutions to E [xx0 ]
= E [xy]. However, any two
^ ~
solutions ;
satisfy
^
Pr x0 = x0
~
= 1:
(80)
However, the interpretation of the two values may be di¤ erent in causal case.
3.1.1
Estimating Subvectors
Let
y = x01
1
+ x02
2
+ u:
(81)
From the above results,
1
=
x1 x01
x2 x01
E
2
1
x1 x02
x2 x02
x1 y
:
x2 y
E
(82)
In this case the partitioned matrix formula may help to get the best linear predictor. However, there
is another method. For A and B, denote BLP (AjB) as the best linear predictor of A given B (de…ned
component-wise of A as a vector). De…ne
~
y=y
and
~
x1 = x1
Also, consider
where E
~ ~
x1 u
i
(83)
BLP (x1 jx2 ) :
(84)
~ ~
~
h
BLP (yjx2 )
y = x01
1
~
+u
(85)
= 0, as in the BLP interpretation.
~
Claim 3.4 (Population Counterpart to the Frisch-Waush-Lovell Decomposition)
1
=
1.
Proof.
~
1
~
~
1
= E x1 x01
~
~
1
= E x1 x01
= E
~
~
x1 x01
~
~
= E x1 x01
=
1
h ~ ~i
E x1 y
h~ i
E x1 y
~
(86)
~
~
E x1 x01
1
~
1
~
E x1 x01
~
E x1 x01
1
1:
18
1
E
h
h~ i
+ E x1 x02
i
~
x1 BLP (yjx2 )
2
h~ i
+ E x1 u
1
~ ~
Note that E x1 x01
E
h
i
h~ i
h~ i
~
x1 BLP (yjx2 ) because E x1 x2 = 0 by the properties of the BLP , E x1 u =
0 because E [xu] = 0.
In a special case, if x2 = 1, the results show that
~
y
= y
~
x1
E [y]
= x1
(87)
E [x2 ] :
Then,
=
1
3.1.2
Cov (x; y)
:
V ar (x)
(88)
Ommited Variables Bias
Suppose
y = x01
1
+ x02
2
~
~
and
y = x1
h
i
~
where E x1 u = 0. Typically
(89)
+u
(90)
~
1
6=
~
1.
In fact,
= E [xx01 ]
1
=
3.1.3
1
+u
1
+E
1
E [x1 y]
1
0
[x1 x1 ] E
(91)
[x1 x02 ]
2:
Ordinary Least Squares
Let (x; y; u) be a random vector, where y and u take values on R and x on Rk+1 . Also, let P be the marginal
distribution of (y; x). Let, f(yi ; xi ) : i 1g be a random i:i:d: sequence. Therefore, the natural estimator of
1
= (E [x0 x]) E [x0 y] is therefore,
n
^
n
=
1X
xi x0i
n i=1
Also, this is the solution to
b2Rk+1
^
where ui :=
n
x0i
^
n
n
1X
xi yi :
n i=1
1X
(yi
n i=1
1X
xi yi
n i=1
yi
1
x0i
2
x0i b)
^
n
=0
(93)
is the ith residual. Also, note that
^
^
yi = y i + u i
and by de…nition
(92)
n
min
which f.o.c. is
!
(94)
n
1X ^
xi ui = 0:
n i=1
19
(95)
OLS as a Projection Now consider that Y = (y1; : : : ; yn ), U = (u1 ; : : : ; un ) ; X = (x1 ; : : : ; xn ), and the
same for the ^ variables. In this notation
^
n
= (X 0 X)
1
X 0Y
(96)
^
and
n
is the equivalent solution for
0
min (Y
Xb) :
Xb) (Y
b
Graph 1. OLS as a Projection
^
As shown in Graph 1., X n is the projection of Y onto the column space of X. The projection, in this
1
case, is carried out by the matrix P = X (X 0 X) X 0 . The matrix M = I P is also a projection. It
projects the linear space orthogonal to the column space of X.
Exercise 3.5 Show that the matrix M is idempotent.
Note that
Y
= PY + MY
^
(97)
^
= Y + U:
(98)
and that is why M is called the residual maker.
Residual Regression De…ne X1 and X2 and let P1 and P2 be as their respective projections. Also, let
M1
M2
= I
= I
20
P1
P2 :
(99)
One may write
^
Y
^
= Y +U
(100)
^
= X1
1
^
+ X2
^
2
+ U:
Then,
^
M2 Y
=
M2 X 1
1
^
+ X2
^
0
(M2 X1 ) (M2 X1 )
1
= M2 X 1
,
1
2
^
+U
(101)
^
+ 0 + M2 U :
~
0
(M2 X1 ) M2 Y
=
1:
which is the residual regression or the FWL decomposition.
3.1.4
Properties of the OLS Estimator
Let the setup of the distributions of the variables be the same as above. Also, assume that E [U jX] = 0.
Then
^
E
n
= :
(102)
In fact,
^
E
n jx1 ; : : : ; xn
= :
(103)
This follows because,
^
^
E
1
=
,
(X 0 X)
=
E [ jX] + (X 0 X)
=
0
n
X 0Y
(104)
1
X 0 E [U jx1 ; : : : ; xn ]
and using the fact that E [U jx1 ; : : : ; xn ] = E [Ui jxi ] = 0 because ui is i:i:d:
Gauss-Markov Theorem
1. Assume E [ujx] = 0 and V ar [ujx] = V ar [u] =
the variance is homeskedastic.
2
. Remark, V ar [ujx] 6= V ar [u], which means that
2. Linear estimations of the the term A0 Y for some matrix A = A (x1 ; : : : ; xn ) such that A0 = (X 0 X)
1
X0
^
yields
n.
This is umbiased because
E [A0 Y jx1 ; : : : ; xn ]
= A0 E [Y jx1 ; : : : ; xn ]
= A0 X
=
:
This means A0 X = I.
3. Best mean "smallest variance".
Remark 3.6 Two matrices hold V
V 0 , i¤ V 0
21
V
0, i.e. positive semi de…nite.
(105)
1
Then, one needs to show that for all possible choices of A, A0 A = (X 0 X)
0 is positive semi de…nite.
To show this,
X (X 0 X)
C := A
1
(106)
Then,
A0 A =
1
(X 0 X)
(107)
1 0
C + (X 0 X)
=
= C 0 C + CX (X 0 X)
but CX = A0 X
3.1.5
1
(X 0 X)
X 0X = I
C + (X 0 X)
1
+ (X 0 X)
1
1
1
XC
(X 0 X)
1
I = 0, which completes the proof.
Consistency of OLS
Claim 3.7 Assume that E [xx0 ] ; E [xy] < 1, the OLS estimator is consistent.
Proof. Recall that
1
(E [xx0 ])
=
E [xy]
! 1
n
1
1X
0
xi xi
n i=1
n
^
=
n
X
x0i yi
i=1
!
(108)
:
By the WLLN,
n
1X
p
xi x0i ! E [xx0 ]
n i=1
n
1X
p
xi yi ! E [xy]
n i=1
and by Slutsky’s Theorem,
^
!
as n ! 1.
which completes the proof.
3.1.6
Asymptotic Distribution of OLS
Note that
n
1X
xi x0i
n i=1
^
=
n
=
1X
xi x0i
n i=1
=
+
!
!
1
1
n
1X
xi yi
n i=1
n
n
^
n
(109)
n
1X 0
1X
xxi +
xi ui
n i=1
n i=1
! 1
!
n
n
1X
1X
0
xi xi
xi ui :
n i=1
n i=1
Then,
p
!
=
1X
xi x0i
n i=1
22
!
1
n
1 X
p
xi ui
n i=1
!
!
(110)
and if we assume E [xx0 ] exists and V ar (xu) exists
n
1 X
d
p
xi ui ! N (0; V ar (xu))
n i=1
p
by the Central Limit Theorem. By the CMT, n
^
d
! N (0; )
where
1
= (E [xx0 ])
3.1.7
1
V ar (xu) (E [xx0 ])
:
(111)
Consistent Estimation of the Variance
1
1
If one wants to estimate = (E [xx0 ]) V ar (xu) (E [xx0 ]) , then the relevant part is to estimate V ar (xu).
Assume that u is homeskedastic, then E [ujx] = 0 and V ar [ujx] = V ar [u] = 2 . Then,
V ar (xu)
= E xx0 u2
(112)
0
2
= E xx E u jx
= E [xx0 ] V AR (u) :
Then, what one needs to do is to estimate V AR (u). A natural choice for this is
^2
Claim 3.8
^2
=
1
n
n
X
n
=
1 X ^2
ui :
n i=1
(113)
^2
ui is a consistent estimator of V ar (u) when there is homeskedasticity.
i=1
Proof. Recall that
^
ui
x0i
= yi
^
(114)
n
x0i + x0i
= yi
x0i
= ui
^
x0i
n
^
:
n
Then,
^2
ui = u2i
2ui x0i
n
n
^
+ x0i
n
2
^
(115)
n
which implies
^2
First, note that
=
1X 2
u
n i=1 i
2X
ui x0i
n i=1
n
X
ui x0i
n
^
n
+
1X 0
xi
n i=1
2
^
:
n
(116)
^
= Op (1) op (1)
n
i=1
= op (1) :
Also, notice that
n
1X 0
xi
n i=1
n
2
^
1X 0
x
n i=1 i
n
n
1X 02
jx j
n i=1 i
23
2
^
(117)
n
2
^
n
where
1
n
n
X
i=1
h
i
2 p
2
jx0i j ! E jxi j = Op (1) and
2
^
= op (1) : Then,
n
n
1X 0
xi
n i=1
Then,
2
^
= op (1) :
n
(118)
^2 p
! V ar (u)
which completes the proof.
Now, assume that the error term is heterosedastic. Then, the problem is how to estimate E xx0 u2 =
V ar (xu). A natural estimator is
n
1X
^2
^
xi x0i ui :
(119)
u=
n i=1
The idea is to show that the estimator is consistent. In order to do so, the following Lemma is required.
r
Lemma 3.9 Let z1 ; : : : ; zn be an i:i:d: sequence such that E [jzi j ] < 1. Then
n
1=r
p
max jzi j ! 0:
1 i n
Proof. Not shown in class.
n
X
^2
^
xi x0i ui is a consistent estimator of V ar (xu) when there is heteroskedasticity.
Claim 3.10 u = n1
i=1
Proof. Notice that
n
^
u
1X
^2
xi x0i ui
n i=1
=
(120)
n
n
1X
1X
^2
xi x0i u2i +
xi x0i ui
n i=1
n i=1
=
Now notice that
u2i :
n
1X
p
xi x0i u2i ! V ar (Xu)
n i=1
by the WLLN. Then, the proof is complete if
1
n
n
X
^2
xi x0i ui
u2i
= op (1) is shown. Note that5
i=1
1
n
n
X
^2
xi x0i ui
n
1X
^2
xi x0i ui
n i=1
u2i
i=1
u2i
(121)
n
1X
^2
jxi x0i j ui u2i
n i=1
!
n
1X
^2
0
jxi xi j max ui
i 1 n
n i=1
u2i
Then, if E [XX 0 ] < 1, then E [XX 0 ] = Op (1) and by the Lemma above, maxi
^2
1 n
ui
u2i = op (1)
because
^2
max ui
i 1 n
u2i
^
2 jui j jxi j
n
2
+ jxi j
^
n
2
(122)
= op (1) :
5k
k is a norm that takes the element wise absolute value of the matrix. Hence, the following inequalities hold element wise.
24
since
n
2
and the same for jxi j
3.2
^
n
2
p
max jui j jxi j n
1=2
^
p
!0
n
i 1 n
h
i
2
. Notice that this is true if E (ujx) < 1.
Hypothesis Testing
Let (x; y; u) be a random vector, where y and u take values on R and x on Rk+1 . Recall that
p
^
d
! N (0; )
n
where
1
1
= (E [xx0 ]) V ar (xu) (E [xx0 ]) :
(123)
h
i
h
i
2
2
This uses the following assumptions E jxj ; E u2 j jxj < 1; E [xu] = 0. Note that this implies
V ar (xu) = E xx0 ju2 . It is reasonable to assume that
noise in the model that is to be estimated, i.e.
is invertible. It su¢ ces to show that there is some
Pr E u2 jx > 0 = 1:
(124)
Then, is invertible, i.e. positive de…nite. To see this, take c 6= 0 and note that
c0 E xx0 ju2 c = E c0 xx0 u2
i
h
2
= E (c0 x) E u2 jx
(125)
(126)
> 0:
3.2.1
Testing a Single Linear Regression
In this case,
: r0 = 0
: r0 = 0
H0
H1
where r is a 1
(k + 1) dimension vector. By the Slutsky’s Theorem,
p
Now the t
n r0
^
r0
d
! N (0; r0 r) :
statistic, is the following
p 0^
nr
q
tn =
r0
^
where
(127)
n is
the consistent estimator of
^
:
(128)
nr
. Note that, by Slutsky’s Theorem
p 0^
q nr
r0
because r
level, i.e.
0
^
nr
p
!r
0
^
nr
d
! N (0; 1) under H0
r (again by Slutsky’s Theorem). Now, the idea is to show that the test is consistent in
lim Pr jtn j > z1
n!1
(129)
=2
when H0 is true. Then,
lim Pr jtn j > z1
n!1
where z1
=2
=
(1
=2
=
=
lim Pr tn > z1
:
n!1
=2)
25
=2
+ Pr tn <
z1
=2
(130)
3.2.2
Testing Multiple Linear Regressions
In this case,
H0
H1
: R
: R
r=0
r 6= 0
(131)
where R is a p (k + 1) dimensional matrix which rows are linearly independent. Also, note that R R0
is invertible (using the fact that is invertible and R 6= 0). By the CMT,
p
^
n R
d
! N (0; R R0 )
R
and by the CMT, also,
^
R
Then
p
nR
^
n R
0 p
! R R0 :
^
d
R
!N
0; R
nR
0
by the CMT, again. Then,
p
0
^
n R
where the critical value is c1
=
R
1
(1
^
R
1
0
nR
).
26
p
^
n R
R
~
2
p