1 Large Sample Theory 1.1 Basics
Transcription
1 Large Sample Theory 1.1 Basics
Empirical Analysis I Part A Class by Professor Azeem M. Shaikh Notes by Jorge L. García1 1 Large Sample Theory 1.1 Basics De…nition 1.1 (Convergence in Probability) A sequence of random variables fxn : n in probability to another random variable x if 8" > 0, Pr fjxn 1g is said to converge xj > "g ! 0 as n ! 1: Notation 1.2 Generally, the above de…nition is referred as p xn ! x as n ! 1. lim Pr fjxn xj > "g ! 0: n!1 A basic result about convergence in probability is the Weak Law of Large Numbers. 1.2 Weak Law of Large Numbers Theorem 1.3 (Weak Law of Large Numbers) Let x1 ; : : : ; xn be an i:i:d: sequence of random variables with distribution P . Assume E [xi ] = (P ) exists, i.e. 1 < (P ) < 1 or E [jxi j] < 1, then the sample mean behaves as follows n _ 1X p xi ! (P ) : (1) xn = n i=1 Remark 1.4 Identically distributed means Pr fxi tg = Pr fxj tg 8i; j: (2) for any constant t 2 R. Remark 1.5 Independent stands for Pr fx1 t1 ; : : : ; xk tj g = Y 1 j k Pr fxj tj g : (3) Theorem 1.6 (Chebyshev’s Inequality) For any random variable x with distribution P and " > 0, Pr fjx Proof. Obviously, if g (x) (P )j > "g f (x) 8x, then E [g (x)] g (x) : = V ar [x] : "2 (4) E [f (x)]. Pick = I01 fjx (P )j > "g jx (P )j " 2 (x (P )) " : f (x) : 1 This (5) notes were taken during the …rst …ve weeks of the …rst class of the quantitative core course (Empirical Analysis, Part A) at the University of Chicago Economics Department Ph. D. program taught by Professor Azeem M. Shaikh. 1 8" > 0. Taking expectations in both sides Pr fjx V ar [x] : "2 (P )j > "g (6) which completes the proof. _ p Proof. (Weak Law of Large Numbers) The idea is to show that xn ! _ xn 8" > 0; Pr (P ), i.e. (P ) > " ! 0 as n ! 1: (7) By Chebyshev’s Inequality, _ Pr _ xn V ar xn "2 V ar [xi ] : n"2 (P ) > " = But note that V ar[xi ] n"2 (8) ! 0 as n ! 1, which completes the proof. ^ De…nition 1.7 (Consistency) An estimator n of a parameter is consistent if, 8" > 0, ^ Pr >" n ! 0 as n ! 1: (9) Remark 1.8 Suppose x1 ; : : : ; xn are i:i:d: with distribution P and (P ) exists. By the Weak Law of Large _ p _ Numbers xn ! (P ), then the natural consistent estimator of the (P ) is xn . 1.3 Existence of Moments h i k Notation 1.9 The k th raw moment of x is E xk and it exists if E jxj < 1. h Notation 1.10 The k th centered moment x is E (x k (P )) i h and it exists if E j(x Remark 1.11 The second centered moment is the variance by de…nition. k (P ))j i < 1. Theorem 1.12 (Jensen’s Inequality) For any random variable x and a convex function g, E [g (x)] gE [x] : (10) Proof. Since g is convex, g (x) = sup l (x) (11) l2L where L = fl : l (x) g (x) 8xg (12) = E sup l (x) (13) and l is a linear function. Then, E [g (x)] l2L E [l (x)] sup E [l (x)] l2L = sup l (E [x]) l2L = g (E [x]) : which completes the proof. 2 Claim 1.13 The existence of higher order moments implies the existence of lower order moments. Proof. Take g (x) = x for Then, by Jensen’s inequality 1. Say that the higher order moment exists and it is h i k 1 < E jxj : h i k E jxj h i k E jxj < 1 (14) (15) which completes the proof. 1.4 Further Illustrations of Chebyshev Inequality Suppose that x1 ; : : : ; xn with distribution P = Bernoulli (q) for 0 < q < 1. We want to form a con…dence region for q := (p) of level (1 ), i.e. Cn (x1 ; : : : ; xn ) s.t. Pr fq 2 Cn g (1 ) 8q. By Chebyshev Inequality, _ Pr _ xn V ar xn "2 V ar [xi ] n"2 q (1 q) n"2 1 n"2 q >" = = (16) 8" > 0. Then, Pr _ xn q 1 n"2 (1 ) " > 1 (17) for the correct choice of ". Hence Cn _ = 0 < q < 1 : xn = _ xn _ "; xn q " (18) +" : Claim 1.14 (Vector Generalization of Convergence in Probability) Marginal convergence in probability implies joint convergence in probability. Let xn denote a sequence of random vectors and x another random vector. Denote as xn;j the j th p p component of xn . Given that xn;j ! xj as n ! 1 8j = 1; : : : ; k, then xn ! x. Proof. Notice that 8 9 <X = 2 Pr fjjxn xjj > "g = Pr (xn;j xj ) > "2 (19) : ; j 8 9 < [ " = Pr jxn;j xj j > p : k ; 1 j k k max Pr jxn;j j ! 0 as n ! 1: 3 " xj j > p k using the following fact Pr (a1 + Pr a1 > g : : : g an > n n 1 0 [ = Pr @ an > A n + an > ) (20) 1 i n n max Pr an > 1 i n n : then the proof is complete. Theorem 1.15 (Continuous Mapping Theorem) Let xn be a sequence of vectors and x be another random p vector taking its value in Rk such that xn ! x. Let g : Rk ! Rd be a continuous function on a set C such that Pr fx 2 Cg = 1. Then p g (xn ) ! g (x) as n ! 1: This is 8" > 0, Proof. For Pr fjg (xn ) g (x)j > "g ! 0 as n ! 1: yj < f jg (x) > 0, let B = x 2 Rk : 9y with jx 8y; jx yj jxn xj g jg (x) ) g (y)j > " . Notice that if x 2 =B g (y)j g jg (xn ) " g (x)j (21) ": Then, Pr fjg (xn ) g (x)j > "g = Pr fjg (xn ) g (x)j > " \ x 2 B g + Pr fjg (xn ) Pr fx 2 B g + Pr fjxn xj > g = Pr fx 2 B \ Cg + Pr fjxn xj > g = Pr (;) + Pr fjxn xj > g ! 0 as n ! 1: g (x)j > " \ x 2 = B g(22) the last step follows from the continuity of g (x) 8x 2 C: Example 1.16 Let x1 ; : : : ; xn be an i:i:d: sequence of random variables with distribution P . Suppose V ar [xi ] < 1. Then, the natural consistent estimator of 2 (P ) is 1 Sn2 = n 1 n X xi _ xn 2 : 2 = (23) i=1 (Proof ) Notice Sn2 = = Now let (x; y; z) := n 1 n 1; n n X _ 2 x2i ; xn i=1 n n n n n n n ! n 1 X xi 1 n i=1 n 1 X 2 x 1 n i=1 i , note that x y 1X 2 _2 ; x ; xn 1 n i=1 i ! _ xn (24) _ 2 xn z 2 is a continuous mapping and 2 ! 1; E x2i ; E [xi ] 4 2 as n ! 1: Then, Sn2 ! 2 as n ! 1 which completes the proof. De…nition 1.17 (Convergence in rth moment). A sequence of random vectors fxn : n rth moment to another random vector x if 1g converges in r E [jxn xj ] ! 0 as n ! 1: Claim 1.18 Convergence in rth moment implies convergence in probability. Proof. Let fxn : n 1g be a sequence of random vectors and x be another random vector. Obviously, if g (x) f (x) 8x, then E [g (x)] E [f (x)] 8x. Now, g (x) : : = I01 fjxn xj > "g jxn xj " r jxn xj " = f (x) : (25) Taking expectations in both sides, r E [jxn xj ] . "r Pr fjxn xj > "g Pr fjxn xj > "g ! 0 as n ! 1: (26) which implies that and completes the proof. Claim 1.19 Convergence in probability implies convergence in rth moment. Proof. Actually, one can show this is not the case. Let xn = n with probability probability 1 n1 . First, note that 8" > 0, Pr fjxn 1 n ! 0 as n ! 1: 0j > "g = 1 n and xn = 0 with (27) However, E [xn ] = 1 6= E [xn jxn = 0] = 0. Remark 1.20 A su¢ cient condition for convergence in probability implying convergence in rth moment is that xn are all bounded. 1.5 Convergence in Distribution De…nition 1.21 (Convergence in distribution) A sequence fxn : n converge in distribution to x on Rk if Pr fxn bg Pr fx 1g of random vectors on Rk is said to bg ! 0 as n ! 1: where b 2 Rk . Notation 1.22 As de…ned above, convergence in distribution is denoted as follows d xn ! x: 5 Remark 1.23 In the convergence in distribution de…nition, x may be any random variable. Lemma 1.24 (Portmonteau). Let a sequence fxn : n tribution on Rk . Then 1g of random vectors on Rk converge to x in dis- 1. E [f (xn )] ! E [f (x)] 8 continuous, bounded, real-valued functions f . 2. E [f (xn )] ! E [f (x)] 8 bounded, lipschitz, real-valued functions f .2 3. lim inf E [f (xn )] n!1 E [f (x)] 8 non-negative, continue, real-valued function f: (28) lim sup Pr fxn 2 F g Pr fx 2 F g : (29) lim inf Pr fxn 2 Gg Pr fx 2 Gg : (30) 4. For any closed set F , n!1 5. For any open set G, n!1 6. For all sets B such that Pr fx 2 Bg = 0, Pr fxn 2 Bg ! Pr fx 2 Bg where B = cl (B) int (B) c = cl (B) \ (int (B)) : (31) Proof. Not showed in class. Lemma 1.26 Let fxn : n random vector, and fyn : n d 1g be a sequence of random vectors such that xn ! x, where x is another p d 1g another sequence of random variables such that xn ! yn , then yn ! x. Proof. By Pormenteau, it su¢ ces to show that for all bounded, real-valued, lipschitz function f E [f (yn )] ! E [f (x)] . Notice that E [f (yn )] because E [f (xn )] jf (yn ) E [f (xn )] + E [f (xn )] d E [f (x)] ! 0 , E [f (yn )] d E [f (xn )] ! 0 E [f (x)] ! 0 by Pormenteau. Now, jf (yn ) f (xn )j I01 fjxn yn j > "g + jf (yn ) f (xn )j I01 fjxn 2BI01 fjxn yn j > "g + L jxn yn j I01 fjxn yn j "g 2BI01 fjxn yn j > "g + L" f (xn )j yn j "g (32) (33) where L is the lipschitz constant and B is the boundary of the function f . Then, jE [f (yn )] E [f (xn )]j E [jf (yn ) f (xn )j] E 2BI01 fjxn yn j > "g + L" = 2B Pr fjxn yn j > "g + L" = 0 for the correct choice of ". 2 De…nition 1.25 A function f is lipschitz if 9L < 1 such that f (x) 6 f (y) L jx yj. (34) Theorem 1.27 Convergence in probability implies convergence in distribution, i.e. p d xn ! x ) xn ! x: Proof. Apply Lemma 1:26 where xn := x, x := x, yn := y. Exercise 1.28 The converse is generally false. However, it is interesting that there is a case where the converse does hold. Lemma 1.29 If fxn : n 1g is a sequence of random variables such that d xn ! c then, p xn ! c: Proof. Need to show that Pr fjxn cj > "g ! 0 as n ! 1: By Pormentau lim sup Pr fxn 2 F g n!1 Pr fx 2 F g (35) for any closed set F . Notice, lim sup Pr fjxn cj > "g lim sup Pr fjxn n!1 n!1 cj > g c Pr fc 2 (B (c)) g = 0; for (36) (37) < ", which completes the proof. In general, marginal convergence in distribution does not imply joint convergence in distribution, i.e. d d d xn ! x; yn ! y ; (xn ; yn ) ! (x; y) : Remark 1.30 An exception to the prior statement is y = c. Lemma 1.31 Let fxn : n d d 1g such that xn ! x and fyn : n 1g such that yn ! c, then d (xn ; yn ) ! (x; c) : Proof. Need to show k(xn ; yn ) (xn ; c)k = kyn ck p ! 0: By the prior lemma is enough to show d (xn ; c) ! (x; c) : By Pormentau, this requires that E [f (xn ; c)] ! E [f (x; c)] 8 bounded constant value f . But, f ( ; c) is bounded real continuous because d xn ! x: which completes the proof. 7 (38) Exercise 1.32 Show that this does not hold for any random variable. One of the basic results of convergence in probability is the Central Limit Theorem. Theorem 1.33 (Univariate Central Limit Theorem). Let fxn : n with probability P and 2 (P ) = V ar [xi ] < 1, then p _ d n xn (P ) ! N 0; 2 1g be a sequence of random variables (P ) : Proof. Not shown in class. Remark 1.34 This result is true even when 2 (P ) = 0, with N (0; 0) interpreted as 0. Lemma 1.35 (Crammer-Wold) Let fxn : n 1g be a sequence of random vectors and t any given vector. Then, d d xn ! x , t0 xn ! t0 x: Proof. Not shown in class. Theorem 1.36 (Multivariate Central Limit Theorem). Let fxn : n X vectors with distribution P and (P ) = V ar (xi ) < 1, then p _ d n xn (P ) ! N 0; Proof. By Crammer-Wold this is true i¤ t0 , p _ n xn n p 1X n t0 xi n i=1 (P ) d ! t0 N 0; d 2 t0 (P ) ! N 0; t0 X 1g be a sequence of random i:i:d: (P ) : d (P ) ! t0 N 0; t0 X X (P ) t (P ) t which holds by the Univariate Central Limit Theorem. Theorem 1.37 (Continuous Mapping Theorem) Let xn be a sequence of vectors and x be another random d vector taking its value in Rk such that xn ! x. Let g : Rk ! Rd be a continuous function on a set C such that Pr fx 2 Cg = 1. Then d g (xn ) ! g (x) as n ! 1: Proof. By Pormeteau, is enough to show that lim sup Pr fg (xn ) 2 F g n!1 Pr fg (x) 2 F g : (39) But notice that Pr fg (xn ) 2 F g = where g 1 Pr x 2 g 1 (F ) Pr x 2 cl g 1 (40) (F ) (F ) = fx : g (x) 2 F g. Therefore, lim sup Pr fg (xn ) 2 F g lim sup Pr x 2 cl g n!1 n!1 Pr x 2 cl g = Pr x 2 g Pr x 2 g 8 1 1 1 (F ) (F ) [ C c (F ) : 1 (F ) (41) To …nish the proof, claim that cl g 1 (F ) 1 g (F ) [ C c : (42) Take 1 z 2 cl g (F ) : Suppose z 2 C. Need to show that z 2 g 1 (F ), by the de…nition of the closure then there exists zn ! z such that zn 2 g 1 (F ) ) g (zn ) 2 F ) g (z) 2 F: which completes the proof. Corollary 1.38 (Slutsky Theorem) Let p yn ! c d xn ! x then, d yn xn ! cx d yn + xn ! c + x: Example 1.39 Let fxn : n 1g be a sequence of random variables with distribution p on R and By the Central Limit Theorem, p _ d n xn (P ) ! N 0; 2 (P ) : By the Weak Law of Large Numbers, p Sn2 ! By the Continuous Mapping Theorem, 2 p Sn ! By Slutsky Theorem 1.6 p _ n xn Sn 2 (P ) < 1. (P ) : (P ) : (P ) d ! N (0; 1) : Hypothesis Testing Consider a simple test where one wants to contrast the following hypothesis H0 H1 : : (p) 0 (P ) > 0: (43) In this context: 1. Type 1 Error: rejecting H0 when it should not be rejected. 2. Type 2 Error: fail to reject H0 when it should be rejected. Normally, is de…ned as the amount of Type I Error that the test is supposed to tolerate. De…nition 1.40 A test is consistent in level if and only if lim sup Pr fT ype I Errorg n!1 whenever H0 is true. 9 (44) 2 Example 1.41 Let x1 ; : : : ; xn be i:i:d: with distribution P such that 0 < p _ n xn (P ) d ! N (0; 1) : Sn (P ) < 1. Then If we want to do hypothesis testing H0 H0 : : (P ) 0 (P ) > 0 (45) at signi…cance level. A natural choice of statistic is the t statistic is p _ nxn tn = : Sn Let cn = z1 1 = (1 (46) ) be standard normal c:d:f: Claim 1.42 The test is consistent in level. Proof. We need to show that lim sup Pr p n!1 whenever H0 is true, i.e. lim sup Pr n!1 (P ) p _ nxn > z1 Sn (47) 0. Notice _ nxn > z1 Sn = lim sup Pr n!1 lim sup Pr n!1 = lim sup (p (p (P ) _ + p n (P ) > z1 Sn ) n xn (P ) > z1 Sn (p _ n xn (P ) Pr Sn 1 n!1 = _ n xn Sn lim sup (1 (z1 z1 ) (48) )! )) n!1 = 1 (1 = : because Pr p ) _ n(xn (P )) Sn z1 ! as n ! 1, since p _ n(xn (P )) d ! Sn N (0; 1). Exercise 1.43 Show that the proof is consistent in the two tails case. 1.6.1 The p-value The p-value is useful as a summary of all the possible types of that the tester is willing to contrast against. Hence, is the smallest for which the test is rejected. For example, in the last test, the p value is the 10 following p value = = = inf inf inf = inf > 1 ( ( ( ( 2 (0; 1) : ) _ n xn (P ) > z1 Sn ! ) p _ n xn (P ) > (z1 ) Sn ! ) p _ n xn (P ) >1 Sn !) p _ n xn (P ) >1 Sn ! (P ) p 2 (0; 1) : 2 (0; 1) : 2 (0; 1) : p H0 = 1 _ n xn Sn (49) (tn ) : Example 1.44 Let x1 ; : : : ; xn be a sequence of i:i:d: random variables with distribution P = Bernoulli (q) with 0 < q < 1. Arguing as above, p _ n xn q d q_ ! N (0; 1) : _ xn 1 xn The idea is to construct an interval cn = cn (x1;:::; xn ) such that Pr fq 2 Cn g ! (1 9 8 p _ = < n xn q Pr z 2 < q _ < z1 2 ! 1 _ ; : xn 1 xn which implies 8 < _ Cn = q : xn : q_ _ xn 1 xn p z n q 2 q_ _ xn 1 xn _ p z xn + n Remark 1.45 8q; " > 0, 9N ("; q) : 8n > N ("; q) jPr fq 2 Cn g (1 2 9 = ; ). Then, : )j < ": (50) This can be misleading. It can actually be true that inf Pr fq 2 Cn g = 0: (51) 0<q<1 1=n Suppose q = (1 ") for some q > 0, which implies Pr fx1 = 1; : : : ; xn = 1g = 1 ". If x1 = 1; : : : ; xn = 1, then cn = f1g and Pr fq 2 = Cn g. Example 1.46 Let x1 ; : : : ; xn be a sequence of i:i:d: random vectors with distribution P on Rk . Say 1 and that it is non-singular, by the Central Limit Theorem p Also, note that z 0 (P ) estimator of (P ) is 1 z~ 2 k _ d n xn (P ) ! N (0; where z~N (0; ^ n = 1 n 1 (P ) < (P )) : (P )), by the Continuous Mapping Theorem. The natural n X xi i=1 11 _ xn xi _ 0 xn : ^ p n ! Exercise 1.47 Show that Claim 1.48 1 (P ). ^ p (P ) ! n 1 . Proof. By the Continuous Mapping Theorem. Exercise 1.49 Consider the following test H0 H1 : : (P ) 0 (P ) > 0: with the statistic tn = and cn = ck;1 its p-value. 1.7 (the k th 2 k quantile of the p _ nxn ^ 0 1 n p _ nxn (52) distribution). Show that the proof is consistent in level and …nd Delta Method 1g be a sequence of random vectors on Rk and x another Theorem 1.50 (Delta Method) Let fxn : n vector such that d tn (xn c) ! x for tn ! 1 as n ! 1. Also, let g : Rk ! Rd be a di¤ erentiable function at c. Let Dg (c) be the Jacobian matrix of g so it has dimensions d k. Then, tn (g (xn ) Proof. By Taylor’s Theorem g (x) o (jhj).3 Then, tn (g (xn ) d g (c)) ! Dg (c) x: g (c) = Dg (c) (x c) + R (jx g (c)) = Dg (c) tn (xn cj) where R (j0j) = 0 and R (jhj) = c) + tn R (jx cj) : (53) By the Continuous Mapping Theorem, d Dg (c) tn (xn d tn R (jx c) ! Dg (c) x cj) ! 0: Note that there are no restrictions in the dimensions of Dg (c). Example 1.51 Let x~N (0; ), then tn (g (xn ) d 0 g (c)) ! N 0; Dg (c) Dg (c) : Exercise 1.52 (Delta Method) Let fxi : i 1g is a sequence of random variables with distribution P = Bernoulli (P ), 0 < q < 1. Notice that V ar [xi ] = g (q) = q (1 q) and, also p _ n xn d q ! N (0; g (q)) _ so what is the limiting distribution for, after normalization and centering, g xn . By the Delta Method p _ n g xn d g (q) ! Dg (q) N (0; g (q)) = N 0; (1 2 2q) q (1 q) : Notice, however, that g 0 (q) = 1 2q and if q = 1=2 and V ar = 0. Then, then the idea of constructing a con…dence interval makes no sense. 3 This is de…ned below, but notice that o (jhj) means that R(jhj) jhj 12 ! 0 as n ! 1. Solution 1.53 (Second Order Approximation for the Delta Method) Use the idea as in the proof of the Delta Method and derive another distribution for g (q). Note that _ g xn where R (jxn _ g (q) = Dg (q) xn qj) = o jxn 2 qj q + D2 g (q) _ xn 2 2 q + R (jxn qj) (54) . Then _ n g xn g (q) 2 = n (xn q) p n (xn q) = d ! N 0; 1 4 (55) 2 2 1 2 N (0; 1) 4 1 2 : 4 1 = = Theorem 1.54 (Cauchy-Shuartz Inequality) Let U; Y be any two random variables such that E U 2 ; E Y 2 1, then 2 E [U Y ] E U2 E Y 2 : (56) Proof. Firstly, show that E [U Y ] exists, i.e. E [jU j] ; E [jY j] < 1 ) E [U Y ] < 1: 2 (jU j jV j) 0 , 2 2 jV j jU j + 2 2 , 2 2 E [jV j] E [jU j] + 2 2 jU V j E [jU V j] : Secondly, show the inequality 0 h E (U 2 V) = E U2 Pick a minimizer i 2 E [U V ] + 2 E V 2 : h i 2 to obtain the inequality and E (U V ) = 0 yields the inequality. Example 1.55 Let f(xi ; yi ) : i and 1g be an i:i:d: sequence with distribution P . Also, let E [jxi j] ; E [jyi j] < 1 Cov [xi ; yi ] = x;y = E [(xi E [xi ]) (yi E [yi ])] = E [xi yi ] E [xi ] E [yi ] : E [xi yi ] exists by Cauchy-Shuartz Inequality. If in addition V ar (xi ) ; V ar (yi ) > 0, then x;y Cov [xi ; yi ] =p V ar (xi ) V ar (yi ) : Also, x;y 1 by Cauchy-Shuartz Inequality. An obvious estimator of consistent estimators of its components. 13 x;y comes from replacing the < 1.8 Tightness or Bounded Probability Sometimes xn may not have a limiting distribution, but a weaker property as tightness may hold. De…nition 1.56 (Tightness) If 8" > 0 9B > 0 such than inf Pr fjxn j Bg n 1 ": This de…nition generalizes the idea of boundness of a sequence of real numbers to a sequence of random variables. (That is why sometimes the concept is referred to as bounded in probability). ^ Notation 1.57 If n ^ n 0 is tight we say n is n consistent. Exercise 1.58 Show that tight consistency implies consistency. d Exercise 1.59 Prove that if xn ! x, then xn is tight. Theorem 1.60 (Prohorov) Let fxn : n 9 a subsequence xn;j such that 1.8.1 1g be a sequence of random variables. Assume xn is tight. Then, d xn;j ! x: Stochastic Order Symbols De…nition 1.61 If fan : n 1g is a sequence of real numbers such that an ! 0 then an = op (1) : De…nition 1.62 By analogy, if fxn : n (57) 1g is a sequence of random variables such that p xn ! 0 then xn = op (1) : Notation 1.63 If fxn : n (58) 1g is a sequence of tight random variables, then xn = Op (1) : (59) Notation 1.64 A more general statement is, if xn = op (Rn ) ;then xn = Rn yn for some yn = op (1) : (60) Notation 1.65 Also, if xn = Op (Rn ) ;then xn = Rn yn for some yn = Op (1) : Conjecture 1.66 Let xn = op (1) and yn = Op (1), then xn yn = op (1). Proof. Assume that the conjecture does not follow, so Pr fjxn yn j > "g 9 0 as n ! 1: Then, 9 > 0 and a subsequence n;j such that Pr fjxn;j yn;j j > "g ! But yn;j is tight, then there is a subsequence n;j;l as n ! 1: such that p d yn;j;l ! y g xn;j;l ! y: By Slutsky’s Theorem yn;j;l xn;j;l ! 0 which implies a contradiction. )(. The converse is true and the proof is complete. 14 (61) 2 Conditional Expectations Let (y; x) be a random vector where y takes values in R and x takes values in Rk . Suppose that E y 2 < 1 and de…ne M = m (x) jm : Rk ! R; E m2 (x) < 1 : (62) Consider the following minimization problem inf m(x)2M h E (y 2 m (x)) i : Claim 2.1 a solution exists, i.e. 9m (x) 2 M such that h i h 2 E (y m (x)) = inf E (y m(x)2M 2 m (x)) i : (63) ~ Claim 2.2 Any two solutions m (x) and m (x) satisfy ~ Pr m (x) = m (x) = 1: (64) De…nition 2.3 Let E [yjx] be any solution. Then, it is the best linear predictor of y given x. Theorem 2.4 m (x) is the solution to the minimization problem above i¤ E [(y M. Proof. For any m (x) 2 M , note that h i 2 E (y m (x)) = h i 2 = E (y m (x) + m (x) m (x)) h i 2 = E (y m (x)) + 2E [(y m (x)) (m (x) Suppose othogonality holds, then the middle term is equal to zero. So h i h i 2 2 E (y m (x)) E (y m (x)) : m (x)) m (x)] = 08m (x) 2 (65) h m (x))] + E (m (x) 2 m (x)) i (66) Suppose m (x) solved the minimization problem. Let m (x) 2 M b given. Consider m (x) + m (x) 2 M . Then, i h i h 2 2 2 E (y m (x)) E (y m (x) m (x)) = E m2 (x) + 2 E [(y m (x)) m (x)] 08 : Then, the croos term should be zero. Exercise 2.5 Show that the solution is unique. The condition E y 2 < 1 is not necessary. All that is required is E [jyj] < 1. In this case E [yjx] is any m (x) such that E [jm (x)j] < 1 and E (y m (x)) I01 fx 2 Bg = 0 for essentially all sets B. 15 2.1 Proporties of the Conditional Expectation Assume all E [yjx] exists. 1) If y = f (x), then E [yjx] = f (x). 2) E [y + zjx] = E [yjx] + E [zjx], then E (y + z E [zjx]) I01 fx 2 Bg E [yjx] = (67) E (y = 0: E [yjx]) I01 fx 2 Bg + E (z E [zjx]) I01 fx 2 Bg 3) E [f (x) yjx] = f (x) E [yjx]. 4) Pr fy 0g = 1, then Pr fE [yjx] 0g = 1. 5) E [y] = E [E [yjx]] More generally, E [E [yjx1 ; x2 ] jx1 ] = E [E [yjx1 ]] (68) Need to show that E (E [yjx1 ; x2 ] Note that E (E [yjx1 ; x2 ] E [yjx1 ]) I01 fx 2 Bg E [yjx1 ]) I01 fx1 2 Bg = 0: = E E [yjx1 ; x2 ] I01 fx1 2 Bg = E = E = 0: 6) If x ? y, then E [yjx1 ; x2 ] I01 yI01 fx1 2 Bg fx1 2 Bg I01 E [yjx1 ] I01 fx1 2 Bg E [E fx1 2 Bg [yjx1 ]] I01 (69) fx1 2 Bg E [E [yjx1 ]] I01 fx1 2 Bg] E [yjx] = E [y] (70) Exercise 2.6 Show that independence implies mean indepenence. Exercise 2.7 Show that mean independence implies uncorrelatedness. 3 Linear Regression Let (y; x; u) be a random vector such that y; u takes values in R and x takes values in Rk+1 . Also, suppose that x0 = (x0 ; : : : ; xk ) with x0 = 1. Then, 9 2 Rk+1 such that y = x0 + u: (71) There are some possible interpretations for this model. 1. Linear Conditional Expectation Assume E [yjx] = x0 and de…ne u = y E [yjx] = y x0 . By de…nition y = x0 + u: Note that E [ujx] = E [yjx] E [E [yjx] jx] = 0() E [ux] = 0). In this case summarizing a feature of the distribution of (y; x), namely E [yjx]. (72) is a convenient way of 2. Best Linear Approximation to E [yjx] or best linear predictor of y given x. E [yjx] may not be linear so the interest is for the best linear approximation, i.e. h i 2 4 min E (E [yjx] x0 b) : b2Rk+1 4 This notes where taken during the …rst …ve weeks of the …rst class of the quantitative core course (Empirical Analysis, Parrt A) at the University of Chicago Economics Department Ph. D. program taught by Professor Azeem M. Shaikh. They cover Large Sample Theory, Hypothesis Testing, Conditional Expectations, and Linear Regression in a rigorous way. 16 Claim 3.1 The problem is equivalent to h min E (y 2 x0 b) b2Rk+1 i Proof. h E (E [yjx] 2 x0 b) i h = E (E [yjx] 2 x0 b) y+y x0 b) + (y = E V 2 + 2V (y i (73) x0 b) h 2E [V X 0 b] + E (y = E V 2 + 2E [V Y ] 2 x0 b) i which completes the proof because the …rst term is zero and the following two are constant. Now, notice that h min E (y b2Rk+1 2 x0 b) has the following implication for the solution E [x (y then if u = y i x0 )] = 0 (74) x0 : E [xu] = 0. (75) 3. Causal Model of y given x (and other stu¤!) Assume y = g (x; u), where x are the observed characteristics of y and u the unobserved ones. Such a g ( ) is a model of y as a function of x and u. In this case the e¤ect of xj on y holding x j ; u constant is determined by g ( ). Assume further that g (x; u) = x0 + u, then j has a causal interpretations, but no one knows E [ux] or E [ujx]. 3.1 Estimation Let (x; y; u) be a random vector, where y and u take values on R and x on Rk+1 , and the …rst element of x is equal to 1. Then, there exists 2 Rk+1 such that y = x0 + u, with E [xu] = 0, E [xx0 ] < 1 and it has no perfect colinearity. There is perfect colinearity in x if 9c 2 Rk+1 such that c 6= 0 and Pr fc0 x = 0g = 1. To solve for , notice that E [xu] = 0 ) 0 E [x (y x )] = 0 ) 0 E [xx ] = E [xy] : Then, if E [x0 x] is invertible, = (E [xx0 ]) 1 E [xy] : (76) (77) Lemma 3.2 E [xx0 ] is invertible i¤ there is no perfect colinearity in x. Proof. ()) E [xx0 ] is invertible. Assume that there is perfect colinearity, then 9c 2 Rk+1 : c 6= 0; Pr fc0 x = 0g = 1: Then E [xx0 ] c ) E [xx0 c] = 0 17 (78) and E [x0 x] is not invertible, which implies a contradiction. (() There is no perfect colinearity in x. Assume E [xx0 ] is not invertible, then c : E [xx0 ] c = 0 90 6 = ) 0 0 c E [xx ] c = ) 2 E [c0 x] = ) 0 Pr fc x = 0g = (79) 0 0 1 and contradicts that there is no perfect colinearity. Remark 3.3 If there is perfect colinearity, then 9 many solutions to E [xx0 ] = E [xy]. However, any two ^ ~ solutions ; satisfy ^ Pr x0 = x0 ~ = 1: (80) However, the interpretation of the two values may be di¤ erent in causal case. 3.1.1 Estimating Subvectors Let y = x01 1 + x02 2 + u: (81) From the above results, 1 = x1 x01 x2 x01 E 2 1 x1 x02 x2 x02 x1 y : x2 y E (82) In this case the partitioned matrix formula may help to get the best linear predictor. However, there is another method. For A and B, denote BLP (AjB) as the best linear predictor of A given B (de…ned component-wise of A as a vector). De…ne ~ y=y and ~ x1 = x1 Also, consider where E ~ ~ x1 u i (83) BLP (x1 jx2 ) : (84) ~ ~ ~ h BLP (yjx2 ) y = x01 1 ~ +u (85) = 0, as in the BLP interpretation. ~ Claim 3.4 (Population Counterpart to the Frisch-Waush-Lovell Decomposition) 1 = 1. Proof. ~ 1 ~ ~ 1 = E x1 x01 ~ ~ 1 = E x1 x01 = E ~ ~ x1 x01 ~ ~ = E x1 x01 = 1 h ~ ~i E x1 y h~ i E x1 y ~ (86) ~ ~ E x1 x01 1 ~ 1 ~ E x1 x01 ~ E x1 x01 1 1: 18 1 E h h~ i + E x1 x02 i ~ x1 BLP (yjx2 ) 2 h~ i + E x1 u 1 ~ ~ Note that E x1 x01 E h i h~ i h~ i ~ x1 BLP (yjx2 ) because E x1 x2 = 0 by the properties of the BLP , E x1 u = 0 because E [xu] = 0. In a special case, if x2 = 1, the results show that ~ y = y ~ x1 E [y] = x1 (87) E [x2 ] : Then, = 1 3.1.2 Cov (x; y) : V ar (x) (88) Ommited Variables Bias Suppose y = x01 1 + x02 2 ~ ~ and y = x1 h i ~ where E x1 u = 0. Typically (89) +u (90) ~ 1 6= ~ 1. In fact, = E [xx01 ] 1 = 3.1.3 1 +u 1 +E 1 E [x1 y] 1 0 [x1 x1 ] E (91) [x1 x02 ] 2: Ordinary Least Squares Let (x; y; u) be a random vector, where y and u take values on R and x on Rk+1 . Also, let P be the marginal distribution of (y; x). Let, f(yi ; xi ) : i 1g be a random i:i:d: sequence. Therefore, the natural estimator of 1 = (E [x0 x]) E [x0 y] is therefore, n ^ n = 1X xi x0i n i=1 Also, this is the solution to b2Rk+1 ^ where ui := n x0i ^ n n 1X xi yi : n i=1 1X (yi n i=1 1X xi yi n i=1 yi 1 x0i 2 x0i b) ^ n =0 (93) is the ith residual. Also, note that ^ ^ yi = y i + u i and by de…nition (92) n min which f.o.c. is ! (94) n 1X ^ xi ui = 0: n i=1 19 (95) OLS as a Projection Now consider that Y = (y1; : : : ; yn ), U = (u1 ; : : : ; un ) ; X = (x1 ; : : : ; xn ), and the same for the ^ variables. In this notation ^ n = (X 0 X) 1 X 0Y (96) ^ and n is the equivalent solution for 0 min (Y Xb) : Xb) (Y b Graph 1. OLS as a Projection ^ As shown in Graph 1., X n is the projection of Y onto the column space of X. The projection, in this 1 case, is carried out by the matrix P = X (X 0 X) X 0 . The matrix M = I P is also a projection. It projects the linear space orthogonal to the column space of X. Exercise 3.5 Show that the matrix M is idempotent. Note that Y = PY + MY ^ (97) ^ = Y + U: (98) and that is why M is called the residual maker. Residual Regression De…ne X1 and X2 and let P1 and P2 be as their respective projections. Also, let M1 M2 = I = I 20 P1 P2 : (99) One may write ^ Y ^ = Y +U (100) ^ = X1 1 ^ + X2 ^ 2 + U: Then, ^ M2 Y = M2 X 1 1 ^ + X2 ^ 0 (M2 X1 ) (M2 X1 ) 1 = M2 X 1 , 1 2 ^ +U (101) ^ + 0 + M2 U : ~ 0 (M2 X1 ) M2 Y = 1: which is the residual regression or the FWL decomposition. 3.1.4 Properties of the OLS Estimator Let the setup of the distributions of the variables be the same as above. Also, assume that E [U jX] = 0. Then ^ E n = : (102) In fact, ^ E n jx1 ; : : : ; xn = : (103) This follows because, ^ ^ E 1 = , (X 0 X) = E [ jX] + (X 0 X) = 0 n X 0Y (104) 1 X 0 E [U jx1 ; : : : ; xn ] and using the fact that E [U jx1 ; : : : ; xn ] = E [Ui jxi ] = 0 because ui is i:i:d: Gauss-Markov Theorem 1. Assume E [ujx] = 0 and V ar [ujx] = V ar [u] = the variance is homeskedastic. 2 . Remark, V ar [ujx] 6= V ar [u], which means that 2. Linear estimations of the the term A0 Y for some matrix A = A (x1 ; : : : ; xn ) such that A0 = (X 0 X) 1 X0 ^ yields n. This is umbiased because E [A0 Y jx1 ; : : : ; xn ] = A0 E [Y jx1 ; : : : ; xn ] = A0 X = : This means A0 X = I. 3. Best mean "smallest variance". Remark 3.6 Two matrices hold V V 0 , i¤ V 0 21 V 0, i.e. positive semi de…nite. (105) 1 Then, one needs to show that for all possible choices of A, A0 A = (X 0 X) 0 is positive semi de…nite. To show this, X (X 0 X) C := A 1 (106) Then, A0 A = 1 (X 0 X) (107) 1 0 C + (X 0 X) = = C 0 C + CX (X 0 X) but CX = A0 X 3.1.5 1 (X 0 X) X 0X = I C + (X 0 X) 1 + (X 0 X) 1 1 1 XC (X 0 X) 1 I = 0, which completes the proof. Consistency of OLS Claim 3.7 Assume that E [xx0 ] ; E [xy] < 1, the OLS estimator is consistent. Proof. Recall that 1 (E [xx0 ]) = E [xy] ! 1 n 1 1X 0 xi xi n i=1 n ^ = n X x0i yi i=1 ! (108) : By the WLLN, n 1X p xi x0i ! E [xx0 ] n i=1 n 1X p xi yi ! E [xy] n i=1 and by Slutsky’s Theorem, ^ ! as n ! 1. which completes the proof. 3.1.6 Asymptotic Distribution of OLS Note that n 1X xi x0i n i=1 ^ = n = 1X xi x0i n i=1 = + ! ! 1 1 n 1X xi yi n i=1 n n ^ n (109) n 1X 0 1X xxi + xi ui n i=1 n i=1 ! 1 ! n n 1X 1X 0 xi xi xi ui : n i=1 n i=1 Then, p ! = 1X xi x0i n i=1 22 ! 1 n 1 X p xi ui n i=1 ! ! (110) and if we assume E [xx0 ] exists and V ar (xu) exists n 1 X d p xi ui ! N (0; V ar (xu)) n i=1 p by the Central Limit Theorem. By the CMT, n ^ d ! N (0; ) where 1 = (E [xx0 ]) 3.1.7 1 V ar (xu) (E [xx0 ]) : (111) Consistent Estimation of the Variance 1 1 If one wants to estimate = (E [xx0 ]) V ar (xu) (E [xx0 ]) , then the relevant part is to estimate V ar (xu). Assume that u is homeskedastic, then E [ujx] = 0 and V ar [ujx] = V ar [u] = 2 . Then, V ar (xu) = E xx0 u2 (112) 0 2 = E xx E u jx = E [xx0 ] V AR (u) : Then, what one needs to do is to estimate V AR (u). A natural choice for this is ^2 Claim 3.8 ^2 = 1 n n X n = 1 X ^2 ui : n i=1 (113) ^2 ui is a consistent estimator of V ar (u) when there is homeskedasticity. i=1 Proof. Recall that ^ ui x0i = yi ^ (114) n x0i + x0i = yi x0i = ui ^ x0i n ^ : n Then, ^2 ui = u2i 2ui x0i n n ^ + x0i n 2 ^ (115) n which implies ^2 First, note that = 1X 2 u n i=1 i 2X ui x0i n i=1 n X ui x0i n ^ n + 1X 0 xi n i=1 2 ^ : n (116) ^ = Op (1) op (1) n i=1 = op (1) : Also, notice that n 1X 0 xi n i=1 n 2 ^ 1X 0 x n i=1 i n n 1X 02 jx j n i=1 i 23 2 ^ (117) n 2 ^ n where 1 n n X i=1 h i 2 p 2 jx0i j ! E jxi j = Op (1) and 2 ^ = op (1) : Then, n n 1X 0 xi n i=1 Then, 2 ^ = op (1) : n (118) ^2 p ! V ar (u) which completes the proof. Now, assume that the error term is heterosedastic. Then, the problem is how to estimate E xx0 u2 = V ar (xu). A natural estimator is n 1X ^2 ^ xi x0i ui : (119) u= n i=1 The idea is to show that the estimator is consistent. In order to do so, the following Lemma is required. r Lemma 3.9 Let z1 ; : : : ; zn be an i:i:d: sequence such that E [jzi j ] < 1. Then n 1=r p max jzi j ! 0: 1 i n Proof. Not shown in class. n X ^2 ^ xi x0i ui is a consistent estimator of V ar (xu) when there is heteroskedasticity. Claim 3.10 u = n1 i=1 Proof. Notice that n ^ u 1X ^2 xi x0i ui n i=1 = (120) n n 1X 1X ^2 xi x0i u2i + xi x0i ui n i=1 n i=1 = Now notice that u2i : n 1X p xi x0i u2i ! V ar (Xu) n i=1 by the WLLN. Then, the proof is complete if 1 n n X ^2 xi x0i ui u2i = op (1) is shown. Note that5 i=1 1 n n X ^2 xi x0i ui n 1X ^2 xi x0i ui n i=1 u2i i=1 u2i (121) n 1X ^2 jxi x0i j ui u2i n i=1 ! n 1X ^2 0 jxi xi j max ui i 1 n n i=1 u2i Then, if E [XX 0 ] < 1, then E [XX 0 ] = Op (1) and by the Lemma above, maxi ^2 1 n ui u2i = op (1) because ^2 max ui i 1 n u2i ^ 2 jui j jxi j n 2 + jxi j ^ n 2 (122) = op (1) : 5k k is a norm that takes the element wise absolute value of the matrix. Hence, the following inequalities hold element wise. 24 since n 2 and the same for jxi j 3.2 ^ n 2 p max jui j jxi j n 1=2 ^ p !0 n i 1 n h i 2 . Notice that this is true if E (ujx) < 1. Hypothesis Testing Let (x; y; u) be a random vector, where y and u take values on R and x on Rk+1 . Recall that p ^ d ! N (0; ) n where 1 1 = (E [xx0 ]) V ar (xu) (E [xx0 ]) : (123) h i h i 2 2 This uses the following assumptions E jxj ; E u2 j jxj < 1; E [xu] = 0. Note that this implies V ar (xu) = E xx0 ju2 . It is reasonable to assume that noise in the model that is to be estimated, i.e. is invertible. It su¢ ces to show that there is some Pr E u2 jx > 0 = 1: (124) Then, is invertible, i.e. positive de…nite. To see this, take c 6= 0 and note that c0 E xx0 ju2 c = E c0 xx0 u2 i h 2 = E (c0 x) E u2 jx (125) (126) > 0: 3.2.1 Testing a Single Linear Regression In this case, : r0 = 0 : r0 = 0 H0 H1 where r is a 1 (k + 1) dimension vector. By the Slutsky’s Theorem, p Now the t n r0 ^ r0 d ! N (0; r0 r) : statistic, is the following p 0^ nr q tn = r0 ^ where (127) n is the consistent estimator of ^ : (128) nr . Note that, by Slutsky’s Theorem p 0^ q nr r0 because r level, i.e. 0 ^ nr p !r 0 ^ nr d ! N (0; 1) under H0 r (again by Slutsky’s Theorem). Now, the idea is to show that the test is consistent in lim Pr jtn j > z1 n!1 (129) =2 when H0 is true. Then, lim Pr jtn j > z1 n!1 where z1 =2 = (1 =2 = = lim Pr tn > z1 : n!1 =2) 25 =2 + Pr tn < z1 =2 (130) 3.2.2 Testing Multiple Linear Regressions In this case, H0 H1 : R : R r=0 r 6= 0 (131) where R is a p (k + 1) dimensional matrix which rows are linearly independent. Also, note that R R0 is invertible (using the fact that is invertible and R 6= 0). By the CMT, p ^ n R d ! N (0; R R0 ) R and by the CMT, also, ^ R Then p nR ^ n R 0 p ! R R0 : ^ d R !N 0; R nR 0 by the CMT, again. Then, p 0 ^ n R where the critical value is c1 = R 1 (1 ^ R 1 0 nR ). 26 p ^ n R R ~ 2 p