Two Sample Contents Elementary Multivariate Statistical Inference
Transcription
Two Sample Contents Elementary Multivariate Statistical Inference
Two Sample Elementary Multivariate Statistical Inference Contents 1 Notation II.G-1 2 General Algorithm II.G-1 3 Paired Data 3.1 Example . . . . . . . . . . . . . . 3.2 Idea . . . . . . . . . . . . . . . . 3.3 Assumptions . . . . . . . . . . . 3.4 Formulas . . . . . . . . . . . . . 3.5 Repeated Measure Design (RMD) . . . . . II.G-1 . . II.G-1 . . II.G-2 . . II.G-2 . . II.G-2 . . II.G-3 . . . . II.G-5 . . II.G-5 . . II.G-5 . . II.G-5 . . II.G-7 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing Mean Vectors From Two Populations—Unpaired Data 4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Example: Profile Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II.G. Two-Sample Data 1 Notation both p dimensional 2 ( Sample Population Summary Characteristics X1 , S1 µ1 , Σ1 X2 , S2 µ2 , Σ2 sample 1 X1,1 , . . . , X1,n1 sample 2 X2,1 , . . . , X2,n2 General Algorithm 1. Can data be paired? If so, use paired data approach. 2. For unpaired data, are samples large? — large sample approach. 3. If unpaired data are small samples, check to make sure that the data is normal and Σ1 = Σ2 . If not, recode data. 3 3.1 Paired Data Example Sales before and after a major advertising campaign. In genearl, comparison of two treatment responses. sales per day sales per day before treatment after treatment store product A product B product A product B 16 15 25 10 1 28 9 13 18 2 27 12 22 14 3 .. .. .. .. .. . . . . . . .. . . .. .. .. 11 . | " X11 = 25 10 {z 1st sample # " , X12 = 13 18 } # " , ... X21 = II.G-1 {z | } 2nd sample 16 15 # " , X22 = 28 9 # , ... 3.2 Idea Di = X1,i − X2,i . Treat D1 , . . . , Dn as one sample data and analyze it using the methods in II.A through II.F. " # " # µ11 − µ21 9 Note: µD = µ1 − µ2 = . Eg. D1 = . µ12 − µ22 −5 3.3 Assumptions D1 , . . . , Dn are i.i.d. If n − p is small then we need the normality assumption. 3.4 Formulas 1. Data transformation: " Let Xi = X1i X2i # and set C = (Ip | − Ip )p×2p . Then D = CX = X1 − X2 , SD = CSC0 , Di = CXi . 2. Hypothesis testing: H0 : µD = 0 (no treatment effect) versus H1 : µD 6= 0. Test statistic: T 2 = nD 0 S−1 DD ∼ (n − 1)p Fp,n−p if µD = 0. Reject H0 if T 2 > Tα2 n−p 2 Eg. T 2 = 13.6, T.05 = 9.47, so reject H0 . Evidently there is treatment effect. 3. Confidence ellipsoids: 2 (a) For µD = µ1 − µ2 , n(D − µD )0 S−1 D (D − µD ) ≤ Tα , use eigenvalues and eigenvectors of SD . Half axis length is q λi Tα2 /n (b) Simultaneous confidence intervals: for all compounds ` = a 0 µD , the simultaneous confidences: s a 0 SD a 0 a D ± Tα . n Eg. 95% simultaneous confidence interval for the change in sales per day for product A: " a= 1 0 # " , D= 13.27 −9.36 s 13.27 ± T.05 # " , SD = 418.61 88.38 88.38 199.26 418.61 = (−5.7, 32.3). 11 II.G-2 # " For product B : a = 0 1 s # − 9.36 ± Tα 199.26 = (−22.5, 3.7). 11 Gain in competetive edge: i.e., change in market difference: product A − product B After µ11 − µ12 −Before µ21 − µ22 Change (µ11 − µ21 ) − (µ12 − µ22 ) " a= 3.5 1 −1 s # 13.27 − (−9.36) ± Tα 418.61 + 199.26 − 2 · 88.38 ≡ (3.1, 42.1). 11 Repeated Measure Design (RMD) 1. Definition: In an RMD each individual (experimental unit) receives each of q treatments over successive periods of time. The order in which one receives treatment is randomized. 2. Notation: Xi = Xi1 Xi2 .. . Xiq ← first treatment response qth treatment ← response responses on the mean = ith individual µ1 µ2 .. . ← average treatment 1 response µq 3. Example of contrasts: (a) Control or standard (i.e., µ1 ) contrasted with other treatments. µ1 − µ2 µ1 − µ3 .. . .. . µ1 − µq = (q−1)×1 1 −1 1 .. . .. . 1 0 .. . .. . 0 0 −1 .. . ··· ··· ··· 0 .. .. . . .. .. .. . . . .. .. . . 0 · · · 0 −1 µ1 µ2 .. . .. . µq = c1 0 c2 0 .. . .. . cq−1 0 mean. Yi = CXi . Contrast matrix: rows are linearly independent, and each row sums to 0. (b) Successive treatment µ1 − µ2 µ2 − µ3 .. . µq−1 − µq = 1 −1 .. . 0 .. . . . . 0 ··· 0 ··· 0 .. .. .. . . . .. .. . . 0 0 1 −1 II.G-3 µ1 µ2 .. . µq = c1 0 c2 0 .. . cq−1 0 mean. (c) Eg. Anesthetic testing Response: millisecond between heart beats 4 present 3 Halothene Factor 1 Design: absent 2 low CO2 1 high Factor 2 sleeping dog data (Table 6.2, page 282) Treatments dog 1 2 3 4 1 426 609 556 600 x1 0 .. .. .. .. .. .. . . . . . . x19 0 19 Contrasts: Halothene (µ3 + µ4 ) − (µ1 + µ2 ) = present − absent CO2 (µ1 + µ3 ) − (µ2 + µ4 ) = high − low interaction (µ1 + µ4 ) − (µ2 + µ3 ) −1 −1 1 1 1 −1 C= 1 −1 , Yi = CXi . 1 −1 −1 1 0 4. Analysis: H0 : µY = 0, i.e., Cµ = 0 (all contrasts are zero). T 2 = nY S−1 Y Y = 116. Recall that Y = CX, SY = CSC0 , Tα2 = (n − 1)(q − 1) Fq−1,n−(q−1) (α) = 10.94 (note that it’s q − 1 dimensional). n − (q − 1) Simultaneous 95% confidence interval is s 0 ci X ± Tα II.G-4 ci 0 Sci . n 4 Comparing Mean Vectors From Two Populations—Unpaired Data 4.1 Assumptions 1. X11 , . . . , X1n1 are i.i.d. random vectors from a population with characteristics µ1 and Σ1 . 2. X12 , . . . , X2n2 are i.i.d. random vectors from a population with characteristics µ2 and Σ2 . 3. The samples are independent of each other and both populations are p dimensional. 4. If n1 − p or n2 − p is small, then assume that each population is normally distributed with the same covariance matrix Σ = Σ1 = Σ2 . Note: If Σ1 ≈ Σ2 , procedure works OK. Σ1 = (σ1,ij ), Σ2 = (σ2,ij ). Trouble if σ1,ii > 4σ2,ii or vice versa. 4.2 Example Electricity consumption for houses with and without air conditioning (AC) during July in Wisconsin. (Eg 6.4, page 289). Population 1: Consumption for homes with AC, n1 = 45 homes Population 2: Consumption for homes without AC, n2 = 55 homes Xi1 = total consumption during peak periods Xi2 = total consumption during off peak periods Data summary: " # " # 204.4 13825 23823 X1 = S1 = with AC 556.6 # 23823 73107 # " " 130.0 8632 19616 without AC X2 = S2 = 355.0 19616 55964 4.3 Analysis 1. Estimate of µD = µ1 −µ2 is D = X1 −X2 and (standard error)2 is cov(D) = Σ1 /n1 +Σ2 /n2 which describes the variability in our estimate. Eg. D = (74.4, 201.6)0 , Σ1 /n1 + Σ2 /n2 must be estimated. If n1 − p and n2 − p are both large then this covariance is estimated by S1 /n1 + S2 /n2 . Eg. (above) S1 S2 + = n1 n2 " 464.17 886.08 886.08 2642.15 II.G-5 # . So the estimate for the peak hours has standard error √ 464.17 = 21.5. For small samples then the assumption Σ = Σ1 = Σ2 is required and so Σ1 /n1 + Σ2 /n2 = (1/n1 + 1/n2 )Σ which is estimated by (1/n1 + 1/n2 )Spooled . Here Spooled = SS1 + SS2 total SS (n1 − 1)S1 + (n2 − 1)S2 = = . (n1 − 1) + (n2 − 1) total d.f.’s n1 + n2 − 2 Eg. (above) Spooled 44S1 + 54S2 = = 98 Hence, " 10963.7 21505.5 21505.5 63661.3 1 1 + Spooled = n1 n2 " # 442.98 868.91 868.91 2572.17 . # . 2. Definition: S = S1 + S2 , if large samples n2 1 + 1 Spooled , if small samples. n1 n2 n1 3. Hypotheses Tests H0 : µ1 = µ2 (i.e., µD = µ1 − µ2 = 0) versus H1 : µ1 6= µ2 . D = X1 − X2 . Test statistic is (df)p H F T 2 = (D − 0)0 S−1 (D − 0) = D0 S−1 D ∼0 df − p + 1 p,df−p+1 | ≈χ2p {z } for large samples For small samples, df = n1 + n2 − 2. Reject H0 if T 2 > Tα2 . Eg. (above, using large sample approach) T 2 = 15.66, α = .05, Tα2 ≈ χ22 (.05) = 5.99, reject H0 . If small sample approach was used, T 2 = 16.17, Tα2 = (98)(2) F2,97 (.05) = 6.26. 97 Fact: The compound most responsible for rejecting H0 : µ1 = µ2 is ` = a 0 µD , where the direction of a is estimated by S−1 D. Eg. (above) " −1 S D= .041 .063 # . i.e., the difference in off peak electricity consumption between those with AC and those without AC contributes more (.063 versus .041) than the difference in on peak electricity consumption to the rejection of H0 : µ1 = µ2 . 4. Confidence ellipsoid for µD = µ1 − µ2 . general formula: (estimate − µD )0 (cov)−1 (estimate − µD ) ≤ Tα2 . Here, (D − µD )0 S−1 (D − µD ) ≤ Tα2 . Use eigenvalues-eigenvectors for the description of the ellipsoid. II.G-6 √ 5. Simultaneous confidence interval for a 0 µD : a 0 D ± Tα a 0 Sa peak diff off peak diff small samples (22, 127) µ11 − µ21 (75, 329) µ12 − µ22 large samples (22, 127) a = (1, 0)0 (76, 327) a = (0, 1)0 4.4 Example: Profile Analysis 1. Idea: Compare two or more groups that are each given p treatments (tests, questions, etc.) All the responses for different groups are independent. 2. Example: Q1 : Describe your contribution to marriage Q2 : Describe your outcomes to marriage The answers to each question are in scale from 1 (extremely negative) to 5 (extremely positive). Q3 : Level of passionate love for your spouse Q4 : Level of compassionate love for your spouse The answers are in scale from 1 (none) to 5 (tremendous) 4.6 population 1: married men, n1 = 30 population 2: married women, n2 = 30. Data Profile: Husband Wife ● 4.2 ● 4.0 mean ratings 4.4 ● ● ● Q1 Q3 Q2 question II.G-7 Q4 3. Analysis P1: Are profile parallel? H0 : µ1i − µ1,i−1 = µ2i − µ2,i−1 , i = 2, · · · , p Cµ1 = Cµ2 or C(µ1 − µ2 ) = 0 −1 1 0 0 1 0 C = 0 −1 . 0 0 −1 1 P2: Are profile identical? (if parallel) H0 : µ1i = µ2i for i = 1, · · · , p. If parallel, this is equivalent to 10 µ1 = 10 µ2 or 10 (µ1 − µ2 ) = 0. P3: Are profile level? (if identical, and all questions are of same scale). H0 : µ11 = µ12 = · · · = µ1p = µ21 = · · · = µ2p . Test statistic: T 2 = (estimate)0 (cov)−1 (estimate). P1 : P2 : P3 : T 2 = [C(X1 − X2 )]0 (CSC0 )−1 [C(X1 − X2 )]; reject H0 if T 2 > Tα2 T 2 = [10 (X1 − X2 )]0 (10 S1)−1 [10 (X1 − X2 )]; reject H0 if T 2 > F1,n−2 (α) T 2 = n(CX)0 (CSpooled C0 )−1 (CX); reject H0 if T 2 > Fp−1,n−p (α), 1 1 where X = grand average, n = n1 + n2 , S = + Spooled n1 n2 (n − 2)(p − 1) Tα2 = Fp−1,n−p (α). n−p II.G-8