Stat 300C: Final Presentation
Transcription
Stat 300C: Final Presentation
Correlation and Large-Scale Simultaneous Significance Testing, Bradley Efron, 2007, JASA Stat 300C: Final Presentation Leonid Pekelis June 03, 2011 Main Points I I Correlation between test statistics can have varied effects on multiple hypothesis testing procedures, making it harder to trust FDR procedures which don’t account for correlation. Allowing for some assumptions, can formalize a model which describes how correlations propogate to false discovery estimates. I I There is some evidence that this model is actually how the world works (at least for microarrays). It is straightforward to adjust FDR procedures to account for such correlations. Effect of Correlations Effect of Correlations 1. Breast Cancer study (BC) compared gene activity groups of patients observed to have one of two different genetic mutations known to increase breast cancer risk, “BRCA1” or “BRCA2”, Hendenfalk et al. (2001) I I 7 BRCA1, 8 BRCA2, 15 patients total N = 3225 genes measured 2. HIV study, van’t Wout et al. (2003) I I 4 HIV positive, 4 HIV negative controls N = 7680 genes per microarray Ensemble Distribution zi zibc = Φ−1 (G0 (ti )) ∼ N (0, 1), i = 1, 2, . . . , N ∼ N (−0.09, 1.552 ) ziHIV ∼ N (−0.11, 0.752 ) Outline of the talk 1. Count vector model 1.1 Covariance of count vector under correlation 2. Poisson process model for counts 3. Numerical examples of model’s accuracy 4. Conditional FDR estimates 5. Numerical simulation comparing conditional to traditional FDR 6. Data example, NBA Counts Model K = 82 bins of width ∆ = 0.1 from −4.1 to 4.1, Z = ∪K k=1 Zk Count vector y, yk = #{zi in k th bin} πk (i) = P(zi ∈ Zk ), πk· = N X . πk (i)/N = ∆φ(z[k]) i=1 P γkl (i, j) = P(zi ∈ Zk ∩ zj ∈ Zl ), E (y) = Nπ, i6=j γkl· = γkl (i, j) N(N − 1) Cov (y) = C0 + C1 C0 = N(diag (π) − ππ 0 ) C1 = N(N − 1)diag (π)δdiag (π), δkl = γkl· −1 πk· πl· Counts Model Further assume bivariate normality, Cov (zi , zj ) = ρij . Z Z γkl (i, j) = Zk Zl P . ψ2 (zi , zj , ρij )dz = 2 2 1 z[k] −2ρij z[k]z[l]+z[l] 1−ρ2 ij −2 ∆2 q e 2π 1 − ρ2ij i6=j P(zi ∈ Zk ∩ zj ∈ Zl ) P δkl + 1 = P i P(zi ∈ Zk ) j P(zj ∈ Zl ) Z 1 ρ 1 (ρz[k]2 −2z[k]z[l]+ρz[l]2 ) . p = e 2(1−ρ2 ) g (ρ)dρ 1 − ρ2 −1 Z 1 = Rkl (ρ)g (ρ)dρ −1 Counts Model R1 Suppose ρ ∼ (0, α2 ), α2 = −1 ρ2 g (ρ)dρ, then 2nd order Taylor approximation of of Rkl (ρ) around ρ = 0 gives √ . δ = α2 qq0 , qk = (z[k]2 − 1)/ 2. Putting the previous results together (Theorem 1) N(N − 1) 2 . α ww0 Cov (y) = N(diag (π) − ππ 0 ) + 2 wk = ∆w (z[k]), , w (z) = φ00 (z) = φ(z)(z 2 − 1) Poisson Model Suppose y|u ∼ Po(u), u ∼ (v, Γ), will need N ∼ Po(N0 ). 2 . Simplifies Cov (y) = N(diag (π) + N2 α2 ww0 . Match with y ∼ (v, diag (v) + Γ) ⇒ N y ∼ Po(Nπ + A √ w), 2 A ∼ (0, α2 ) Numerical Examples, α = 0.05 Numerical Examples, α = 0.10 Numerical Examples, α = 0.15 Numerical Examples, α = 0.20 Numerical Examples, α = 0.25 Numerical Examples, α = 0.30 Numerical Examples, α = 0.35 Numerical Examples, α = 0.40 Numerical Examples, α = 0.45 Numerical Examples, α = 0.50 Numerical Examples Numerical Examples α: C1 Cnorm Cpois α: 0.05 0.9958 0.1007 0.1074 0.40 0.7758 0.6938 0.6923 0.10 0.9925 0.2776 0.2790 0.45 0.7748 0.7043 0.7028 0.15 0.9828 0.4582 0.4563 0.50 0.8081 0.7390 0.7374 0.20 0.9657 0.5962 0.5931 0.25 0.9291 0.6794 0.6765 0.30 0.8679 0.7059 0.7036 Table: Proportion of total variance explained by first eigenvector, as a function of α. 0.35 0.8085 0.6996 0.6978 Conditional FDR Given A, can approximate N . u = Nπ + A √ w = N∆fA (z[k]) 2 fA (z) = φ(z)(1 + Aq(z)), . Matching moments, can approximate uk = N σ1A ψ(x/σA ), with √ σA2 = 1 + 2A. I took 2nd term in Edgeworth expansion, µ4 − 3σ 4 . 1 fA (x) = H4 (x) . ψ(x/σA ) 1 + σA 24σ 4 Conditional FDR Conditional FDR 2 Use GLM to fit distribution of yk ∼ Po(e β0 +β1 z[k]+β2 z[k] ) for k ∈ K0 . Using normal approximation for with p0 proportion of nulls gives E (yk ) = p0 uk , hence σˆA = (−2βˆ2 )−.5 Estimate p0 by pˆ0 = Pˆ0 /P0 (σˆA , P0 (σ) = 2Φ(x0 ; σ) − 1, Pˆ0 = Y0 /N Fdr (x|σˆA ) = N pˆ0 Φ̄(x; σA )/T (x) Simulation Data Example: NBA 1. What professional basketball players can really be called exceptional? 2. Data from http://www.databasebasketball.com/ 3. 1946-2009, stats on every player, each year, ≈ 22, 000 entries 4. Will focus on ppm = points scored in season minutes played in season 5. Idea: get z-value for each player, apply BH procedure to determine non-null players 6. Can hypothesise there is some correlation between players ppm scores. 7. Cleaned data (year > 1950, minutes ≥ 10) Data Example: NBA Data Example: NBA Data Example: NBA Data Example: NBA Data Example: NBA I Detrend: year effect, shot clock (1954), 3 pointer (1979), center I Aggregate years by players, keep only careers ≥ 5 years I Gives N = 1535 players I Calculate tk = I Convert to z values, zk = P ck i=1 ppmi /ck , SE ck - career length −1 Φ (Tck −1 (tk )) Data Example: NBA Max = 6.74 (Kareem , Abdul-jabbar ’69-’89), Min = -6.43 (E.c. Coleman ’94-’00) Wilt Chamberlain (’59-’72) = 3.31, Michael Jordan (’84-’02) = 6.49 Data Example: NBA I Naive BH(.10) procedure gives 891 rejections, I Est. correlation from central spread Poisson glm, znull ∼ N (0, 22 ) I Trying BH(.10) with correlated null gives 1 rejection, Third approach: estimate pˆ0 = Pˆ0 /P0 (1.92) ≈ 0.588, Pˆ0 = Y0 (1)/N Conditional Fdr estimates Fdr (naive|2) = .347 , Fdr (cor|2) = 0.673 I I I I x∗ I Both > .10! = arg max{x : Fdr (x|2) ≤ 0.10}, gives 36 rejections Actually used x ∗ = arg min Fdr (x|2), since min Fdr (x|2) = .121 > .10. Data Example: NBA Theoretical Null Dist N (0, 1), Correlated Null Dist N (0, 22 ) Data Example: NBA (Best Players) [1] ”Kareem , Abdul-jabbar” ”Tim , Duncan” ”Shaquille , O’neal” [4] ”Michael , Jordan” ”Karl , Malone” ”Julius , Erving” [7] ”Walter , Davis” ”Glenn , Robinson” ”Jerry , West” [10] ”Dominique , Wilkins” ”Tim , Thomas” ”Calvin , Murphy” [13] ”Bob , Pettit” ”Eddie , Johnson” ”Sam , Cassell” [16] ”James , Worthy” ”George , Gervin” ”John , Drew” [19] ”Allen , Iverson” ”Dan , Issel” Data Example: NBA (Worst Players) [1] ”Charles , Jones” ”Tree , Rollins” ”Ben , Wallace” [4] ”Nate , Mcmillan” ”Greg , Kite” ”Manute , Bol” [7] ”Harvey , Catchings” ”Paul , Mokeski” ”Don , Buse” [10] ”Adonal , Foyle” ”Kurt , Rambis” ”Bo , Outlaw” [13] ”Matt , Guokas” ”Bruce , Bowen” ”George , Johnson” [16] ”Chris , Dudley”