Stat 300C: Final Presentation

Transcription

Stat 300C: Final Presentation
Correlation and Large-Scale Simultaneous
Significance Testing, Bradley Efron, 2007, JASA
Stat 300C: Final Presentation
Leonid Pekelis
June 03, 2011
Main Points
I
I
Correlation between test statistics can have varied effects on
multiple hypothesis testing procedures, making it harder to
trust FDR procedures which don’t account for correlation.
Allowing for some assumptions, can formalize a model which
describes how correlations propogate to false discovery
estimates.
I
I
There is some evidence that this model is actually how the
world works (at least for microarrays).
It is straightforward to adjust FDR procedures to account for
such correlations.
Effect of Correlations
Effect of Correlations
1. Breast Cancer study (BC) compared gene activity groups of
patients observed to have one of two different genetic
mutations known to increase breast cancer risk, “BRCA1” or
“BRCA2”, Hendenfalk et al. (2001)
I
I
7 BRCA1, 8 BRCA2, 15 patients total
N = 3225 genes measured
2. HIV study, van’t Wout et al. (2003)
I
I
4 HIV positive, 4 HIV negative controls
N = 7680 genes per microarray
Ensemble Distribution
zi
zibc
= Φ−1 (G0 (ti )) ∼ N (0, 1), i = 1, 2, . . . , N
∼ N (−0.09, 1.552 ) ziHIV ∼ N (−0.11, 0.752 )
Outline of the talk
1. Count vector model
1.1 Covariance of count vector under correlation
2. Poisson process model for counts
3. Numerical examples of model’s accuracy
4. Conditional FDR estimates
5. Numerical simulation comparing conditional to traditional
FDR
6. Data example, NBA
Counts Model
K = 82 bins of width ∆ = 0.1 from −4.1 to 4.1, Z = ∪K
k=1 Zk
Count vector y, yk = #{zi in k th bin}
πk (i) = P(zi ∈ Zk ),
πk· =
N
X
.
πk (i)/N = ∆φ(z[k])
i=1
P
γkl (i, j) = P(zi ∈ Zk ∩ zj ∈ Zl ),
E (y) = Nπ,
i6=j
γkl· =
γkl (i, j)
N(N − 1)
Cov (y) = C0 + C1
C0 = N(diag (π) − ππ 0 )
C1 = N(N − 1)diag (π)δdiag (π),
δkl =
γkl·
−1
πk· πl·
Counts Model
Further assume bivariate normality, Cov (zi , zj ) = ρij .
Z
Z
γkl (i, j) =
Zk
Zl
P
.
ψ2 (zi , zj , ρij )dz =
2
2
1 z[k] −2ρij z[k]z[l]+z[l]
1−ρ2
ij
−2
∆2
q
e
2π 1 − ρ2ij
i6=j P(zi ∈ Zk ∩ zj ∈ Zl )
P
δkl + 1 = P
i P(zi ∈ Zk )
j P(zj ∈ Zl )
Z 1
ρ
1
(ρz[k]2 −2z[k]z[l]+ρz[l]2 )
.
p
=
e 2(1−ρ2 )
g (ρ)dρ
1 − ρ2
−1
Z 1
=
Rkl (ρ)g (ρ)dρ
−1
Counts Model
R1
Suppose ρ ∼ (0, α2 ), α2 = −1 ρ2 g (ρ)dρ,
then 2nd order Taylor approximation of of Rkl (ρ) around ρ = 0
gives
√
.
δ = α2 qq0 , qk = (z[k]2 − 1)/ 2.
Putting the previous results together (Theorem 1)
N(N − 1) 2
.
α ww0
Cov (y) = N(diag (π) − ππ 0 ) +
2
wk = ∆w (z[k]),
, w (z) = φ00 (z) = φ(z)(z 2 − 1)
Poisson Model
Suppose y|u ∼ Po(u), u ∼ (v, Γ), will need N ∼ Po(N0 ).
2
.
Simplifies Cov (y) = N(diag (π) + N2 α2 ww0 . Match with
y ∼ (v, diag (v) + Γ) ⇒
N
y ∼ Po(Nπ + A √ w),
2
A ∼ (0, α2 )
Numerical Examples, α = 0.05
Numerical Examples, α = 0.10
Numerical Examples, α = 0.15
Numerical Examples, α = 0.20
Numerical Examples, α = 0.25
Numerical Examples, α = 0.30
Numerical Examples, α = 0.35
Numerical Examples, α = 0.40
Numerical Examples, α = 0.45
Numerical Examples, α = 0.50
Numerical Examples
Numerical Examples
α:
C1
Cnorm
Cpois
α:
0.05
0.9958
0.1007
0.1074
0.40
0.7758
0.6938
0.6923
0.10
0.9925
0.2776
0.2790
0.45
0.7748
0.7043
0.7028
0.15
0.9828
0.4582
0.4563
0.50
0.8081
0.7390
0.7374
0.20
0.9657
0.5962
0.5931
0.25
0.9291
0.6794
0.6765
0.30
0.8679
0.7059
0.7036
Table: Proportion of total variance explained by first eigenvector, as a
function of α.
0.35
0.8085
0.6996
0.6978
Conditional FDR
Given A, can approximate
N
.
u = Nπ + A √ w = N∆fA (z[k])
2
fA (z) = φ(z)(1 + Aq(z)),
.
Matching moments, can approximate uk = N σ1A ψ(x/σA ), with
√
σA2 = 1 + 2A.
I took 2nd term in Edgeworth expansion,
µ4 − 3σ 4
. 1
fA (x) =
H4 (x) .
ψ(x/σA ) 1 +
σA
24σ 4
Conditional FDR
Conditional FDR
2
Use GLM to fit distribution of yk ∼ Po(e β0 +β1 z[k]+β2 z[k] ) for
k ∈ K0 .
Using normal approximation for with p0 proportion of nulls gives
E (yk ) = p0 uk , hence
σˆA = (−2βˆ2 )−.5
Estimate p0 by pˆ0 = Pˆ0 /P0 (σˆA , P0 (σ) = 2Φ(x0 ; σ) − 1,
Pˆ0 = Y0 /N
Fdr (x|σˆA ) = N pˆ0 Φ̄(x; σA )/T (x)
Simulation
Data Example: NBA
1. What professional basketball players can really be called
exceptional?
2. Data from http://www.databasebasketball.com/
3. 1946-2009, stats on every player, each year, ≈ 22, 000 entries
4. Will focus on ppm =
points scored in season
minutes played in season
5. Idea: get z-value for each player, apply BH procedure to
determine non-null players
6. Can hypothesise there is some correlation between players
ppm scores.
7. Cleaned data (year > 1950, minutes ≥ 10)
Data Example: NBA
Data Example: NBA
Data Example: NBA
Data Example: NBA
Data Example: NBA
I
Detrend: year effect, shot clock (1954), 3 pointer (1979),
center
I
Aggregate years by players, keep only careers ≥ 5 years
I
Gives N = 1535 players
I
Calculate tk =
I
Convert to z values, zk =
P ck
i=1
ppmi /ck
,
SE
ck - career length
−1
Φ (Tck −1 (tk ))
Data Example: NBA
Max = 6.74 (Kareem , Abdul-jabbar ’69-’89), Min = -6.43 (E.c.
Coleman ’94-’00)
Wilt Chamberlain (’59-’72) = 3.31, Michael Jordan (’84-’02) =
6.49
Data Example: NBA
I
Naive BH(.10) procedure gives 891 rejections,
I
Est. correlation from central spread Poisson glm,
znull ∼ N (0, 22 )
I
Trying BH(.10) with correlated null gives 1 rejection,
Third approach: estimate pˆ0 = Pˆ0 /P0 (1.92) ≈ 0.588,
Pˆ0 = Y0 (1)/N
Conditional Fdr estimates Fdr (naive|2) = .347 ,
Fdr (cor|2) = 0.673
I
I
I
I x∗
I
Both > .10!
= arg max{x : Fdr (x|2) ≤ 0.10}, gives 36 rejections
Actually used x ∗ = arg min Fdr (x|2), since
min Fdr (x|2) = .121 > .10.
Data Example: NBA
Theoretical Null Dist N (0, 1), Correlated Null Dist N (0, 22 )
Data Example: NBA (Best Players)
[1] ”Kareem , Abdul-jabbar” ”Tim , Duncan” ”Shaquille , O’neal”
[4] ”Michael , Jordan” ”Karl , Malone” ”Julius , Erving”
[7] ”Walter , Davis” ”Glenn , Robinson” ”Jerry , West”
[10] ”Dominique , Wilkins” ”Tim , Thomas” ”Calvin , Murphy”
[13] ”Bob , Pettit” ”Eddie , Johnson” ”Sam , Cassell”
[16] ”James , Worthy” ”George , Gervin” ”John , Drew”
[19] ”Allen , Iverson” ”Dan , Issel”
Data Example: NBA (Worst Players)
[1] ”Charles , Jones” ”Tree , Rollins” ”Ben , Wallace”
[4] ”Nate , Mcmillan” ”Greg , Kite” ”Manute , Bol”
[7] ”Harvey , Catchings” ”Paul , Mokeski” ”Don , Buse”
[10] ”Adonal , Foyle” ”Kurt , Rambis” ”Bo , Outlaw”
[13] ”Matt , Guokas” ”Bruce , Bowen” ”George , Johnson”
[16] ”Chris , Dudley”