Information content

Transcription

Information content
2013-07-31
Lecture 5
Information Content
Jurg Ott
http://lab.rockefeller.edu/ott/
http://www.jurgott.org/PekingU/
Curvature
• Curves of log(likelihood) with maxima at θ = 0.10
• At maximum,, curve with n = 10 is more flat than that with n = 20
2
1
2013-07-31
Measures of “Informativeness”
• Fisher information (Fisher RA (1925) Theory of statistical
estimation. Proc Camb Phil Soc 22:700-725)
o I(θ) measures the precision, with which the θ parameter can be
o
o
o
o
estimated. Based on curvature (2nd derivative) of loge[L(θ)] curve.
S(θ) = loge[L(θ)] also called support for parameter or hypothesis
1/I(θ) = variance of estimate
Binomial
I(θ)
i
i l situation:
i i
(θ) = n/[θ(1
/[θ(1 – θ)]
“Information” = technical statistical term
3
“Informativeness” in Linkage Analysis
• Less interested in precision of θ estimate than in amount of
information for detecting linkage
• Suitable quantity = expected lod score, ELOD
• Consider n = 3 offspring of phase-known mating; k = 0…3
recombinants
• Lod score:
Z ( )  log
 k (1   ) n k
( 12) n
 n log(2)  k log( )  (n  k ) log(1   )
• Distinguish r = true value of recombination fraction (only known to
statistician) and θ = formal parameter in expression for likelihood
4
2
2013-07-31
Expected Lod Score, ELOD
• Each of k = 0…3
•
•
•
furnishes a lod score
curve.
At given θ, the
weighted average
over these curves
leads to the expected
(average) lod score
curve, weights =
prob. of occurrence
(depends on r).
Often only its max. is
called the ELOD
Max. occurs at true
value of r (Rao CR
[1973] Linear statistical inference and its applications. New York: Wiley)
5
Computing ELOD
• Probability of k recombinants (usually
k
)
unknown):
P( k ; r ) 
 r (1  r )
n
k
k
n k
,
• so that for ELOD curve:
n
E[ Z ( )] P (k ;r )Z k ( ),
k 0
6
3
2013-07-31
Comparing ELODs
• Compare data types or strategies in terms of
ELOD
o Thompson EA (1975) Ann Hum Genet 39, 173
o Ott (1985) Analysis of Human Genetic Linkage, 1st
edition
• ELOD (MELOD) approximately
corresponds to a power of 50%
7
Phase‐known Double Backcross
• Consider two loci: ((1)) with alleles A and B,, ((2)) with alleles
•
•
•
1 and 2.
Mating A1/B2 × A1/A1, allows counting recombinants and
nonrecombinants.
Curve E[Z(r; θ)]: r log(2θ) + (1 – r) log[2(1 – θ)].
ELOD curve is maximum for θ = r, find
• For r → 0, ELOD = 0.30 per offspring.
• On average, to obtain lod score of 3, need to count 10
offspring.
8
4
2013-07-31
Phase‐unknown Double Backcross with 2 offspring
• Matingg ((A1/B2 or A2/B1)) × A1/A1,, cannot count
•
•
•
•
recombinants and nonrecombinants.
Assume equal phase probabilities
Single offspring, A1/A1: Prob = 0.5r × 0.5 + 0.5(1 – r) ×
0.5 = 0.25 → no information on linkage
Two offspring, for example, A1/A1 each: Probability is
f1 = [r2 + (1 – r)2]/8,
]/8 both are recombinants given one
phase and nonrecombinants given the other phase.
For given phase, if one is a recombinant and the other a
nonrecombinant, then probability is f2 = r(1 – r)/4.
9
Phase‐unknown Double Backcross with 2 offspring
• List of all possible offspring genotypes:
• Define an offspring as Type 1 if nonrecombinant given
•
some phase, and Type 2 if recombinant given this phase.
Combine classes with equal probability
10
5
2013-07-31
List of Type 1 and 2 Offspring
• Offspring are not independent!
• Again, combine classes with equal probabilities
11
Two Classes of Offspring Pairs
• ELOD (θ = r):
)
E[Z(r)] = 2r(1 – r)log[4r(1 – r)] + [r2 + (1 – r)2]log[2r2 + 2(1 – r)2].
12
6
2013-07-31
Comparing ELODs
• DB = phase-known double backcross, 1 offspring
• 2 kids = phase-unknown double backcross, 1 offspring pair
• “Lose 1 offspring for not knowing phase”
13
Conditional ELODs
• In ppractice,, want to find ELOD or maximum ppossible lod
•
•
score in family with given disease phenotypes.
Conditional calculations complicated. Computer programs:
SIMLINK (Boehnke 1986), SLINK (Weeks 1990)
Off the cuff considerations often done assuming full
penetrance and no recombination:
• “In
In a genome screen,
screen it will always be possible to find a marker very close to the
disease locus. Therefore, we should take as our measure of informativeness the
maximum lod score (at  = 0) obtained under the assumption of no recombinants.”
14
7
2013-07-31
Conditional ELODs: Simple Family
• Reasoning: Genotype of II-3
•
will reveal phase in mother
(“
hild sets
t phase”)
h ”)
(“one
child
Remaining 3 children can be
scored as nonrecombinants,
each providing a lod score
of 2 × log(2) = 0.301
• Thus,, max. lod score ppossible = 0.903.
• Correct for full penetrance, no phenocopies, mother always
doubly heterozygous.
15
Conditional ELODs: Larger Family
• Dominant disease
• With full penetrance and
no recombinants, expect up
to 7 non-recombinants, lod
score off 7 × 0.301
0 301 = 22.107.
107
16
8
2013-07-31
Conditional ELODs: Incomplete Penetrance
01
• A
Assume ttrue r = 00.01
• Compute lod scores with SLINK
program (2000 replicates)
n = number of equally frequent marker alleles
17
Cost of Incomplete Penetrance: ELODs
• Measure difference in ELOD for full versus incomplete penetrance
• Consider phase-known mating, D1/d2 × d1/d1, D > d (D dominant), and
pphenotypes
yp A ((affected)) and U ((unaffected).
)
• Obtain ELOD as a weighted sum of log of last line, weights = P(x)
18
9
2013-07-31
Cost of Incomplete Penetrance: Results
• With tight linkage, f = 0.90 reduces ELOD to 75% of its value at f = 1
• May compensate for this loss by increasing sample size by factor 1/0.758 =
1.32,, that is,, need 30% more data.
• Analogously, to compensate for f = 0.50 requires 3 times the sample size
Ratio of ELOD relative to ELOD with full penetrance
19
Inconsistencies
Ott (1978) Cytogenet Cell Genet 22, 702‐705
• Maximum likelihood estimates (MLEs) are generally consistent
(asymptotically unbiased, variance → 0).
A
t i
t bias
bi may lead
l d to
t inconsistency
i
it
• Ascertainment
• Consider two codominant loci and phase-known mating A1/B2 × A1/B2
(common in CEPH families)
• Offspring genotypes:
20
10
2013-07-31
Inconsistencies
• Mating A1/B2 × A1/B2 , offspring phenotypes and their probabilities of
occurrence; i = 3 is ambiguous
21
Inconsistencies
• Mating A1/B2 × A1/B2 , collecting offspring phenotypes with same
probabilities into one class each leads to:
22
11
2013-07-31
Ascertainment Strategies
Book chapter 11
• Use all data. ELOD:
E0 [ Z ( )]  l 11 pl (r ) Z l ( )
4
• Computing dE0/dθ (r is
a constant!), setting it to
0, and solving this
equation leads to θ = r.
• Analyze
only unambiguous data (l = 1…3): ELOD curve has a maximum
~
at   r  b, b = r(1 – r)(1 – 2r)/[1 + 2r(1 – r)] ≈ r for small r.
r
Recombination fraction estimate is twice its true value!
• For small r, most l = 4 individuals will be nonrecombinants: Assume that
all of them are nonrecombinant → negative asymptotic bias.
23
Ascertainment Strategies: Summary
24
12
2013-07-31
Observed Information
• Maximum lod score, Z (ˆ) .
• Testing for linkage: X 2  2  ln(10)  Z (ˆ)approximately follows a chisquare (1 df) distribution in the absence of linkage (α = 0.0001).
• Lander & Kruglyak (1995) Nat Genet 11, 244:
• Equivalent
andd (n
i l number
b k off recombinants
bi
( – k) off nonrecombinants:
bi
Assume lod score curve was obtained on all phase-known data (book ch. 4).
25
Equivalent Numbers of Observations
• Edwards’ method:
• Two-point method:
• This furnishes k = Z(0.0010) – Z(0.0001) and n = 3.322 [Z(0.0010) + 3k]
• These numbers have no statistical meaning; may be useful to judge
information content of data.
26
13
2013-07-31
Exact Tests for Linkage
• Binomial situation: Count k recombinants in n meioses.
• Result significant when Z (ˆ) > 3
• r = 0.5:
significance
level,
level α
(r < 0.5: power)
α
n
27
Power and Significance Level
• Binomial situation:
C
t k recombinants
bi t
Count
in n meioses.
• Result significant
when Z (ˆ) > 3
• n = number of families
(1 offspring each when
phase known)
• m = number
b off
offspring in phase
unknown families
r
28
14