Carolin Kosiol

Transcription

Carolin Kosiol
Model Comparison and
Hypothesis Testing
Carolin Kosiol
<[email protected]>
Institute of Population Genetics
Vetmeduni Vienna
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Likelihood and Maximum Likelihood
Recall that for model M, parameters  and data D:
likelihood
L(M,  | D) =Pr(D | M, )
and that maximum likelihood inference consists of
finding 𝜽 that makes the likelihood as large as
possible:
find  so that L(M, 𝜽 | D)  L(M,  | D) for all other 
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Likelihood and hypotheses
Now supose we have some hypothesis H regarding the
model and the parameters. Similarly, the likelihood oft
the hypothesis is:
L(H | D) = Pr (D | H)
Perhaps the hypothesis fully defines the likelihood (no
free parameters), or perhaps there are some free
parameters in the hypothesis  in which case we
again maximize the likelihood to find the best value
under a hypothesis.
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
“Biased and Unbiased dice”
hypothesis
Imagine we have a bag containing lots of dice and
we know that there are seven types in it: one type is
the normal, i.e. fair or “unbiased”, die…
Six types of dice are strongly biased: they
give one number with prob. 50% and the
other five numbers with prob. 10% each.
Imagine that you take one die from the
bag and throw it 8 times, obtaining:
The problem is: what kind of die did you roll?
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
“Biased and unbiased dice (cont.)”
In general the likelihood of a hypothesis is defined as the
probability of the data, assuming that hypothesis:
Pr ( data | hypothesis )
The likelihood that this is an unbiased die is:
Lu = Pr (
| unbiased die )
= 1/6 × 1/6 × 1/6 × 1/6 × 1/6 × 1/6 × 1/6 × 1/6  0.6e-6
L6 = Pr (
| biased
)
= 0.1 × 0.5 × 0.1 × 0.5 × 0.5 × 0.1 × 0.1 × 0.1 = 1.25e-6
Similarly,
L3 = 0.25e-6, L2 = 0.01e-6 and L1 = L4 = L5 = 0.05e-6
The biased type
is our maximum likelihood estimate
(MLE) for the type of die we rolled.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
More examples: "Fair coin" hypothesis
We toss a coin 100 times, and observe 65 Heads and 35
Tails. Our hypothesis H0 is that each throw is
independent, with probability 0.5 of giving Heads.
What is the likelihood of this hypothesis?
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
"Fair coin" hypothesis
We toss a coin 100 times, and observe 65 Heads and
35 Tails. Our hypothesis H0 is that each throw is
independent, with probability 0.5 of giving Heads.
What is the likelihood of this hypothesis?
100
L(H0 | D) = Pr (D | H0) =
x 0.565 x 0.535 = 0.000864
65
or ln (L(H0 ) ) = ln (0.000864) = -7.054
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
„Possible unfair coin" hypothesis
We toss a coin 100 times, and observe 65 Heads and
35 Tails. Our hypothesis H1 is that each throw is
independent, with unknown probability p of giving Heads.
What is the likelihood of this hypothesis? Now we have a
free parameter p. The maximum likelihood estimate 𝒑 is
exactly the proportion of Heads, i.e. 65/100
L(H1 | D) = Pr (D | H1) =
100
65
x 0.6565 x 0.3535 = 0.008340
or ln (L(H1 ) ) = ln (0.008340) = -2.484
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Comparison of hypothesis
L(H0 )
= 0.000864
ln (L(H0 ) ) = -7.054
L(H1 ) = 0.008340
ln (L(H1 ) ) = -2.484
Obvisiously H1 is better than H0 but is it significantly
better?
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Nested Hypotheses
Terminology:
Hypothesis H0 is nested within hypothesis H1 if forcing
a particular choice of the parameters of H1 makes it the
same as H0.
For coin tossing, forcing the unknown probability of
Heads H1, to equal 0.5 gives us exactly H0. Thus H0 is
nested in H1.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Comparision of general hypothesis
Traditionally statistical hypothesis testing compares a
"null hypothesis" H0 with a "alternative hypothesis" H1.
Usually H0 is nested in H1, and we treat H0 as valid
unless the evidence infavour of H1 is much stronger.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Comparision of general hypothesis (cont.)
How large a value of 2 is big enough? Traditionally, we
say that if 2 is bigger than expected by chance in 95%
(or 99%, or 99.9%) of cases when H0 is correct, then we
favour H1 over H0.
Theorem:
Suppose H0 is nested in H1. Then, if H0 is correct, 2
has a chi-square distribution with d degrees of freedom:
2 = 2 ( ln (L(H1 ) ) - ln (L(H0 ) ) ) ~
d2
If 2 is greater than the 95% (or 99%, or 99.9% …) point
of the d2 distribution, we reject H0 in favour of H1.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Chi-square distribution
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Comparison of general hypothesis (cont.)
These statistical tests are called Likelihood Ratio Test
or LRTs.
They are a very powerful class of statitistical tests, with
very broad applicability.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Comparison of coin hypothesis revisited:
L(H0 ) = 0.000864
L(H1 ) = 0.008340
ln (L(H0 ) ) = -7.054
ln (L(H1 ) ) = -2.484
2 = 2 ( ln (L(H1 ) ) - ln (L(H0 ) ) ) = 2 x (-2.484  -7.054)
= 2 x (-2.484 + 7.054) = 9.14
H0 and H1 differ by 1 free parameter (probabitility to
observe Heads)  degree of freedom d =1
We compare 2 =9.14 with the 12 distribution, and
observe a P-value < 0.005. H1 (probability of heads not
necessarily equal to 0.5) is preferred to H0 (fair coin).
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Many sequence evolution models are nested
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Jukes- Cantor (JC) vs Kimura 2-parameter (K2P)
 JC: 1 rate of change applies to all substitutions
 K2P: 2 rates of change, transitions different to
transversions
H0:
H1:
unknown tree relating the parmeters, JC model
of substitutions
unknown tree relating the parameters, K2P
model of substitution
The difference in parameters is just the transitiontransversion rate ratio, a single number. Fixing it equal
to 1 in K2P gives us the JC model  nested models!
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Kimura 2-parameter (K2P) vs
Hasegawa, Kishino, Yano (HKY)
 K2P: 2 rates of change, transitions different to
transversions
 HKY: 2 rates of change, transitions different to
transversions AND 3 base frequencies free to vary
H0:
H1:
unknown tree relating the parmeters, K2P model
of substitutions
unknown tree relating the parameters, HKY
model of substitution
The difference in parameters are the 3 base
frequencies. Fixing them equal to 1/4 in HKY gives us
the K2P model  nested models!
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Hasagawa, Kishino, Yano (HKY) vs
general time reversible (GTR or REV) model
 HKY: 2 rates of change, transitions different to
transversions AND 3 base frequencies free to vary
 GTR: 6 rates of change AND 3 base frequencies
H0:
H1:
unknown tree relating the parmeters, HKY model
of substitutions
unknown tree relating the parameters, GTR
model of substitution
The difference in parameters are 4 relative rates of
change. Fixing them in appropriate ration in GTR gives
us the HKY model  nested models!
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
GTR model vs. GTR + Gamma model
 GTR: 6 rates of change AND 3 base frequencies
 GTR+: as above and a rate hetrogeneity parameter 
H0:
H1:
GTR
GTR +
Difference: 1 parameter . Models are nested 
recovers GTR. But this is a special case where the
parameter tested in H1 must be on the limit of what is
permitted to recover H0.
In this case we take a mixture of  distributions for
hypothesis testing.
2
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Summary
models for DNA
Allow for transition/
transversion bias (red)
Allow for unequal base
frequencies (yellow)
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
Substitution models for amino acids
Dayhoff (PAM)
4x4 DNA model
20x20 AA model
Jukes & Cantor, 1969; Dayhoff et al. 1978, Kimura, 1980;
Felsenstein, 1981 & 1984; Hasegawa, Kishino, & Yano, 1985)
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Mechanistic and Empirical models
• Empirical models
summarise the substitution
patterns from large quantities
of data.
• Useful when patterns of
evolution are expected to be
similar between datasets.
• Mechanistic models
use parameters to model
factors of protein evolution.
• Estimates of parameters
provide insight about the
evolution of specific sequences.
Nucleotide models are often mechanistic and
amino acid model are mostly empirical.
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
Examples of AA substitution models…
 Average process of evolution for globular proteins:
PAM(1972,1978), JTT(1992), WAG(2001), LG (2008)
matrices.
 Functional models: Chloroplast-derived amino acid
replacement matrix (Adachi, 1996). Replacement
matrices derived from mitochondrially-encoded
proteins (Adachi and Hasegawa 2000, Yang 1998).
 Structural models: Amino acid replacement matrix for
alpha-helices, beta-sheets, turns and loops with each
category further classified whether it is buried or
exposed (e.g. Goldman et al., 1998, Overington et
al., 1990).
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
+F method
 AA frequencies vary a lot between different data sets;
thus also between the database the substitution
matrix was estimated from and the data set you want
to analyse.
 The +F allows you to replace the frequencies of the
AA in the entire data base with the AA frequencies
from the specific data analysed.
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
Substitution models for codons
M0
Dayhoff (PAM)
4x4 DNA model
Carolin Kosiol
Biomedizin
20x20 AA model
61x61 codon model
Spezielle Bioinformatik in der
The Universal Genetic Code
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
Codon sequence evolution
A G T
A T C
C G G
A T T ...
Transitions or transversion?
A T C
C G A A T T ...
Codon frequencies
A G T
A T C
I
A G T
Carolin Kosiol
Biomedizin
G T C
V
C G A
A T A ...
Synonymous or
nonsynonymous?
C G A
A T A ...
Spezielle Bioinformatik in der
Evolutionary time
A G T
The codon model M0
Qij =
0
if i -> j is > 1 nucleotide substitution or j is a stop codon
j
if i  j synonymous transversion
j 
if i  j synonymous transition
j 
if i  j nonsynonymous transversion
j   if i  j nonsynonymous transition
where
 : transition/transversion rate ratio
j : equilibrium frequency of codon j
 : nonsynonymous/synonymous rate ratio
(Goldman &Yang 1994,Yang et al. , 2000)
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
M0
1nt change at the
1st
2nd
3rd
position of the codon
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
Detecting positive selection
 < 1 purifying selection
 = 1 neutral evolution
 > 1 positive selection
Variation of nonsynomous/synonymous rate ration 
among sites:
Model 0
Carolin Kosiol
Biomedizin
Model 1a
Model 2a
Spezielle Bioinformatik in der
(Bielaweski, 2005)
Test for positive selection
Model 1a (2 classes of sites)
Each sites evolves with 0 < 1 or 1=1
Model 2a (3 classes of sites):
Each sites evolves with 0 < 1, 1=1 or 2 > 1
 < 1 purifying selection
 = 1 neutral evolution
 > 1 positive selection
Carolin Kosiol
(Branch-site models,
Yang Nielsen, 1998)
Spezielle Bioinform atik in der Biomedizin
Positive selection in six mammalian
genomes
6 high-quality genomes of
eutherian mammals
0.05
human
chimp
17489 orthologous genes:
human / chimp / macaque /
mouse / rat / dog
544 genes identified to be
under positive selection
(PSGs) using codon models
macaque
mouse
rat
dog
Kosiol et al., PLoS Genetics, 2008
Carolin Kosiol
Spezielle Bioinform atik in der Biomedizin
Complement component C7
M1a: nearly neutral
p0 = 0.67, (p1=1-p0=0.33)
0 = 0.05, (1 = 1)
log L1= -6520.02
M2a: selection
p0 = 0.68, p1=0.31, (p2 = 1-p0-p1 = 0.02)
0 = 0.06, (1 = 1), 2 = 10.51
log L2= -6530.84
2x (log L2 – log L1)=
2x (-6520.02 – ( - 6530.84)) = 2 x 10.82 = 21.64
2 (df= 2, 0.05) = 5.99 pvalue = 2.01e-05
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
Co-evolution in the complement pathway
P<0.05
FDR<0.05
Carolin Kosiol
Biomedizin
Spezielle Bioinformatik in der
Kosiol
et al., PLoS Genetics, 2008
What if the assumptions of the LR test are not
satisfied?
Often the hypothesis or the sample size is not big
enough for the asymptotic assumptions to be valid.
Can’t use a 2 distribution table.
Instead use simulation method to obtain required
distribution for significance testing.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Models that are not nested
We can use the Akaike Information Criterion (AIC):
AIC(model) = 2k – 2(ln (L(model) )
Where k is the number of parameter values and the
model with the smallest AIC value is considered to be
best.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Models that are not nested
The Akaike InformationCriterion (AIC) is suitable for
comparing empirical models of protein evolution.
For example:
JTT vs WAG
or
WAG+F vs LG+F
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
AIC and BIC
AIC(model) = 2k – 2(ln (L(model) )
Prefer a model with small AIC
Both LRT and AIC favour parameter rich models too often
BIC(model) = k ln(n) – 2(ln (L(model) )
where n is the sample size (sequence length)
BIC penalizes parameter rich models more severely
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Summary: Hypothesis Testing
ML gives a likelihood value to each tree given that the
sequences evolved according to that tree and a given
model of substitution:
L= Prob(data| tree, model) =Prob(data| tree,
)
We find the model that maximises L, i.e. makes the data
as probable as possible: “maximum likelihood”.
The aims is to find a model which describe the data
significantly better than another model.
Probabilistic models can be used to test hypothesis
evolutionary process acting on a set of sequences.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin
Literature list
 Substitution models (Review papers)
- Whelan S, Lio P, Goldman N, 2001. Molecular phylogenetics:
state-of-the-art methods for looking into the past. Trends in
Genetics 17: 262-272.
- Huelsenbeck JP and Rannala B, 1997. Phylogenetic methods
come of age: testing hypothesis in an evolutionary context.
Science 276: 227-232.
- Anisimova M and Kosiol C, 2009. Investigating proteincoding sequence evolution with probabilistic codon
substitution models. Mol Biol Evol 26: 255-271.
 PAML
- http://abacus.gene.ucl.ac.uk/software/paml.html
- Yang, Z. 2006. Computational Molecular Evolution. Oxford
University Press.
Carolin Kosiol
Spezielle Bioinformatik in der Biomedizin