Carolin Kosiol
Transcription
Carolin Kosiol
Model Comparison and Hypothesis Testing Carolin Kosiol <[email protected]> Institute of Population Genetics Vetmeduni Vienna Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Likelihood and Maximum Likelihood Recall that for model M, parameters and data D: likelihood L(M, | D) =Pr(D | M, ) and that maximum likelihood inference consists of finding 𝜽 that makes the likelihood as large as possible: find so that L(M, 𝜽 | D) L(M, | D) for all other Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Likelihood and hypotheses Now supose we have some hypothesis H regarding the model and the parameters. Similarly, the likelihood oft the hypothesis is: L(H | D) = Pr (D | H) Perhaps the hypothesis fully defines the likelihood (no free parameters), or perhaps there are some free parameters in the hypothesis in which case we again maximize the likelihood to find the best value under a hypothesis. Carolin Kosiol Biomedizin Spezielle Bioinformatik in der “Biased and Unbiased dice” hypothesis Imagine we have a bag containing lots of dice and we know that there are seven types in it: one type is the normal, i.e. fair or “unbiased”, die… Six types of dice are strongly biased: they give one number with prob. 50% and the other five numbers with prob. 10% each. Imagine that you take one die from the bag and throw it 8 times, obtaining: The problem is: what kind of die did you roll? Carolin Kosiol Biomedizin Spezielle Bioinformatik in der “Biased and unbiased dice (cont.)” In general the likelihood of a hypothesis is defined as the probability of the data, assuming that hypothesis: Pr ( data | hypothesis ) The likelihood that this is an unbiased die is: Lu = Pr ( | unbiased die ) = 1/6 × 1/6 × 1/6 × 1/6 × 1/6 × 1/6 × 1/6 × 1/6 0.6e-6 L6 = Pr ( | biased ) = 0.1 × 0.5 × 0.1 × 0.5 × 0.5 × 0.1 × 0.1 × 0.1 = 1.25e-6 Similarly, L3 = 0.25e-6, L2 = 0.01e-6 and L1 = L4 = L5 = 0.05e-6 The biased type is our maximum likelihood estimate (MLE) for the type of die we rolled. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin More examples: "Fair coin" hypothesis We toss a coin 100 times, and observe 65 Heads and 35 Tails. Our hypothesis H0 is that each throw is independent, with probability 0.5 of giving Heads. What is the likelihood of this hypothesis? Carolin Kosiol Spezielle Bioinformatik in der Biomedizin "Fair coin" hypothesis We toss a coin 100 times, and observe 65 Heads and 35 Tails. Our hypothesis H0 is that each throw is independent, with probability 0.5 of giving Heads. What is the likelihood of this hypothesis? 100 L(H0 | D) = Pr (D | H0) = x 0.565 x 0.535 = 0.000864 65 or ln (L(H0 ) ) = ln (0.000864) = -7.054 Carolin Kosiol Spezielle Bioinformatik in der Biomedizin „Possible unfair coin" hypothesis We toss a coin 100 times, and observe 65 Heads and 35 Tails. Our hypothesis H1 is that each throw is independent, with unknown probability p of giving Heads. What is the likelihood of this hypothesis? Now we have a free parameter p. The maximum likelihood estimate 𝒑 is exactly the proportion of Heads, i.e. 65/100 L(H1 | D) = Pr (D | H1) = 100 65 x 0.6565 x 0.3535 = 0.008340 or ln (L(H1 ) ) = ln (0.008340) = -2.484 Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Comparison of hypothesis L(H0 ) = 0.000864 ln (L(H0 ) ) = -7.054 L(H1 ) = 0.008340 ln (L(H1 ) ) = -2.484 Obvisiously H1 is better than H0 but is it significantly better? Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Nested Hypotheses Terminology: Hypothesis H0 is nested within hypothesis H1 if forcing a particular choice of the parameters of H1 makes it the same as H0. For coin tossing, forcing the unknown probability of Heads H1, to equal 0.5 gives us exactly H0. Thus H0 is nested in H1. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Comparision of general hypothesis Traditionally statistical hypothesis testing compares a "null hypothesis" H0 with a "alternative hypothesis" H1. Usually H0 is nested in H1, and we treat H0 as valid unless the evidence infavour of H1 is much stronger. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Comparision of general hypothesis (cont.) How large a value of 2 is big enough? Traditionally, we say that if 2 is bigger than expected by chance in 95% (or 99%, or 99.9%) of cases when H0 is correct, then we favour H1 over H0. Theorem: Suppose H0 is nested in H1. Then, if H0 is correct, 2 has a chi-square distribution with d degrees of freedom: 2 = 2 ( ln (L(H1 ) ) - ln (L(H0 ) ) ) ~ d2 If 2 is greater than the 95% (or 99%, or 99.9% …) point of the d2 distribution, we reject H0 in favour of H1. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Chi-square distribution Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Comparison of general hypothesis (cont.) These statistical tests are called Likelihood Ratio Test or LRTs. They are a very powerful class of statitistical tests, with very broad applicability. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Comparison of coin hypothesis revisited: L(H0 ) = 0.000864 L(H1 ) = 0.008340 ln (L(H0 ) ) = -7.054 ln (L(H1 ) ) = -2.484 2 = 2 ( ln (L(H1 ) ) - ln (L(H0 ) ) ) = 2 x (-2.484 -7.054) = 2 x (-2.484 + 7.054) = 9.14 H0 and H1 differ by 1 free parameter (probabitility to observe Heads) degree of freedom d =1 We compare 2 =9.14 with the 12 distribution, and observe a P-value < 0.005. H1 (probability of heads not necessarily equal to 0.5) is preferred to H0 (fair coin). Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Many sequence evolution models are nested Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Jukes- Cantor (JC) vs Kimura 2-parameter (K2P) JC: 1 rate of change applies to all substitutions K2P: 2 rates of change, transitions different to transversions H0: H1: unknown tree relating the parmeters, JC model of substitutions unknown tree relating the parameters, K2P model of substitution The difference in parameters is just the transitiontransversion rate ratio, a single number. Fixing it equal to 1 in K2P gives us the JC model nested models! Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Kimura 2-parameter (K2P) vs Hasegawa, Kishino, Yano (HKY) K2P: 2 rates of change, transitions different to transversions HKY: 2 rates of change, transitions different to transversions AND 3 base frequencies free to vary H0: H1: unknown tree relating the parmeters, K2P model of substitutions unknown tree relating the parameters, HKY model of substitution The difference in parameters are the 3 base frequencies. Fixing them equal to 1/4 in HKY gives us the K2P model nested models! Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Hasagawa, Kishino, Yano (HKY) vs general time reversible (GTR or REV) model HKY: 2 rates of change, transitions different to transversions AND 3 base frequencies free to vary GTR: 6 rates of change AND 3 base frequencies H0: H1: unknown tree relating the parmeters, HKY model of substitutions unknown tree relating the parameters, GTR model of substitution The difference in parameters are 4 relative rates of change. Fixing them in appropriate ration in GTR gives us the HKY model nested models! Carolin Kosiol Spezielle Bioinformatik in der Biomedizin GTR model vs. GTR + Gamma model GTR: 6 rates of change AND 3 base frequencies GTR+: as above and a rate hetrogeneity parameter H0: H1: GTR GTR + Difference: 1 parameter . Models are nested recovers GTR. But this is a special case where the parameter tested in H1 must be on the limit of what is permitted to recover H0. In this case we take a mixture of distributions for hypothesis testing. 2 Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Summary models for DNA Allow for transition/ transversion bias (red) Allow for unequal base frequencies (yellow) Carolin Kosiol Biomedizin Spezielle Bioinformatik in der Substitution models for amino acids Dayhoff (PAM) 4x4 DNA model 20x20 AA model Jukes & Cantor, 1969; Dayhoff et al. 1978, Kimura, 1980; Felsenstein, 1981 & 1984; Hasegawa, Kishino, & Yano, 1985) Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Mechanistic and Empirical models • Empirical models summarise the substitution patterns from large quantities of data. • Useful when patterns of evolution are expected to be similar between datasets. • Mechanistic models use parameters to model factors of protein evolution. • Estimates of parameters provide insight about the evolution of specific sequences. Nucleotide models are often mechanistic and amino acid model are mostly empirical. Carolin Kosiol Biomedizin Spezielle Bioinformatik in der Examples of AA substitution models… Average process of evolution for globular proteins: PAM(1972,1978), JTT(1992), WAG(2001), LG (2008) matrices. Functional models: Chloroplast-derived amino acid replacement matrix (Adachi, 1996). Replacement matrices derived from mitochondrially-encoded proteins (Adachi and Hasegawa 2000, Yang 1998). Structural models: Amino acid replacement matrix for alpha-helices, beta-sheets, turns and loops with each category further classified whether it is buried or exposed (e.g. Goldman et al., 1998, Overington et al., 1990). Carolin Kosiol Biomedizin Spezielle Bioinformatik in der +F method AA frequencies vary a lot between different data sets; thus also between the database the substitution matrix was estimated from and the data set you want to analyse. The +F allows you to replace the frequencies of the AA in the entire data base with the AA frequencies from the specific data analysed. Carolin Kosiol Biomedizin Spezielle Bioinformatik in der Substitution models for codons M0 Dayhoff (PAM) 4x4 DNA model Carolin Kosiol Biomedizin 20x20 AA model 61x61 codon model Spezielle Bioinformatik in der The Universal Genetic Code Carolin Kosiol Biomedizin Spezielle Bioinformatik in der Codon sequence evolution A G T A T C C G G A T T ... Transitions or transversion? A T C C G A A T T ... Codon frequencies A G T A T C I A G T Carolin Kosiol Biomedizin G T C V C G A A T A ... Synonymous or nonsynonymous? C G A A T A ... Spezielle Bioinformatik in der Evolutionary time A G T The codon model M0 Qij = 0 if i -> j is > 1 nucleotide substitution or j is a stop codon j if i j synonymous transversion j if i j synonymous transition j if i j nonsynonymous transversion j if i j nonsynonymous transition where : transition/transversion rate ratio j : equilibrium frequency of codon j : nonsynonymous/synonymous rate ratio (Goldman &Yang 1994,Yang et al. , 2000) Carolin Kosiol Biomedizin Spezielle Bioinformatik in der M0 1nt change at the 1st 2nd 3rd position of the codon Carolin Kosiol Biomedizin Spezielle Bioinformatik in der Detecting positive selection < 1 purifying selection = 1 neutral evolution > 1 positive selection Variation of nonsynomous/synonymous rate ration among sites: Model 0 Carolin Kosiol Biomedizin Model 1a Model 2a Spezielle Bioinformatik in der (Bielaweski, 2005) Test for positive selection Model 1a (2 classes of sites) Each sites evolves with 0 < 1 or 1=1 Model 2a (3 classes of sites): Each sites evolves with 0 < 1, 1=1 or 2 > 1 < 1 purifying selection = 1 neutral evolution > 1 positive selection Carolin Kosiol (Branch-site models, Yang Nielsen, 1998) Spezielle Bioinform atik in der Biomedizin Positive selection in six mammalian genomes 6 high-quality genomes of eutherian mammals 0.05 human chimp 17489 orthologous genes: human / chimp / macaque / mouse / rat / dog 544 genes identified to be under positive selection (PSGs) using codon models macaque mouse rat dog Kosiol et al., PLoS Genetics, 2008 Carolin Kosiol Spezielle Bioinform atik in der Biomedizin Complement component C7 M1a: nearly neutral p0 = 0.67, (p1=1-p0=0.33) 0 = 0.05, (1 = 1) log L1= -6520.02 M2a: selection p0 = 0.68, p1=0.31, (p2 = 1-p0-p1 = 0.02) 0 = 0.06, (1 = 1), 2 = 10.51 log L2= -6530.84 2x (log L2 – log L1)= 2x (-6520.02 – ( - 6530.84)) = 2 x 10.82 = 21.64 2 (df= 2, 0.05) = 5.99 pvalue = 2.01e-05 Carolin Kosiol Biomedizin Spezielle Bioinformatik in der Co-evolution in the complement pathway P<0.05 FDR<0.05 Carolin Kosiol Biomedizin Spezielle Bioinformatik in der Kosiol et al., PLoS Genetics, 2008 What if the assumptions of the LR test are not satisfied? Often the hypothesis or the sample size is not big enough for the asymptotic assumptions to be valid. Can’t use a 2 distribution table. Instead use simulation method to obtain required distribution for significance testing. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Models that are not nested We can use the Akaike Information Criterion (AIC): AIC(model) = 2k – 2(ln (L(model) ) Where k is the number of parameter values and the model with the smallest AIC value is considered to be best. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Models that are not nested The Akaike InformationCriterion (AIC) is suitable for comparing empirical models of protein evolution. For example: JTT vs WAG or WAG+F vs LG+F Carolin Kosiol Spezielle Bioinformatik in der Biomedizin AIC and BIC AIC(model) = 2k – 2(ln (L(model) ) Prefer a model with small AIC Both LRT and AIC favour parameter rich models too often BIC(model) = k ln(n) – 2(ln (L(model) ) where n is the sample size (sequence length) BIC penalizes parameter rich models more severely Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Summary: Hypothesis Testing ML gives a likelihood value to each tree given that the sequences evolved according to that tree and a given model of substitution: L= Prob(data| tree, model) =Prob(data| tree, ) We find the model that maximises L, i.e. makes the data as probable as possible: “maximum likelihood”. The aims is to find a model which describe the data significantly better than another model. Probabilistic models can be used to test hypothesis evolutionary process acting on a set of sequences. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin Literature list Substitution models (Review papers) - Whelan S, Lio P, Goldman N, 2001. Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends in Genetics 17: 262-272. - Huelsenbeck JP and Rannala B, 1997. Phylogenetic methods come of age: testing hypothesis in an evolutionary context. Science 276: 227-232. - Anisimova M and Kosiol C, 2009. Investigating proteincoding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26: 255-271. PAML - http://abacus.gene.ucl.ac.uk/software/paml.html - Yang, Z. 2006. Computational Molecular Evolution. Oxford University Press. Carolin Kosiol Spezielle Bioinformatik in der Biomedizin