Notes - desdevises
Transcription
Notes - desdevises
Phylogenetic Reconstruction Yves Desdevises Université Pierre et Marie Curie (Paris 6) Observatoire Océanologique de Banyuls France [email protected] http://desdevises.free.fr http://desdevises.free.fr/Phylogenetic_reconstruction 1 References • Felsenstein J. 2004. Inferring phylogenies. Sinauer. • Lemey P., Salemi M. et Vandamme A.-M. 2009. The phylogenetic handbook. Second Edition. Cambridge University Press. • Hall B. 2007. Phylogenetic trees made easy. Third Edition. Sinauer. • Page R. & Holmes E. 1998. Molecular evolution: a phylogenetic approach. Blackwell. • Nei M. & Kumar S. 2000. Molecular Evolution and Phylogenetics. Oxford University Press. 2 • Goal: propose a hypothesis of relationships between several taxa • Phylogeny = tree (≠ ladder) • Speciation: binary • Based on homology: similarity from a common ancestor • Indicates the existence of a common ancestor • Identified from a phylogenetic tree, and basis to build it! 3 Labrus viridis Cheilinus trilobatus Cheilinus chlorourus Stetojulis albovittata Stetojulis bandanensis Halichoeres margaritace us albovittata Stetojulis bandanensis Stetojulis rus lorou nus ch Cheili Ch eil in us tril ob a Labrus merula viridis tus Labropsis australis Halichoeres marginatus Labroides dimidiatus Labrichthys unilineatus Coris julis Hemigymnus melapterus Hemigymnus fasciatus Thalassoma bifasciatum Thalassoma lunare Notolabrus tetricus Bodianus rufus Clepticus parrae Pagrus major Symphodus roissali Symphodus roissali Symphodus cinereus Symphodus tinca Symphodus tinca Symphodus ocellatus Symphodus ocellatus Symphodus mediterraneus Symphodus mediterraneus Ctenolabrus rupestris Ctenolabrus rupestris Labrus merula Labrus viridis Labrus viridis Cheilinus chlorourus Epibulus incidiator Cheilinus trilobatus Cheilinus chlorourus Epibulus incidiator Stetojulis albovittata Stetojulis albovittata Stetojulis bandanensis Stetojulis bandanensis Halichoeres hortulanus Halichoeres hortulanus Halichoeres margaritaceus Halichoeres margaritaceus Labropsis australis Labropsis australis Halichoeres marginatus Halichoeres marginatus Anampses geographicus Anampses geographicus Anampses caeruleopunctatus Anampses caeruleopunctatus Labroides dimidiatus Labroides dimidiatus Labrichthys unilineatus Labrichthys unilineatus Coris julis Coris julis Hemigymnus melapterus Hemigymnus melapterus Hemigymnus fasciatus Hemigymnus fasciatus Thalassoma bifasciatum Thalassoma bifasciatum Thalassoma lunare Thalassoma lunare Thalassoma lutescens Pictilabrus laticlavius Notolabrus tetricus Bodianus rufus Sympho dus cin ereus Sym phod Sy us tin mp ca Sy ho m du ph so ce od ll us atu s m ed ite rra ne us Symphodus melanocercus Labrus merula Cheilinus trilobatus Labrus viridis Thalassoma lutescens Pictilabrus laticlavius lis Labropsis austra ceus rgarita us es ma lan hoer Halic is ortu ns sh re ne oe da lich an Ha sb juli to Ste Symphodus cinereus Symphodus melanocercus Symphodus roissali Anampses geographicus Anampses caeruleopunctatus us tric te s bru fus la to s ru No ianu d rrae Bo us pa ptic Cle major Pagrus s rcu ce no ela sm ris du ho pest s ru mp bru Sy nola Cte a s merul Labru stris rupe Sym ph od us oce lla tus r to ia cid in us tinca Symphod SSyy mmp phh oodd uuss cro inis ere sa ulis Halichoeres hortulanus Halichoeres margaritaceus La bro ide sd im cae idia rule opu tus Anam nct atu pses s geog raph icus Halichoeres margin atus An am pse s Thalassoma bifasciatum Ctenolabrus rupestris Labrus merula Epibulus incidiator brus nola Cte Pa gru sm ajo r Symphodus melanocercus Ste to juli sa Ep lbo ibu vit lus ta inc Chei idia ta linus tor chlo rour us Cheilinus trilobatus La Symphodus ocellatus Symphodus mediterraneus Th br An am oide s di pse HLab mid alic ropsis aus s ca iatu tralis ho eru s ere leo sm pu nct arg atu ina s tus Symphodus cinereus Symphodus tinca s ulu ib Ep s rcueus ocean ditnerr s meela hodu s m Symp odu ph Sym nus fasciatus Hemigym rus apte mel julis ris s Co tu ea ilin un ys th ch bri La Symphodus roissali s s nu icu ula ph ort gra eo sh sg ere pse ho am lic An Ha fus s ru ianu Bod nus igym Hem unilineatus Labrichthys Th TH Cor ala haem is ju ss lasig lis om soym nu a b ma s fa ifa lute sciatu s Hemigymnusscmelapterus iatuscen m s Pic tilabr us are la maticlun lavi sso us Thala Cle ptic tetricus Notolabrus us pa rra e alasso ma lun Tha are lass Pic om a lu tila tesc bru ens s la tic lav ius Phylogenetic trees Thalassoma lutescens Pictilabrus laticlavius Notolabrus tetricus Bodianus rufus Clepticus parrae Clepticus parrae Pagrus major Pagrus major 4 • Cladogram • No branch lengths • Clades • Phylogram • Branch lengths Ultrametric tree Additive tree 5 Leafs = terminal taxa Clade Terminal branches A B C D E F G H I J Polytomy Internal branches Node Root 6 • Speciation 7 Hypothesis A B C 8 Rooting • Gives the branching order • Use of an outgroup • Rest = ingroup Rooted tree outgroup Non rooted tree Add an outgroup 9 • Outgroup: sister taxa from ingroup • Shared characters between outgroup and ingroup = ancestral characters • Sometimes no outgroup: rooting at equal distance from tree tips (need branch lengths) = midpoint rooting B A C D F E B C E A D F 10 • Groups • Monophyletic (clade): natural group • Mammals • Paraphyletic • Reptiles • Polyphyletic • Algae, protozoans 11 Characters • Organisms are composed of different features • These features are different among taxa: Character states • All character states form a character • These states are produced by heritable changes • Phylogenetic inference is performed from differences between character states 12 • We want to establish the ancestor-descendant link from the presence/absence of character states • We look for the appearance of new character states in descendants • The different character states are homologies • Taxa sharing this new character state (derived) form clades • Example: hair in mammals • Characters can be differentially weighted 13 • Homology 14 15 • Homoplasy 16 • Ancestral characters: plesiomorphies • Shared ancestral characters: symplesiomorphies • Derived characters: apomorphies • Shared derived characters: synapomorphies • Ideally, identify clades • Non shared derived characters = particular to a given taxon: autapomorphies 17 18 Homology • Homologies are supposed to show similarities in: • position • structure • development • A recognized criterion to support homology is the congruence with other characters 19 Dog Lizard Frog Human Change HAIR Absents Presents 20 Homoplasy • Non homologous similarities • Results from independent evolution • Convergence • Parallelism • Reversion • Blurs phylogenetic signal: may lead to false evolutionary relationships 21 Parallelism Convergence Reversion 22 Lizard Human TAIL Frog Dog Human Dog Absent Present TAIL Frog Lizard Absent Present 23 • Without homoplasy, phylogenetic inference would be easy • Main problem of phylogenetic recontruction: discriminate homoplasy (noise) from homology (signal) • Data quality (“good” phylogenetic signal) is more important than method used 24 • If there is only one correct tree, when characters support different trees, at least one contains homoplasies Dog Lizard HAIR Absent Present Frog Human Human Dog TAIL Frog Lizard Absent Present 25 Congruence • The chosen tree is the tree maximising the number of congruent characters MAMMALS Dog HAIR MILK ... Human Lizard Frog Changes 26 Case of molecular data • Homoplasy is more common with molecular than morphological data • Few states (4 for DNA: A G C T) • Chemically close • Evolutionary rates can be high • No identification of homoplasy via structure or development 27 Data • Fossils: rare • Morphological characters • Molecular character: DNA, proteins, ... • By far the most used now: models, numerous characters, less subjective, ... • But... phylogeny of the DNA fragment (≠ taxa) • Future: genomes ➙ phylogenomics • Others (behaviour, hosts, habitat, ...) 28 Morphological data • Homology uneasy to identify • Characters often not numerous: problem when studying many taxa, especially if they are closely related • Some subjective decisions • Evolutionary processes poorly known: limit method choice • Require coding • Sometimes difficult • Hypotheses on character evolution 29 Coding • Binary: Presence/absence = 0/1 • Multiple states (ordered or not): definition of step numbers between states • Additive binary coding: e.g. 00, 01, 10, 11 • Linear coding: e.g. 0, 1, 2 • Both can be combined 30 31 Molecular data • Nucleotides ou amino acids (for ancient divergences) • Characters = base (or AA) positions • Character states = bases (ou AA) identity • Important step: alignment • Sometimes manual • Automated methods: manual editing required • No test: no null hypothesis • Can use information on secondary structure or coding nature 32 • Nucleotides: only 4 states (in 2 types) • Evolution can be modelled • Homoplasy “easy” 33 • Amino acids • 20 states • 5 categories • Evolution much more difficult to model • Codons • 61 states! 34 • Gene tree ≠ species tree • Genes: orthologous or paralogous Paralogs Orthologs b* c a Orthologs C* B A* b* C* A* Duplication Tree Ancestral gene 35 Alignment <---------------(--------------------HELIX 19---------------------) <---------------(22222222-000000-111111-00000-111111-0000-22222222 Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA E.coli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA B.subtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA match ** *** * ** ** * ** • Hypothesis of positional homology between nucleotides or AA • Methods • Manual (Seaview, BioEdit, Se-Al, ...) • Automated (ClustalX, MAFFT, POY, MUSCLE, TCoffee, ...) • Combination (what we do) 36 • Alignment easy or not • Coding sequence or not • Use AA (codons) for alignment • Consider AA types (size, polarity, hydrophobicity) • Sequences may be more or less divergent • Homology can be variable within regions • Alignment preformed by adding insertion-deletion events (indels) via gaps: limited by penalties (unless at sequence ends) 37 • Goal of automated alignment: maximise alignment score • Example Dot Plot GATTC GAATTC We define: Match = +1 Mismatch = 0 Indel = -1 38 1 1 1 GA-TTC GAATTC Score = 4 -1 1 1 0 1 1 1 0 -1 GATTCGAATTC Score = 2 2 optimal alignments 1 -1 G-ATTC GAATTC Score = 4 1 1 1 1 39 • Need to define a gap opening penalty and a gap extension penalty, generally lower (favour extension on holes everywhere in the alignment) • GOP and GEP may vary along sequences, because of gap presence and biochemical features (e.g. hydrophil AA) • Substitutions can be differentially weighted (some easier than others; e.g. for AA: BLOSUM 62 matrix) 40 • Analytically complex problem: the “best” alignment cannot be guaranteed when sequences number rises (multiple alignment) • Progressive alignment (e.g. Clustal) • Estimation of a guide tree (NJ) from pairwise alignment • Closest sequences first aligned and so on • Fast but no optimality criterion 41 • Global or local alignment • Global: consider whole sequence length. Good if few divergence and similar size • Local: by region. Better if variable regions • Hybrid (semiglobal or glocal) 42 • Informative regions can be automatically selected after the alignment, by removing badly aligned parts • GBlocks • Several options are available to modify the stringency of selection 43 Saturation • Multiple hits • Multiple substitutions at the same site • At fast evolving sites Seq 1 Seq 2 AGCGAG GCGGAC 1 Seq 1 C Seq 2 C 3 2 G T 1 A A 44 • rRNA small subunit 45 • 3 observable changes • 12 actual changes 46 • Detection • Plot transitions (Ti) vs transversions (Tv) • Plot % differences between sequences vs time (if available) • Plot uncorrected vs corrected distances 47 Saturation No saturation (Jukes-Cantor) 48 • Correction • Use evolutionary model to correct divergence between sequences • Remove fast evolving sites (e.g. third codon position) • Use different weights for Ti and Tv • Use only Tv • Use more slowly evolving sequences 49 Bias • Long branch attraction • If the method assumes that all sites change at the same rate A B p A D p q q q C D True tree C B Inferred tree 50 • Codon usage bias: some codons more used for the same AA 51 • Base compositional differences in lineages (LogDet, heterogenous ML) • Example: % GC in thermophilic bacteria Aquifex Thermus Bacillus Deinococcus True tree Aquifex (73%) Thermus (72%) Bacillus (50%) Deinococcus (52% G+C) Inferred tree 52 Optimality criteria • To choose the “best tree” • Hypothesis on how evolution works • Different in different methods • Number of steps • Sum of branch lengths • Likelihood 53 Several methods The best?? • Parsimony • Distance • Maximum likelihood • Bayesian inference • If there is an optimality criterion, topologies must be compared to find the best 54 Topologies: number • Number of unrooted trees (for n taxa) i= n ∏ (2i-5) = (2n-5)(2n-7)...(3)(1) i= 3 • Number of rooted trees i= t ∏ (2i-3) = (2n-3)(2n-5)...(3)(1) i= 2 • Examples • 5 taxa: 105 rooted trees • 8 taxa: 135 135 • 10 taxa: 34 459 425 • 50 taxa: 3 1074 (> atomes in the universe!!) 55 • Algorithms to explore the treespace • Exhaustive search if few taxa (10-12 for parsimony): examines all topologies • Branch and Bound: partly explores treespace (about 20 taxons in parsimony), efficient • Heuristic search, less efficient, faster: finds a “good” tree via a driven agglomeration procedure and rearranges it to find a better tree 56 Treespace Suboptimal island of trees Global optimum Starting trees “Treespace” 57 • Rearrangments: • NNI = Nearest Neighbour Interchange • Faster but less rigourous than other techniques 58 • SPR = Subtree Pruning Regrafting 59 • TBR = Tree Bisection Reconnection • More rigorous but slower • Launch several independent searches with exhaustive algorithm 60 Parsimony 61 Cladistics • Two lineages are more closely related to each other than to another if they share a more recent common ancestor • Phylogenetic hypotheses = hypothesis of a common ancestor • Associated to reconstruction via parsimony • MP = Maximum Parsimony 62 Parsimony • “Ockham’s razor” Pluralites non est ponenda sine necessitate • Favour simplest solution • Choose between competing phylogenetic hypotheses • Maximize congruences and minimize homoplasies • Assess character fit to trees • Method based on individual characters 63 Character fit • Minimum number of steps (from one state to another) required to explain the observed distribution of character states • This is determined by character optimisation (mapping) via parsimony • Optimisation is different on different trees • Changes may be non unique for a single tree with a given number of steps: branch length may not be defined 64 1 step Hair Bird Bat Human Crocodile Kangaroo Frog Human Bat Kangaroo Bird Frog Crocodile Example 2 steps Absent Présent 65 Parsimony analysis • For a set of characters, determine the fit (number of steps) of each character to the tree • The sum for all characters (X putative weighting) is called tree length • The most parsimonious trees (MPT) are those with the smallest length • Informative character: at least 2 states in 2 taxa • Optimality criterion (= objective function): number of steps = tree length 66 • Several MPT may be obtained • Several trees: consensus • Trees give hypotheses on character evolution • Branch lengths: number of changes. Generally underestimated. Not the objective in MP • Several indices to assess fit between tree and data (Consistency Index, Retention Index, ...) 67 Consensus • Strict • Semi-strict • Majority-rule 68 Character types • Different costs for state change • Wagner (ordered, additive): morphology 0→1→2 • Fitch (non ordered, non additive, equal costs): DNA, AA, morphology A ⎯ G T ⎯ C 69 • Sankoff (generalized) A ⎯ G 1 step T ⎯ C 5 steps • Typical example: different weights for transitions and transversions • Symmetrical or asymmetrical costs 70 Transversions (Tv) Py Pu Stepmatrices to Purines (Pu) G A C T Pyrimidines (Py) A C G T A 0 5 1 5 from C 5 0 5 1 G 1 5 0 5 T 5 1 5 0 Transitions (Ti) Py Py Pu Pu Transitions easiests Transversions more numerous 71 Generalized parsimony • = Weighted parsimony • Different costs for different changes • Minimize costs sum = global cost 72 • Problem: define costs • Knowledge on molecular evolution is used to define costs • Transitions/transversions (Ti/Tv, numbers or rate) • Substitution rate heterogeneity, e.g. for different codon position 73 Algorithms 1. Calculate topologies 2. Optimize all characters and calculate length • Long if many taxa • Algorithms • Exhaustive search if few taxa (about 10): examines all topologies • Branch and Bound: partly explores treespace, for about 20 taxons, efficient • Heuristic search, less efficient, faster: finds “good” trees and rearranges them 74 Parsimony - Advantages • Simple • No explicit evolutionary model • Tree and character evolution • Good if homoplasy rare • Good for morphological characters 75 Parsimony - Drawbacks • Problem if many homoplasy, or concentrated in some regions • Long branch attraction (Felsenstein Zone) • Underestimates branch lengths • Implicit evolutionary model: behaviour may not be clear • More justified on philosophical than numerical bases 76 Maximum likelihood 77 • Maximum Likelihood = ML • Method based on individual characters • Uses an explicit evolutionary model • MP sometimes considered as a special case of ML • The more computationally complex method • Model very important: only for molecular data 78 Principle • Answers the question: What is the probability to observe the data given the evolutionary model(process and tree)? • Pr(D|T) • Estimation of parameter model values to maximize this probability: likelihood • Of course, we look for the tree (topology and length) • Compute likelihood for all topologies: heuristic algorithm 79 Nucleotides Given A Probability of A : AACG B : ACCG C : AACA D : AATG D C B A ⎧ A ⎪a ⎪b P = C ⎪⎨ c G ⎪ T ⎩d C G T b c a e e a c f d⎫ ⎪ f ⎪ ⎬ g ⎪ ⎪ a⎭ π = [A, C, G, T] 80 Parameters • π = [A, C, G, T] • Sum = 1 • Substitution rates: P matrix • Row sum = 1 • Function of bases and time (branch lengths) • Heterogeneity: Γ • Tree • Topology • Branch lengths Bases frequencies: π A C G T P= A C G T ⎧a b ⎪ ⎪b a ⎨ ⎪ c e ⎪ ⎩d c d⎫ ⎪ f ⎪ ⎬ g ⎪ ⎪ a⎭ c e a f A D C B 81 Substitution rate heterogeneity € Parameter: α - high: rate = 1 at all sites - small (0.5): few changes for most sites - 0: all rates different In practice, a discrete distribution with 4 classes gives good results 82 • The probability to observe a given sequence is the product of frequencies (composition) by substitution rates (considering branch lengths) Example ⎧0.976 0.01 0.007 0.007 ⎫ ⎪ ⎪ ⎪0.002 0.983 0.005 0.01 ⎪ ⎬ P = ⎨ ⎪ 0.003 0.01 0.979 0.007⎪ ⎪ ⎪ ⎩ 0.002 0.013 0.005 0.979⎭ (for a given branch length b) CCAT CCGT b π = [0.1, 0.4, 0.2, 0.3] Likelihood = πCPC→CπCPC→CπAPA→GπTPT→T = 0.4X0.983X0.4X0.983X0.1X0.007X0.3X0.979 = 0.00003 83 • The likelihood L changes with branch length 0.0002 0.00018 0.00016 0.00014 L 0.00012 0.0001 0.00008 0.00006 0.00004 0.00002 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Branch length b ML for a branch length of 0.330614 84 • Very small number: compute log(L) • Additivity: log(AT) = log(A) + log(T) • Negative number (0<L<1) • Do the same thing for the whole tree • for all topologies and branch lengths • for all sequences of a fixed length, and sequences at the nodes (ancestral sequences) • while estimating the best parameters • Very long... 85 • ...and: changes do not happen the same way at the same places • Constraints in structure • Codon position • Active site • etc... • And substitution rate varies with time for a fixed position: heterotachy 86 • We can add a proportion of invariant sites (estimation via ML is possible, another parameter) • Compute α for variables sites and/or different models for different position • Codon position • Alpha-helix 87 Basic models for DNA Jukes-Cantor (JC) πA= πC = πG = πT Kimura 2 parameters (K2P) πA= πC = πG = πT α=β Felsenstein 81 (F81) πA≠ πC ≠ πG ≠ πT α≠β α=β Kimura 3 parameters (K3P) πA= πC = πG = πT Hasegawa-Kishino -Yano 85 (HKY 85) πA≠ πC ≠ πG ≠ πT α ≠ β1 ≠ β2 α≠β Symmetric (SYM) πA= πC = πG = πT Tamura-Nei (TrN) πA≠ πC ≠ πG ≠ πT 6 different rates α: transitions β: transversions General Time Reversible (GTR) πA≠ πC ≠ πG ≠ πT 6 different rates α ≠ β1 ≠ β2 88 Coding sequences • Different constraints on different codon positions • Partition sequence according to codon position and assign different model/parameters. Various options • SRD06 (Shapiro et al. 2006) • Link positions 1 and 2 • Position 3 can have different rate, Ti/Tv, Γ • Use codon model 89 • Use information in genetic code: codon model • Computationally intensive • GY94 (Goldman & Yang 1994, Muse & Gaunt 1994) (MrBayes, HyPhy, PAML) • New parameter ω = ratio nonsynonymous/ synonymous substitutions 90 Proteins (amino acids) • Model: probability of change of an AA to another (PhyML, PhyloWin, Puzzle, Phylip) • 20 AA: many more possibilities than nucleotides, estimation is difficult • Many empirical models (Dayhoff, JTT, WAG, Blosum, ...), from sequence pairs or tree-based comparisons on big datasets • Some models based on codons (REV) • Take into account AA characteristics 91 Model choice • The more a model has parameters • The more it fits the data • The more computing time is high • The more estimation is uncertain (= variance increases = degrees of freedom decrease) 92 • Need of a compromise • Eventually, choosing a more complex model does not significantly increase the likelihood • Solution: hLRT or AIC (Modeltest, MrModelTest, ProtTest) • hLRT (hierarchical likelihood ratio test): compares models (must be nested) • AIC (Akaike information criterion): estimates model fit to data • AIC = 2k - 2logL, where k is the number of free parameters • Choose model with lower AIC 93 • Very long to estimate parameters while estimating topology • If the tree is roughly correct, parameter estimation is stable • Parameter estimation from a fixed tree (rapidly constructed via e.g. MP, NJ) • Use these parameters to estimate topology 94 Likelihood Ratio Test • To test many hypotheses • Comparison of two nested hypotheses: one (H0) is a special case of the other (H1) • Statistic Δ = logL1 - logL0 • If no difference, 2Δ follows a Χ2 distribution with degrees of freedom equal to the difference in parameters between the two hypotheses • Comparison of models, topologies (KH- and SHtests), lengths (molecular clock), ... 95 ML - Advantages • Considers saturation • Reliable branch lengths • Consistent: with a good model, converges toward the right tree with increasing number of data • With good model, not affected by LBA • Uses all the data (no “informative sites”) • Evolutionary process and ancestral sequences • Quite robust 96 ML - Drawbacks • Inconsistent with a wrong model • Even the more complex model simplifies reality • Still computationally intensive: needs heuristics then compromise 97 Bayesian inference • Recent and now widely used method (MrBayes, PhyloBayes, BayesPhylogenies) • Same models as ML (MrModelTest) • Gives posterior probability of parameters (among which topology and branch lengths), based on previous knowledge on data: prior probability (controversed) 98 • What is the probability of the model/theory/tree given the data? • Pr(T|D) = (Pr(T)Pr(D|T))/Pr(D) posterior prior likelihood probability of the data 99 • Bayes formula combines prior probability and likelihood to yield posterior probability: prior chosen as non informative (e.g. flat), then posterior probability (pp) mainly depends on likelihood 100 • BI does not search “the” best tree (and parameters), but explores treespace with a Markov chain Monte Carlo (MCMC), and samples trees when a plateau is reached (e.g. high probability trees): confidence intervals, assess clade support (pp) • No validation step needed: a high number of trees is produced, a consensus based on a sample of trees gives the probabilities of clade support (if the model is good): faster than ML • Problem: running chains long enough. Several chains used to better explore treespace (MCMCMC = Metropolis coupled MCMC) and to avoid getting stuck on hills 101 102 Traditional approach (ML, MP) Bayesian inference Tend to choose trees with higher pp Long! MCMC After a delay: sample trees with high pp 103 • Bayesian analyses estimate marginal (Tree B) rather than joint (Tree A) probability • ML selects Tree A (highest peak) • BI chooses Tree B (most voluminous peak) 104 • Example: MrBayes output Rough plot of parameter LnL +------------------------------------------------------------+ -47216.46 | *******************************************************| | * | | * | | * | | | | * | | | | | | | |* | | | | | | | | | | | +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ -72924.41 ^ ^ 1 100000 • 100000 iterations (generations) • Sample a tree/100 generations: 1000 trees • Burnin: discard first 200 trees (keep only trees from plateau) and make consensus 105 • Many applications: ancestral character states, divergence time estimates... not only for phylogenetic reconstruction • Clades pp tend to be higher than bootstrap values (e.g. from ML): precision overestimated? • Maybe not, pp should not be interpreted as bootstrap proportions 106 Distances 107 • Assessement of the mean number of changes between two taxa • Based on distances, not individual characters • Data sometimes only as distances (DNA/DNA hybridation, serology, morphometry, ...) • If not, data transformation in distance matrix • Mainly for molecular data 108 • The percentage of differences between sequences (p- distance, Hammig distance) generally underestimates the true distance because of saturation • Especially true if sequences are distantly related • Use a model to correct distances: parameters reflect the way we think molecular evolution works (same models as ML: JC, K2P, GTR, ...) • These models can consider substitution rate heterogeneity (Γ) • LogDet distance allows for different base frequencies in sequences 109 • Coding DNA: synonymous (or silent) substitutions (do not change AA) or not • Evolutionary rate higher for synonymous • Ka = non synonymous distance = non synonymous substitutions/non synonymous sites • Ks = synonymous distance = synonymous substitutions/synonymous sites • These distances consider Ti and Tv (K2P) • Close sequences: only Ks informative • Distant sequences: Ks is saturated, Ka informative 110 Algorithms • Main: Neighbor-Joining (NJ) • Additive trees • Derived methods: BioNJ, weighbor... • Sometimes (not anymore): UPGMA • Ultrametric trees (molecular clock) 111 • NJ: starts with a star tree and forms pairs which minimise tree length (sum of branch lengths) 8 8 7 1 1 7 6 2 6 4 3 5 5 2 • Tends to generate the shortest tree, but no 4 3 optimisation during the agglomeration procedure (which is very fast) 112 Parameters • Model must fit to data, we must find the right parameters • Number of invariant sites • Heterogenous substitution rate along sequence alignment • Substitution rates different for different types of change 113 • Starting distances ≠ patristic distances (computed from the tree) • Else it would be easy because the tree is additive A C 0.1 0.2 0.3 0.1 0.6 B A B C D A B 0.4 0.4 0.4 0.6 0.8 1.0 C 0.4 0.6 0.8 D 0.8 1.0 0.8 - D 114 • Different in the real life • Stochastic errors even with a perfect model • Model never perfect (evolutionary model and algorithm) • Need a criterion to assess the fit of original data to the tree (topology and branch lengths) • Fitch-Margoliash: least-squares • Minimum evolution (ME): minimize tree length • The algorithm itself does not guarantee to reach the criterion, even if NJ is a good approximation: better to add an optimization step 115 Distances - Advantages • Fast: the only method if the number of taxa is very high • Many models, can be tested via ML • LogDet very useful when base composition varies, but does not consider substitution rate heterogeneity (remove invariant sites) 116 Distances - Drawbacks • Information loss: impossible to reverse to sequences from distances • No scenarios on character evolution • Generally less performant than ML (simulations) • Poor for old divergences 117 Validation 118 • Any data yield a tree, even without phylogenetic signal • No way to test if this is “the” right tree: no interesting null hypothesis • But we can assess the confidence we have in a tree • Many methods based on randomisation (destruction or alteration of phylogenetic signal) • Most methods are independent of the tree reconstruction method 119 Bootstrap (non parametric) • Resampling technique • Create new datasets (100, 1000,...) from the original: random character selection (columns) with replacement (without: jacknife) • Noise in the phylogenetic structure = estimation of sampling variance • Build a tree from each dataset • Compute majority-rule consensus of all trees • Percentage of clade occurence = support 120 • Widely used • Supposes character independence • Supposes they are identically distributed • Not a statistical test • Often too conservative • Requires many characters: usually not good for morphology 121 Parametric bootstraping • Select a model from the data (ModelTest) • Estimate topology • Use model and topology to generate data via simulation (SeqGen) • Analyse variation of simulated datasets: topology, confidence interval (datation, ...), topology comparison tests (SOWH, ...) 122 Permutation Tail Probability • Statistical test. H0: no phylogenetic structure • Measure a statistic on tree (e.g. length) • Destroy original data structure via random permutations (randomisation) • Generate a distribution of the statistic under H0 • PTP: proportion of data ≥ observed statistic 123 Randomisation • Keep number of taxa, characters and character steps ‘TAXA’ R-P A-E N-R D-M O-U M-T L-E Y-D 1 R A N D O M L Y 2 P E R M U T E D 3 R A N D O M L Y ‘CHARACTERS’ 4 5 6 P R P E A E R N R M D M U O U T M T E L E D Y D 7 R A N D O M L Y 8 P E R M U T E D 1 N R M L D O Y A 2 U E R T E M D P 3 D A M R Y O N L ‘CHARACTERS’ 4 5 6 E R T P L E M A D E Y M U D E T O U D M P R N R 7 O A N D Y L M R 8 U D P R M T E E ‘TAXA’ R-P A-E N-R D-M O-U M-T L-E Y-D 124 Frequency FAIL TEST 95% cutoff PASS TEST reject null hypothesis Measure of data quality (e.g. tree length, ML, pairwise incompatibilities) GOOD BAD 125 • Phylogenetic signal Number of Number of Tree length replicates Tree length replicates ------------------------- ------------------------1222* 1 1686 8 1669 1 1687 7 1671 1 1688 6 1672 1 1689 8 1673 1 1690 6 1674 1 1691 3 1675 2 1692 2 1676 2 1693 3 1678 1 1694 3 1679 2 1695 3 1680 4 1696 3 1681 5 1697 2 1682 8 1699 2 1683 4 1702 1 1684 4 1704 2 1685 2 1705 1 126 • No signal Number of Number of Tree length replicates Tree length replicates ------------------------- ------------------------1924 3 1940 6 1926 1 1941 7 1927 4 1942 4 1928 1 1943 2 1929 2 1944 1 1930 8 1945 1 1931 6 1946 1 1932 5 1947 1 1933 4 1950 3 1934 4 1952 1 1935 5 1953 1 1936 1 1955 1 1937 8 1958 1 1938* 11 1939 7 127 • H0 easily rejected: PTP identifies only very poor data • Does not identify were is the structure in the data 128 Bremer index • BI = Decay index (TreeRot) • Only for parsimony • A strong clade should appear in trees slightly longer than MPT • BI = number of steps needed to “break” a clade • For a tree = sum of BI for each clade 129 • The more a group is supported, the more high is BI • BI > 0 only for clades occuring in MPT • BI not standardised (≠ bootstrap): interpretation may not be simple • Generally in accordance with bootstrap 130 Data combination 131 • Several datasets (genes, morphology, ...): several trees • Important issue because of increasing use of genomes (many genes!) in phylogenetics • What should we do if they are not congruent? • Compare trees or combine them via a consensus • Combine data (total evidence) and build a new tree • Conditional combination: before combining, test data homogeneity and/or difference between trees 132 • Consensus 133 • Combination (total evidence) 134 Partition homogeneity test • ILD test (Incongruence Length Difference) • Principle • For same data, compare tree length (or ML) for observed and random partitions • If it is no significantly different, data are homogeneous: combine • If significant difference: keep separated trees or discard taxa generating conflict 135 sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 TACATAAACAAGCCTAAAATGCGACACTACGTTCACTGTTACGCTCTCCACTGCCTAGACGAAGAAGCTTCA TACATAAACAAGCCCAAAATGCGACACTACGTCCACTGTTATGCTCTCCACTGCCTAGACGAAGACGCTTCA TACATAAACAAGCCCAAAATGCGACACTACGTCCACTGTTACGCTCTTCACTGCCTAGACGAGGATGCCTCG TACATAAATAAGCCAAAAATGCGACACTACGTTCATTGTTACGCACTCCATTGCCTCGACGAAGAAGCTTCA TACATAAACAAACCCAAAATGCGACACTACGTCCACTGTTATGCTCTCCACTGTCTAGACGAAGACGCTTCG TACATAAACAAGCCCAAGATGCGTCACTACGTCCACTGCTACGCCCTCCACTGTCTCGACGAGGAGGCCTCG TACATAAACAAACCAAAAATGCGACACTACGTCCATTGTTACGCCCTACACTGCCTAGACGAAGACGCTTCA TACATAAACAAACCAAAAATGCGACACTACGTCCATTGTTACGCCCTACACTGCCTAGACGAAGACGCTTCA Partition 1: L = 12 sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 L = 21 Partition 2: L = 9 TACATAAACAAGCCTAAAATGCGACACTACGTTCACTGTTACGCTCTCCACTGCCTAGACGAAGAAGCTTCA TACATAAACAAGCCCAAAATGCGACACTACGTCCACTGTTATGCTCTCCACTGCCTAGACGAAGACGCTTCA TACATAAACAAGCCCAAAATGCGACACTACGTCCACTGTTACGCTCTTCACTGCCTAGACGAGGATGCCTCG TACATAAATAAGCCAAAAATGCGACACTACGTTCATTGTTACGCACTCCATTGCCTCGACGAAGAAGCTTCA TACATAAACAAACCCAAAATGCGACACTACGTCCACTGTTATGCTCTCCACTGTCTAGACGAAGACGCTTCG TACATAAACAAGCCCAAGATGCGTCACTACGTCCACTGCTACGCCCTCCACTGTCTCGACGAGGAGGCCTCG TACATAAACAAACCAAAAATGCGACACTACGTCCATTGTTACGCCCTACACTGCCTAGACGAAGACGCTTCA TACATAAACAAACCAAAAATGCGACACTACGTCCATTGTTACGCCCTACACTGCCTAGACGAAGACGCTTCA Partition 1: L = 14 L = 25 Partition 2: L = 11 136 Sum of Number of Sum of Number of tree lengths replicates tree lengths replicates ------------------------------------------------------------1661 1 1672 10 1662 2 1673 7 1663 1 1674 4 1665* 9 1675 4 1666 8 1676 1 1667 9 1677 4 1668 5 1678 2 1669 11 1679 1 1670 10 1680 1 1671 9 1683 1 * = sum of lengths for original partition P value = 1 - (87/100) = 0.130000 137 Assessing difference between trees • Templeton test • One of the earliest approach (Templeton 1983) • Comparison of topologies with different lengths: is this difference significantly different from 0? • List characters with different lengths • Do a Wilcoxon test (signed rank, non parametric) 138 • Symmetric difference (PAUP) • Statistic: number of different partitions between trees (topologies only) • Assess the observed statistic against a null distribution generated from random topologies 139 • ML-based tests • Kishino-Hasegawa test (1989) (PAUP) • Statistic: difference in lnL (likelihood ratio) or length (steps) between trees (around 0 if not significant) • Trees must be selected a priori (not best ML tree against suboptimal tree) • Null distribution from differences between sites or generated from pseudoreplicates (bootstrapping) because of non-normality 140 • Test observed difference against null distribution Sites favouring tree A Mean Expected Sites favouring tree B 0 Distribution of Step/Likelihood differences at each site 141 • Shimodaira-Hasegawa test (1999) (PAUP) • In most cases, trees selected a posteriori, from phylogenetic analysis: KH not good • In such cases, SH test corrects bias in H rejection by KH test, but same principle 0 • Comparison of multiple topologies • Approximately Unbiased test (Shimodeira, 2002) (Consel) • Like KH and SH tests, it is a winning sites test • Less conservative than SH test, because of a better way to generate pseudoreplicates 142 • Swofford Waddell Olsen Hillis (SWOH) test • Uses parametric bootstraping • H : topology A (hypothetical) is not different 0 from B (observed, e.g. ML tree from the data)? • Use a statistic assessing differences between A and B: likelihood ratio, number of steps, ... • Compute best model with A and simulate data on this topology (SeqGen) • From the simulated dataset, find the likelihood for topology A and compute ML tree • Compute Δ (if LRT) for each pair of trees 143 • Do this many times: distribution of the statistic Δ to assess significancy of the observed value • If observed Δ > 95 % of the simulated values of Δ, reject H0 • More power than KH, SH, and AU tests, but depends on model, which has to be correct • Bayesian methods: computationally highly demanding (still quite infeasible) 144 Supertrees 145 • Combine trees with partially overlapping taxa • Bigger tree • Several methods (at least 17) • Indirect: matrix constructed from tree, and analysis with an optimality criterion (e.g. MRP, MRD, MRC, MRF) • Direct: combination of topologies in a consensuslike way (e.g. MinCut, Modified MinCut) 146 147 Matrix Representation with Parsimony (MRP) • Most used technique • Reconstruct a matrix from trees (RadCon, Rainbow) and analysis via parsimony (PAUP): can be very long • Clades coding (nodes), can be weighted (e.g. bootstrap source trees) • Classical validation indices can be used 148 149 MinCut • Direct analysis: no optimality criterion • Fast • No supertree validation • Good with compatible source trees 150 • Uses of supertrees • Combining trees from different data/studies • Phylogenomics: genes are often unequally present in the taxa under study • Metagenomic: taxa partially and unequally represented in sequences ➡Many gaps in the matrix: • Supermatrix (as is) • Design several complete sub-matrices, compute subtrees, build supertree 151 9E uk 22 Alv 26 Alv 13Din h 3 Sp 39 31 CC 3R Ant 33 g 35Eu is selm Tetra l 19Ch 79 88 96 lla Mantonie 1RCC143 82 85 refPFRRDB 93 81 Ostreococcus Bathycoccus 87 Haplosp oridium 7Pla 57 16Cr u 54 67 56 79 59 Ci 30 ym 34G 14Euk 24Eu k 28 Cru 25 Cru 21 G ym k Eu 10 ru 8C Cru 27 ULABN14TF 36 refDSU213 2Hyd 4Hyd 20 Eu k 32 Co l 18 Eu Euk kO LI11 261 ULAK X75T F Alveol ateGII 66 93 81 17Euk 97 52 ystis Phaeoc nesio Prym F 943T 3 ULAC 02 99 r 77 u B BH 11 GB k Eu 15 k Eu 29 ... 98 99 5Emb Emiliania 96 85 refTH ER R18S refBBO RR18S C refPVLRRD A Bolidomonas 70 Nannochloropsis re e ch pty 3 no 02 86 Cya F2 fA roides Nyctothe ymena Tetrah 6Sph ean thar ia Acan us Han Ka Gy re ro na din Ale ium xa nd refS riu YM m 18 SR RN ULAG 91 E01T F ULADY7 4TF ULAE 395T F 12Eu k 23 Euk • e.g. Sargasso Sea environmental sequences 152 • Sometimes too many taxa/data to perform analysis • Design well-chosen sub-dataset, individually analysed, combination: divide-and-conquer • Supermatrix (in addition to the above-mentioned problem): many missing data increasing computing time 153 • Possible conflicts between supertree and supermatrix 154 Phylogenomics 155 156 • Genomes: more accurate and precise phylogenies? Not so simple... • Very large dataset: computation difficult • Genomes are plastic: duplications (total, partial), fusions, chromosome fissions, LGT, ... • No good model of genomic evolution • Still difficult: be very careful to control biases 157 • Diminution of stochastic error (random), only by increasing character number • The possibility of systematic error still remains, for example caused by wrong method or model choice 158 • 3 main biases • Composition bias: sequences with the same composition tend to cluster • Check from sequences • Long branch attraction • Good taxon sampling • Heterotachy: substitution rate change through time for fixed positions • Hard to detect and correct 159 Genomes • More characters • New character types: gene order, gene content, nucleotidic signature (DNA strings), rare genomic changes • 2 main approaches • Classical: sequences (gene concatenation) and phylogeny (supermatrix or supertree) • Whole genome features: gene order, gene content, DNA string • + 1: rare genomic changes 160 Classical methods 161 • Resolution of difficult phylogenetic problems (e.g. Tree of Life, Eukaryotes, Bilateria) • Evolution of gene groups: mutations, selective pressure • Identification of lateral gene transfer 162 • Example: Tree of Life (Nature, 2005) - Purple: identified by genomic - Yellow: confirmed by genomic 163 • Example: Tree of Life (Science, 2006) 164 • Example: Eukaryote phylogeny 165 • Example: Classical picture of Deuterostomians evolution 166 • Genomic data (Nature, 2006) - 146 genes - Classical methods: sequences - Bias control 167 Summary Data DNA, AA, morphology, ... Alignment Software + eye Characters Distances Data quality Saturation, homogeneity, ... Distances Method Model? Data type, taxa number BI ML Model? MP Optimality criteria Weigthing? (sites, changes) Yes Tree(s) Validation Bootstrap, PTP, Bremer, ... ME... No NJ... 168 Softwares • Plenty!!... often free! but almost all for molecular data, with various • ...methods (MEGA, SeaView, DAMBE, FastDNAml, PhyML, MrBayes, Phylobayes, Tree-Puzzle, ...). morphological data (and molecular): Phylip (free • For but not simple), PAUP (the best, but not free) which contains many methods and tests softwares to read and edit trees (TreeView, • Numerous TreeEdit, NJ-Plot, FigTree, TreeDyn...) consensus (RadCon, PAUP, Component, ...), • For supertrees (RadCon, Rainbow, Clann, SuperTree, ...) 169