n - HyPhy
Transcription
n - HyPhy
R ECOMBINATION . C O - EVOLUTION . H Y P HY . SERGEI L KOSAKOVSKY POND DIVISIONS OF INFECTIOUS DISEASE AND BIOMEDICAL INFORMATICS DEPARTMENT OF MEDICINE UNIVERSITY OF CALIFORNIA SAN DIEGO [email protected] WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] http://www.hyphy.org/wiki/HyPhy WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] I. R ECOMBINATION Affects a large variety of organisms, from viruses to mammals (e.g. gene family evolution) Manifests itself by incongruent phylogenetic signal This can be exploited to detect which sequence regions recombined and which sequences were involved WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] D UAL INFECTION IN HIV-1 AS IF THE FIRST TIME WASN’T BAD ENOUGH? COINFECTION HIV STRAIN A+B SUPERINFECTION HIV STRAIN A WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 HIV STRAIN B TIME SERGEI L KOSAKOVSKY POND [[email protected]] V IRAL RECOMBINATION Recombination during dual infection allows the virus to rapidly generate escape variants; this can lead to viral rebound and treatment failure. A B WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] EVALUATION OF THREE RECOMBINATION BREAKPOINT ANALYSIS PROGRAMS, USING HIV-1 AS AN EXAMPLE GENOME. 39.769 PRESENTATION - ALLISON M. LAND WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] R ECOMBINATION : DISCORDANT PHYLOGENETIC SIGNAL 10% 0.1 Genetic Distance Patient 1 Consensus Late Consensus Early 0.3 0.2 0.1 Putative recombinant Putative recombinant (see dista 0 500 1000 1500 2000 Sliding Window Midpoint "bp# 2500 10% divergence 10% divergence 100% 100% Original strain Superinfecting strain WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] D ETECTING RECOMBINATION Number of breakpoints Location of breakpoints Sequences involved in recombination What if ‘parental’ strains are not in the sample? Confounding processes: Strong rate variation (e.g. unusually conserved fragments) Convergent evolution WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] S CREENING FOR RECOMBINATION Should be included in the data analysis pipeline (e.g. the PARRIS analysis) Affects Tree reconstruction Evolutionary process inference Selection analyses WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] S INGLE BREAKPOINT METHOD (SBP) Consider the null model - one phylogeny is adequate to describe relatedness (no recombination) Next consider a family of alternative models; two independent trees (with own branch length parameters), with the breakpoint moving through every variable site. Compare the (non-nested) models using 3 information criteria (AIC, AIC-c and BIC). Select the model with the best score. If it is an alternative model, report evidence of recombination SERGEI L. KOSAKOVSKY POND, DAVID POSADA, MICHAEL B. GRAVENOR, CHRISTOPHER H. WOELK, AND SIMON D.W. FROST "AUTOMATED PHYLOGENETIC DETECTION OF RECOMBINATION USING A GENETIC ALGORITHM" MBE 23(10):1891-1901 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] P ERFORMS R EMARKABLY W ELL POSADA AND CRANDALL (2001) TESTED 14 METHODS ON SIMULATED DATA R(n): 0 2.83 11.32 45.26 181.05 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 SERGEI L KOSAKOVSKY POND [[email protected]] Fig. 1. Power (Left) and rate of false positives (Right) corresponding to 14 recombination detection al Monday, August 1, 11 against increasing levels of recombination (!) and nucleotide diversity ("). Sequences were evolved un E XAMPLE : SBP SINGLEBREAKPOINTRECOMB.BF APPLIED TO A TEST ALIGNMENT OF HIV REFERENCE SEQUENCES AND 2 BC RECOMBINANTS SUBTYPE B AND C HTTP://WWW.HIV.LANL.GOV/CONTENT/SEQUENCE/HIV/CRFS/CRFS.HTML PARTIAL OUTPUT BREAKPOINT BREAKPOINT BREAKPOINT BREAKPOINT BREAKPOINT BREAKPOINT BREAKPOINT AT POSITION AT POSITION AT POSITION AT POSITION AT POSITION AT POSITION AT POSITION 612. 614. 618. 620. 624. 626. 629. DAIC DAIC DAIC DAIC DAIC DAIC DAIC = = = = = = = 30.79 30.50 30.35 30.67 30.34 31.68 30.01 DAICC DAICC DAICC DAICC DAICC DAICC DAICC = = = = = = = 29.35 29.07 28.92 29.24 28.90 30.25 28.58 DBIC DBIC DBIC DBIC DBIC DBIC DBIC = = = = = = = -179.59 -179.88 -180.03 -179.71 -180.04 -178.70 -180.37 AIC BEST SUPPORTED BREAKPOINT IS LOCATED AT POSITION 260 AIC = 8019.03 : AN IMPROVEMENT OF 58.6488 AIC POINTS AIC-C BEST SUPPORTED BREAKPOINT IS LOCATED AT POSITION 260 AIC = 8020.99 : AN IMPROVEMENT OF 57.2153 AIC POINTS BIC THERE SEEMS TO BE NO RECOMBINATION IN THIS ALIGNMENT WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] 0.1875 B_FR_83_HXB2_LAI_IIIB_BRU_K03455 B_US_98_1058_11_AY331295 B_NL_00_671_00T36_AY423387 B_TH_90_BK132_AY173951 C_ET_86_ETH2220_U46016 C_BR_92_BR025_D_U52953 C_ZA_04_SK164B1_AY772699 C_IN_95_95IN21068_AF067155 1-260 08_BC_CN_97_97CNGX_6F_AY008715 07_BC_CN_97_CN54_AX149771 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.1452 B_FR_83_HXB2_LAI_IIIB_BRU_K03455 B_TH_90_BK132_AY173951 B_US_98_1058_11_AY331295 B_NL_00_671_00T36_AY423387 07_BC_CN_97_CN54_AX149771 08_BC_CN_97_97CNGX_6F_AY008715 C_IN_95_95IN21068_AF067155 C_ZA_04_SK164B1_AY772699 C_BR_92_BR025_D_U52953 261-1323 C_ET_86_ETH2220_U46016 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] GARD/G ENETIC A LGORITHMS FOR RECOMBINATION DETECTION For a fixed number of breakpoints B Try placing B breakpoints somewhere in the sequence Reconstruct trees for each fragment between breakpoints Compute goodness of fit Select a model with the best fit (using a GA to move breakpoints around) Change B and try again If B>0, verify phylogenetic discordance, compute model averaged breakpoint support. SERGEI L. KOSAKOVSKY POND, DAVID POSADA, MICHAEL B. GRAVENOR, CHRISTOPHER H. WOELK, AND SIMON D.W. FROST "AUTOMATED PHYLOGENETIC DETECTION OF RECOMBINATION USING A GENETIC ALGORITHM" MBE 23(10):1891-1901 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] GARD EXAMPLE WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] S ELECTION /R ECOMBINATION Recombination can influence or even mislead selection detection methods. Using an incorrect tree to analyze a segment of a recombinant analysis can bias dS and dN estimation The basic intuition is that an incorrect tree will generally break up identity by descent and hence make it appear as if more substitutions took place than did in reality. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] make custom reference alignments and screen sequences against them. 0.01 0.1 ACC TCC TCC ACC ACC ACC TCC ACC TCC TCC Figure 4.2: The effect of recombination on inferring diversifying selection. Reconstructed evolutionary history of codon 516 of the Cache Valley Fever virus glycoprotein alignment is shown according to GARD inferred segment phylogeny (left) or a single phylogeny inferred from the entire alignment (right). Ignoring the confounding effect of recombination causes the number of nonsynonymous substitutions to be overestimated. A fixed effects likelihood (FEL, Kosakovsky Pond and Frost (2005)) analysis infers codon 516 to be under diversifying selection when recombination is ignored (p = 0.02), but not when it is corrected for using a partitioning approach (p = 0.28). WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] 1 BREAKPOINT LOCATIONS 0.9 0.8 Model averaged support 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 1200 1400 Breakpoint location TREE LENGTHS 0.38 0.37 Model averaged support 0.36 0.35 0.34 0.33 0.32 0.31 0.3 0.29 0 200 400 600 800 1000 1200 1400 Breakpoint location 0.1 C_IN_95_95IN21068_AF067155 0.1 C_BR_92_BR025_D_U52953 08_BC_CN_97_97CNGX_6F_AY008715 C_ET_86_ETH2220_U46016 07_BC_CN_97_CN54_AX149771 C_ZA_04_SK164B1_AY772699 C_ZA_04_SK164B1_AY772699 C_IN_95_95IN21068_AF067155 C_BR_92_BR025_D_U52953 07_BC_CN_97_CN54_AX149771 C_ET_86_ETH2220_U46016 B_NL_00_671_00T36_AY423387 B_TH_90_BK132_AY173951 B_NL_00_671_00T36_AY423387 B_US_98_1058_11_AY331295 B_TH_90_BK132_AY173951 08_BC_CN_97_97CNGX_6F_AY008715 B_US_98_1058_11_AY331295 B_FR_83_HXB2_LAI_IIIB_BRU_K03455 B_FR_83_HXB2_LAI_IIIB_BRU_K03455 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] A CCOUNTING FOR RECOMBINATION First screen the alignment to find putative non-recombinant fragments Apply a model-based test (SLAC, FEL, MEME or REL) using multiple phylogenies (one per fragment), but inferring other parameters (e.g. kappa and base frequencies) from the entire alignment This is the approach taken by PARRIS (corrected REL), and corrected SLAC, FEL and MEME WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] UNCORRECTED ERROR RATES CORRECTED ERROR RATES FIG. 3.—False-positive error rates for the FEL test for selected (both positively and negatively) sites based under the Neutral the error rates for the uncorrected (single partition) FEL and panel B, for the corrected (2 partitions) FEL. Solid lines indicate expe the P value. Tabulated error rates are presented for the first 400 codons (evolved under one tree), the last 100 codons (evolved un the joint error rate for all 500 codons, averaged over 100 replicates. break points were placed on variable sites only, and the each of the alignments was subject to number of break points allocated to a replicate was ran- excess of the nominal P value (fig. 3A domly drawn from the distribution of the number of rate for the first 400 codons was effectiv inferred break points for that scenario. P values of observ- P value. Intuitively, a topology inferred ing smaller median distances to correct break points by is ‘‘almost’’ correct for the first 400 c chance were computed based on 1,000 replicates. In all correct for the last 100 codons. A simp cases, the median toKOSAKOVSKY dure, in which we[SPOND split each of.EDU the ]10 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 distance from inferred break Spoints ERGEI L POND @UCSD correct ones was significantly less than that expected by ments into 2 fragments, identified b Monday, August 1, 11 Table 4. Effect of correcting for recombination when using fixed effects likelihood to detect positively selected sites. Virus and gene Positively Selected Codons Uncorrected FEL Corrected FEL 212,516,546,551 None 158, 179, 264, 444 179, 264, 444, 548 195 9,195 None None 37,91, 358, 556 91, 358 87, 166, 252, 358 87, 147,252, 358 42,106,345,436 42,106,345,436 57, 480 57, 480 399 None 1,4,5,7,16,18,108,516 1,5,7,16,108,493,505 2,54,58,228,262,284,306,471 2,58,228,262,284,306,471 Newcastle disease N 425, 430, 466 425, 430, 462, 466 Newcastle disease P 12,56,65,174,179,188,189, 204, 56, 65, 146, 153, 174, 179, 189, 208, 213,217,218,239,306,332 193, 204,208, 213, 218, 261,306,332 79 None Cache Valley G Canine Distemper H Crimean Congo hemm. fever NP Hantaan G2 Human Parainfluenza (1) HN Influenza A (human H2N2) HA Influenza B NA Mumps F Mumps HN Newcastle disease F Newcastle disease HN Puumala NP Test p < 0.1 was used to classify sites as selected. Codon sites found under selection by both methods are shown in bold. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] II. D ETECTING CO - EVOLUTION BETWEEN SITES USING B AYESIAN G RAPHICAL M ODELS The fundamental assumption of many computational models – independent evolution of sites – is often violated Compensatory (fitness restoring) mutations, e.g. in HIV to restore replicative fitness following the acquisition of drug resistance Complex phenotypes, i.e. those which depend upon many alleles (epistasis) Evolution of motifs (e.g. N-linked glycosylation sites) WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] D ETECTING CO - EVOLUTION Apart from building (very computationally intensive and not yet very tractable) models of co-evolving sites, one can seek sites in an alignment which could be interacting in a post hoc fashion. Sites which accumulate substitutions along the same branches can be hypothesized to interact, i.e. substitutions at one site increase the probability of a substitution at another site. Approach: collect a large collection of homologous sequences and study associations in substitution patterns among sites. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] D ETECTING INTERACTIONS FROM SEQUENCES . A very straightforward approach: Generate an alignment of homologous amino-acid sequences; Look for statistical associations between residues at every pair of positions in the alignment. ‘N’ occurs at site 2 with frequency 74% if there is a ‘V’ at site 1, but only with frequency 21% if there is an ‘I’ at site 1. N ALIGNMENT WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 S V 105 24 I 38 91 PAIRWISE ASSOCIATION TESTS SERGEI L KOSAKOVSKY POND [[email protected]] D ETECTING INTERACTIONS FROM SEQUENCES . EXAMPLES OF EARLY STUDIES OF CORRELATED RESIDUES IN PROTEIN SEQUENCES. P15 G14 R7 A 180 113 160. 140 120- 125 126 G27 N6 D28 Ns 100 c 80 . t_- & 60- 129 N4 R3 R30 031 A32 P2 T1 C-s-S- C H33 HIV-1 V3 LOOP (KORBER 1993) WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 Et 20 A 40 B 60 D 60 E 10 F'F 120 G 140 160 H Position Index k MYOGLOBINS (NEHER 1994) SERGEI L KOSAKOVSKY POND [[email protected]] T HE EFFECTS OF PHYLOGENY . But applying statistical tests directly to sequence variation is plagued with major issues! A genotype is not a random sample of alleles from a population. Certain genotypes may be over-represented because they are jointly inherited from the same common ancestor, i.e., identical by descent. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] T HE EFFECTS OF PHYLOGENY . Suppose there are two continuous characters (X and Y) that are evolving at random along this tree. If we measure X and Y in the individuals in the present, they will appear to be significantly correlated (co-evolving). T H E AMERICAN NATURALIST X FIG.7.-The same data set, with the points distinguished to show the members of the 2 monophyletic taxa. It can immediately be seen that the apparently significant relationship of fig. 6 is illusory. FIG. 5.-A "worst case" phylogeny for 40 species, in which there prove to be 2 groups each of 20 close relatives. of species from which we are sampling. This does not work. Imagine two species IMAGES FROM J FELSENSTEIN (1985). AM NAT 125: 5-6. that have diverged some time ago, and thus have diverged in both brain and body WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 weight. Clearly the correlation between those characters cannot be significant, ERGEI L Kspecies OSAKOVSKY OND [SPOND since there are only two points. S Now if each gives risePto a group of [email protected]] daughter species, essentially identical to it, we now have two clusters of 100 species each. Sampling species from this pool of 200 species, we are actually T HE EFFECTS OF PHYLOGENY . We’re trying to make a statement about how X and Y evolve, but most of the action happened in the divergence of group 1 from group 2. Our correlation is essentially based on two independent data points! –Y +X X IMAGES FROM J FELSENSTEIN (1985). AM NAT 125: 5-6. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] 2 0.2 G ENETIC INTERACTIONS IN HIV-1 V3 1 0.0 0.00 0.05 0.10 0.15 0.20 False positive rate Simulate the evolution of V3 along the fixed tree under the binary-character model with a known set of interactions. B Evol-Net Fisher Compare false- and true-positive rates of pairwise association test (Fisher’s exact test) and a binary analog of evolutionarynetwork model. 0.0 0.1 0.2 0.3 0.4 0.5 False positive rate With no interactions, Fisher’s exact test finds that 40 out of 511 pairs have significant associations. Evolutionary-network method reduces false positives to about 2 pairs. Image from AFY Poon et al. (2007), PLoS Comput Biol 3(1): e11. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] B AYESIAN NETWORK INFERENCE . The structure of a Bayesian network is the set of edges connecting nodes to represent a conditional dependence between the corresponding variables. P(A,B,C,D,E,F) = P(F|E,D) P(D|E) P(E|B,C) P(B|A) P(C|A) P(A) NODE EDGE A B C E WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 D F SERGEI L KOSAKOVSKY POND [[email protected]] B AYESIAN NETWORK INFERENCE . Advantages of using Bayesian networks: A natural graphical representation of complex systems. Reduces the number of parameters, making it possible to learn structure from relatively small data sets. Models interactions among all variables simultaneously — higher-order interactions. A B C E WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 D F SERGEI L KOSAKOVSKY POND [[email protected]] E VOLUTIONARY NETWORKS . Every branch in the tree is converted into a string of 1’s and 0’s for the presence or absence of a non-synonymous substitution. Doing this for every site in the alignment yields a binary matrix. 0 0 AGT Ser 0 1 AAT Asn 0 011 0 100 1 . . . 010 0 010 0 000 0 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] E VOLUTIONARY NETWORKS . Coincident substitution events on the same branches is evidence of interactions between sites. The distribution of substitution events throughout the tree becomes the target for our analysis. Replace correlations of residue compositions with correlations of substitution patterns 0 0 AGT Ser 0 1 AAT Asn 0 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 0 0 1 0 0 0 0 1 0 0 0 0 CCT Pro 0 1 TCT Ser 0 SERGEI L KOSAKOVSKY POND [[email protected]] I NTERACTING SITES (P R = 0.9) 0.1 Node5 2_M9244858 11_M849175 13_U819885 12_U819895 Node15 Node14 17_AJ30988 Node22 Node21 Node18 Node26 Node25 19_AF10426 23_AF01807 Node35 Node13 Node33 Node39 Node32 Node42 28_AB01544 29_AF30942 30_L220635 Node46 Node1 Node48 14_M586295 Node0 Node55 6_D0107558 8_X8525358 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] V ARIABLE , BUT NOT INTERACTING (P R =0.04) 0.1 Node6 3_X7762758 Node4 Node10 Node15 13_U819885 Node14 Node21 Node18 Node22 18_AJ30987 19_AF10426 Node26 Node25 Node29 Node20 Node34 25_AB01544 Node13 Node33 Node2 Node39 Node32 27_AB01544 Node42 29_AF30942 30_L220635 Node46 Node1 Node48 33_AB03794 14_M586295 Node0 Node56 Node55 6_D0107558 8_X8525358 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] E VOLUTIONARY NETWORKS . EVOLUTIONARY MAP PAIRWISE ASSOCIATION TESTS ALIGNMENT TREE AAT AGT AAA 105 24 AGA 91 38 MODEL Qt e WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 000 010 010 100 000 011 011 BAYESIAN NETWORKS SERGEI L KOSAKOVSKY POND [[email protected]] E VOLUTIONARY NETWORKS . Problems with using pairwise association tests: significant pairs are never tested in the context of other sites missing out on higher-order interactions no clear procedure for assembling the “big picture” from a list of significant pairs; difficult to interpret! Use Bayesian networks. Image from PB Gilbert et al. (2005) AIDS Res Hum Retrovir 21: 1020. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] G ENETIC INTERACTIONS IN HIV-1 V3 0.1 0.2 0.263235 Obtained 1,154 full-length HIV-1 env sequences from the Los Alamos National Laboratory HIV Sequence Database (excluding recombinants, mostly subtypes B and C). Reconstr ucted phylogeny f rom nucleotide sequences excluding all variable domains (V1/V2, V3, V4, and V5). Minimize influence of alignment uncertainty and convergent evolution. AFY Poon et al. (2007), PLoS Comput Biol 3(1): e11. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] G ENETIC INTERACTIONS IN HIV-1 V3 I13 95 Network edges correspond to well-characterized interactions; Q17 D28 T1 R8 78 11-25 rule 18 25 78 G23 P15 N4 45 31 86 31 S10 A18 9 Y20 48 N-linked glycosylation motif at sites N5 and T7. N6 64 F19 I11 T7 75 99 N-linked glycosylation I26 R12 18 Q31 28 D24 ‘11-25’ rule of co-receptor usage. H33 N5 74 73 I25 R30 Image from AFY Poon et al. (2007), PLoS Comput Biol 3(1): e11. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] All of the analyses presented during the previous two lectures have been powered by the HyPhy package What is HyPhy? WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] III. H Y P HY : [H Y ] POTHESIS A TESTING USING [P HY ] LOGENIES : SCRIPTABLE SEQUENCE ANALYSIS PLATFORM HTTP://WWW.HYPHY.ORG/ WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] H I - FI NOT HIGH - FEE THE OTHER HYPHY (WE STARTED IN 1997) HYPHY (PRONOUNCED HIGH-FEE; IPA: [ˈHAɪFIː]]) IS A STYLE OF MUSIC AND DANCE ASSOCIATED WITH SAN FRANCISCO BAY AREA HIP HOP CULTURE. IT BEGAN TO EMERGE IN EARLY 2000 AS A RESPONSE FROM BAY AREA RAPPERS AGAINST COMMERCIAL HIP HOP FOR NOT ACKNOWLEDGING THE B AY FOR SETTING TRENDS IN THE HIP HOP INDUSTRY . A LTHOUGH THE " HYPHY MOVEMENT" HAS JUST RECENTLY SEEN LIGHT IN MAINSTREAM AMERICA, IT HAS BEEN A LONG STANDING AND EVOLVING CULTURE IN THE BAY AREA. THE TERM IS A COMBINATION OF THE WORDS "HYPE" AND "FLY". IT IS DISTINGUISHED BY GRITTY, POUNDING RHYTHMS, AND IN THIS SENSE CAN BE ASSOCIATED WITH THE B AY AS CRUNK MUSIC IS TO THE S OUTH . A N INDIVIDUAL IS SAID TO " GET HYPHY " WHEN THEY ACT OR DANCE IN AN OVERSTATED AND RIDICULOUS MANNER . M ANY IN THE B AY A REA WOULD DESCRIBE THIS AS ACTING "RETARDED", "RIDING THE YELLOW BUS" OR "GOING DUMB". HTTP://EN.WIKIPEDIA.ORG/WIKI/ WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] HYPHY Data view and exploration X Tree viewer/editor Charting Class Class Class Class 1 2 3 4 2 2 1 1 0 Scripts HyPhy batch language 0 Model parameters/hypothesis setup X M P MPI Computational backend Substitution model editor Text based I/O for pipelines and terminal execution HyPhy GUI WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] T HE H Y P HY PACKAGE 12 years in development together with Spencer Muse, Wayne Delport and Art Poon Open source, runs on all major platforms natively, supports multiprocessor (OpenMP), distributed (MPI) and gpGPU (OpenCL, next release) systems Disclaimer: the developers all use Macs, so the Mac OS version is generally the most polished/stable ~7000 users and 500 citations Has ~100 prepackaged analyses Fully features graphical interface for Mac, Windows and X11 (based of GTK WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] W HAT DOES H Y P HY DO WELL ? Inference about the evolutionary process, e.g. Selection Recombination Model selection Evolutionary rate tests (relative rate/ratio) HyPhy can infer phylogenies using a variety of algorithms, but this is not its primary function. There exist much faster/more comprehensive packages for inference (e.g. PAUP*, Garli, PhyML). However, HyPhy can accommodate the widest range of evolutionary models into inference HyPhy does not really do sequence alignment, even though we use it for some customized low divergence sequence alignments and NGS processing. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] H Y P HY DESIGN PHILOSOPHY : TO EACH THEIR OWN Prepackaged point and click analyses Graphical user interface (Mac OS, Windows, X11) An easy mechanism to design own analyses graphically Visualization and result processing Flexible scripting language (HBL) for writing complex custom analyses Unparalleled flexibility Pipeline integration, web services (e.g. datamonkey.org, GALAXY http://g2.trac.bx.psu.edu, MEGA 5) WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] N OT JUST PHYLOGENETIC LIKELIHOOD Because HyPhy is the primary platform for our research, it has grown to include a number of new features. Hidden Markov Models: spatial correlation in rates, phyloHMM Bayesian Graphical Models: epistatic interactions, phenotypes Stochastic Context Free Grammars: probabilistic models for structured data (RNA secondary structure, tree shape) Genetic algorithms: complex feature selection 454 data analysis: specialized error correction and read mapping routines for rapidly evolving pathogens (HIV, HCV) WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] H Y P HY DOCUMENATION Is not nearly as good as it should be! Several book chapters Package description and tutorial: http://www.hyphy.org/docs/HyPhyDocs.pdf Selection analyses: http://www.hyphy.org/pubs/hyphybook2007.pdf Recombination, epistasis and directional evolution: http://www.hyphy.org/ pubs/methods2011.pdf Wiki (just started, work it progress): http://www.hyphy.org/wiki/ Main_Page Message boards (the best place for specific questions): http:// www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] W HY FIT MODELS ? Model describe our mechanistic understanding of the evolutionary process, e.g. the dichotomy between synonymous and non-synonymous substitutions Using formal models, we can estimate biologically relevant parameters such as branch lengths and divergence times, substitution rates, measures of selection. Each model can be assigned a goodness of fit (usually derived from its Log L score and complexity) Models can be compared to decide which parameters are important, or to test what values biological quantities may take (hypothesis testing) Sequence data can be ‘mined’ for pattern discovery using collections of substitution models WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] A VERY BASIC EXAMPLE Given a nucleotide alignment and a tree, fit a simple substitution model and interpret its output. Data file: p51.nex (located in data/ within the HyPhy distribution folder). Three alternative ways to do it: Standard analysis Graphical user interface Directly in the batch language WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] S TANDARD A NALYSES WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] ______________READ THE FOLLOWING DATA______________ 8 species: {B_FR_83_HXB2_ACC_K03455,B_US_83_RF_ACC_M17451,B_US_86_JRFL_ACC_U63632,B_US_90_WEAU160_ACC_U21135,D_CD_83_ELI_ACC_K 03454,D_CD_83_NDK_ACC_M27323,D_CD_84_84ZR085_ACC_U88822,D_UG_94_94UG114_ACC_U88824}; Total Sites:1320; Distinct Sites:118 A tree was found in the data file: ((((D_CD_83_ELI_ACC_K03454,D_CD_83_NDK_ACC_M27323),D_UG_94_94UG114_ACC_U88824),D_CD_84_84ZR085_ACC_U88822),B_US_83_ RF_ACC_M17451,((B_FR_83_HXB2_ACC_K03455,B_US_86_JRFL_ACC_U63632),B_US_90_WEAU160_ACC_U21135)) Would you like to use it:(Y/N)?y ______________RESULTS______________ Time taken = 0.23 seconds AIC Score = 6682.51 Log Likelihood = -3327.25252976199; Shared Parameters: R=0.111274 Tree givenTree=(B_US_83_RF_ACC_M17451:0.0262012, ((B_FR_83_HXB2_ACC_K03455:0.0116675,B_US_86_JRFL_ACC_U63632:0.0178118)Node4:0.0022271,B_US_90_WEAU160_ACC_U21135:0. 0209889)Node3:0.00507481, (((D_CD_83_ELI_ACC_K03454:0.0188158,D_CD_83_NDK_ACC_M27323:0.0100127)Node10:0.0105018,D_UG_94_94UG114_ACC_U88824:0. 053071)Node9:0.00391151,D_CD_84_84ZR085_ACC_U88822:0.0283077)Node8:0.0234111); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] I N THE GUI 0.01 0.02 0.03 0.04 0.05 0.06 0.06869 PARAMETERS B_US_83_RF_ACC_M17451 B_FR_83_HXB2_ACC_K03455 B_US_86_JRFL_ACC_U63632 B_US_90_WEAU160_ACC_U21135 D_CD_83_ELI_ACC_K03454 D_CD_83_NDK_ACC_M27323 D_UG_94_94UG114_ACC_U88824 D_CD_84_84ZR085_ACC_U88822 TREE MODEL WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] I N HBL DataSet ! ! ! ! ! nucleotideSequences = ReadDataFile ("data/p51.nex"); DataSetFilter! ! ! ! filteredData = CreateFilter (nucleotideSequences,1); HarvestFrequencies ! ! (observedFreqs, filteredData, 1, 1, 1); global R = 1; HKY85RateMatrix = ! ! {{*,trvs,R*trvs,trvs} ! ! {trvs,*,trvs,R*trvs} ! ! {R*trvs,trvs,*,trvs} ! ! {trvs,R*trvs,trvs,*}}; Model ! HKY85 = (HKY85RateMatrix, observedFreqs); Tree! givenTree = DATAFILE_TREE; LikelihoodFunction theLnLik = (filteredData, givenTree); Optimize (paramValues, theLnLik); fprintf (stdout, theLnLik); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] HBL OBJECTS Data sets (alignments) Data filters (partitioned alignments) Models (matrices of parameters, constraints) Trees (topologies+models+parameter values) Likelihood functions (datafilters+trees+models+structure) Stochastic Context Free Grammars SQL databases (via SQLite) Strings, numbers, matrices, associative arrays, expressions (formulas), regexps, 100+ built in functions ... WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] T ESTING HYPOTHESES Model (+constraints on parameters) = hypothesis To test hypotheses, we fit several models of varying complexity to the data and compare their goodness-of-fit. The model that fits the data significantly better than all others is chosen as the best explanation to how the data have arisen. The simplest case has two models Null (simple) Alternative (more complex) WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] R ELATIVE RATE TESTS MRCA of All 3 (can't estimate For many models, we can’t decouple location with reversible models) substitution rates and evolutionary times; they are confounded in the ‘expected substitutions/site’ measure. MRCA of A and B However, sometimes it is possible to factor out the time to directly compare evolutionary rates OUTGROUP A L(A) = T * rate (A) L(B) = T * rate (B) B One of the first tests to do that was the relative ratio test: using an outgroup to ‘polarize’ substitutions. It is then possible to directly compare branch lengths and make statements about evolutionary rates. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 If L(A) != L(B), this means that rate (A) != rate (B) TESTING FOR EQUALITY OF EVOLUTIONARY RATES S. V. MUSE AND B. S. WEIR . Genetics, (132), 269-276 SERGEI L KOSAKOVSKY POND [[email protected]] FIRST, THE OUTPUT 0).UNCONSTRAINED MODEL:LOG LIKELIHOOD = -944.001000454328; TREE THREETAXATREE=(A:0.0850301,B:0.0733855,C:0.165368); 1). WITH THE OUTGROUP AT TAXON #1, THE P-VALUE IS:0.00215511 LOG LIKELIHOOD = -948.707253306001; TREE THREETAXATREE=(A:0.0823004,B:0.119437,C:0.119437); 2). WITH THE OUTGROUP AT TAXON #2, THE P-VALUE IS:0.0083535 LOG LIKELIHOOD = -947.479039420142; TREE THREETAXATREE=(A:0.12514,B:0.0712726,C:0.12514); 3). WITH THE OUTGROUP AT TAXON #3, THE P-VALUE IS:0.666914 LOG LIKELIHOOD = -944.093617010496; TREE THREETAXATREE=(A:0.0792154,B:0.0792154,C:0.16536); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] L IKELIHOOD R ATIO T ESTING Alternative model (unconstrained rates) yielded log L = -944.00, while the simpler null (two rates equal, outgroup at A) model returned logL = 948.70 Lower likelihood = worse fit However because the alternative model has one more parameter than the null, it should always beat (or at least match) the score of the null, even if the latter were the correct model. Is the improvement in fit (log(LR)- likelihood ratio = 9.4) large enough to be significant? Perform a Likelihood Ratio Test (LRT), comparing the distribution of 2*LR with the tail of the chi-squared distribution with as many degrees of freedom as there are additional parameters in the alternative model In this case, p-value - the probability that LR >= observed value if the null model is correct is effectively 0, hence there is very strong evidence that transitions happen at higher rates than transversions in our data set. WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] T HE HBL FILE . R EAD THE DATA /* 1. Read in the data and store the result in a DataSet variable.*/ DataSet ! ! nucleotideSequences = ReadDataFile ("data/3.seq"); /* 2. Filter the data, specifying that all of the data is to be used and that it is to be treated as nucleotides. */ DataSetFilter!filteredData = CreateFilter (nucleotideSequences,1); /* 3. Collect observed nucleotide frequencies from the filtered data. observedFreqs will store the vector of frequencies. */ HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] D EFINE THE MODEL / TREE /* 4. Define the F81 substitution matrix. '*' is defined to be -(sum of off-diag row elements); the elements of the matrix reflect substitution rates from one nucleotide to another. The nucleotides are ordered alphabetically -- a,c,g,t, thus, for example the entry at row 3, column 4 (mu) supplies the rate from G and T */ F81RateMatrix = {{*,mu,mu,mu} {mu,*,mu,mu} {mu,mu,*,mu} {mu,mu,mu,*}}; /*5. Define the F81 model, by combining the substitution matrix with the vector of observed (equilibrium) frequencies; this is done by default to avoid having to multiply the rate matrix by pi_j terms. */ Model "F81 = (F81RateMatrix, observedFreqs); /*6. Now we can define the simple three taxa tree. ! Since there is only 1 three sequence tree, we turn on ! ALLOW_SEQUENCE_MISMATCH to tell hyphy to map the first ! sequence in the data to leaf 'a', the 2nd - to leaf 'b' ! and the third - leaf 'c'. */ ALLOW_SEQUENCE_MISMATCH = 1;" Tree" threeTaxaTree = (a,b,c); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] C ONSTRUCT AND O PTIMIZE LF /*7. Since all the likelihood function ingredients (data, tree, equilibrium frequencies) have been defined we are ready to construct the likelihood function. */ LikelihoodFunction theLnLik = (filteredData, threeTaxaTree); /*8. Maximize the likelihood function, storing parameter values in the matrix paramValues. ! We also store the resulting ln-lik. */ Optimize (paramValues, theLnLik); unconstrainedLnLik = paramValues[1][0]; /*9. Print the tree with optimal branch lengths to the console. */ fprintf (stdout, "\n0).UNCONSTRAINED MODEL:", theLnLik); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] P ERFORM THE RR TEST /*10. We now constrain the rate of evolution to be equal along the branches leading ! to c and b and repeat the optimization. */ threeTaxaTree.b.mu := threeTaxaTree.c.mu; Optimize (paramValues, theLnLik); /*11. Now we compute the ln-lik ratio statistic and the P-Value, using the Chi^2 dist'n with 1 degree of freedom. */ lnlikDelta = 2 (unconstrainedLnLik-paramValues[1][0]); pValue = 1-CChi2 (lnlikDelta, 1); fprintf (stdout, "\n\n1). With the outgroup at taxon #1, the P-value is:", pValue, "\n", theLnLik); REPEAT WITH THE OTHER 2 TAXA AS OUTGROUP WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] M ODEL EXAMPLES . HyPhy permits three basic model parameter types; mixing them in a single analysis permits unique flexibility. LOCAL: Attached to an individual branch in a tree, e.g. branch length branch dN/dS ratio branch rate GLOBAL: Shared by many branches in a single tree (or multiple trees),e.g. base frequencies transition/transversion ratios hyperparameters (shape of the gamma distribution, mixing coefficients) CATEGORY: Variables which are integrated out, e.g. the distribution of substitution rates across sites a model-mixture distribution WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] M ODELS WITH LOCAL / GLOBAL PARAMETERS The most common example: linea ge specific selection Consider HIV sequences f rom two epidemiologically linked patients: the source and the recipient. Biologically, the virus experiences at least THREE different selective environments, hence there is reason to believe that selection (measured by dN/dS will vary across lineages). TRANSMISSION WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 FROST SDW ET AL JOURNAL OF VIROLOGY, MAY 2005, P. 6523–6527 SERGEI L KOSAKOVSKY POND [[email protected]] T HE SETUP IN H Y P HY Define the most general model first, i.e. the model where each branch has it’s own dN/dS ratio, or to be more precise, a pair of rates (synonymous and non-synonymous rates) Fit this model as the alternative Constrain the parameters (dN/dS ratios) to be the same within groups of branches to define nulls Fit the null model(s) and perform LRT WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] C ONSTRAINTS IN H Y P HY X Y X = 1; // ASSIGN VALUE := X; // CONSTRAIN Y TO BE ALWAYS EQUAL TO X :< 5; // LIMIT THE RANGE OF X GLOBAL OMEGA_SOURCE=1; // A GLOBAL PARAMETER TREEID.BRANCHID.PARAMETERNAME; // THE NAME OF A TYPICAL MODEL PARAMETER IN HYPHY REPLICATECONSTRAINT("THIS1.?.NONSYNRATE:=OMEGA_SOU RCE*THIS2.?.SYNRATE",HIVTRANSMISSION_TREE,HIVTRAN SMISSION_TREE); // A CONVENIENCE FUNCTION TO APPLY THE SAME CONSTRAINT TO ALL BRANCHES IN A TREE WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] M IXED DATA Example: Test for equality of the transition/transversion ratio in an exon-intron-exon data set between exons and the intron. R EAD / FILTER THE DATA LoadFunctionLibrary " ("chooseGeneticCode"); DataSet " " sequences = ReadDataFile ("intronexon.nex"); DataSetFilter" intron " = CreateFilter (sequences,1,"0-88,275-551"); filterString" " " = "89-139,146-151,179-184,199-274"; // GeneticCodeExclusions = stop codons which are not valid states (TAA, TAG, TGA) DataSetFilter" exon " = CreateFilter (sequences,3,filterString,"",GeneticCodeExclusions); HarvestFrequencies (exonBaseFreqs, exon, 3, 1, 1); // 4x3 matrix HarvestFrequencies (intronBaseFreqs, intron, 1, 1, 1); // 4x1 matrix WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] D EFINE AND CONSTRAIN THE MODELS // nucleotide model; explicit definition global!! kappaInv = 1.0; HKY85RateMatrix = {{*,t*kappaInv,t,t*kappaInv} ! ! ! ! {t*kappaInv,*,t*kappaInv,t} ! ! ! ! {t,t*kappaInv,*,t*kappaInv} ! ! ! ! {t*kappaInv,t,t*kappaInv,*}}; Model HKY85! ! = (HKY85RateMatrix, intronFreqs); // codon model; read from a ‘standard’ file ExecuteAFile ! (HYPHY_BASE_DIRECTORY + "TemplateBatchFiles" + ! ! ! ! DIRECTORY_SEPARATOR + "2RatesAnalyses" + ! ! ! ! DIRECTORY_SEPARATOR + "MG94xREV.mdl"); PopulateModelMatrix ("MGRateMatrix",exonBaseFreqs,0);!! ! ! ! codonFreqs = BuildMGCodonFrequencies (exonBaseFreqs); Model MG94! ! = (MGRateMatrix, codonFreqs, 0); AT:=AC; CG:=1; CT:=AC; GT:=AC; WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 ! ! SERGEI L KOSAKOVSKY POND [[email protected]] T REES /L IKELIHOOD FUNCTION treeString = "(((HKL5:0,HKL8:0.0108295)Node2:0.0143961,PHA: 0.0761833)Node1:0.076599,PTR7:0.00325449,HSA1:0.0140364)"; UseModel (MG94);" Tree" exonTree = treeString; UseModel (HKY85);" Tree" intronTree = treeString; fprintf (stdout, "[1. FITTING SEPARATE TV/TS RATIOS TO EXON/INTRON]\n"); LikelihoodFunction theLnLik = (exon, exonTree, intron,intronTree); Optimize (paramValues, theLnLik); fprintf (stdout, theLnLik); [1. FITTING SEPARATE TV/TS RATIOS TO EXON/INTRON] Log Likelihood = -1043.89674669368; Shared Parameters: AC=0.961309 R=3.16948 kappaInv=0.30696 AT=AC=0.961309 CG=1=1 CT=AC=0.961309 GT=AC=0.961309 Tree exonTree=(((HKL5:0,HKL8:0)Node2:0.0522098,PHA: 0.0465591)Node1:0.223343,PTR7:0,HSA1:0.0390297); Tree intronTree=(((HKL5:0,HKL8:0.0150867)Node2:0.00512057,PHA: 0.0829941)Node1:0.0280879,PTR7:0.00298343,HSA1:0.00587939); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] C ONSTRAIN AND FIT THE NULL MODEL fprintf (stdout, "\n[2. FITTING SINGLE TV/TS RATIO TO EXON/INTRON] \n"); kappaInv := AC; Optimize (paramValues2, theLnLik); fprintf (stdout, theLnLik); [2. FITTING SINGLE TV/TS RATIO TO EXON/INTRON] Log Likelihood = -1046.64481338653; Shared Parameters: AC=0.486803 R=2.91845 AT=AC=0.486803 CG=1=1 CT=AC=0.486803 GT=AC=0.486803 kappaInv=AC=0.486803 Tree exonTree=(((HKL5:0,HKL8:0)Node2:0.0532969,PHA: 0.0475588)Node1:0.228066,PTR7:0,HSA1:0.0390151); Tree intronTree=(((HKL5:0,HKL8:0.0149572)Node2:0.00517383,PHA: 0.0818904)Node1:0.0277913,PTR7:0.00292384,HSA1:0.00590144); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] P ERFORM THE LRT fprintf (stdout, "\n[3. CONDUCTING THE LRT FOR TV/TS EQUALITY]\n"); LR = 2*(paramValues[1][0]-paramValues2[1][0]); DF = (paramValues[1][1]-paramValues2[1][1]); pV = 1-CChi2(LR,DF); fprintf (stdout, "LR Statistic = ", Format (LR,8,2), ! ! ! "Constraints = ", Format (DF,8,0), ! ! ! "p-value = ", Format (pV,8,4), "\n"); [3. CONDUCTING THE LRT FOR TV/TS EQUALITY] LR Statistic = 5.50 Constraints = 1 p-value = 0.0191 WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] S IMULATION Simulation is an invaluable tool of statistical modeling HyPhy permits the user to simulate data from any likelihood function object (parameterically) with a single command Non-parametric bootstrap (sample with replacement) is equally easy // parametric DataSet simulatedDataSet = SimulateDataSet (likelihoodFunction); //non-parametric, treat data as nucleotides DataSetFilter simulatedDataFilter = Bootstrap (filteredData,1); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] A UTOMATING ANALYSES HyPhy provides a mechanism to write ‘wrapper’ files to call other HBL files with predefined parameters. This feature makes it possible to carry out the same analysis over a large collection of files to perform a multistep analysis (pipeline, e.g. as in Datamonkey.org) inputRedirect = {}; inputRedirect["01"]="Universal"; inputRedirect["02"]="/Users/sergei/Desktop/MyFiles/somealignment.nex"; inputRedirect["03"]="MG94CUSTOM"; inputRedirect["04"]="Local"; inputRedirect["05"]="012232"; inputRedirect["06"]="y"; LoadFunctionLibrary ("AnalyzeCodonData.bf", inputRedirect); WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] P YTHON AND R BINDING HyPhy can be compiled as a Python or R module and called directly from those languages. # import the HyPhy library # and standard OS utilities import os, HyPhy # first, create a HyPhy interface instance (class _THyPhy) # the first argument defines the root directory for HyPhy # and the second - how many threads the computational core # should spawn hyphyInstance = HyPhy._THyPhy (os.getcwd(),2) # the basic interface command is 'ExecuteBF' which # executes HyPhy batch language commands in HyPhy # and returns a string representation of the return value # (if any) from HYPHY # The returned object is of type _THyPhyString with # sData and sLength fields # HyPhy will take care of disposing of the memory needed # to store the result hyphyResult = hyphyInstance.ExecuteBF ("return 2+2;"); print "Testing a trivial HyPhy command. 2+2 = ", hyphyResult.sData WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]] hyphyResult = hyphyInstance.ExecuteBF ("ExecuteAFile(\"../HBL/HKY85.bf\")"); print "Log-L = ", retrieveValueByKey ("LogL", HyPhy.THYPHY_TYPE_NUMBER, hyphyInstance).nValue; print "kappa = ", retrieveValueByKey ("kappa", HyPhy.THYPHY_TYPE_NUMBER, hyphyInstance).nValue; print "tree string = ", retrieveValueByKey ("Tree", HyPhy.THYPHY_TYPE_STRING, hyphyInstance).sData; bl = retrieveValueByKey ("Branch lengths", HyPhy.THYPHY_TYPE_MATRIX, hyphyInstance); print "retrieved ", bl.mCols-1, "branch lengths" for i in range(0,bl.mCols-1): " print "Branch ", i+1, " has length ", bl.MatrixCell(0,i) Log-L = -1165.73404336 kappa = 2.90308047463 tree string = (((317:0.0369387,6767:0.0674453)Node2:0.0116625,((135:0.019967, (529:0.0330648,105r:0.0129175)Node8:0.0257932)Node6:0.0057794, (719:0.0290622,136:0.00806843)Node11:0.0470372)Node5:0.0153195)Node1:0.0137487,6760:0.06 8227,((113:0.0213909,9939:0.0737467)Node16:0.00489231,(256:0.011979, (822:0.0288374,159:0.00462985)Node21:0.00351714)Node19:0.0211093)Node15:0.0229936) retrieved 23 branch lengths Branch 1 has length 0.0369386905354 Branch 2 has length 0.0674453438678 Branch 3 has length 0.0116625398497 Branch 4 has length 0.0199669752233 Branch 5 has length 0.0330648421312 Branch 6 has length 0.0129175106238 Branch 7 has length 0.0257932085606 Branch 8 has length 0.00577940003498 Branch 9 has length 0.0290622269958 Branch 10 has length 0.00806843337064 Branch 11 has length 0.0470371687369 Branch 12 has length 0.0153195087307 Branch 13 has length 0.0137487198541 Branch 14 has length 0.0682270304125 Branch 15 has length 0.0213909255649 Branch 16 has length 0.0737467133295 Branch 17 has length 0.00489231290247 Branch 18 has length 0.0119789870549 .... WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011 Monday, August 1, 11 SERGEI L KOSAKOVSKY POND [[email protected]]