n - HyPhy

Transcription

n - HyPhy

R ECOMBINATION .
C O - EVOLUTION .
H Y P HY .
SERGEI L KOSAKOVSKY POND
DIVISIONS OF INFECTIOUS DISEASE
AND BIOMEDICAL INFORMATICS
DEPARTMENT OF MEDICINE
UNIVERSITY OF CALIFORNIA SAN DIEGO
[email protected]
WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA 2011
Monday, August 1, 11
SERGEI L KOSAKOVSKY POND [[email protected]]
http://www.hyphy.org/wiki/HyPhy
I. R ECOMBINATION
Affects a large variety of organisms, from viruses to
mammals (e.g. gene family evolution)
Manifests itself by incongruent phylogenetic signal
This can be exploited to detect which sequence
regions recombined and which sequences were
involved
D UAL INFECTION IN HIV-1
AS IF THE FIRST TIME WASN’T BAD ENOUGH?
COINFECTION
HIV STRAIN A+B
SUPERINFECTION
HIV STRAIN A
HIV STRAIN B
TIME
V IRAL RECOMBINATION
Recombination during dual infection allows the virus to rapidly generate
escape variants; this can lead to viral rebound and treatment failure.
A
B
EVALUATION OF THREE RECOMBINATION BREAKPOINT ANALYSIS PROGRAMS, USING HIV-1 AS AN EXAMPLE GENOME. 39.769 PRESENTATION - ALLISON M. LAND
R ECOMBINATION : DISCORDANT
PHYLOGENETIC SIGNAL
10%
0.1
Genetic Distance
Patient 1
Consensus Late
Consensus Early
0.3
0.2
0.1
Putative recombinant
Putative recombinant (see dista
0
500
1000
1500
2000
Sliding Window Midpoint "bp#
2500
10% divergence
10% divergence
100%
100%
Original strain
Superinfecting strain
D ETECTING RECOMBINATION
Number of breakpoints
Location of breakpoints
Sequences involved in recombination
What if ‘parental’ strains are not in the sample?
Confounding processes:
Strong rate variation (e.g. unusually conserved fragments)
Convergent evolution
S CREENING FOR RECOMBINATION
Should be included in the data analysis pipeline (e.g. the PARRIS
analysis)
Affects
Tree reconstruction
Evolutionary process inference
Selection analyses
S INGLE BREAKPOINT METHOD (SBP)
Consider the null model - one phylogeny is adequate to describe
relatedness (no recombination)
Next consider a family of alternative models; two independent trees
(with own branch length parameters), with the breakpoint moving
through every variable site.
Compare the (non-nested) models using 3 information criteria (AIC,
AIC-c and BIC). Select the model with the best score. If it is an
alternative model, report evidence of recombination
SERGEI L. KOSAKOVSKY POND, DAVID POSADA, MICHAEL B. GRAVENOR, CHRISTOPHER H. WOELK, AND SIMON D.W. FROST
"AUTOMATED PHYLOGENETIC DETECTION OF RECOMBINATION USING A GENETIC ALGORITHM" MBE 23(10):1891-1901
P ERFORMS R EMARKABLY W ELL
POSADA AND CRANDALL (2001) TESTED
14 METHODS ON SIMULATED DATA
R(n):
0
2.83
11.32
45.26
181.05
WORKSHOP ON MOLECULAR EVOLUTION, NORTH AMERICA
2011
Fig. 1. Power (Left) and rate of false positives (Right) corresponding to 14 recombination detection al
against increasing levels of recombination (!) and nucleotide diversity ("). Sequences were evolved un
E XAMPLE : SBP
SINGLEBREAKPOINTRECOMB.BF APPLIED TO A TEST ALIGNMENT OF HIV
REFERENCE SEQUENCES AND 2 BC RECOMBINANTS
SUBTYPE B AND C
HTTP://WWW.HIV.LANL.GOV/CONTENT/SEQUENCE/HIV/CRFS/CRFS.HTML
PARTIAL OUTPUT
BREAKPOINT
BREAKPOINT
BREAKPOINT
BREAKPOINT
BREAKPOINT
BREAKPOINT
BREAKPOINT
AT POSITION
AT POSITION
AT POSITION
AT POSITION
AT POSITION
AT POSITION
AT POSITION
612.
614.
618.
620.
624.
626.
629.
DAIC
DAIC
DAIC
DAIC
DAIC
DAIC
DAIC
=
=
=
=
=
=
=
30.79
30.50
30.35
30.67
30.34
31.68
30.01
DAICC
DAICC
DAICC
DAICC
DAICC
DAICC
DAICC
=
=
=
=
=
=
=
29.35
29.07
28.92
29.24
28.90
30.25
28.58
DBIC
DBIC
DBIC
DBIC
DBIC
DBIC
DBIC
=
=
=
=
=
=
=
-179.59
-179.88
-180.03
-179.71
-180.04
-178.70
-180.37
AIC
BEST SUPPORTED BREAKPOINT IS LOCATED AT POSITION 260
AIC = 8019.03 : AN IMPROVEMENT OF 58.6488 AIC POINTS
AIC-C
BEST SUPPORTED BREAKPOINT IS LOCATED AT POSITION 260
AIC = 8020.99 : AN IMPROVEMENT OF 57.2153 AIC POINTS
BIC
THERE SEEMS TO BE NO RECOMBINATION IN THIS ALIGNMENT
0.1875
B_FR_83_HXB2_LAI_IIIB_BRU_K03455
B_US_98_1058_11_AY331295
B_NL_00_671_00T36_AY423387
B_TH_90_BK132_AY173951
C_ET_86_ETH2220_U46016
C_BR_92_BR025_D_U52953
C_ZA_04_SK164B1_AY772699
C_IN_95_95IN21068_AF067155
1-260
08_BC_CN_97_97CNGX_6F_AY008715
07_BC_CN_97_CN54_AX149771
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.1452
B_TH_90_BK132_AY173951
B_US_98_1058_11_AY331295
B_NL_00_671_00T36_AY423387
07_BC_CN_97_CN54_AX149771
08_BC_CN_97_97CNGX_6F_AY008715
C_IN_95_95IN21068_AF067155
C_ZA_04_SK164B1_AY772699
C_BR_92_BR025_D_U52953
261-1323
C_ET_86_ETH2220_U46016
GARD/G ENETIC A LGORITHMS
FOR RECOMBINATION DETECTION
For a fixed number of breakpoints B
Try placing B breakpoints somewhere in the sequence
Reconstruct trees for each fragment between breakpoints
Compute goodness of fit
Select a model with the best fit (using a GA to move breakpoints around)
Change B and try again
If B>0, verify phylogenetic discordance, compute model averaged breakpoint
support.
SERGEI L. KOSAKOVSKY POND, DAVID POSADA, MICHAEL B. GRAVENOR, CHRISTOPHER H. WOELK, AND SIMON D.W. FROST
"AUTOMATED PHYLOGENETIC DETECTION OF RECOMBINATION USING A GENETIC ALGORITHM" MBE 23(10):1891-1901
GARD EXAMPLE
S ELECTION /R ECOMBINATION
Recombination can influence or even mislead selection
detection methods.
Using an incorrect tree to analyze a segment of a
recombinant analysis can bias dS and dN estimation
The basic intuition is that an incorrect tree will generally
break up identity by descent and hence make it appear as if
more substitutions took place than did in reality.
make custom reference alignments and screen sequences against them.
0.01
0.1
ACC
TCC
TCC
ACC
ACC
ACC
TCC
ACC
TCC
TCC
Figure 4.2: The effect of recombination on inferring diversifying selection. Reconstructed evolutionary history of codon 516 of the Cache Valley Fever virus glycoprotein alignment is shown according to GARD inferred segment phylogeny (left) or a single phylogeny inferred from the entire
alignment (right). Ignoring the confounding effect of recombination causes the number of nonsynonymous substitutions to be overestimated. A fixed effects likelihood (FEL, Kosakovsky Pond and
Frost (2005)) analysis infers codon 516 to be under diversifying selection when recombination is
ignored (p = 0.02), but not when it is corrected for using a partitioning approach (p = 0.28).
1
BREAKPOINT LOCATIONS
0.9
0.8
Model averaged support
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
200
400
600
800
1000
1200
1400
Breakpoint location
TREE LENGTHS
0.38
0.37
Model averaged support
0.36
0.35
0.34
0.33
0.32
0.31
0.3
0.29
0
200
400
600
800
1000
1200
1400
Breakpoint location
0.1
C_IN_95_95IN21068_AF067155
0.1
C_BR_92_BR025_D_U52953
08_BC_CN_97_97CNGX_6F_AY008715
C_ET_86_ETH2220_U46016
07_BC_CN_97_CN54_AX149771
C_ZA_04_SK164B1_AY772699
C_ZA_04_SK164B1_AY772699
C_IN_95_95IN21068_AF067155
C_BR_92_BR025_D_U52953
07_BC_CN_97_CN54_AX149771
C_ET_86_ETH2220_U46016
B_NL_00_671_00T36_AY423387
B_TH_90_BK132_AY173951
B_NL_00_671_00T36_AY423387
B_US_98_1058_11_AY331295
B_TH_90_BK132_AY173951
08_BC_CN_97_97CNGX_6F_AY008715
B_US_98_1058_11_AY331295
A CCOUNTING FOR
RECOMBINATION
First screen the alignment to find putative non-recombinant
fragments
Apply a model-based test (SLAC, FEL, MEME or REL) using
multiple phylogenies (one per fragment), but inferring other
parameters (e.g. kappa and base frequencies) from the entire
alignment
This is the approach taken by PARRIS (corrected REL), and
corrected SLAC, FEL and MEME
UNCORRECTED ERROR
RATES
CORRECTED ERROR RATES
FIG. 3.—False-positive error rates for the FEL test for selected (both positively and negatively) sites based under the Neutral
the error rates for the uncorrected (single partition) FEL and panel B, for the corrected (2 partitions) FEL. Solid lines indicate expe
the P value. Tabulated error rates are presented for the first 400 codons (evolved under one tree), the last 100 codons (evolved un
the joint error rate for all 500 codons, averaged over 100 replicates.
break points were placed on variable sites only, and the each of the alignments was subject to
number of break points allocated to a replicate was ran- excess of the nominal P value (fig. 3A
domly drawn from the distribution of the number of rate for the first 400 codons was effectiv
inferred break points for that scenario. P values of observ- P value. Intuitively, a topology inferred
ing smaller median distances to correct break points by is ‘‘almost’’ correct for the first 400 c
chance were computed based on 1,000 replicates. In all correct for the last 100 codons. A simp
cases,
the median
toKOSAKOVSKY
dure, in which
we[SPOND
split each
of.EDU
the ]10
WORKSHOP ON MOLECULAR EVOLUTION, NORTH
AMERICA
2011 distance from inferred break Spoints
ERGEI L
POND
@UCSD
correct ones was significantly less than that expected by ments into 2 fragments, identified b
Table 4. Effect of correcting for recombination when using fixed effects
likelihood to detect positively selected sites.
Virus and gene
Positively Selected Codons
Uncorrected FEL
Corrected FEL
212,516,546,551
None
158, 179, 264, 444
179, 264, 444, 548
195
9,195
None
None
37,91, 358, 556
91, 358
87, 166, 252, 358
87, 147,252, 358
42,106,345,436
42,106,345,436
57, 480
57, 480
399
None
1,4,5,7,16,18,108,516
1,5,7,16,108,493,505
2,54,58,228,262,284,306,471
2,58,228,262,284,306,471
Newcastle disease N
425, 430, 466
425, 430, 462, 466
Newcastle disease P
12,56,65,174,179,188,189, 204,
56, 65, 146, 153, 174, 179, 189,
208, 213,217,218,239,306,332
193, 204,208, 213, 218, 261,306,332
79
None
Cache Valley G
Canine Distemper H
Crimean Congo hemm. fever NP
Hantaan G2
Human Parainfluenza (1) HN
Influenza A (human H2N2) HA
Influenza B NA
Mumps F
Mumps HN
Newcastle disease F
Newcastle disease HN
Puumala NP
Test p < 0.1 was used to classify sites as selected. Codon sites found under selection by
both methods are shown in bold.
II. D ETECTING CO - EVOLUTION
BETWEEN SITES USING
B AYESIAN G RAPHICAL M ODELS
The fundamental assumption of many computational models – independent evolution of sites – is often violated
Compensatory (fitness restoring) mutations, e.g. in HIV to restore
replicative fitness following the acquisition of drug resistance
Complex phenotypes, i.e. those which depend upon many alleles (epistasis)
Evolution of motifs (e.g. N-linked glycosylation sites)
D ETECTING CO - EVOLUTION
Apart from building (very computationally intensive and not yet
very tractable) models of co-evolving sites, one can seek sites in an
alignment which could be interacting in a post hoc fashion.
Sites which accumulate substitutions along the same branches can
be hypothesized to interact, i.e. substitutions at one site increase the
probability of a substitution at another site.
Approach: collect a large collection of homologous sequences and
study associations in substitution patterns among sites.
D ETECTING INTERACTIONS
FROM SEQUENCES .
A very straightforward approach:
Generate an alignment of homologous amino-acid sequences;
Look for statistical associations between residues at every pair of positions in the
alignment.
‘N’ occurs at site 2 with frequency 74% if there is a ‘V’ at site 1, but only with frequency
21% if there is an ‘I’ at site 1.
N
ALIGNMENT
S
V
105 24
I
38
91
PAIRWISE ASSOCIATION TESTS
D ETECTING INTERACTIONS
FROM SEQUENCES .
EXAMPLES OF EARLY STUDIES OF CORRELATED RESIDUES IN
PROTEIN SEQUENCES.
P15
G14
R7
A
180
113
160.
140
120-
125
126
G27
N6
D28
Ns
100
c
80
.
t_-
&
60-
129
N4
R3
R30
031
A32
P2
T1
C-s-S- C
H33
HIV-1 V3 LOOP (KORBER 1993)
Et
20
A
40
B
60
D
60
E
10
F'F
120
G
140
160
H
Position Index k
MYOGLOBINS (NEHER 1994)
T HE EFFECTS OF
PHYLOGENY .
But applying statistical tests directly to sequence
variation is plagued with major issues!
A genotype is not a random sample of alleles from a
population.
Certain genotypes may be over-represented because
they are jointly inherited from the same common
ancestor, i.e., identical by descent.
T HE EFFECTS OF PHYLOGENY .
Suppose there are two continuous characters (X and
Y) that are evolving at random along this tree.
If we measure X and Y in the individuals in the
present, they will appear to be significantly correlated
(co-evolving).
T H E AMERICAN NATURALIST
X
FIG.7.-The same data set, with the points distinguished to show the members of the 2
monophyletic taxa. It can immediately be seen that the apparently significant relationship of
fig. 6 is illusory.
FIG. 5.-A "worst case" phylogeny for 40 species, in which there prove to be 2 groups
each of 20 close relatives.
of species from which we are sampling. This does not work. Imagine two species
IMAGES FROM J FELSENSTEIN (1985). AM NAT 125: 5-6. that
have diverged some time ago, and thus have diverged in both brain and body
weight. Clearly the correlation between those characters cannot be significant,
ERGEI
L Kspecies
OSAKOVSKY
OND
[SPOND
since there are only two points. S
Now
if each
gives risePto
a group
of [email protected]]
daughter species, essentially identical to it, we now have two clusters of 100
species each. Sampling species from this pool of 200 species, we are actually
T HE EFFECTS OF PHYLOGENY .
We’re trying to make a statement about how X and Y evolve, but
most of the action happened in the divergence of group 1 from
group 2.
Our correlation is essentially based on two independent data
points!
–Y
+X
X
IMAGES FROM J FELSENSTEIN (1985). AM NAT 125: 5-6.
2
0.2
G ENETIC INTERACTIONS IN HIV-1 V3
1
0.0
0.00
0.05
0.10
0.15
0.20
False positive rate
Simulate the evolution of V3 along the
fixed tree under the binary-character
model with a known set of interactions.
B
Evol-Net
Fisher
Compare false- and true-positive rates of
pairwise association test (Fisher’s exact
test) and a binary analog of evolutionarynetwork model.
0.0
0.1
0.2
0.3
0.4
0.5
False positive rate
With no interactions, Fisher’s exact test
finds that 40 out of 511 pairs have
significant associations.
Evolutionary-network method reduces
false positives to about 2 pairs.
Image from AFY Poon et al. (2007), PLoS Comput Biol 3(1): e11.
B AYESIAN NETWORK INFERENCE .
The structure of a Bayesian network is the set of
edges connecting nodes to represent a conditional
dependence between the corresponding variables.
P(A,B,C,D,E,F) = P(F|E,D) P(D|E) P(E|B,C) P(B|A) P(C|A) P(A)
NODE
EDGE
A
B
C
E
D
F
B AYESIAN NETWORK INFERENCE .
Advantages of using Bayesian networks:
A natural graphical representation of complex systems.
Reduces the number of parameters, making it possible to learn
structure from relatively small data sets.
Models interactions among all variables simultaneously —
higher-order interactions.
A
B
C
E
D
F
E VOLUTIONARY NETWORKS .
Every branch in the tree is converted into a string of 1’s and 0’s for
the presence or absence of a non-synonymous substitution.
Doing this for every site in the alignment yields a binary matrix.
0
0
AGT
Ser
0
1
AAT
Asn
0
011
0
100
1 . . . 010
0
010
0
000
0
Coincident substitution events on the same branches is evidence of interactions
between sites.
The distribution of substitution events throughout the tree becomes the target for
our analysis.
Replace correlations of residue compositions with correlations of substitution
patterns
0
0
AGT
Ser
0
1
AAT
Asn
0
0
0
1
0
0
0
0
1
0
0
0
0
CCT
Pro
0
1
TCT
Ser
0
I NTERACTING SITES (P R = 0.9)
0.1
Node5
2_M9244858
11_M849175
13_U819885
12_U819895
Node15
Node14
17_AJ30988
Node22
Node21
Node18
Node26
Node25
19_AF10426
23_AF01807
Node35
Node13
Node33
Node39
Node32
Node42
28_AB01544
29_AF30942
30_L220635
Node46
Node1
Node48
14_M586295
Node0
Node55
6_D0107558
8_X8525358
V ARIABLE , BUT NOT INTERACTING (P R =0.04)
0.1
Node6
3_X7762758
Node4
Node10
Node15
13_U819885
Node14
Node21
Node18
Node22
18_AJ30987
19_AF10426
Node26
Node25
Node29
Node20
Node34
25_AB01544
Node13
Node33
Node2
Node39
Node32
27_AB01544
Node42
29_AF30942
30_L220635
Node46
Node1
Node48
33_AB03794
14_M586295
Node0
Node56
Node55
6_D0107558
8_X8525358
EVOLUTIONARY MAP
PAIRWISE ASSOCIATION
TESTS
ALIGNMENT
TREE
AAT AGT
AAA 105
24
AGA
91
38
MODEL
Qt
e
000
010
010
100
000
011
011
BAYESIAN NETWORKS
Problems with using pairwise association tests:
significant pairs are never tested in the context of other sites
missing out on higher-order interactions
no clear procedure for assembling the “big picture” from a list of
significant pairs;
difficult to interpret!
Use Bayesian networks.
Image from PB Gilbert et al. (2005) AIDS Res Hum Retrovir 21: 1020.
0.1
0.2
0.263235
Obtained 1,154 full-length HIV-1 env
sequences from the Los Alamos
National Laboratory HIV Sequence
Database (excluding recombinants,
mostly subtypes B and C).
Reconstr ucted phylogeny f rom
nucleotide sequences excluding all
variable domains (V1/V2, V3, V4, and
V5).
Minimize influence of alignment
uncertainty and convergent
evolution.
AFY Poon et al. (2007), PLoS Comput Biol 3(1): e11.
I13
95
Network edges correspond to
well-characterized
interactions;
Q17
D28
T1
R8
78
11-25 rule
18
25
78
G23
P15
N4
45
31
86
31
S10
A18
9
Y20
48
N-linked glycosylation motif
at sites N5 and T7.
N6
64
F19
I11
T7
75
99
N-linked
glycosylation
I26
R12
18
Q31
28
D24
‘11-25’ rule of co-receptor
usage.
H33
N5
74
73
I25
R30
Image from AFY Poon et al. (2007), PLoS Comput Biol 3(1): e11.
All of the analyses presented during the previous two lectures have
been powered by the HyPhy package
What is HyPhy?
III. H Y P HY : [H Y ] POTHESIS
A
TESTING USING
[P HY ] LOGENIES :
SCRIPTABLE SEQUENCE
ANALYSIS PLATFORM
HTTP://WWW.HYPHY.ORG/
H I - FI NOT HIGH - FEE
THE OTHER HYPHY (WE STARTED IN 1997)
HYPHY (PRONOUNCED HIGH-FEE; IPA: [ˈHAɪFIː]]) IS A STYLE OF MUSIC AND
DANCE ASSOCIATED WITH SAN FRANCISCO BAY AREA HIP HOP CULTURE. IT
BEGAN TO EMERGE IN EARLY 2000 AS A RESPONSE FROM BAY AREA RAPPERS
AGAINST COMMERCIAL HIP HOP FOR NOT ACKNOWLEDGING THE B AY FOR
SETTING TRENDS IN THE HIP HOP INDUSTRY . A LTHOUGH THE " HYPHY
MOVEMENT" HAS JUST RECENTLY SEEN LIGHT IN MAINSTREAM AMERICA, IT HAS
BEEN A LONG STANDING AND EVOLVING CULTURE IN THE BAY AREA. THE TERM
IS A COMBINATION OF THE WORDS "HYPE" AND "FLY".
IT IS DISTINGUISHED BY GRITTY, POUNDING RHYTHMS, AND IN THIS SENSE CAN
BE ASSOCIATED WITH THE B AY AS CRUNK MUSIC IS TO THE S OUTH . A N
INDIVIDUAL IS SAID TO " GET HYPHY " WHEN THEY ACT OR DANCE IN AN
OVERSTATED AND RIDICULOUS MANNER . M ANY IN THE B AY A REA WOULD
DESCRIBE THIS AS ACTING "RETARDED", "RIDING THE YELLOW BUS" OR "GOING
DUMB".
HTTP://EN.WIKIPEDIA.ORG/WIKI/
SERGEI L KOSAKOVSKY
POND [[email protected]]
HYPHY
Data view and exploration
X
Tree viewer/editor
Charting
Class
Class
Class
Class
1
2
3
4
2
2
1
1
0
Scripts HyPhy batch language
0
Model parameters/hypothesis setup
X
M
P
MPI
Computational backend
Substitution model editor
Text based I/O for
pipelines and terminal
execution
HyPhy GUI
T HE H Y P HY PACKAGE
12 years in development together with
Spencer Muse, Wayne Delport and Art Poon
Open source, runs on all major platforms natively, supports multiprocessor
(OpenMP), distributed (MPI) and gpGPU (OpenCL, next release) systems
Disclaimer: the developers all use Macs, so the Mac OS version is generally the most
polished/stable
~7000 users and 500 citations
Has ~100 prepackaged analyses
Fully features graphical interface for Mac, Windows and X11 (based of GTK
W HAT DOES H Y P HY DO WELL ?
Inference about the evolutionary process, e.g.
Selection
Recombination
Model selection
Evolutionary rate tests (relative rate/ratio)
HyPhy can infer phylogenies using a variety of algorithms, but this is not its primary
function. There exist much faster/more comprehensive packages for inference (e.g.
PAUP*, Garli, PhyML).
However, HyPhy can accommodate the widest range of evolutionary models into
inference
HyPhy does not really do sequence alignment, even though we use it for some
customized low divergence sequence alignments and NGS processing.
H Y P HY DESIGN PHILOSOPHY : TO
EACH THEIR OWN
Prepackaged point and click analyses
Graphical user interface (Mac OS, Windows, X11)
An easy mechanism to design own analyses graphically
Visualization and result processing
Flexible scripting language (HBL) for writing complex custom analyses
Unparalleled flexibility
Pipeline integration, web services (e.g. datamonkey.org, GALAXY
http://g2.trac.bx.psu.edu, MEGA 5)
N OT JUST PHYLOGENETIC
LIKELIHOOD
Because HyPhy is the primary platform for our research, it has grown
to include a number of new features.
Hidden Markov Models: spatial correlation in rates, phyloHMM
Bayesian Graphical Models: epistatic interactions, phenotypes
Stochastic Context Free Grammars: probabilistic models for structured data
(RNA secondary structure, tree shape)
Genetic algorithms: complex feature selection
454 data analysis: specialized error correction and read mapping routines for
rapidly evolving pathogens (HIV, HCV)
H Y P HY DOCUMENATION
Is not nearly as good as it should be!
Several book chapters
Package description and tutorial: http://www.hyphy.org/docs/HyPhyDocs.pdf
Selection analyses: http://www.hyphy.org/pubs/hyphybook2007.pdf
Recombination, epistasis and directional evolution: http://www.hyphy.org/
pubs/methods2011.pdf
Wiki (just started, work it progress): http://www.hyphy.org/wiki/
Main_Page
Message boards (the best place for specific questions): http://
www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl
W HY FIT MODELS ?
Model describe our mechanistic understanding of the evolutionary process,
e.g. the dichotomy between synonymous and non-synonymous substitutions
Using formal models, we can estimate biologically relevant parameters such as
branch lengths and divergence times, substitution rates, measures of selection.
Each model can be assigned a goodness of fit (usually derived from its Log L
score and complexity)
Models can be compared to decide which parameters are important, or to test
what values biological quantities may take (hypothesis testing)
Sequence data can be ‘mined’ for pattern discovery using collections of
substitution models
A VERY BASIC EXAMPLE
Given a nucleotide alignment and a tree, fit a simple substitution
model and interpret its output.
Data file: p51.nex (located in data/ within the HyPhy distribution
folder).
Three alternative ways to do it:
Standard analysis
Graphical user interface
Directly in the batch language
S TANDARD A NALYSES
______________READ THE FOLLOWING DATA______________
8 species:
{B_FR_83_HXB2_ACC_K03455,B_US_83_RF_ACC_M17451,B_US_86_JRFL_ACC_U63632,B_US_90_WEAU160_ACC_U21135,D_CD_83_ELI_ACC_K
03454,D_CD_83_NDK_ACC_M27323,D_CD_84_84ZR085_ACC_U88822,D_UG_94_94UG114_ACC_U88824};
Total Sites:1320;
Distinct Sites:118
A tree was found in the data file:
((((D_CD_83_ELI_ACC_K03454,D_CD_83_NDK_ACC_M27323),D_UG_94_94UG114_ACC_U88824),D_CD_84_84ZR085_ACC_U88822),B_US_83_
RF_ACC_M17451,((B_FR_83_HXB2_ACC_K03455,B_US_86_JRFL_ACC_U63632),B_US_90_WEAU160_ACC_U21135))
Would you like to use it:(Y/N)?y
______________RESULTS______________
Time taken = 0.23 seconds
AIC Score = 6682.51
Log Likelihood = -3327.25252976199;
Shared Parameters:
R=0.111274
Tree givenTree=(B_US_83_RF_ACC_M17451:0.0262012,
((B_FR_83_HXB2_ACC_K03455:0.0116675,B_US_86_JRFL_ACC_U63632:0.0178118)Node4:0.0022271,B_US_90_WEAU160_ACC_U21135:0.
0209889)Node3:0.00507481,
(((D_CD_83_ELI_ACC_K03454:0.0188158,D_CD_83_NDK_ACC_M27323:0.0100127)Node10:0.0105018,D_UG_94_94UG114_ACC_U88824:0.
053071)Node9:0.00391151,D_CD_84_84ZR085_ACC_U88822:0.0283077)Node8:0.0234111);
I N THE GUI
0.01
0.02
0.03
0.04
0.05
0.06
0.06869
PARAMETERS
B_US_83_RF_ACC_M17451
B_FR_83_HXB2_ACC_K03455
B_US_86_JRFL_ACC_U63632
B_US_90_WEAU160_ACC_U21135
D_CD_83_ELI_ACC_K03454
D_CD_83_NDK_ACC_M27323
D_UG_94_94UG114_ACC_U88824
D_CD_84_84ZR085_ACC_U88822
TREE
MODEL
I N HBL
DataSet ! ! ! ! ! nucleotideSequences = ReadDataFile ("data/p51.nex");
DataSetFilter!
! ! ! filteredData = CreateFilter (nucleotideSequences,1);
HarvestFrequencies ! ! (observedFreqs, filteredData, 1, 1, 1);
global R = 1;
HKY85RateMatrix =
! ! {{*,trvs,R*trvs,trvs}
! !
{trvs,*,trvs,R*trvs}
! !
{R*trvs,trvs,*,trvs}
! !
{trvs,R*trvs,trvs,*}};
Model ! HKY85 = (HKY85RateMatrix, observedFreqs);
Tree! givenTree = DATAFILE_TREE;
LikelihoodFunction theLnLik = (filteredData, givenTree);
Optimize (paramValues, theLnLik);
fprintf (stdout, theLnLik);
HBL OBJECTS
Data sets (alignments)
Data filters (partitioned alignments)
Models (matrices of parameters, constraints)
Trees (topologies+models+parameter values)
Likelihood functions (datafilters+trees+models+structure)
Stochastic Context Free Grammars
SQL databases (via SQLite)
Strings, numbers, matrices, associative arrays, expressions (formulas),
regexps, 100+ built in functions ...
T ESTING HYPOTHESES
Model (+constraints on parameters) = hypothesis
To test hypotheses, we fit several models of varying complexity to
the data and compare their goodness-of-fit. The model that fits the
data significantly better than all others is chosen as the best
explanation to how the data have arisen.
The simplest case has two models
Null (simple)
Alternative (more complex)
R ELATIVE RATE TESTS
MRCA of All 3 (can't estimate
For many models, we can’t decouple
location with reversible models)
substitution rates and evolutionary times;
they are confounded in the ‘expected
substitutions/site’ measure.
MRCA of A and B
However, sometimes it is possible to
factor out the time to directly compare
evolutionary rates
OUTGROUP
A
L(A) = T * rate (A)
L(B) = T * rate (B) B
One of the first tests to do that was the
relative ratio test: using an outgroup to
‘polarize’ substitutions.
It is then possible to directly compare
branch lengths and make statements
about evolutionary rates.
If L(A) != L(B), this means
that rate (A) != rate (B)
TESTING FOR EQUALITY OF EVOLUTIONARY RATES
S. V. MUSE AND B. S. WEIR .
Genetics, (132), 269-276
FIRST, THE OUTPUT
0).UNCONSTRAINED MODEL:LOG LIKELIHOOD = -944.001000454328;
TREE THREETAXATREE=(A:0.0850301,B:0.0733855,C:0.165368);
1). WITH THE OUTGROUP AT TAXON #1, THE P-VALUE IS:0.00215511
LOG LIKELIHOOD = -948.707253306001;
LOG LIKELIHOOD = -947.479039420142;
LOG LIKELIHOOD = -944.093617010496;
L IKELIHOOD R ATIO T ESTING
Alternative model (unconstrained rates) yielded log L = -944.00, while the simpler null (two
rates equal, outgroup at A) model returned logL = 948.70
Lower likelihood = worse fit
However because the alternative model has one more parameter than the null, it should always
beat (or at least match) the score of the null, even if the latter were the correct model.
Is the improvement in fit (log(LR)- likelihood ratio = 9.4) large enough to be significant?
Perform a Likelihood Ratio Test (LRT), comparing the distribution of 2*LR with the tail of the
chi-squared distribution with as many degrees of freedom as there are additional parameters in
the alternative model
In this case, p-value - the probability that LR >= observed value if the null model is correct is
effectively 0, hence there is very strong evidence that transitions happen at higher rates than
transversions in our data set.
T HE HBL FILE .
R EAD THE DATA
/* 1. Read in the data and store the result in a DataSet variable.*/
DataSet ! ! nucleotideSequences = ReadDataFile ("data/3.seq");
/* 2. Filter the data, specifying that all of the data is to be used
and that it is to be treated as nucleotides. */
DataSetFilter!filteredData = CreateFilter (nucleotideSequences,1);
/* 3. Collect observed nucleotide frequencies from the filtered data.
observedFreqs will store the vector of frequencies. */
HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1);
D EFINE THE MODEL / TREE
/* 4. Define the F81 substitution matrix. '*' is defined to be -(sum of off-diag
row elements); the elements of the matrix reflect substitution rates from one
nucleotide to another. The nucleotides are ordered alphabetically -- a,c,g,t,
thus, for example the entry at row 3, column 4 (mu) supplies the rate from G and T
*/
F81RateMatrix = {{*,mu,mu,mu}
{mu,*,mu,mu}
{mu,mu,*,mu}
{mu,mu,mu,*}};
/*5. Define the F81 model, by combining the substitution matrix with the vector
of observed (equilibrium) frequencies; this is done by default to avoid having to
multiply the rate matrix by pi_j terms. */
Model "F81 = (F81RateMatrix, observedFreqs);
/*6. Now we can define the simple three taxa tree.
!
Since there is only 1 three sequence tree, we turn on
!
ALLOW_SEQUENCE_MISMATCH to tell hyphy to map the first
!
sequence in the data to leaf 'a', the 2nd - to leaf 'b'
!
and the third - leaf 'c'. */
ALLOW_SEQUENCE_MISMATCH = 1;"
Tree" threeTaxaTree = (a,b,c);
C ONSTRUCT AND O PTIMIZE LF
/*7. Since all the likelihood function ingredients (data, tree,
equilibrium frequencies)
have been defined we are ready to construct the likelihood function. */
LikelihoodFunction theLnLik = (filteredData, threeTaxaTree);
/*8. Maximize the likelihood function, storing parameter values in the
matrix paramValues.
!
We also store the resulting ln-lik. */
unconstrainedLnLik = paramValues[1][0];
/*9. Print the tree with optimal branch lengths to the console. */
fprintf (stdout, "\n0).UNCONSTRAINED MODEL:", theLnLik);
P ERFORM THE RR TEST
/*10. We now constrain the rate of evolution to be equal along the branches leading
!
to c and b and repeat the optimization. */
threeTaxaTree.b.mu := threeTaxaTree.c.mu;
/*11. Now we compute the ln-lik ratio statistic and the P-Value, using the Chi^2 dist'n
with 1 degree of freedom. */
lnlikDelta = 2 (unconstrainedLnLik-paramValues[1][0]);
pValue = 1-CChi2 (lnlikDelta, 1);
fprintf (stdout, "\n\n1). With the outgroup at taxon #1, the P-value is:", pValue, "\n",
theLnLik);
REPEAT WITH THE OTHER 2 TAXA AS OUTGROUP
M ODEL EXAMPLES .
HyPhy permits three basic model parameter types; mixing them in a
single analysis permits unique flexibility.
LOCAL: Attached to an individual branch in a tree, e.g.
branch length
branch dN/dS ratio
branch rate
GLOBAL: Shared by many branches in a single tree (or multiple trees),e.g.
base frequencies
transition/transversion ratios
hyperparameters (shape of the gamma distribution, mixing coefficients)
CATEGORY: Variables which are integrated out, e.g.
the distribution of substitution rates across sites
a model-mixture distribution
M ODELS WITH
LOCAL / GLOBAL PARAMETERS
The most common example: linea ge
specific selection
Consider HIV sequences f rom two
epidemiologically linked patients: the
source and the recipient.
Biologically, the virus experiences at least
THREE different selective environments,
hence there is reason to believe that
selection (measured by dN/dS will vary
across lineages).
TRANSMISSION
FROST SDW ET AL
JOURNAL OF VIROLOGY, MAY 2005, P. 6523–6527
T HE SETUP IN H Y P HY
Define the most general model first, i.e. the model where each
branch has it’s own dN/dS ratio, or to be more precise, a pair of rates
(synonymous and non-synonymous rates)
Fit this model as the alternative
Constrain the parameters (dN/dS ratios) to be the same within
groups of branches to define nulls
Fit the null model(s) and perform LRT
C ONSTRAINTS IN H Y P HY
X
Y
X
= 1; // ASSIGN VALUE
:= X; // CONSTRAIN Y TO BE ALWAYS EQUAL TO X
:< 5; // LIMIT THE RANGE OF X
GLOBAL OMEGA_SOURCE=1;
// A GLOBAL PARAMETER
TREEID.BRANCHID.PARAMETERNAME;
// THE NAME OF A TYPICAL MODEL PARAMETER IN
HYPHY
REPLICATECONSTRAINT("THIS1.?.NONSYNRATE:=OMEGA_SOU
RCE*THIS2.?.SYNRATE",HIVTRANSMISSION_TREE,HIVTRAN
SMISSION_TREE);
// A CONVENIENCE FUNCTION TO APPLY THE SAME
CONSTRAINT TO ALL BRANCHES IN A TREE
M IXED DATA
Example: Test for equality of the transition/transversion ratio in an
exon-intron-exon data set between exons and the intron.
R EAD / FILTER THE DATA
LoadFunctionLibrary " ("chooseGeneticCode");
DataSet " "
sequences = ReadDataFile ("intronexon.nex");
DataSetFilter" intron " = CreateFilter (sequences,1,"0-88,275-551");
filterString" "
"
= "89-139,146-151,179-184,199-274";
// GeneticCodeExclusions = stop codons which are not valid states (TAA, TAG, TGA)
DataSetFilter" exon
" = CreateFilter (sequences,3,filterString,"",GeneticCodeExclusions);
HarvestFrequencies (exonBaseFreqs,
exon,
3, 1, 1); // 4x3 matrix
HarvestFrequencies (intronBaseFreqs, intron, 1, 1, 1); // 4x1 matrix
D EFINE AND CONSTRAIN THE MODELS
// nucleotide model; explicit definition
global!! kappaInv = 1.0;
HKY85RateMatrix = {{*,t*kappaInv,t,t*kappaInv}
! ! ! !
{t*kappaInv,*,t*kappaInv,t}
! ! ! !
{t,t*kappaInv,*,t*kappaInv}
! ! ! !
{t*kappaInv,t,t*kappaInv,*}};
Model HKY85! ! = (HKY85RateMatrix, intronFreqs);
// codon model; read from a ‘standard’ file
ExecuteAFile ! (HYPHY_BASE_DIRECTORY + "TemplateBatchFiles" +
! ! ! !
DIRECTORY_SEPARATOR + "2RatesAnalyses" +
! ! ! !
DIRECTORY_SEPARATOR + "MG94xREV.mdl");
PopulateModelMatrix ("MGRateMatrix",exonBaseFreqs,0);!! ! ! !
codonFreqs
= BuildMGCodonFrequencies (exonBaseFreqs);
Model MG94!
! = (MGRateMatrix, codonFreqs, 0);
AT:=AC; CG:=1; CT:=AC; GT:=AC;
!
!
T REES /L IKELIHOOD FUNCTION
treeString = "(((HKL5:0,HKL8:0.0108295)Node2:0.0143961,PHA:
0.0761833)Node1:0.076599,PTR7:0.00325449,HSA1:0.0140364)";
UseModel (MG94);"
Tree" exonTree
= treeString;
UseModel (HKY85);"
Tree" intronTree = treeString;
fprintf (stdout, "[1. FITTING SEPARATE TV/TS RATIOS TO EXON/INTRON]\n");
LikelihoodFunction theLnLik = (exon, exonTree, intron,intronTree);
[1. FITTING SEPARATE TV/TS RATIOS TO EXON/INTRON]
Log Likelihood = -1043.89674669368;
Shared Parameters:
AC=0.961309
R=3.16948
kappaInv=0.30696
AT=AC=0.961309
CG=1=1
CT=AC=0.961309
GT=AC=0.961309
Tree exonTree=(((HKL5:0,HKL8:0)Node2:0.0522098,PHA:
0.0465591)Node1:0.223343,PTR7:0,HSA1:0.0390297);
Tree intronTree=(((HKL5:0,HKL8:0.0150867)Node2:0.00512057,PHA:
0.0829941)Node1:0.0280879,PTR7:0.00298343,HSA1:0.00587939);
C ONSTRAIN AND FIT THE NULL MODEL
fprintf (stdout, "\n[2. FITTING SINGLE TV/TS RATIO TO EXON/INTRON]
\n");
kappaInv := AC;
Optimize (paramValues2, theLnLik);
[2. FITTING SINGLE TV/TS RATIO TO EXON/INTRON]
Log Likelihood = -1046.64481338653;
Shared Parameters:
AC=0.486803
R=2.91845
AT=AC=0.486803
CG=1=1
CT=AC=0.486803
GT=AC=0.486803
kappaInv=AC=0.486803
Tree exonTree=(((HKL5:0,HKL8:0)Node2:0.0532969,PHA:
0.0475588)Node1:0.228066,PTR7:0,HSA1:0.0390151);
Tree intronTree=(((HKL5:0,HKL8:0.0149572)Node2:0.00517383,PHA:
0.0818904)Node1:0.0277913,PTR7:0.00292384,HSA1:0.00590144);
P ERFORM THE LRT
fprintf (stdout, "\n[3. CONDUCTING THE LRT FOR TV/TS EQUALITY]\n");
LR = 2*(paramValues[1][0]-paramValues2[1][0]);
DF = (paramValues[1][1]-paramValues2[1][1]);
pV = 1-CChi2(LR,DF);
fprintf (stdout, "LR Statistic = ", Format (LR,8,2),
! ! !
"Constraints = ", Format (DF,8,0),
! ! !
"p-value
= ", Format (pV,8,4), "\n");
[3. CONDUCTING THE LRT FOR TV/TS EQUALITY]
LR Statistic =
5.50
Constraints =
1
p-value
=
0.0191
S IMULATION
Simulation is an invaluable tool of statistical modeling
HyPhy permits the user to simulate data from any likelihood
function object (parameterically) with a single command
Non-parametric bootstrap (sample with replacement) is equally easy
// parametric
DataSet simulatedDataSet = SimulateDataSet (likelihoodFunction);
//non-parametric, treat data as nucleotides
DataSetFilter simulatedDataFilter = Bootstrap (filteredData,1);
A UTOMATING ANALYSES
HyPhy provides a mechanism to write ‘wrapper’ files to call other
HBL files with predefined parameters.
This feature makes it possible
to carry out the same analysis over a large collection of files
to perform a multistep analysis (pipeline, e.g. as in Datamonkey.org)
inputRedirect = {};
inputRedirect["01"]="Universal";
inputRedirect["02"]="/Users/sergei/Desktop/MyFiles/somealignment.nex";
inputRedirect["03"]="MG94CUSTOM";
inputRedirect["04"]="Local";
inputRedirect["05"]="012232";
inputRedirect["06"]="y";
LoadFunctionLibrary ("AnalyzeCodonData.bf", inputRedirect);
P YTHON AND R BINDING
HyPhy can be compiled as a Python or R module and called directly
from those languages.
# import the HyPhy library
# and standard OS utilities
import os, HyPhy
# first, create a HyPhy interface instance (class _THyPhy)
# the first argument defines the root directory for HyPhy
# and the second - how many threads the computational core
# should spawn
hyphyInstance = HyPhy._THyPhy (os.getcwd(),2)
# the basic interface command is 'ExecuteBF' which
# executes HyPhy batch language commands in HyPhy
# and returns a string representation of the return value
# (if any) from HYPHY
# The returned object is of type _THyPhyString with
# sData and sLength fields
# HyPhy will take care of disposing of the memory needed
# to store the result
hyphyResult = hyphyInstance.ExecuteBF ("return 2+2;");
print "Testing a trivial HyPhy command. 2+2 = ", hyphyResult.sData
hyphyResult = hyphyInstance.ExecuteBF ("ExecuteAFile(\"../HBL/HKY85.bf\")");
print "Log-L = ", retrieveValueByKey ("LogL", HyPhy.THYPHY_TYPE_NUMBER, hyphyInstance).nValue;
print "kappa = ", retrieveValueByKey ("kappa", HyPhy.THYPHY_TYPE_NUMBER, hyphyInstance).nValue;
print "tree string = ", retrieveValueByKey ("Tree", HyPhy.THYPHY_TYPE_STRING, hyphyInstance).sData;
bl = retrieveValueByKey ("Branch lengths", HyPhy.THYPHY_TYPE_MATRIX, hyphyInstance);
print "retrieved ", bl.mCols-1, "branch lengths"
for i in range(0,bl.mCols-1):
"
print "Branch ", i+1, " has length ", bl.MatrixCell(0,i)
Log-L = -1165.73404336
kappa = 2.90308047463
tree string = (((317:0.0369387,6767:0.0674453)Node2:0.0116625,((135:0.019967,
(529:0.0330648,105r:0.0129175)Node8:0.0257932)Node6:0.0057794,
(719:0.0290622,136:0.00806843)Node11:0.0470372)Node5:0.0153195)Node1:0.0137487,6760:0.06
8227,((113:0.0213909,9939:0.0737467)Node16:0.00489231,(256:0.011979,
(822:0.0288374,159:0.00462985)Node21:0.00351714)Node19:0.0211093)Node15:0.0229936)
retrieved 23 branch lengths
Branch 1 has length 0.0369386905354
....

n - HyPhy

Transcription

Similar documents

Viagra Light Switch Snopes. Fast delivery.

Publicatie Vijver Corten staal Eng2 2014

Name: MERRYMEETING LAKE Town: New Durham

T-8601 Pond Life - TREND enterprises, Inc.

The Supply Pond and Queach Preserves

$1,766,160 - Round Top Real Estate

Free Preview Have a look inside!

facebook.com/wemattractions | facebook.com/westedmall

4443 Beckermann Rd., Brenham, Texas $335000

The Genus Cephalotaxus