Homology, Orthology, Paralogy, Xenology Bioinformatics: In
Transcription
Homology, Orthology, Paralogy, Xenology Bioinformatics: In
Bioinformatics: In-depth, Spring 2012 Pairwise Evolutionary Relations Homology, Orthology, Paralogy, Xenology Many slides based on lecture by Christophe Dessimoz Manuel Gil people.inf.ethz.ch/mgil/ Pairwise Relations Fitch 1970 Systematic Zoology, Vol. 19, No. 2 (Jun., 1970), pp. 99-113 Pairwise Evolutionary Relations DLIGHT: LGT Detection Using Pairwise Distances in a Statistical Framework Conclusion Two genes (or characters) are homologs if they have a common Evolutionary Relations ancestor. Subtypes: Gene duplication • Pairwise evolutionary speciation • Orthologs: relation • orthologs Paralogs: duplication • • paralogs • xenologs Xenologs: lateral transfer • Speciation Lateral gene transfer S1 S2 • Family of orthologs: no paralogs a b1 b2 c1 c2 Applications • Homology • Phylogenetic tree building homology at level of gene and character matters • Evolution of gene families • Orthologs (as opposed to paralogs, xenologs) • Species tree building • Gene function • All: study of genome evolution; mappings btw genomes • Rearrangements, duplications, LGTs, ... Limits of Orthology for function inference Trends Genet (2009) vol. 25 (5) pp. 210-216 Other Usages of “Orthology” * • Genes with same function • Homologs in the same genomic context * not recommended Homology Homology • Most commonly inferred by sequence similarity •“All-against-all” •BLAST • At low similarity (20-30% identity, “twilight zone”), protein structure tends to be better conserved, but •structure often unknown •only for conserved regions 121 Pairwise Alignment <,32=:9 42@ :@@ 2=14 /34,D=:4 03?-82:6 9@ Statistics http://www.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf !" Scores of two random (i.e unrelated) sequences are distributed according to Gumbel extreme value distribution (2 parameters, fat tail) Orthology / Paralogy Orthology Inference: Two Classes Pairwise Methods Gene/Species Tree Reconciliation S ... ... Homo sapiens G1 Homo sapiens Pan troglodytes Loss Pan troglodytes Mus musculus G2 Mus musculus Rattus norvegicus Loss Rattus norvegicus Loss Homo sapiens G1 Homo sapiens G2 Mus musculus G G3 Rattus norvegicus G4 Pan troglodytes G4 Pan troglodytes Loss Mus musculus R G3 Rattus norvegicus Duplication node Dufayard et al., Bioinformatics, 2005 “implicit” “explicit” Two Phases of Pairwise Approach Orthology Inference Clustering from Pairs Basic Idea • Orthologs are closer than paralogs • Closer genes have usually higher pairwise alignment score → species-specific top scoring hit • Corresponding orthologs maybe missing → “bidirectional best hit” (BBH) Refinements of the Basic Idea “stable pairs” • Use distance instead of score • Take into account variance of distance estimates • Relax top/smallest requirement to include more than one ortholog Gene duplication • Detect differential gene losses Speciation S1 S2 a b1 b2 c1 c2 Detect Gene Losses Duplication Speciation Detect Gene Losses Duplication Speciation Losses Detect Gene Losses Duplication Speciation Detect Gene Losses • • • • • Duplication Speciation (x1, z3) & (y2, z4) are stable pairs d(x1, z3) < d(x1, z4) d(y2, z4) < d(y2, z3) d(x1, z4) = d(y2, z3) All relations considering variance of distance estimates Dessimoz, Boeckmann, et al., Nucl Acid Res, 2006 Are Two Distances Significantly Different? dAB = dBC ? ˆ = dˆAB − dˆBC ∆ 2 2 2 ˆ ˆ σ∆ = σ + σ − 2cov( d , d AB BC ) ˆ AB BC ˆ < k · σˆ |∆| ∆ F A B C D Two Phases of Pairwise Approach Orthology Inference Clustering from Pairs Grouping of Orthologs • List of n(n-1)/2 pairs scales poorly • ... and does not present data in insightful way → Cluster pairs in groups Non-trivial, orthology is non-transitive Grouping of Orthologs • If interested in gene x: → all genes orthologous to x • COG database: → “triangles” of orthologs, merge triangles with common face • OMA Groups: → all pairs in group are orthologs • Hierarchical: → orthologs and “in-paralogs” with respect to taxonomic range Tatusov et al. Science 1997 Orthology Graph &" #" &% & 02-1.4.563.73+/89*)8369:52,(13G S(ni ) 0-1)23.2-1.4.56 '()*+,-+./ &9(4+*,-+./ L(ni ) ni { !" #" Orthology !% Graph $" !" $% $" #" !% &% $% Orthology !% Graph &" $" Gene Tree &% !" $% &" 02-1.4.563.73+/89*)8369:52,(13G S(ni ) 0-1)23.2-1.4.56 Species Tree '()*+,-+./ &9(4+*,-+./ Hierarchical Groups ! $ #" &% &" } #ni S(ni ) & L(ni ) { !" #" $" !% $% OMA Groups w1 x1 y2 z1 z2 Complete Cliques in Orthology Graph Hierarchical Groups Orthology !% Graph $" Species Tree ! !" # $% $ #" & &% &" 02-1.4.563.73+/89*)8369:52,(13G S(ni ) 0-1)23.2-1.4.56 '()*+,-+./ &9(4+*,-+./ ni L(ni ) { !" #" $" !% $% &" &% Gene Tree Hierarchical Groups } S(ni ) Hierarchical Groups Induced Forest of Gene Trees !& ni !" Induced Orthology Subgraph #& #" !" $" #" $" !% $% !% $% ()*+,*-./0.1*-,2*1.345/6/7,+.8*9*8: "&!"!'#'$# Connected Components in Induced Subgraph Putting it Together: OMA algorithm All protein sequences from full genomes AP CP SP VP GP BP = SP \ VP = Paralogs = Orthologs Pairs All Pairs Candidate Pairs Stable Pairs Broken Pairs Verified Pairs Group Pairs Evolutionary Relation (AP) (CP) (SP) (BP) (VP) (GP) Any Homologs Orthologs, Pseudo-Orthologs Paralogs Orthologs Close Orthologs Roth et al., BMC Bioinformatics, 2008 Orthology Inference: Two Classes Pairwise Methods Gene/Species Tree Reconciliation S ... ... Homo sapiens G1 Homo sapiens Pan troglodytes Loss Pan troglodytes Mus musculus G2 Mus musculus Rattus norvegicus Loss Rattus norvegicus Loss Homo sapiens G1 Homo sapiens G2 Mus musculus G G3 Rattus norvegicus G4 Pan troglodytes G4 Pan troglodytes Loss Mus musculus R G3 Rattus norvegicus Duplication node Dufayard et al., Bioinformatics, 2005 “implicit” “explicit” Gene/Species Tree Reconciliation /#,#'()## *#$+,$%-#.'()## !12&&()'%*'+',-.-#/,0 !"#$%#&'()## *#$+,$%-#.'()## !"#$%&'()'%*'+',-.-#/,0 "%'4#.-#/, 32%.-#/, 5','(&/++ 1. Mapping 2. Duplication node assignment parsimony assumption on #duplications Possible Problems • Tree rooting •Mid point assumes ultrametric tree •Minimum duplication based on parsimony •Selection of outgroup tricky • Tree inference errors •Bootstrap •Use multifurcating trees • Parsimony assumption may be inappropriate (high rates of duplications & losses) Xenology Lateral Gene Transfer (LGT) Types of LGTs novel gene acquisition orthologous gene replacement • • Detection of LGT Events Parametric methods (based on genome signatures) GC-content Codon bias k-nucleotide frequencies • • • Phylogenetic methods Explicit: tree reconciliation Implicit: pairwise distance methods • • Genome signatures 1392 Deschavanne et al. FIG. 1.—The fractal nature of chaos game representation (CGR) images. The frequencies of words up to eight letters long used by the archaebacteria Archeoglobus fulgidus are represented from left to right and from top to bottom. For single-letter words, frequencies of letters from only one strand are represented. The gray scale is fitted to the frequency values in order to use its full range of variation for each CGR image. Genome signatures Representation of 7-nucleotide frequencies Parametric Methods: Limitations • “Amelioration”: adjustment of laterally transfered gene to the nucleotide composition of its new host. → restricted to relatively recent transfers • Donor species are difficult to identify • Only works for genes transfered from species with significantly different parameters Tree reconciliation with LGT • Find a reconciled tree with nodes marked as duplication, speciation or LGT events. • Often based on subtree prune and regraft (SPR) distance (hard to compute!) • Still subject to difficulties of accurate tree building Yufeng Wu, Bioinformatics, v.25, pp. 190-196, 2009. Algorithm Model & assumptions • Input: groups of non-paralogous sequences !→ at most one sequence per species • Interspecies distance: average distance over groups • gene tree = ratef × species tree Algorithm for all orthologous families f do for all pairs d, r with a seq. in f do if l(LGT) l(f ,d,r,δ 2 ML ) ln > f(α, (α)1) 2 ln l(f > χ ,d,r,δ=∞) l(no LGT) then the triplet (f,d,r) is a LGT transfer r d Summary • Two genes of common origin are called homologs. Generally identified through sequence alignment and scoring. • Homologs are commonly divided in • Orthologs (diverged through speciation) • Paralogs (diverged through duplication) • Xenologs (diverged through lateral transfer) • The classification of pairs of homologs into these categories is a difficult problem. Common methods are typically based on tree reconciliation or pairwise analysis.