Homology, Orthology, Paralogy, Xenology Bioinformatics: In

Transcription

Homology, Orthology, Paralogy, Xenology Bioinformatics: In
Bioinformatics: In-depth, Spring 2012
Pairwise Evolutionary Relations
Homology, Orthology, Paralogy, Xenology
Many slides based on lecture by Christophe Dessimoz
Manuel Gil
people.inf.ethz.ch/mgil/
Pairwise Relations
Fitch 1970
Systematic Zoology, Vol. 19, No. 2 (Jun., 1970), pp. 99-113
Pairwise Evolutionary Relations
DLIGHT: LGT Detection Using Pairwise Distances in a Statistical Framework
Conclusion
Two genes (or characters) are homologs if they have a common
Evolutionary Relations
ancestor.
Subtypes:
Gene duplication
• Pairwise evolutionary
speciation
• Orthologs:
relation
• orthologs
Paralogs:
duplication
• • paralogs
• xenologs
Xenologs:
lateral transfer
•
Speciation
Lateral gene transfer
S1
S2
• Family of orthologs:
no paralogs
a
b1
b2
c1
c2
Applications
• Homology
• Phylogenetic tree building
homology at level of gene and character matters
• Evolution of gene families
• Orthologs (as opposed to paralogs, xenologs)
• Species tree building
• Gene function
• All: study of genome evolution; mappings btw genomes
• Rearrangements, duplications, LGTs, ...
Limits of Orthology for
function inference
Trends Genet (2009) vol. 25 (5) pp. 210-216
Other Usages of “Orthology” *
• Genes with same function
• Homologs in the same
genomic context
* not recommended
Homology
Homology
• Most commonly inferred by sequence similarity
•“All-against-all”
•BLAST
• At low similarity (20-30% identity, “twilight zone”),
protein structure tends to be better conserved, but
•structure often unknown
•only for conserved regions
121
Pairwise Alignment
<,32=:9
42@
:@@ 2=14
/34,D=:4
03?-82:6
9@
Statistics
http://www.math.ku.dk/~richard/courses/binf_project/Stinus-BLAST.pdf
!"
Scores of two random (i.e unrelated) sequences are distributed according to
Gumbel extreme value distribution (2 parameters, fat tail)
Orthology / Paralogy
Orthology Inference: Two Classes
Pairwise Methods
Gene/Species Tree
Reconciliation
S
...
...
Homo sapiens
G1 Homo sapiens
Pan troglodytes
Loss Pan troglodytes
Mus musculus
G2 Mus musculus
Rattus norvegicus
Loss Rattus norvegicus
Loss Homo sapiens
G1 Homo sapiens
G2 Mus musculus
G
G3 Rattus norvegicus
G4 Pan troglodytes
G4 Pan troglodytes
Loss Mus musculus
R
G3 Rattus norvegicus
Duplication node
Dufayard et al., Bioinformatics, 2005
“implicit”
“explicit”
Two Phases of Pairwise Approach
Orthology Inference
Clustering from Pairs
Basic Idea
• Orthologs are closer than paralogs
• Closer genes have usually higher pairwise alignment
score → species-specific top scoring hit
• Corresponding orthologs maybe missing
→ “bidirectional best hit” (BBH)
Refinements of the Basic Idea
“stable pairs”
• Use distance instead of score
• Take into account variance of
distance estimates
• Relax top/smallest requirement to
include more than one ortholog
Gene duplication
• Detect differential gene losses
Speciation
S1
S2
a
b1
b2
c1
c2
Detect Gene Losses
Duplication
Speciation
Detect Gene Losses
Duplication
Speciation
Losses
Detect Gene Losses
Duplication
Speciation
Detect Gene Losses
•
•
•
•
•
Duplication
Speciation
(x1, z3) & (y2, z4) are stable pairs
d(x1, z3) < d(x1, z4)
d(y2, z4) < d(y2, z3)
d(x1, z4) = d(y2, z3)
All relations considering variance
of distance estimates
Dessimoz, Boeckmann, et al., Nucl Acid Res, 2006
Are Two Distances Significantly Different?
dAB = dBC ?
ˆ = dˆAB − dˆBC
∆
2
2
2
ˆ
ˆ
σ∆
=
σ
+
σ
−
2cov(
d
,
d
AB
BC )
ˆ
AB
BC
ˆ < k · σˆ
|∆|
∆
F
A
B
C
D
Two Phases of Pairwise Approach
Orthology Inference
Clustering from Pairs
Grouping of Orthologs
• List of n(n-1)/2 pairs scales poorly
• ... and does not present data in insightful way
→ Cluster pairs in groups
Non-trivial, orthology is non-transitive
Grouping of Orthologs
• If interested in gene x:
→ all genes orthologous to x
• COG database:
→ “triangles” of orthologs, merge
triangles with common face
• OMA Groups:
→ all pairs in group are orthologs
• Hierarchical:
→ orthologs and “in-paralogs” with
respect to taxonomic range
Tatusov et al. Science 1997
Orthology
Graph
&"
#"
&%
&
02-1.4.563.73+/89*)8369:52,(13G S(ni )
0-1)23.2-1.4.56
'()*+,-+./
&9(4+*,-+./
L(ni )
ni
{
!"
#"
Orthology !%
Graph
$"
!"
$%
$"
#"
!%
&%
$%
Orthology !%
Graph
&"
$"
Gene Tree
&%
!"
$%
&"
02-1.4.563.73+/89*)8369:52,(13G S(ni )
0-1)23.2-1.4.56
Species Tree
'()*+,-+./
&9(4+*,-+./
Hierarchical Groups
!
$
#"
&%
&"
}
#ni S(ni )
&
L(ni )
{
!"
#"
$"
!%
$%
OMA Groups
w1
x1
y2
z1
z2
Complete Cliques in Orthology Graph
Hierarchical Groups
Orthology !%
Graph
$"
Species Tree
!
!"
#
$%
$
#"
&
&%
&"
02-1.4.563.73+/89*)8369:52,(13G S(ni )
0-1)23.2-1.4.56
'()*+,-+./
&9(4+*,-+./
ni
L(ni )
{
!"
#"
$"
!%
$%
&"
&%
Gene Tree
Hierarchical Groups
}
S(ni )
Hierarchical Groups
Induced Forest of Gene Trees
!&
ni
!"
Induced Orthology Subgraph
#&
#"
!"
$"
#"
$"
!%
$%
!%
$%
()*+,*-./0.1*-,2*1.345/6/7,+.8*9*8:
"&!"!'#'$#
Connected Components in Induced Subgraph
Putting it Together: OMA algorithm
All protein
sequences from
full genomes
AP
CP
SP
VP
GP
BP = SP \ VP
= Paralogs
= Orthologs
Pairs
All Pairs
Candidate Pairs
Stable Pairs
Broken Pairs
Verified Pairs
Group Pairs
Evolutionary Relation
(AP)
(CP)
(SP)
(BP)
(VP)
(GP)
Any
Homologs
Orthologs, Pseudo-Orthologs
Paralogs
Orthologs
Close Orthologs
Roth et al., BMC Bioinformatics, 2008
Orthology Inference: Two Classes
Pairwise Methods
Gene/Species Tree
Reconciliation
S
...
...
Homo sapiens
G1 Homo sapiens
Pan troglodytes
Loss Pan troglodytes
Mus musculus
G2 Mus musculus
Rattus norvegicus
Loss Rattus norvegicus
Loss Homo sapiens
G1 Homo sapiens
G2 Mus musculus
G
G3 Rattus norvegicus
G4 Pan troglodytes
G4 Pan troglodytes
Loss Mus musculus
R
G3 Rattus norvegicus
Duplication node
Dufayard et al., Bioinformatics, 2005
“implicit”
“explicit”
Gene/Species Tree Reconciliation
/#,#'()##
*#$+,$%-#.'()##
!12&&()'%*'+',-.-#/,0
!"#$%#&'()##
*#$+,$%-#.'()##
!"#$%&'()'%*'+',-.-#/,0
"%'4#.-#/,
32%&#4.-#/,
5','(&/++
1. Mapping
2. Duplication node assignment
parsimony assumption on
#duplications
Possible Problems
• Tree rooting
•Mid point assumes ultrametric tree
•Minimum duplication based on parsimony
•Selection of outgroup tricky
• Tree inference errors
•Bootstrap
•Use multifurcating trees
• Parsimony assumption may be inappropriate
(high rates of duplications & losses)
Xenology
Lateral Gene Transfer (LGT)
Types of LGTs
novel gene acquisition
orthologous gene replacement
•
•
Detection of LGT Events
Parametric methods (based on genome signatures)
GC-content
Codon bias
k-nucleotide frequencies
•
•
•
Phylogenetic methods
Explicit: tree reconciliation
Implicit: pairwise distance methods
•
•
Genome signatures
1392
Deschavanne et al.
FIG. 1.—The fractal nature of chaos game representation (CGR) images. The frequencies of words up to eight letters long used by the
archaebacteria Archeoglobus fulgidus are represented from left to right and from top to bottom. For single-letter words, frequencies of letters
from only one strand are represented. The gray scale is fitted to the frequency values in order to use its full range of variation for each CGR
image.
Genome signatures
Representation of 7-nucleotide frequencies
Parametric Methods: Limitations
• “Amelioration”: adjustment of laterally transfered
gene to the nucleotide composition of its new
host.
→ restricted to relatively recent transfers
• Donor species are difficult to identify
• Only works for genes transfered from species
with significantly different parameters
Tree reconciliation with LGT
• Find a reconciled tree with nodes marked
as duplication, speciation or LGT events.
• Often based on subtree prune and regraft
(SPR) distance (hard to compute!)
• Still subject to difficulties of accurate tree
building
Yufeng Wu, Bioinformatics, v.25, pp. 190-196, 2009.
Algorithm
Model & assumptions
• Input: groups of non-paralogous sequences
!→ at most one sequence per species
• Interspecies distance: average distance over
groups
• gene tree = ratef × species tree
Algorithm
for all orthologous families f do
for all pairs d, r with a seq. in f do
if
l(LGT)
l(f ,d,r,δ
2
ML )
ln
> f(α,
(α)1)
2 ln l(f
>
χ
,d,r,δ=∞)
l(no LGT)
then
the triplet (f,d,r) is a LGT transfer
r
d
Summary
• Two genes of common origin are called homologs. Generally
identified through sequence alignment and scoring.
• Homologs are commonly divided in
• Orthologs (diverged through speciation)
• Paralogs (diverged through duplication)
• Xenologs (diverged through lateral transfer)
• The classification of pairs of homologs into these categories
is a difficult problem. Common methods are typically based
on tree reconciliation or pairwise analysis.