Bas E. Dutilh, Vera van Noort, René TJM van der Heijden

Transcription

Bas E. Dutilh, Vera van Noort, René TJM van der Heijden
A comprehensive comparison of phylogenomics
and orthology methods applied to the Fungi
CMBI / NCMLS, Radboud University Nijmegen
The Netherlands www.cmbi.ru.nl/~dutilh
Bas E. Dutilh, Vera van Noort, René T. J. M. van der Heijden, Martijn A. Huynen, Berend Snel
16
14
12
10
n = 100 ± S.D.
0
10 20 30
40 50 60
number of phylogenetic trees
Several phylogenomic approaches have
surfaced that fundamentally differ in the
level at which they integrate the phylogenetic information: gene content,
superalignment, superdistance and
supertree (Fig. 2).
We systematically test these phylogenomic methods on the Fungi, the eukaryotic clade with the largest number of
sequenced genomes. Many of the taxonomic relationships in the fungal kingdom have been resolved while several
details remain unclear (Fig. 3)
Hemiascomycota
Euascomycota
Sce, Spa
gc: 38 sa: 100
sd: 100 st: 100
Sce, Smi, Spa
gc: 0 sa: 100
Sce, Sku, Smi, Spa
sd: 87 st: 100
gc: 8 sa: 93
Basidiomycota
Saccharomyces s.s. sd: 67 st: 58
Sba, Sca
gc: 54 sa: 100
percentage of trees
Sce, Sku, Smi, Spa sd: 100 st: 100
that recover the node:
Cgl, Sba, Sca
gc: 15 sa: 100
gene content superalignment
Sce, Sku, Smi, Spa sd: 100 st: 100
superdistance supertree
gc: 54 sa: 100
Saccharomyces s.l. sd: 100 st: 100
gc: 85 sa: 100
sd: 100 st: 100
Ago, Kla, Kwa, Skl
Yli primitive in
gc: 0 sa: 100
Hemiascomycota
sd: 87 st: 50
gc: 85 sa: 100
Hemiascomycota sd: 100 st: 100
Cal, Dha
gc: 62 sa: 100
gc: 85 sa: 100
sd: 100 st: 100
sd: 100 st: 100
Archiascomycota
Ascomycota
gc: 31 sa: 71
sd: 13 st: 83
Hypocreales
gc: 69 sa: 100
Sordariomycetes
sd: 100 st: 100
gc: 100 sa: 100
Mgr, Ncr
Euascomycota sd: 100 st: 100
gc: 85 sa: 57
gc: 100 sa: 100
sd: 80 st: 58
sd: 100 st: 100
Eurotiomycetes
gc: 100 sa: 100
sd: 100 st: 100
Hymenomycetes
gc: 0 sa: 100
Basidiomycota
sd: 100 st: 100
gc: 69 sa: 100
sd: 100 st: 100
Sce
Spa
Smi
Sku
Sba
Sca
Cgl
Ago
Kla
Kwa
Skl
Cal
Dha
Yli
Fgr
Tre
Mgr
Ncr
Afu
Ani
Sno
Spo
Pch
Cne
Uma
Ecu
Figure 3. Consensus phylogeny of the Fungi. The labeled
nodes are supported by several literature references. We use
these nodes to score the reconstructed phylogenomic trees.
Unresolved issues are indicated by multifurcating nodes (bold
lines). The colored numbers at each node indicate the percentage of the trees in each of the phylogenomic approaches
that recovered this node correctly.
orthology
- InparanOGs
- duOGs / triOGs, distance / ML
cluster orthology
sequence similarity
- duOGs / triOGs
superalignment
superdistance
unambiguous
Sce
Spa
Smi
Sku
Sba
Sca
Cgl
Skl
Kwa
Kla
Ago
100001101100001101
111101000100001101
111001000111001000
111101000111001000
111101000001010011
000010011001010011
001010011111010001
001110010111101000
000110110111001000
000110111000110110
001010011010000010
presence/absence
pan-OGs (1:1)
supertree
superalignment
10
10
Skl_Skluy_2050.2
0.00 0.30 0.32 0.27 0.32 0.35 0.33 0.28 0.33 0.33
10
Skl_Skluy_2050.2
0.00
0.30
0.32
0.27
0.32
0.35
0.33
0.28
0.33
0.33
Ago_Q752K0
0.30
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
0.24
10
Skl_Skluy_2050.2
0.00
0.30
0.32
0.27
0.32
0.35
0.33
0.28
0.33
0.33
Ago_Q752K0
0.30
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
0.24
Kla_KLLA0B08041g
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
0.29
Skl_Skluy_2050.2
0.00
0.30
0.32
0.27
0.32
0.35
0.33
0.28
0.33
0.33
Ago_Q752K0
0.30
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
0.24
Kla_KLLA0B08041g
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
0.29
Kwa_Kwal_9254
0.27
0.26
0.32
0.00
0.28
0.35
0.40
0.21
0.34
0.34
Ago_Q752K0
0.30
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
0.24
Kla_KLLA0B08041g
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
0.29
Kwa_Kwal_9254
0.27
0.26
0.32
0.00
0.28
0.35
0.40
0.21
0.34
0.34
Sca_Scast_683.20 0.32 0.29 0.27 0.28 0.00 0.31 0.32 0.18 0.21 0.21
Kla_KLLA0B08041g
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
0.29
Kwa_Kwal_9254
0.27
0.26
0.32
0.00
0.28
0.35
0.40
0.21
0.34
0.34
Sca_Scast_683.20
0.32
0.29
0.27
0.28
0.00
0.31
0.32
0.18
0.21
0.21
Cgl_CAGL0G04983g
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
0.18
Kwa_Kwal_9254
0.27
0.26
0.32
0.00
0.28
0.35
0.40
0.21
0.34
0.34
Sca_Scast_683.20
0.32
0.29
0.27
0.28
0.00
0.31
0.32
0.18
0.21
0.21
Cgl_CAGL0G04983g
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
0.18
Sba_Contig545.15 0.33 0.29 0.39 0.40 0.32 0.25 0.00 0.12 0.09 0.08
Sca_Scast_683.20
0.32
0.29
0.27
0.28
0.00
0.31
0.32
0.18
0.21
0.21
Cgl_CAGL0G04983g
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
0.18
Sba_Contig545.15
0.33
0.29
0.39
0.40
0.32
0.25
0.00
0.12
0.09
0.08
Sku_Skud_1678.1
0.28
0.24
0.26
0.21
0.18
0.16
0.12
0.00
0.06
0.06
Cgl_CAGL0G04983g
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
Sba_Contig545.15
0.33
0.29
0.39
0.40
0.32
0.25
0.00
0.12
0.09
0.08
Sku_Skud_1678.10.33
0.28
0.24
0.26
0.21
0.18
0.16
0.12
0.00
0.06
0.06 0.18
Sce_YLR363W-A
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00
0.01
Sba_Contig545.15
0.33
0.29
0.39
0.40
0.32
0.25
0.00
0.12
0.09
0.08
Sku_Skud_1678.1
0.28
0.24
0.26
0.21
0.18
0.16
0.12
0.00
0.06
0.06
Sce_YLR363W-A
0.33
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00
0.01
Spa_ORFP_15496
0.33
0.24
0.29
0.34
0.21
0.18
0.08
0.06
0.01
0.00
Sku_Skud_1678.1
0.28
0.24
0.26
0.21
0.18
0.16
0.12
0.00
0.06
0.06
Sce_YLR363W-A
0.33
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00
0.01
Spa_ORFP_15496
0.33
0.24
0.29
0.34
0.21
0.18
0.08
0.06
0.01
0.00
Sce_YLR363W-A 0.33
0.33
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00
0.01
Spa_ORFP_15496
0.24
0.29
0.34
0.21
0.18
0.08
0.06
0.01
0.00
Spa_ORFP_15496 0.33 0.24 0.29 0.34 0.21 0.18 0.08 0.06 0.01 0.00
distance matrices
10
Skl_Skluy_2050.2
Ago_Q752K0
Kla_KLLA0B08041g
Kwa_Kwal_9254
Sca_Scast_683.20
Cgl_CAGL0G04983g
Sba_Contig545.15
Sku_Skud_1678.1
Sce_YLR363W-A
Spa_ORFP_15496
0.00
0.30
0.32
0.27
0.32
0.35
0.33
0.28
0.33
0.33
0.30
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
0.24
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
0.29
0.27
0.26
0.32
0.00
0.28
0.35
0.40
0.21
0.34
0.34
0.32
0.29
0.27
0.28
0.00
0.31
0.32
0.18
0.21
0.21
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
0.18
0.33
0.29
0.39
0.40
0.32
0.25
0.00
0.12
0.09
0.08
0.28
0.24
0.26
0.21
0.18
0.16
0.12
0.00
0.06
0.06
0.33
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00
0.01
10
Skl_Skluy_2050.2
Ago_Q752K0
Kla_KLLA0B08041g
Kwa_Kwal_9254
Sca_Scast_683.20
Cgl_CAGL0G04983g
Sba_Contig545.15
Sku_Skud_1678.1
Sce_YLR363W-A
Spa_ORFP_15496
0.33
0.24
0.29
0.34
0.21
0.18
0.08
0.06
0.01
0.00
distance
0.00
0.30
0.32
0.27
0.32
0.35
0.33
0.28
0.33
0.33
0.30
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
0.24
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
0.29
0.27
0.26
0.32
0.00
0.28
0.35
0.40
0.21
0.34
0.34
0.32
0.29
0.27
0.28
0.00
0.31
0.32
0.18
0.21
0.21
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
0.18
0.33
0.29
0.39
0.40
0.32
0.25
0.00
0.12
0.09
0.08
0.28
0.24
0.26
0.21
0.18
0.16
0.12
0.00
0.06
0.06
0.33
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00
0.01
0.33
0.24
0.29
0.34
0.21
0.18
0.08
0.06
0.01
0.00
distance
P
parsimony
likelihood
10
Skl_Skluy_2050.2
Ago_Q752K0
Kla_KLLA0B08041g
Kwa_Kwal_9254
Sca_Scast_683.20
Cgl_CAGL0G04983g
Sba_Contig545.15
Sku_Skud_1678.1
Sce_YLR363W-A
Spa_ORFP_15496
0.00
0.30
0.32
0.27
0.32
0.35
0.33
0.28
0.33
0.33
0.30
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
0.24
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
0.29
0.27
0.26
0.32
0.00
0.28
0.35
0.40
0.21
0.34
0.34
0.32
0.29
0.27
0.28
0.00
0.31
0.32
0.18
0.21
0.21
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
0.18
0.33
0.29
0.39
0.40
0.32
0.25
0.00
0.12
0.09
0.08
0.28
0.24
0.26
0.21
0.18
0.16
0.12
0.00
0.06
0.06
0.33
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00
0.01
superdistance
10
10
Skl_Skluy_2050.2
10 0.00 0.30 0.32 0.27 0.32 0.35 0.33 0.28 0.33 0.33
Skl_Skluy_2050.2
0.30
0.32
0.27
0.32
0.35
0.33
0.28
0.33
0.33
100.30 0.00
Ago_Q752K0
0.00 0.00
0.34 0.30
0.26 0.32
0.29 0.27
0.36 0.32
0.29 0.35
0.24 0.33
0.25 0.28
0.24 0.33
Skl_Skluy_2050.2
0.33
Ago_Q752K0
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
0.24
Skl_Skluy_2050.2
Kla_KLLA0B08041g
0.32 0.30
0.34 0.30
0.000.00
0.320.30
0.270.32
0.310.27
0.390.32
0.260.35
0.300.33
0.290.28
Ago_Q752K0
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25 0.33
0.24 0.33
Kla_KLLA0B08041g
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
0.29
Ago_Q752K0
0.30
0.00
0.34
0.26
0.29
0.36
0.29
0.24
0.25
Kwa_Kwal_9254
0.27 0.26 0.32
0.32 0.34
0.00 0.00
0.28 0.32
0.35 0.27
0.40 0.31
0.21 0.39
0.34 0.26
0.34 0.30 0.29 0.24
Kla_KLLA0B08041g
Kwa_Kwal_9254
0.27
0.26
0.32
0.00
0.28
0.35
0.40
0.21
0.34
0.34
Kla_KLLA0B08041g
0.32
0.34
0.00
0.32
0.27
0.31
0.39
0.26
0.30
Sca_Scast_683.20
0.32 0.29 0.27
0.27 0.26
0.28 0.32
0.00 0.00
0.31 0.28
0.32 0.35
0.18 0.40
0.21 0.21
0.21 0.34 0.34 0.29
Kwa_Kwal_9254
Sca_Scast_683.20
0.29
0.27
0.28
0.00
0.31
0.32
0.18
0.21
0.21
Kwa_Kwal_9254
Cgl_CAGL0G04983g
0.35 0.32
0.36 0.32
0.310.27
0.350.26
0.310.32
0.000.00
0.250.28
0.160.35
0.180.40
0.180.21
Sca_Scast_683.20
0.29
0.27
0.28
0.00
0.31
0.32
0.18
0.21 0.34
0.21 0.34
Cgl_CAGL0G04983g
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
0.18
Sca_Scast_683.20
Sba_Contig545.15
0.33 0.29 0.35
0.390.32
0.400.29
0.320.27
0.250.28
0.000.00
0.120.31
0.090.32
0.080.18
Cgl_CAGL0G04983g
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18 0.21
0.18 0.21
Sba_Contig545.15
0.33
0.29
0.39
0.40
0.32
0.25
0.00
0.12
0.09
0.08
Cgl_CAGL0G04983g
0.35
0.36
0.31
0.35
0.31
0.00
0.25
0.16
0.18
Sku_Skud_1678.1
0.28 0.24 0.33
0.26 0.29
0.21 0.39
0.18 0.40
0.16 0.32
0.12 0.25
0.00 0.00
0.06 0.12
0.06 0.09 0.08 0.18
Sba_Contig545.15
Sku_Skud_1678.1
0.28
0.24
0.26
0.21
0.18
0.16
0.12
0.00
0.06
0.06
Sba_Contig545.15
0.33
0.29
0.39
0.40
0.32
0.25
0.00
0.12
0.09
Sce_YLR363W-A
0.33 0.25 0.28
0.30 0.24
0.34 0.26
0.21 0.21
0.18 0.18
0.09 0.16
0.06 0.12
0.00 0.00
0.01 0.06 0.06 0.08
Sku_Skud_1678.1
Sce_YLR363W-A
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00
0.01
Sku_Skud_1678.1
Spa_ORFP_15496
0.33 0.33
0.24 0.33
0.290.28
0.340.24
0.210.26
0.180.21
0.080.18
0.060.16
0.010.12
0.000.00
Sce_YLR363W-A
0.25
0.30
0.34
0.21
0.18
0.09
0.06
0.00 0.06
0.01 0.06
Spa_ORFP_15496
0.33
0.24
0.29
0.34
0.21
0.18
0.08
0.06
0.01
0.00
Sce_YLR363W-A0.33 0.33
Spa_ORFP_15496
0.24 0.25
0.29 0.30
0.34 0.34
0.21 0.21
0.18 0.18
0.08 0.09
0.06 0.06
0.01 0.00
0.00 0.01
Spa_ORFP_15496 0.33 0.24 0.29 0.34 0.21 0.18 0.08 0.06 0.01 0.00
distance
PPPP
likelihood
0.33
0.24
0.29
0.34
0.21
0.18
0.08
0.06
0.01
0.00
Sce_YLR363W-A
Sce_YLR363W-A
Spa_ORFP_15496
Sce_YLR363W-A
Spa_ORFP_15496
Sce_YLR363W-A
Sba_Contig545.15
Spa_ORFP_15496
Sba_Contig545.15
Spa_ORFP_15496
Sku_Skud_1678.1
Sba_Contig545.15
Sku_Skud_1678.1
Sba_Contig545.15
Cgl_CAGL0G04983g
Sku_Skud_1678.1
Cgl_CAGL0G04983g
Sku_Skud_1678.1
Sca_Scast_683.20
Cgl_CAGL0G04983g
Sca_Scast_683.20
Cgl_CAGL0G04983g
Kla_KLLA0B08041g
Sca_Scast_683.20
Kla_KLLA0B08041g
Sca_Scast_683.20
Kwa_Kwal_9254
Kla_KLLA0B08041g
Kwa_Kwal_9254
Kla_KLLA0B08041g
Skl_Skluy_2050.2
Kwa_Kwal_9254
Skl_Skluy_2050.2
Kwa_Kwal_9254
Ago_Q752K0
Skl_Skluy_2050.2
Ago_Q752K0
Skl_Skluy_2050.2
Ago_Q752K0
Ago_Q752K0
gene family trees
gene content tree
superalignment tree
average score: 10.38
average score: 18.21
superdistance tree
supertree
average score: 17.33
average score: 17.50
Figure 2. Making phylogenomic trees. Phylogenomics follows the steps of phylogenetics: (1) definition of orthologs, (2) multiple
alignment, (3) distance, likelihood or parsimony and (4) tree reconstruction. The parallel phylogenetic information (gray boxes)
can be integrated to a phylogenomic scale (white boxes) in each step; this defines the phylogenomic approach.
The phylogenomic methods that are most
successful at recovering the nodes of the
consensus phylogeny are found among
the superalignment and the supertree approaches (Fig. 4b).
a
Dimorphic Fungi
Filamentous Fungi
Yeasts
0.10
b
S. cerevisiae
Apart from the gene content trees, all
phylogenomic trees recover many target
nodes. They are also quite unanimous
about the solutions for the unresolved
nodes of the fungal taxonomy (Table 1).
Table 1. Number of trees in each phylogenomic approach that
support the possible branchings in the unresolved nodes.
A. fumigatus
A. nidulans
S. pombe
P. chrysosporium
C. neoformans
U. maydis
E. cuniculi
Figure 4. Phylogenomic trees. (a) Gene content tree. BioNJ
distance tree based on pairwise orthologs. (b) One of the two
highest scoring phylogenomic trees. This topology was recovered by four superalignment trees and one supertree using
cluster orthology (unambiguous orthologs or pan-orthologs).
gene content (13)
superalignment (14)
superdistance (15)
supertree (12)
10
14
14
12
0 0 1 0 0 8
0 0 0 0 14 0
0 0 0 0 15 0
0 0 0 0 12 0
0 0
0 0
0 0
0 0
0 4 5 4
0 14 0 0
1 13 0 2
0 12 0 0
Archiascomycota prim. in Ascomycota
Euascomycota prim. in Ascomycota
Hemiascomycota prim. in Ascomycota
S. paradoxus
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
C. glabrata
A. gossypii
K. lactis
K. waltii
S. kluyveri
D. hansenii
C. albicans
Y. lipolytica
F. graminearum
T. reesei
M. grisea
N. crassa
S. nodorum
Kla, Kwa, Skl
Ago, Kwa, Skl
Ago, Kla, Skl
Ago, Kla, Kwa
S. cerevisiae
S. paradoxus
S. bayanus
S. mikatae
S. kudriavzevii
S. castellii
C. glabrata
A. gossypii
K. lactis
K. waltii
S. kluyveri
D. hansenii
C. albicans
Y. lipolytica
S. pombe
F. graminearum
T. reesei
M. grisea
N. crassa
S. nodorum
A. fumigatus
A. nidulans
P. chrysosporium
C. neoformans
U. maydis
E. cuniculi
The orthology definition and the phylogenomic approach have more influence on the final topology than the precise methodological details, such as
using a distance or likelihood approach
for the reconstruction of the tree.
Kwa, Skl
Kla, Skl
Kluyveromyces
Ago, Skl
Ago, Kwa
Ago, Kla
Several relevant trends in the resulting
topologies can be explained from the
methods. The gene content trees, for example, cluster the Euascomycota with
the Basidiomycota, which can be explained by their filamentous lifestyle,
leading to an apparent convergence in
gene content (Fig. 4a).
tree based orthology
S. nodorum prim. in Euascomycota
Eurotiomycetes prim. in Euascomycota
Sordariomycetes prim. in Euascomycota
In parallel to these four phylogenomic
approaches, we also compare the performance of different orthology prediction methods with higher and lower resolution: pairwise orthology (Inparanoid),
cluster orthology (a COG-like algorithm)
and tree-based orthology.
pairwise orth.
multiple alignments
alignments
18
Figure1. More high quality
data leads to a better phylogeny. Supertrees (Consense)
were composed from an increasing number of phylogenetic distance trees of cluster
pan-orthologs.
predicted genes
Fungi
gene content
distance/likelihood/parsimony
average score
20
sequenced genomes
trees
Phylogenomics integrates all the phylogenetic information present in complete
genome sequences, and is quickly becoming the standard for inferring reliable
species trees. While phylogenetic trees
(based on single genes) may be unreliable due to e.g. horizontal gene transfer
or parallel gene loss, the large scale of
phylogenomic trees can average out
biases (Fig. 1).
0
0
2
4
11
0
0
2
0
12
1
4

Similar documents