Supplementary Online Material UEST FOR UALITY “BUSCO

Transcription

Supplementary Online Material UEST FOR UALITY “BUSCO
Supplementary Online Material
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
Felipe A. Simão†, Robert M. Waterhouse†*, Panagiotis Ioannidis, Evgenia V. Kriventseva, and Evgeny M. Zdobnov *
Department of Genetic Medicine and Development, University of Geneva Medical School
and Swiss Institute of Bioinformatics, rue Michel-Servet 1, 1211 Geneva, Switzerland.
†
Equal contribution. * To whom correspondence should be addressed:
[email protected], [email protected]
Contents:
1.
BUSCO: Benchmarking Universal Single-Copy Orthologs...................................................................... 2
1.1.
BUSCO selection............................................................................................................................... 2
1.2.
Hidden Markov models, ancestral sequences and block profiles ...................................................... 2
1.3.
Candidate BUSCO matches from genome assemblies ...................................................................... 4
1.4.
Gene prediction: assessing genome assemblies and transcriptomes ................................................. 4
1.5.
BUSCO match assignment ................................................................................................................ 4
1.6.
Classification: Complete, Duplicated, Fragmented, Missing ............................................................ 5
1.7.
Training Augustus gene finding parameters ...................................................................................... 5
2.
BUSCO completeness versus N50 contiguity ........................................................................................... 5
3.
BUSCO versus CEGMA assessment of genome assembly completeness ................................................ 6
4.
BUSCO assessments of genomes, transcriptomes, and gene sets ............................................................. 7
5.
BUSCO and CEGMA analysis run-times ............................................................................................... 12
6.
References ............................................................................................................................................... 13
UEST FOR
UALITY
“BUSCO CALIDAD”
“BUSCO QUALIDADE”
http://busco.ezlab.org
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 1 of 13
1. BUSCO: Benchmarking Universal Single-Copy Orthologs
1.1. BUSCO selection
Benchmarking Universal Single-Copy Orthologs (BUSCO) sets are collections of orthologous groups
with near-universally-distributed single-copy genes in each species, selected from OrthoDB root-level
orthology delineations across arthropods, vertebrates, metazoans, fungi, and eukaryotes (Kriventseva, et al.,
2014; Waterhouse, et al., 2013). BUSCO groups were selected from each major radiation of the species
phylogeny requiring genes to be present as single-copy orthologs in at least 90% of the species; in others
they may be lost or duplicated, and to ensure broad phyletic distribution they cannot all be missing from one
sub-clade. The species that define each major radiation were selected to include the majority of OrthoDB
species, excluding only those with unusually high numbers of missing or duplicated orthologs, while
retaining representation from all major sub-clades. Their widespread presence means that any BUSCO can
therefore be expected to be found as a single-copy ortholog in any newly-sequenced genome from the
appropriate phylogenetic clade (Waterhouse, et al., 2011). A total of 38 arthropods (3’078 BUSCO groups),
41 vertebrates (4’425 BUSCO groups), 93 metazoans (1’008 BUSCO groups), 125 fungi (1’438 BUSCO
groups), and 99 eukaryotes (431 BUSCO groups), were selected from OrthoDB to make up the initial
BUSCO sets which were then filtered based on uniqueness and conservation as described below to produce
the final BUSCO sets for each clade, representing 2’675 genes for arthropods, 3’023 for vertebrates, 843 for
metazoans, 1’438 for fungi, and 429 for eukaryotes. For bacteria, 40 universal marker genes were selected
from (Mende, et al., 2013).
1.2. Hidden Markov models, ancestral sequences and block profiles
Hidden Markov models: For each BUSCO group, multiple sequence alignments (MSAs) were built with
ClustalOmega (Sievers and Higgins, 2014) using the orthologous protein sequences of each BUSCO. The
MSAs were then used to build amino acid-level hidden Markov model (HMM) profiles using HMMER 3
(Eddy, 2011). Subsequently, all BUSCO input sequences were searched (hmmsearch) against the complete
library of HMM profiles to identify and remove any BUSCO groups whose members could not be reliably
distinguished from each other by their profiles, and hence ensure reliable profile-delineated orthology. In
total, 376, 852, and 156 groups were removed in this way from the arthropod, vertebrate, metazoan sets,
respectively, while none were removed for the fungi or eukaryote datasets. The remaining, reliablydistinguishable BUSCO sets were then analysed to delineate the two parameters ‘expected-score’ and
‘expected-length’ that define the BUSCO-specific cut-offs used to classify a match as orthologous or not and
as complete or not. The ‘expected score’ cut-off is defined as 90% of the minimum bitscore from an HMM
search of all of a BUSCO group’s members against its own HMM profile (i.e. the lowest scoring match of
the sequences used to build the profile). To be classified as a true ortholog, any BUSCO-matching gene from
the species being assessed (from its genome, transcriptome, or gene set) must score above the ‘expectedscore’ cut-off. For a match to be classified as ‘complete’, it must satisfy the ‘expected-length’ cut-off, which
is defined using each BUSCO group’s protein length distribution (Figure S1). Any BUSCO-matching gene
from the species being assessed whose protein length falls within two standard deviations (2σ) of the
BUSCO group’s mean length is classified as ‘complete’.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 2 of 13
Consensus sequences: For each BUSCO group, an amino acid consensus sequence was generated from its
respective HMM profile using HMMER’s default hmmemit settings for a majority-rule consensus sequence.
These consensus sequences are used during BUSCO assessments of genome assemblies to search the
genome of the species being assessed to identify the best-matching genomic regions that may encode the
corresponding BUSCO-matching gene.
Figure S1. Distribution of the percent differences between BUSCO group member proteins and the
group’s mean protein length (negative = shorter than the mean, positive = longer than the mean, values
of one and two standard deviations are shown with lines). Insets: spread of BUSCO group member
protein lengths compared to BUSCO group mean lengths for arthropods (left) and vertebrates (right).
Block profiles: For each BUSCO group, a ‘block profile’ was built to guide automated gene predictions
with Augustus (Keller, et al., 2011). Block profiles are position-specific frequency matrices that model
conserved regions of multiple sequence alignments. The BUSCO group block profiles were created from
their corresponding protein multiple sequence alignments using the msa2prfl script from the Augustus
package. Several highly-divergent BUSCO groups failed to produce reliable block profiles, even after
processing their alignments with the Augustus preparealign script, and were therefore removed from the
assessment sets: 27, 149, 51, 0 and 2 BUSCO groups were removed from the arthropod, vertebrate,
metazoan, fungi and eukaryote sets, respectively.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 3 of 13
1.3. Candidate BUSCO matches from genome assemblies
Regions in a genome likely to encode BUSCO-matching genes are identified by tBLASTn searches
(Camacho, et al., 2009) with the reconstructed consensus sequences of each BUSCO. Neighbouring highscoring segment pairs (HSPs) from the tBLASTn searches are merged if located within 50 Kb of each other,
thus defining the span of the genomic regions to be evaluated. These genomic regions are then ranked
according to the total length of the consensus sequence aligned, and up to three regions are selected for the
subsequent gene prediction steps. The second- and third-ranked regions must have consensus sequence
alignment lengths of at least 70% of the aligned length of the top ranking region. Selecting more than just the
best candidate BUSCO match allows for the identification of normally-rarely duplicated BUSCOs from the
assessed genome, which, if numerous, could indicate potentially erroneously assembled haplotypes. Lastly,
the selected genomic regions are extended with 5 Kbp (small genomes) and 20 Kbp (large genomes) flanking
regions (default parameters, users can specify their own flank-extension lengths).
1.4. Gene prediction: assessing genome assemblies and transcriptomes
The candidate BUSCO-matching regions identified in the previous step are extracted from the genome
being assessed for processing by the Augustus automated gene prediction procedure. Gene prediction is
performed on each candidate region using the corresponding BUSCO group’s block profile, and default gene
finding parameters (unless otherwise specified by the user). Successful Augustus gene prediction for each
BUSCO group produces an initial BUSCO gene set whose protein sequences are then evaluated using the
BUSCO-specific cut-offs to determine true orthology and completeness. High-confidence predicted BUSCO
genes can then be selected from this initial gene set for the training of Augustus to rerun the automated gene
prediction procedure with these specific genome-trained parameters (see below). For assessing
transcriptomes, if the transcripts have not already been pre-processed to extract protein-coding genes then the
longest open reading frame (ORF) is selected for assessment.
1.5. BUSCO match assignment
This step uses the properties of each BUSCO group’s HMM profile to determine whether a significantly
matching protein sequence is likely orthologous or just homologous. Significant matches are first determined
by searching the full set of protein sequences to be assessed against the complete library of BUSCO group
HMM profiles using HMMER’s hmmsearch. As described above, filtering of the initial BUSCO sets ensured
that each library contains only reliably-distinguishable profiles. The set of protein sequences to be assessed
may be from the Augustus-predicted BUSCO gene set, a transcriptome-based gene set, or the annotated
‘Official Gene Set’ (OGS). For each hmmsearch sequence-profile alignment, two measures are computed
and evaluated: the alignment bitscore and the total length of sequence aligned to the HMM profile. For a
BUSCO-matching gene to be considered orthologous, the alignment bitscore must be greater than or equal to
the ‘expected-score’ of the corresponding BUSCO group (see above for ‘expected-score’ definition). Genes
that pass the ‘expected-score’ cut-off are then evaluated for protein length completeness as described below.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 4 of 13
1.6. Classification: Complete, Duplicated, Fragmented, Missing
The final stage of the assessments classify each arthropod, vertebrate, metazoan, fungal, or eukaryote
BUSCO as complete, duplicated, fragmented, or missing from the gene set being assessed. Classification of
BUSCO-matching genes that meet the ‘expected-score’ cut-off employs the protein length distribution of
each BUSCO to determine whether the ortholog is ‘Complete’ or ‘Fragmented’. Orthologs are considered to
be ‘Complete’ if the length of their aligned sequence is within two standard deviations (2σ) of the BUSCO
group’s mean length (i.e. 95% expectation), otherwise they are classified as ‘Fragmented’ recoveries (Figure
S1). A BUSCO is classified as ‘Duplicated’ when multiple BUSCO-matching genes meet both the
‘expected-score’ and the ‘expected-length’ cut-offs, i.e. multiple copies of full-length orthologs are found in
the gene set being assessed. Lastly, any BUSCO without a BUSCO-matching gene that meets the ‘expectedscore’ cut-off is classified as ‘Missing’.
1.7. Training Augustus gene finding parameters
Augustus predictions using generic parameters may produce sub-optimal gene predictions. Training the
prediction parameters using the most reliable gene structures obtained from the initial set of predictions can
substantially improve the results. To train Augustus, BUSCO-matching genes classified as ‘Complete’ and
single-copy are selected to form a high-quality training dataset. The selected gene structures are extracted,
and used to build GenBank files (gff2smallgb) suitable for training Augustus (etraining). This procedure
results in the creation of genome-specific gene finding parameters; for the vast majority of genomes
evaluated, when compared to ‘generic’ gene finding parameters, these genome-specific parameters result in
substantial increases in the sensitivity and specificity of Augustus predictions, both at gene and exon levels.
A second round of Augustus gene prediction is then performed using these genome-specific parameters on
all BUSCO-matching candidate regions where initial predictions failed or did not yield a ‘Complete’
ortholog. Orthology assessment, protein length evaluations, and final classifications are then performed as
outlined above to produce the final BUSCO assessment results.
Augustus allows for the possibility of further sensitivity and specificity gains by applying multiple rounds
of metaparameter optimisation performed using OptimizeAugustus. However, this extra optimisation step
comes at the cost of generally more than double the run-time for a typical genome assembly assessment,
without large improvements in assessment sensitivity. Thus, for default genome assembly assessments, this
extra optimisation step is not performed unless specified by the user (--long mode). This option is made
available to users because although the improvements from this extra optimisation step are minimal for the
purposes of assembly assessments, they can prove valuable when using BUSCO sets to train gene predictors
for subsequent use as part of multi-evidence-based whole genome annotation pipelines.
2. BUSCO completeness versus N50 contiguity
BUSCO assessment of genome assembly completeness is designed to provide a more detailed
quantification of assembly quality than traditional measures such as scaffold N50 metrics of assembly
contiguity. Comparing BUSCO completeness with N50 contiguity for a selection of genomes ranging from
fragmented draft assemblies to chromosome-level genome assemblies reveals the low correlation (r=0.149)
between these measures (Figure S2). Thus, even fragmented assemblies with relatively low N50 values can
encode fairly complete gene sets, and some assemblies that appear to be of good quality based on contiguity
measures are not necessarily more complete in terms of expected gene content. Additionally, when assessing
gene sets, it is clear that species with very high gene counts are not necessarily the most complete, nor are
those with rather low gene counts necessarily incomplete (Waterhouse, 2015). For a typical eukaryotic draft
assembly, BUSCO assessments suggest that assemblies with N50 values on the order of 50 Kbp are capable
of yielding fairly complete gene sets.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 5 of 13
Figure S2. BUSCO completeness versus N50 contiguity. Nine outliers with N50 values above 10’000 Kbp
are not shown, each of which achieve more than 90% BUSCO completeness.
3. BUSCO versus CEGMA assessment of genome assembly completeness
The Core Eukaryotic Genes Mapping Approach (CEGMA) is a widely-used method to assess genome
assembly completeness in terms of gene content (Parra, et al., 2007; Parra, et al., 2009), but does not provide
a means for directly assessing gene sets. CEGMA employs a set of 248 conserved Core Eukaryotic Genes
(CEGs) expected to be present in any newly sequenced eukaryotic genome. The CEGs are derived from
eukaryotic KOGs (Tatusov, et al., 2003) and are composed of orthologous protein sequences from six
eukaryotic species (human, fruit fly, roundworm, thale cress, fission yeast and baker’s yeast), for which a
corresponding HMM profile is built from their multiple sequence alignments.
In order to perform a like-for-like comparison of the CEGMA and BUSCO genome assembly and gene
set assessments, a subset of 250 of the 429 eukaryote BUSCOs was selected with the lowest variations of
their ‘expected-score’ and ‘expected-length’ parameters. As the CEGMA pipeline does not perform gene set
assessments, an analysis pipeline was built to use the CEGMA HMM profiles instead of the BUSCO HMM
profiles. In addition, the pipeline employed the cut-offs that CEGMA uses to determine the presence/absence
(from the provided ‘cutoff_file’ with the cut-offs for CEGMA HMMs) and complete/partial (complete,
>70% CEG length) status of potentially orthologous matches.
Thus, BUSCO assessments of genome assemblies and gene sets were performed with normal default
options except for substituting the full eukaryote BUSCO set with a subset of only 250 in order to match the
number of CEGMA CEGs. The CEGMA assessments of genome assemblies were performed with normal
default options, and CEGMA assessments of gene sets were enabled by building a pipeline to use CEGMA
HMM profiles and cut-offs. The results for the assessments of 40 species are shown in Figure 2 of the main
text. They reveal generally consistent BUSCO assessments across highly divergent lineages from fungi to
human, with somewhat less consistent results from the CEGMA assessments (BUSCO linear regression
more closely follows the diagonal than that of CEGMA).
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 6 of 13
Linear regressions of each set, adjusted R2:
BUSCO R2 = 0.718
CEGMA R2 = 0.413
R2 = SSR / SST where SSR = ∑ (ŷi - )2, SST = ∑ (yi - )2
 yi is the ith observed value
 ŷi is the ith expected value from the best-fit line
 and is the mean of y
To evaluate against the diagonal (x = y) instead of the best-fit, the expected value (ŷi) simply becomes the x
value (xi), and there is no intercept term (i.e. x = y = 0) so: R2(x=y) = 1 – ( SSE / SST ) where SSE = ∑ (yi - ŷi)2.
BUSCO: R2(x=y) = 1 – ( 1281.6 / 3440.5 ) = 0.63
CEGMA: R2(x=y) = 1 – ( 5944.3 / 1936.3 ) = -2.07
4. BUSCO assessments of genomes, transcriptomes, and gene sets
The BUSCO assessment pipeline was applied to 70 available genome assemblies and their corresponding
official gene sets, as well as to 93 additional gene sets, and 96 transcriptomes. The detailed results are shown
in Table S1 in C[D],F,M,n BUSCO notation. The evaluated genome assemblies include both high quality
reference genomes (e.g. Homo sapiens), as well as de novo assemblies of non-model organisms, sampling a
wide range of different fold-coverage levels, N50 sizes, sequencing technologies, and assembly strategies.
These genomes represent the four major BUSCO lineages with 41 arthropods from 13 different orders, 3
vertebrates from 3 different orders, 11 basal metazoans, and 15 fungal species from 12 different orders. The
gene sets chosen for these assessments comprise: 41 arthropods, 26 vertebrates, 11 basal metazoans and 15
fungal species. 96 transcriptomes were also evaluated; sequences were typically derived from mRNA
extracted from different tissue types. The transcriptomes analysed cover a total of 11 fungal species (14
transcriptomes), 39 arthropods (44 transcriptomes), 18 vertebrates (28 transcriptomes) and 10 basal
metazoans (13 transcriptomes). Duplications [D] were not assessed (n.a.) for unfiltered gene sets or
transcriptomes that contained multiple transcripts of the same gene as this would lead to overestimates of
BUSCO duplications.
Table S1. Current assessment completeness metrics in BUSCO notation (C:complete [D:duplicated],
F:fragmented, M:missed, n:genes) sampling different types of data and a variety of eukaryotic species.
Lineage
Species
Homo sapiens
Mus musculus
Vertebrates
Ornithorhyncus anatinus
Callithrix jacchus
Pan troglodytes
Sample type
Genome
Gene set
Genome
Gene set
Genome
Gene set
Gene set
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Gene set
Transcriptome
Transcriptome
Identifier
GCA_000001405.15
GRCh37.75
GCA_000001635.4
GRCm38.75
GCF_000002275.2
OANA5.75
C_jacchus3.2.1.75
GI:532219616 Bladder
GI:532292355 hypocampus
GI:532349506 Cortex
GI:532452938 S. muscle
GI:532524775 Cerebellum
CHIMP2.14.75
GI:410228237adipose SC
GI:410308999 Fibroblast
N50 (Kbp) BUSCOs assessment
67,794 C:89% [D:1.5%], F:6.0%, M:4.5%, n:3023
C:99% [D:1.7%], F:0.0%, M:0.0%, n:3023
52,589 C:78% [D:3.0%], F:19%, M:2.5%, n:3023
C:99% [D:2.5%], F:99%, M:0.1%, n:3023
991 C:55% [D:0.8%], F:25%, M:18%, n:3023
C:72% [D:1.1%], F:19%, M:8.2%, n:3023
C:97% [D:2.9%], F:1.7%, M:0.8%, n:3023
C:76% [D:17%], F:5.5%, M:18%, n:3023
C:79% [D:18%], F:4.5%, M:15%, n:3023
C:34% [D:7.6%], F:34%, M:64%, n:3023
C:69% [D:13%], F:6.0%, M:24%, n:3023
C:76% [D:19%], F:5.1%, M:18%, n:3023
C:96% [D:0.5%], F:1.2%, M:1.9%, n:3023
C:75% [D:15%], F:3.8%, M:20%, n:3023
C:75% [D:16%], F:3.7%, M:21%, n:3023
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 7 of 13
Lineage
Species
Anolis carolinensis
Latimeria chalmnae
Rana clamitans
Pseudoacris regilla
Salmo salar
Oreochromis niloticus
Ameiurus nebulosus
Ursus maritimus
Tripterygion delaisi
Atractaspis aterrima
Latimeria menadoensis
Hynobius chinensis
Carduelis chloris
Maylandia zebra
Chinchilla lanigera
Ailuropoda melanoleuca
Bos taurus
Danio rerio
Felis catus
Ficedula albicollis
Gallus gallus
Gorilla gorilla
Loxodonta africana
Macaca mulatta
Monodelphis domestica
Mustela putorius
Oreochromis niloticus
Oryctolagus cuniculus
Oryzias latipes
Pongo abelii
Sus scrofa
Taeniopygia guttata
Takifugu rubripes
Xenopus tropicalis
Xiphophorus maculatus
Acromyrmex echinatior
Acyrtosiphon pisum
Aedes aegypti
Anopheles gambiae
Arthropods
Apis mellifera
Atta cephalotes
Bombyx mori
Camponotus floridanus
Danaus plexippus
Daphnia pulex
Dendroctonus ponderosa
Drosophila anannasse
Sample type
Transcriptome
Gene set
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Identifier
GI:410268357 Endothelium
AnoCar2.0.75
GI:614142443 Skeletal
GI:464801713 Whole
GI:387756559 Muscle
GI:451274083 Unknown
GI:451272305 Unknown
GI:666988260 Mixed
GI:555682626 Spleen
GI:472819489 Unknown
GI:510063642 Fat
GI:572723144 Brain
GI:673456880 Venom
GI:673404158 Venom
GI:559559797 Testis
GI:570932341 Unknown
GI:617996660 Blood
GI:614241491 Kidney
GI:618625375 Trachea
ailMel1.75
UMD3.175
Zv9.75
Felis_catus_6.2.75
FicAlb_1.4.75
Galga4.75
gorGor3.1.75
loxAfr3.75
MMUL_1.75
BROADO5.75
MusPutFur1.0.75
Orenil1.0.75
OryCun2.0.75
MEDAKA1.75
PPYG2.75
Sscrofa10.2.75
taeGut3.2.4.75
FUGU4.75
JGI_4.2.75
Xipmac4.4.2.75
Aech_2.0
Aech_OGS_v3.8
GCA_000142985.2
GCA_000142985.2.22
AaegL3
AaegL3.2
AgamP4
AgamP4.2
Amel_v4.5
Amel_OGS_v3.2
Acep 1.0
Acep OGS v1.2
GCA_000151625.1
GLEAN set
Cflor_v3.3
Cflor_OGS_v3.3
DanPle_1.0.22
DanPle_1.0.22
GCA_000187875.1
GCA_000187875.1.22
GCA_000355655.2
GCA_000355655.2.22
Dana_r1.3
N50 (Kbp)
1,110
86
1,547
49,364
997
5,154
4,008
451
52
642
628
4,599
BUSCOs assessment
C:75% [D:15%], F:3.5%, M:21%, n:3023
C:89% [D:2.6%], F:6.8%, M:3.4%, n:3023
C:58% [D:14%], F:8.7%, M:32%, n:3023
C:27% [D:15%], F:18%, M:53%, n:3023
C:37% [D:6.9%], F:11%, M:50%, n:3023
C:21% [D:0.3%], F:13%, M:65%, n:3023
C:20% [D:0.4%], F:16%, M:63%, n:3023
C:19% [D:7.8%], F:6.6%, M:74%, n:3023
C:39% [D:0.4%], F:16%, M:44%, n:3023
C:7.3% [D:0.2%], F:10%, M:82%, n:3023
C:50% [D:29%], F:5.5%, M:44%, n:3023
C:35% [D:13%], F:17%, M:47%, n:3023
C:0.7% [D:0.0%], F:1.0%, M:98%, n:3023
C:4.4% [D:0.5%], F:6.8%, M:88%, n:3023
C:71% [D:15%], F:6.5%, M:22%, n:3023
C:59% [D:7.3%], F:13%, M:26%, n:3023
C:31% [D:0.2%], F:12%, M:55%, n:3023
C:64% [D:15%], F:8.7%, M:26%, n:3023
C:80% [D:44%], F:5.7%, M:14%, n:3023
C:97% [D:1.3%], F:1.8%, M:0.3%, n:3023
C:97% [D:1.3%], F:1.6%, M:0.5%, n:3023
C:95% [D:8.3%], F:3.2%, M:1.7%, n:3023
C:96% [D:1.2%], F:2.8%, M:0.5%, n:3023
C:88% [D:2.0%], F:4.1%, M:7.8%, n:3023
C:90% [D:2.4%], F:3.5%, M:6.0%, n:3023
C:96% [D:2.6%], F:1.7%, M:2.1%, n:3023
C:96% [D:1.5%], F:2.3%, M:1.0%, n:3023
C:94% [D:2.0%], F:4.5%, M:0.9%, n:3023
C:95% [D:4.0%], F:2.3%, M:1.6%, n:3023
C:97% [D:1.4%], F:1.7%, M:1.0%, n:3023
C:96% [D:5.1%], F:1.4%, M:2.5%, n:3023
C:93% [D:2.7%], F:3.0%, M:3.2%, n:3023
C:83% [D:3.2%], F:5.4%, M:11%, n:3023
C:95% [D:1.1%], F:3.3%, M:1.1%, n:3023
C:83% [D:7.4%], F:6.8%, M:10%, n:3023
C:81% [D:3.2%], F:7.5%, M:11%, n:3023
C:89% [D:5.2%], F:3.5%, M:7.3%, n:3023
C:93% [D:3.4%], F:3.5%, M:2.5%, n:3023
C:93% [D:3.6%], F:4.7%, M:1.3%, n:3023
C:91% [D:2.6%], F:8.0%, M:0.6%, n:2675
C:96% [D:8.8%], F:2.8%, M:0.5%, n:2675
C:72% [D:6.1%], F:15%, M:12%, n:2675
C:89% [D:14%], F:4.1%, M:5.9%, n:2675
C:86% [D:13%], F:10%, M:3.2%, n:2675
C:93% [D:17%], F:3.6%, M:3.0%, n:2675
C:93% [D:4.7%], F:4.1%, M:2.5%, n:2675
C:97% [D:10%], F:1.4%, M:0.8%, n:2675
C:93% [D:2.9%], F:5.1%, M:0.9%, n:2675
C:97% [D:9%], F:2.1%, M:0.1%, n:2675
C:89% [D:2.6%], F:8.7%, M:1.3%, n:2675
C:91% [D:7.7%], F:7.5%, M:0.5%, n:2675
C:73% [D:2.2%], F:17%, M:8.3%, n:2675
C:75% [D:7.0%], F:14%, M:10%, n:2675
C:92% [D:3.1%], F:6.6%, M:0.5%, n:2675
C:95% [D:8.7%], F:3.9%, M:0.4%, n:2675
C:83% [D:8.6%], F:11%, M:4.3%, n:2675
C:86% [D:9.0%], F:9.5%, M:3.7%, n:2675
C:83% [D:3.9%], F:11%, M:5.1%, n:2675
C:84% [D:10%], F:11%, M:4.0%, n:2675
C:77% [D:6.1%], F:15%, M:7.2%, n:2675
C:82% [D:11%], F:10%, M:6.6%, n:2675
C:96% [D:3.7%], F:1.9%, M:1.9%, n:2675
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 8 of 13
Lineage
Species
Drosophila erecta
Drosophila grimshawi
Drosophila melanogaster
Drosophila mojavensis
Drosophila persimilis
Drosophila pseudobscura
Drosophila sechelia
Drosophila simulans
Drosophila virilis
Drosophila willistoni
Drosophila yakuba
Harpegnathos saltator
Heliconius melpomene
Ixodes scapularis
Linepithema humile
Lutzomyia longipalpis
Manduca sexta
Megaselia scalaris
Metaseiulus occidentalis
Musca domestica
Nasonia vitripennis
Pediculus humanus
Phlebotomus papatasi
Pogonomyrmex barbatus
Solenopsis invicta
Rhodnius prolixus
Strigamia maritima
Tetranychus urticae
Tribolium castaneum
Acanthoscurria geniculata
Anopheles sinensis
Anthonomus grandis
Sample type
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Transcriptome
Transcriptome
Transcriptome
Identifier
Dana_r1.3
Dere_r1.3
Dere_r1.3
Dgri_r1.3
Dgri_r1.3
Dmel_r5.55
Dmel_r5.55
Dmoj_r1.3
Dmoj_r1.3
Dper_r1.3
Dper_r1.3
Dpse_r3.1
Dpse_r3.1
Dsec_r1.3
Dsec_r1.3
Dsim_r1.4
Dsim_r1.4
Dvir_r1.2
Dvir_r1.2
Dwil_r1.3
Dwil_r1.3
Dyak_r1.3
Dyak_r1.3
Hsal_v3.3
Hsal_OGS_v3.3
Hmel_v1.22
Hmel_v1.22
IscaW1
IscaW1.3
Lhum_v1.0
Lhum_OGS_v1.2
Llonj1.1
Llonj1.1
GCA_000262585.1
OGS2_20140407
Mscal_v1.22
Mscal_v1.22
Mocc_1.0
Mocc_1.0
v2.0.2
v2.0.2
Nvit_v1.0
Nvit_OGS_v1.2
PhumU2
PhumU2.1
Ppapi1.1
Ppapi1.1
Pbar_v1.0
Pbar_OGS_v1.2
Sinv_v1.0
Sinv_OGS_v2.2.3
RproC1
RprocC1.2
Smar1.22
GCA_000239435.1.22
GCA_000239435.1
GCA_000239435.1.22
Tcas3.22
Tcas_OGS_v2
GI:598795695 whole
GI:656597267 unknown
GI:562777735 whole
N50 (Kbp)
18,748
8,399
23,011
24,764
1,869
12,541
2,123
857
10,161
4,511
21,770
601
194
76
1,402
85
664
1
896
226
698
497
0.87
819
558
847
139
2,993
19,135
BUSCOs assessment
C:98% [D:9.6%], F:0.8%, M:0.1%, n:2675
C:98% [D:4.7%], F:1.4%, M:0.4%, n:2675
C:99% [D:9.3%], F:0.2%, M:0.1%, n:2675
C:97% [D:6.2%], F:2.2%, M:0.4%, n:2675
C:99% [D:11%], F:0.4%, M:0.0%, n:2675
C:98% [D:6.4%], F:0.6%, M:0.3%, n:2675
C:99% [D:9.1%], F:0.2%, M:0.0%, n:2675
C:97% [D:4.4%], F:2.2%, M:0.4%, n:2675
C:99% [D:9.6%], F:0.8%, M:0.1%, n:2675
C:93% [D:5.6%], F:5.8%, M:0.8%, n:2675
C:93% [D:9.3%], F:5.6%, M:0.7%, n:2675
C:96% [D:6.3%], F:2.2%, M:1.1%, n:2675
C:98% [D:11%], F:0.6%, M:0.6%, n:2675
C:96% [D:5.1%], F:2.8%, M:0.7%, n:2675
C:96% [D:8.9%], F:3.0%, M:0.3%, n:2675
C:85% [D:4.6%], F:9.0%, M:5.0%, n:2675
C:84% [D:7.6%], F:6.9%, M:8.0%, n:2675
C:96% [D:5.2%], F:2.4%, M:0.6%, n:2675
C:99% [D:9.6%], F:0.7%, M:0.1%, n:2675
C:97% [D:5.5%], F:1.7%, M:0.4%, n:2675
C:99% [D:10%], F:0.6%, M:0.2%, n:2675
C:97% [D:6.5%], F:1.5%, M:0.7%, n:2675
C:98% [D:10%], F:0.8%, M:0.2%, n:2675
C:89% [D:3.2%], F:9.6%, M:1.1%, n:2675
C:95% [D:9.0%], F:3.8%, M:0.7%, n:2675
C:77% [D:2.0%], F:11%, M:10%, n:2675
C:74% [D:6.7%], F:14%, M:11%, n:2675
C:58% [D:1.7%], F:21%, M:19%, n:2675
C:69% [D:6.6%], F:23%, M:7.1%, n:2675
C:92% [D:3.3%], F:7.0%, M:0.6%, n:2675
C:95% [D:8.8%], F:4.0%, M:0.1%, n:2675
C:73% [D:6.3%], F:10%, M:16%, n:2675
C:66% [D:9.7%], F:13%, M:20%, n:2675
C:81% [D:4.4%], F:12%, M:6.1%, n:2675
C:80% [D:10%], F:10%, M:8.2%, n:2675
C:16% [D:0.6%], F:21%, M:61%, n:2675
C:21% [D:1.4%], F:20%, M:58%, n:2675
C:76% [D:4.9%], F:12%, M:10%, n:2675
C:82% [D:14%], F:10%, M:6.5%, n:2675
C:91% [D:4.3%], F:5.3%, M:2.7%, n:2675
C:97% [D:29%], F:2.3%, M:0.5%, n:2675
C:91% [D:6.0%], F:5.1%, M:3.2%, n:2675
C:94% [D:10%], F:4.0%, M:1.1%, n:2675
C:92% [D:3.9%], F:6.1%, M:1.6%, n:2675
C:93% [D:9.1%], F:4.9%, M:1.3%, n:2675
C:33% [D:3.2%], F:33%, M:33%, n:2675
C:54% [D:6.1%], F:20%, M:25%, n:2675
C:90% [D:2.9%], F:8.5%, M:0.7%, n:2675
C:93% [D:8.2%], F:6.5%, M:0.3%, n:2675
C:74% [D:2.4%], F:19%, M:6.3%, n:2675
C:80% [D:6.5%], F:14%, M:5.4%, n:2675
C:85% [D:2.5%], F:12%, M:2.5%, n:2675
C:74% [D:8.3%], F:9.1%, M:16%, n:2675
C:84% [D:5.9%], F:12%, M:3.2%, n:2675
C:87% [D:12%], F:8.3%, M:4.6%, n:2675
C:61% [D:4.5%], F:12%, M:25%, n:2675
C:69% [D:11%], F:9.6%, M:20%, n:2675
C:95% [D:5.8%], F:3.9%, M:0.8%, n:2675
C:95% [D:10%], F:3.0%, M:1.3%, n:2675
C:65% [D:n.a.], F:13%, M:20%, n:2675
C:36% [D:n.a.], F:22%, M:41%, n:2675
C:18% [D:n.a.], F:16%, M:65%, n:2675
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 9 of 13
Lineage
Species
Bactrocera dorsalis
Belgica antartica
Calanus finmarchicus
Ceratitis capitata
Cherax quadricarinatus
Corydalinae sp.
Delia antiqua
Dendroctonus frontalis
Drosophila ercepeae
Drosophila malerkotliana m.
Drosophila malerkotliana p.
Drosophila merina
Drosophila miranda
Drosophila pseudoananassae n.
Drosophila pseudoananassae p.
Drosophila serrata
Echinogammarus veneris
Enallagma hageni
Folsomia candida
Hyalella azteca
Ips typographus
Ixodes scapularis
Ixodes ricinus
Latrodectus hesperus
Melita plumosa
Mengenilla moldrzyki
Musca domestica
Nannochorista philpotti
Nilaparvata lugens
Orchesella cincta
Polistes canadensis
Pontastacus leptodactylus
Priacma serrata
Spodoptera exigua
Stegodyphus mimosarum
Teleopsis dalmanni
Teleopsis whitei
Themira biloba
Brugia malayi
Caenorhabditis briggsae
Caenorhabditis elegans
Other metazoans
Caenorhabditis japonica
Helobdella robusta
Loa loa
Lottia gigantea
Nematostella vectensis
Schistosoma mansoni
Strongylocentrotus purpuratus
Trichoplax adhaerens
Sample type
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Identifier
GI:618068638 unknown
GI:418280542 whole
GI:592958556 unknown
GI:647215886 unknown
GI:577749858 whole
GI:512174511 hypodermis
GI:661070030 whole
GI:604701913 whole
GI:452943093 whole
GI:570540147 unknown
GI:570549742 unknown
GI:570523813 unknown
GI:570504412 unknown
GI:645592147 unknown
GI:570451470 unknown
GI:570485056 whole
GI:480512000 unknown
GI:595402945 unknown
GI:459275420 total
GI:570625125 unknown
GI:510074665 unknown
GI:510092454 unknown
GI:459277393 antenna
GI:604952323 Synganglion
GI:556088131 salivary
GI:618730332 unknown
GI:510208131 whole
GI:660742704 whole
GI:604923024 unknown
GI:661012745 whole
GI:672467144 salivary
GI:570587022 unknown
GI:452055806 multiple
GI:556694752 hypodermis
GI:557011125 hepatopancreas
GI:661240973 Unknown
GI:548816146 unknown
GI:598904898 whole
GI:615270444 whole
GI:619803922 whole
GI:654236640 wildtype
GCA_000002995.3
B_malayi_3.0.22
CB4
CB4.22
GCA_000002985.3
WBcel235.22
GCA_000147155.1
C_japonica-7.0.1.22
GCA_000326865.1
GCA_000326865.1.22
GCA_00018385.2
Loa_loa_v3.22
GCA_00032785.1
GCA_00032785.1.22
GCA_000209225.1
GCA_000209225.1.22
GCA_000237925.2
ASM2379v2.22
GCA_000002235.2
GCA_000002235.2.22
GCA_000150275.1
N50 (Kbp)
37
17,512
17,494
94
3,060
174
1,870
472
34,464
167
5,978
BUSCOs assessment
C:87% [D:n.a.], F:5.9%, M:6.4%, n:2675
C:79% [D:n.a.], F:10%, M:9.8%, n:2675
C:84% [D:n.a.], F:7.3%, M:8.5%, n:2675
C:78% [D:n.a.], F:11%, M:10%, n:2675
C:87% [D:n.a.], F:7.3%, M:5.6%, n:2675
C:7.8% [D:n.a.], F:7.6%, M:84%, n:2675
C:14% [D:n.a.], F:20%, M:64%, n:2675
C:55% [D:n.a.], F:15%, M:28%, n:2675
C:56% [D:n.a.], F:22%, M:21%, n:2675
C:18% [D:n.a.], F:16%, M:65%, n:2675
C:19% [D:n.a.], F:16%, M:64%, n:2675
C:29% [D:n.a.], F:24%, M:45%, n:2675
C:25% [D:n.a.], F:20%, M:53%, n:2675
C:91% [D:n.a.], F:4.2%, M:4.0%, n:2675
C:6.2% [D:n.a.], F:21%, M:72%, n:2675
C:8.5% [D:n.a.], F:21%, M:70%, n:2675
C:40% [D:n.a.], F:22%, M:36%, n:2675
C:20% [D:n.a.], F:8.0%, M:71%, n:2675
C:6.9% [D:n.a.], F:7.6%, M:85%, n:2675
C:47% [D:n.a.], F:14%, M:38%, n:2675
C:5.9% [D:n.a.], F:3.8%, M:90%, n:2675
C:6.6% [D:n.a.], F:5.4%, M:87%, n:2675
C:19% [D:n.a.], F:20%, M:59%, n:2675
C:27% [D:n.a.], F:26%, M:46%, n:2675
C:77% [D:n.a.], F:8.4%, M:13%, n:2675
C:82% [D:n.a.], F:8.4%, M:9.3%, n:2675
C:6.4% [D:n.a.], F:6.3%, M:87%, n:2675
C:9.5% [D:n.a.], F:13%, M:76%, n:2675
C:64% [D:n.a.], F:19%, M:15%, n:2675
C:31% [D:n.a.], F:31%, M:37%, n:2675
C:74% [D:n.a.], F:12%, M:12%, n:2675
C:44% [D:n.a.], F:11%, M:44%, n:2675
C:51% [D:n.a.], F:22%, M:26%, n:2675
C:73% [D:n.a.], F:11%, M:14%, n:2675
C:44% [D:n.a.], F:12%, M:43%, n:2675
C:11% [D:n.a.], F:16%, M:72%, n:2675
C:29% [D:n.a.], F:14%, M:55%, n:2675
C:14% [D:n.a.], F:16%, M:68%, n:2675
C:92% [D:n.a.], F:6.0%, M:1.6%, n:2675
C:90% [D:n.a.], F:4.6%, M:5.3%, n:2675
C:71% [D:n.a.], F:16%, M:11%, n:2675
C:60% [D:1.5%], F:13%, M:25%, n:843
C:77% [D:9.7%], F:5.1%, M:17%, n:843
C:76% [D:2.9%], F:7.5%, M:16%, n:843
C:85% [D:11%], F:3.5%, M:11%, n:843
C:85% [D:6.9%], F:2.8%, M:11%, n:843
C:90% [D:11%], F:1.7%, M:7.5%, n:843
C:63% [D:4.8%], F:13%, M:22%, n:843
C:67% [D:9.4%], F:11%, M:20%, n:843
C:74% [D:3.4%], F:10%, M:14%, n:843
C:85% [D:12%], F:9.9%, M:4.2%, n:843
C:80% [D:6.6%], F:2.4%, M:17%, n:843
C:81% [D:8.5%], F:4.5%, M:14%, n:843
C:89% [D:2.3%], F:4.3%, M:5.8%, n:843
C:90% [D:13%], F:7.8%, M:2.1%, n:843
C:78% [D:3.5%], F:10%, M:10%, n:843
C:83% [D:15%], F:14%, M:2.8%, n:843
C:56% [D:4.3%], F:8.3%, M:34%, n:843
C:65% [D:7.8%], F:8.3%, M:26%, n:843
C:87% [D:6.5%], F:7.8%, M:4.9%, n:843
C:83% [D:19%], F:15%, M:0.7%, n:843
C:81% [D:1.1%], F:7.8%, M:10%, n:843
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 10 of 13
Lineage
Species
Ancylostoma ceylanicum
Aplysia californica
Apostichopus japonicus
Asterias amurensis
Bithynia siamensis goniomphalos
Evechinus chloroticus
Henricia sp. AR-2014
Patiria miniata
Patiria pectinifera
Procotyla flyviatilis
Ashbya gossypii
Aspergillus nidulans
Cryptococcus neoformnas
Gibberella zeae
Komagataella pastoris
Neurospora crassa
Phaeosphaeria nodorum
Puccinia graminis
Saccharomyces cerevisiae
Fungi
Schizosaccharomyces pombe
Sclerotina sclerotiorum
Tuber melanosporum
Ustilago maydis
Verticillium dahliae
Yarrowia lipolytica
Agaricus subrufescens
Armillaria ostoyae
Hypsizygus marmoreus
Ophiocordyceps sinensis
Phakopsora pachyrhizi
Puccinia striiformis f.sp. tritici
Pyrenochaeta lycopersici
Spraguea lophii
Termitomyces clypeatus
Trametes sanguinea
Uromyces appendiculatus
Sample type
Gene set
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Genome
Gene set
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Transcriptome
Identifier
ASM1507v1.22
GI:595744344 Unknown
GI:613602134 chemokine
GI:614063388 Gills
GI:606015213 Heart
GI:594457164 Salivary
GI:638469663 Unknown
GI:638532954 Unknown
GI:480970007 Unknown
GI:559461775 Unknown
GI:638872012 Unknown
GI:638728087 Ovary
GI:638651248 Unknown
GI:528026207 Unknown
GCA_000091025.4
N50 (Kbp)
1,476
GCA_000011425.1
3,704
GCA_000091045.1
1,438
GCA_000240135.2
5,350
GCA_000027005.1
2,394
GCA_000182925.1
6,000
GCA_000146915.1
1,045
GCA_000149925.1
964
GCA_000146045.2
924
GCA_000002945.2
4,539
GCA_000146945.1
1,625
GCA_000151645.1
638
GCA_000328475.1
127
GCA_000150675.1
1,273
GCA_000002525.1
3,633
GI:645683639 Unknown
GI:480500433 RNA1
GI:612225315 Unknown
GI:630075070 Unknown
GI:452772923 Thai1
GI:509494464 PST
GI:509507311 Haustorium
GI:509515198 Spore
GI:589143963 unknown
GI:520759716 Spore
GI:595370870 treated
GI:595351039 untreated
GI:511189810 BAFC2126
GI:452898896 SWBR1
BUSCOs assessment
C:85% [D:11%], F:12%, M:2.3%, n:843
C:16% [D:n.a.], F:38%, M:44%, n:843
C:88% [D:n.a.], F:8.1%, M:2.8%, n:843
C:88% [D:n.a.], F:8.4%, M:3.5%, n:843
C:77% [D:n.a.], F:12%, M:9.3%, n:843
C:41% [D:n.a.], F:23%, M:34%, n:843
C:68% [D:n.a.], F:24%, M:6.9%, n:843
C:59% [D:n.a.], F:28%, M:11%, n:843
C:57% [D:n.a.], F:24%, M:17%, n:843
C:92% [D:n.a.], F:5.3%, M:2.6%, n:843
C:90% [D:n.a.], F:7.9%, M:1.1%, n:843
C:88% [D:n.a.], F:10%, M:1.1%, n:843
C:80% [D:n.a.], F:18%, M:1.6%, n:843
C:54% [D:n.a.], F:18%, M:26%, n:843
C:95% [D:4.5%], F:1.8%, M:2.9%, n:1438
C:95% [D:7.3%], F:3.8%, M:0.9%, n:1438
C:98% [D:1.8%], F:0.9%, M:0.2%, n:1438
C:95% [D:11%], F:2.8%, M:1.8%, n:1438
C:92% [D:5.4%], F:2.5%, M:4.8%, n:1438
C:90% [D:7.1%], F:5.9%, M:3.1%, n:1438
C:98% [D:1.3%], F:1.3%, M:0.2%, n:1384
C:97% [D:11%], F:2.0%, M:0.2%, n:1384
C:93% [D:5.0%], F:4.5%, M:2.0%, n:1438
C:93% [D:8.5%], F:3.8%, M:2.7%, n:1438
C:98% [D:6.5%], F:0.6%, M:0.6%, n:1438
C:97% [D:10%], F:1.5%, M:0.6%, n:1438
C:96% [D:6.0%], F:3.1%, M:0.2%, n:1438
C:91% [D:9.7%], F:8.4%, M:0.4%, n:1438
C:63% [D:5.6%], F:20%, M:15%, n:1438
C:85% [D:11%], F:8.0%, M:6.3%, n:1438
C:96% [D:5.2%], F:0.4%, M:2.7%, n:1438
C:98% [D:8.6%], F:1.1%, M:0%, n:1438
C:89% [D:3.8%], F:2.7%, M:7.7%, n:1438
C:90% [D:9.5%], F:5.7%, M:3.3%, n:1438
C:70% [D:3.5%], F:3.8%, M:25%, n:1438
C:67% [D:8%], F:7.4%, M:25%, n:1438
C:95% [D:5.0%], F:4.1%, M:0.6%, n:1438
C:91% [D:9.0%], F:6.2%, M:2.3%, n:1438
C:92% [D:5.9%], F:3.1%, M:4.4%, n:1438
C:88% [D:7.5%], F:6.6%, M:5.0%, n:1438
C:95% [D:4.4%], F:3.5%, M:0.9%, n:1438
C:94% [D:9.4%], F:4.5%, M:0.9%, n:1438
C:97% [D:5.4%], F:2.1%, M:0.6%, n:1438
C:96% [D:8.8%], F:2.9%, M:0.6%, n:1438
C:7.7% [D:n.a.], F:28%, M:63%, n:1438
C:45% [D:n.a.], F:42%, M:11%, n:1438
C:59% [D:n.a.], F:34%, M:6.4%, n:1138
C:38% [D:n.a.], F:36%, M:24%, n:1438
C:9.3% [D:n.a.], F:12%, M:78%, n:1438
C:32% [D:n.a.], F:35%, M:32%, n:1438
C:22% [D:n.a.], F:33%, M:43%, n:1438
C:17% [D:n.a.], F:32%, M:49%, n:1438
C:94% [D:n.a.], F:4.8%, M:0.1%, n:1438
C:6.4% [D:n.a.], F:11%, M:82%, n:1438
C:95% [D:n.a.], F:4.3%, M:0.0%, n:1438
C:91% [D:n.a.], F:7.5%, M:1.1%, n:1438
C:18% [D:n.a.], F:30%, M:50%, n:1438
C:34% [D:n.a.], F:25%, M:39%, n:1438
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 11 of 13
5. BUSCO and CEGMA analysis run-times
The total run-times of default-parameter BUSCO and CEGMA assessments of genome assemblies and
gene sets were evaluated on the analysis on representative species from different metazoan lineages (Table
S2). All analyses were performed using 4 CPUs with up to 8 GB of RAM. BUSCO assessments were
performed using the eukaryote and metazoan sets, as well as the largest specific set for each species.
Table S2. BUSCO and CEGMA assessment run-times on four representative species.
Species
Dataset
Genome, 180 Mbp
Drosophila melanogaster
Gene set, 13’918
Genome, 269 Mbp
Heliconius melpomene
Gene set, 12’669
Genome, 3’381 Mbp
Homo sapiens
Gene set, 20’364
Genome, 100 Mbp
Caenorhabditis elegans
Gene set, 20’447
Analysis
2’675 arthropod BUSCOs
843 metazoan BUSCOs
429 eukaryote BUSCOs
250 eukaryote BUSCOs
248 CEGMA genes
2’675 arthropod BUSCOs
843 metazoan BUSCOs
429 eukaryote BUSCOs
250 eukaryote BUSCOs
248 CEGMA genes
2’675 arthropod BUSCOs
843 metazoan BUSCOs
429 eukaryote BUSCOs
250 eukaryote BUSCOs
248 CEGMA genes
2’675 arthropod BUSCOs
843 metazoan BUSCOs
429 eukaryote BUSCOs
250 eukaryote BUSCOs
248 CEGMA genes
3’023 vertebrate BUSCOs
843 metazoan BUSCOs
429 eukaryote BUSCOs
250 eukaryote BUSCOs
248 CEGMA genes
3’023 vertebrate BUSCOs
843 metazoan BUSCOs
429 eukaryote BUSCOs
250 eukaryote BUSCOs
248 CEGMA genes
843 metazoan BUSCOs
429 eukaryote BUSCOs
250 eukaryote BUSCOs
248 CEGMA genes
843 metazoan BUSCOs
429 eukaryote BUSCOs
250 eukaryote BUSCOs
248 CEGMA genes
Run-time
7.6h
3.2h
1.4h
0.81h
2.5h
1.4h
0.5h
0.36h
0.15h
N/A
8.1h
3.6h
0.91h
0.58h
5.7h
0.35h
0.18h
0.12h
0.1h
N/A
29h
13h
6.5h
2.8h
25.3h
2.6h
1.2h
0.5h
0.21h
N/A
5.3h
1.36h
0.88h
1.7h
0.5h
0.3h
0.1h
N/A
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 12 of 13
6. References
Camacho, C., et al. (2009) BLAST+: architecture and applications, BMC Bioinformatics, 10, 421.
Eddy, S.R. (2011) Accelerated Profile HMM Searches, PLoS Comput Biol, 7, e1002195.
Keller, O., et al. (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments,
Bioinformatics, 27, 757-763.
Kriventseva, E.V., et al. (2014) OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free
software, Nucleic Acids Res.
Mende, D.R., et al. (2013) Accurate and universal delineation of prokaryotic species, Nat Methods, 10, 881-884.
Parra, G., Bradnam, K. and Korf, I. (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic
genomes, Bioinformatics, 23, 1061-1067.
Parra, G., et al. (2009) Assessing the gene space in draft genomes, Nucleic Acids Res, 37, 289-297.
Sievers, F. and Higgins, D.G. (2014) Clustal Omega, accurate alignment of very large numbers of sequences,
Methods Mol Biol, 1079, 105-116.
Tatusov, R., et al. (2003) The COG database: an updated version includes eukaryotes., BMC Bioinformatics, 4, 41.
Waterhouse, R.M. (2015) A maturing understanding of the composition of the insect gene repertoire, Current
Opinion in Insect Science, 1.
Waterhouse, R.M., et al. (2013) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs, Nucleic
Acids Research, 41, D358-D365.
Waterhouse, R.M., Zdobnov, E.M. and Kriventseva, E.V. (2011) Correlating Traits of Gene Retention, Sequence
Divergence, Duplicability and Essentiality in Vertebrates, Arthropods, and Fungi, Genome Biology and
Evolution, 3, 75-86.
Simão, Waterhouse, et al. 2015, Supplementary Online Material: Page 13 of 13

Similar documents