OrthoFinder: solving fundamental biases in whole

Transcription

OrthoFinder: solving fundamental biases in whole
Emms and Kelly Genome Biology (2015) 16:157
DOI 10.1186/s13059-015-0721-2
SOFTWARE
Open Access
OrthoFinder: solving fundamental biases in
whole genome comparisons dramatically
improves orthogroup inference accuracy
David M. Emms and Steven Kelly*
Abstract
Identifying homology relationships between sequences is fundamental to biological research. Here we provide a
novel orthogroup inference algorithm called OrthoFinder that solves a previously undetected gene length bias in
orthogroup inference, resulting in significant improvements in accuracy. Using real benchmark datasets we demonstrate
that OrthoFinder is more accurate than other orthogroup inference methods by between 8 % and 33 %. Furthermore,
we demonstrate the utility of OrthoFinder by providing a complete classification of transcription factor gene families in
plants revealing 6.9 million previously unobserved relationships.
Background and rationale
Identifying homology relationships between sequences is
fundamental to all aspects of biological research. In
addition to the pivotal role these inferences play in
furthering our understanding of the evolution and diversity of life, they also provide a coherent framework for
the extrapolation of biological knowledge between organisms. In this context, orthology inference underpins
genome and transcriptome annotation and provides the
foundation on which synthetic and systems biology is
built. Given the importance of this process to biological
research there has been a rich heritage of methodology
development in this area with the production of several
effective orthology databases and algorithms.
The most widely used methods for orthology inference
can be classified into two distinct groups. One group of
methods approaches the problem by inferring pairwise
relationships between genes in two species, and then
extending orthology to multiple species by identifying
sets of genes spanning these species in which each genepair is an orthologue. Popular methods that adopt this
approach include MultiParanoid [1] and OMA [2]. A
confounding factor to these approaches is that gene duplications cause orthology relationships that are not one-toone [3] and so orthology is not a transitive relationship
* Correspondence: [email protected]
Department of Plant Sciences, University of Oxford, South Parks Road, Oxford
OX1 3RB, UK
(for example, if gene A is an orthologue of gene B, and
gene B is an orthologue of gene C, it is not necessarily true
that gene A is an orthologue of gene C) [4]. This lack of
transitivity means that to capture all pairwise orthology
relationships individual genes must be allowed to be
members of more than one set [2], or the gene sets must
be restricted to subsets of species that share the same last
common ancestor [1]. Methods that adopt these pairwise
approaches have high levels of precision in recovering
orthologues, however, they suffer from low rates of recall
in discovering the complete orthogroup due to these
complications arising from gene duplications.
The second group of methods do not adopt this pairwise strategy but rather attempt to identify complete
orthogroups; an orthogroup is the set of genes that are
descended from a single gene in the last common ancestor of all the species being considered [2, 5–9]. Here an
orthogroup by definition contains both orthologues and
paralogues, and in this context is frequently used as a
unit of comparison for comparative genomics [10–12].
In this work we follow this latter approach as it is a
logical extension of orthology to multiple species. The
most widely used orthogroup inference method is
OrthoMCL [13] (usage assessed by citations n = 870
Scopus citations at the time of writing this article).
OrthoMCL uses BLAST [14] to compute sequence similarity scores between sequences in multiple species and
then uses the MCL clustering algorithm [15] to identify
© 2015 Emms and Kelly. This is an Open Access article distributed under the terms of the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://
creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Emms and Kelly Genome Biology (2015) 16:157
highly-connected clusters (groups of highly similar sequences) within this dataset.
In addition to the approaches discussed above, several
methods have also been developed that incorporate gene
synteny/co-linearity information to assist in the inference
of orthogroups [16, 17]. For groups of organisms such as
the Kinetoplastids, where gene synteny/co-linearity is well
conserved [18] it can provide valuable additional information. However, synteny is not conserved over large evolutionary distances and thus can provide little assistance to
the identification of related genes between distantly related groups such as plants and metazoa. Moreover,
synteny is unavailable for de novo assembled transcriptomes and for fragmented, low-coverage genome assemblies. Thus there is a need to have accurate methods of
orthogroup inference that do not require gene synteny
information.
Here we present OrthoFinder, a novel method that infers orthogroups of protein coding genes. It is fast, easy
to use and scalable to thousands of genomes. In tests
using real benchmark datasets OrthoFinder outperforms
all other commonly used orthogroup inference methods
by between 8 % and 33 %. We further demonstrate the
utility of OrthoFinder through the inference and analysis
of plant transcription factor orthogroups. Here we use
phylogenetic methods to validate the orthogroups and
show that using OrthoFinder to infer orthogroups identifies millions of previously unobserved relationships.
Further information about the algorithm can be found at
[19] and a standalone implementation of the algorithm
is available under the GPLv3 licence at [20].
Problem definition, method evaluation and comparison
to other approaches
Gene length bias in BLAST scores affects the accuracy of
orthogroup detection
The inference of orthogroups across multiple species
requires a fast method to measure pairwise sequence
similarity between all sequences in the species being
considered. BLAST [14] is the most widely used method
to identify and measure similarity between sequences
and thus it underpins the majority of orthologue identification methods [9, 13, 21–23]. Analysis of the pairwise
BLAST scores that are produced when the full set of
protein sequences from one species is BLAST searched
against those from another species revealed that there is
a clear length dependency in the scores that are obtained
(Fig. 1a and b). Short sequences cannot produce large
bit scores or low e-values (Fig. 1a and b, respectively),
whereas long sequences produce many hits with scores
better than those for the best hits of short sequences
(Fig. 1a and b). Thus, methods that construct orthogroups
by evaluation of BLAST scores in the absence of gene
length information should result in a large number of
Page 2 of 14
Fig. 1 Analysis of gene length dependency of BLASTp scores. a
BLAST log10(bit score) for all hits between Homo sapiens
(Homo_sapiens.GRCh37.60.pep.all, 21,841 sequences) and Mus
musculus (Mus_musculus.NCBIM37.60.pep.all, 23,111 sequences). b
–log10(e-value) for all hits between and Homo sapiens and Mus
musculus. To avoid infinite values, BLAST scores of zero have been
replaced with the lowest obtainable value 10−180. The heat map in
both cases goes from blue (lowest density of hits) to red (highest).
c The F-score (red), recall (blue) and precision (green) of orthogroup
inference using OrthoMCL plotted as a function of sequence length.
The sequences were sorted according to length and divided into four
bins with the same number of sequences in each. The F-score, recall
and precision were calculated for each bin and the scores plotted
against the geometric mean of the length of the sequences in each
bin. The error bars show the lower and upper limits of sequence
lengths for the shortest and longest sequences in each bin and the
dot shows the geometric mean of these lengths. d Histogram of all
protein-coding gene lengths in Homo sapiens is provided for reference
missing genes (low recall) from orthogroups that contain
short genes and a large number of incorrectly clustered
genes (low precision) in orthogroups that contain long
genes.
To determine if this was the case we assessed the
performance of OrthoMCL using the OrthoBench dataset [5]. OrthoBench is the only publicly available benchmark dataset of manually curated orthogroups. The
dataset consists of 70 orthogroups of protein coding
genes covering 12 species within the Metazoa where
each orthogroup contains all the genes derived from a
single gene in the last common ancestor of the 12 species considered. For further details concerning the
construction, species range and complexity of each
orthogroup see [5]. The recall and precision of
OrthoMCL was assessed as a function of gene length in
this dataset. This revealed that there were strong dependencies between the performance characteristics of
OrthoMCL and the length of the gene that was being
Emms and Kelly Genome Biology (2015) 16:157
Page 3 of 14
clustered (Fig. 1c, Additional file 1: Table S1). Specifically,
short sequences suffer from low recall rate (that is, many
short sequences fail to be assigned to an orthogroup) and
long sequences suffer from low precision (that is, many
long sequences are assigned to the incorrect orthogroup)
as predicted from the analysis of BLAST scores above. To
put these results in perspective the distribution of protein
lengths in Homo sapiens is provided in Fig. 1d.
A novel score transform eliminates gene length bias in
orthogroup detection
Given that orthogroup inference shows a clear gene
length dependency, we sought to develop a transform of
the BLAST scores that would reduce the impact of gene
length on clustering accuracy. To do this we developed a
novel method that determines the gene length dependency of a given pairwise species comparison from an
analysis of the bit scores from an all-versus-all BLAST
search between the two species. Bit scores were used in
place of e-values as the e-value calculation enforces a
limit of 1×10−180 and thus all scores below this floor are
given the same value (that is, 0) (Fig. 1b) and thus length
bias in e-values is non-uniform and irreversible. As bit
scores do not have a threshold value, and they have been
previously shown to be capable of facilitating accurate
inference of phylogenetic trees [24], they were selected
as the raw data for the development of a novel score
transform.
In brief, for each species-pair in turn, the all-vs-all
BLAST hits (Fig. 2a) were divided into equal sized bins
of increasing sequence length according to the product
of the query and hit sequence lengths. The top 5 % of
hits in each bin (ranked according to BLAST bit score)
were used to represent ‘good’ hits for sequences of that
length bin between the given species pair (Fig. 2b). A
linear model in log-log space was used to fit a line to
these scores using least squares fitting (Fig. 2b). All of
the BLAST bit scores that were obtained from each
species-pair all-vs-all BLAST search are then transformed using this model so that the best hits between
sequences in this species pair have equivalent scores that
are independent of sequence length (Fig. 2c and d).
Following the transform the poor quality hits for longer
sequences were no longer better than the best quality
hits for short sequences (Fig. 2c). This normalisation
procedure is applied to each pairwise species comparison independently as the behaviour of the BLAST scores
is different for each pairwise species comparison
(Additional file 2: Figure S1). Importantly, this pairwise
length normalisation between species also normalises for
phylogenetic distance between species (See ‘Gene length
and phylogenetic distance normalisation’ & Additional
file 2: Figure S1). Specifically, the normalisation ensures
that the best scoring hits between distantly related
Fig. 2 The gene length and phylogenetic distance normalisation
procedure for a single species pair. a BLAST bit scores for all hits
between Homo sapiens and Mus musculus. b BLAST bit scores for
the top 5 % of BLAST hits with least-squares fit of the equation
log10Bqh = a log10Lqh + b., where Bqh is the bit score for the hit
between sequence q and sequence h and Lqh is the product of the
gene lengths (measured in amino acids). c Gene length and
phylogenetic distance normalised BLAST bit scores. Note that
there are a large number of poor scoring hits for long sequences
due to these hits exceeding the BLAST search e-value cutoff. d
The same top 5 % of BLAST hits as shown in (b) after normalisation
for reference
species achieve the same scores (on average) to the best
scoring hits between closely related species (Additional
file 2: Figure S1). These length and phylogenetic distance
normalised scores were then used as the measure of sequence similarity on which all subsequent analysis and
clustering were performed.
Application of this novel score transform prior to clustering of the OrthoBench dataset resulted in a dramatic
reduction in the length dependency of the clustering
results (Fig. 3). Unlike OrthoMCL (Fig. 3a), neither
precision, recall nor F-score displayed any dependency
on gene length (Fig. 3b). Moreover, precision was substantially increased over the entire range of sequence
lengths (Fig. 3b).
An improved method for orthogroup delimitation
improves overall accuracy
Given that we had reduced gene length bias and that
precision was high but recall was low, we assessed
whether a method that could identify a higher proportion of cognate gene-pairs prior to clustering could
produce an overall increase in clustering accuracy. Many
orthology assignment methods make use of reciprocal
Emms and Kelly Genome Biology (2015) 16:157
1
OrthoMCL
OrthoMCL (normalised scores)
OrthoFinder
0.5
Fraction
Page 4 of 14
A
C
0
B
2
3
log 10 (length)
2
3
log 10 (length)
4
F-score
Recall
F-score
0.6
0.4
Precision
E
D
4
F
Fraction
Fraction
0.8
2
3
log 10 (length)
Precision
Recall
Fraction
1.0
4
0.2
0.0
TreeFam
OrthoFinder
T
eggNOG
OrthoMCL
OrthoDB
OMA
Fig. 3 Comparison of OrthoFinder to other orthogroup inference methods. a The length dependency of OrthoMCL. b The length dependency of
OrthoMCL using our normalised similarity scores. c The length dependency of the complete OrthoFinder algorithm. For A-C scores were calculated as
in Fig. 1c. d Comparison of the results of OrthoFinder F-score with all other methods tested in OrthoBench. e As in (d) but for recall. f As in (d) but for
precision. The error bars show the lower and upper limits of sequence lengths corresponding to the shortest and longest sequences in each bin and
the dot shows the geometric mean of these lengths
best BLAST hit (RBH) as it is widely regarded as a
high precision method for the identification of orthologues gene-pairs [25–27]. Therefore we also sought
to use reciprocal best BLAST hits using our new
length-normalised score to assist in construction of
the orthogroup graph. Henceforth, we refer to a
reciprocal best hit that is obtained using the lengthnormalised score as an RBNH (reciprocal best normalised hit).
In brief, for each gene that had successfully identified one or more RBNHs, the scores for these RBNHs
were used to delimit an inclusion threshold (see
methods). As all scores are normalised for gene
length and phylogenetic distance, hits to other genes
(in any species) that had scores above this inclusion
threshold were included as putative cognate genepairs and added to the orthogroup graph that was
subjected to MCL clustering (for further details see
methods). This novel data selection criterion resulted
in a dramatic improvement in overall clustering accuracy while maintaining gene length independence
(Fig. 3c). The overall results for OrthoFinder, were
0.85 precision, 0.81 recall and 0.83 F-score.
OrthoFinder outperforms all other methods from the
OrthoBench analysis
Given that OrthoFinder exhibited high accuracy on the
benchmark dataset we sought to determine the relative
performance to other commonly used methods for
orthogroup inference. OrthoFinder outperformed all
other methods that have been applied to OrthoBench [5]
as measured by F-score (Fig. 3d), performing 8 % better
than TreeFam (the next best method) 25 % better than
OrthoMCL (the most widely used method), and 33 %
better than OMA (the lowest scoring method in this
test). Importantly, the precision and recall of OrthoFinder were balanced, demonstrating that the method is not
biased towards over- or under-clustering of sequences. It
should be noted that OMA exhibits a low recall in this
test as its goal is to identify orthologues instead of
complete orthogroups and thus paralogues will be absent from the orthologue groups identified by this
method. OMA is included here for completeness as it
was included in the original OrthoBench analysis [5].
In addition to accuracy, a number of other criteria
were used to compare the performance of the different
inference methods in the OrthoBench paper. These
Emms and Kelly Genome Biology (2015) 16:157
criteria included the percentage of orthogroups predicted without any errors, the number of erroneously
assigned genes (that is, false positives, and thus also captured by the precision) and missing genes (that is, false
negatives, and thus also captured by recall) in the assignment of genes to orthogroups and the proportion of
orthogroups affected by these false positive and false
negatives. The results for OrthoFinder according to
these criteria are reported in Additional file 3: Figure S2
and are consistent with the increased accuracy of OrthoFinder compared to other methods. Additionally, the 70
orthogroups that make up the OrthoBench dataset comprise 40 that represent particular biological or technical
challenges and 30 randomly chosen orthogroups. Additional file 4: Figure S3 shows the F-scores for these two
categories separately to illustrate the difference in performance of the method for ‘randomly selected’ and
‘difficult’ orthogroups. OrthoFinder outperformed all
other methods in both categories and achieved an Fscore of 81 % and 90 % on the difficult and randomly
selected orthogroups, respectively.
OrthoFinder is suitable for the analysis of incomplete
datasets
As many research groups are producing partial genome
assemblies and transcriptome resources it is to be expected that sequence datasets will be missing genes due
to incomplete assembly, low expression or errors in gene
prediction. To demonstrate the suitability of OrthoFinder for analysing these incomplete datasets we assessed
the performance of OrthoFinder with between 5 % and
60 % of genes deleted at random from the OrthoBench
input sequences. This revealed that the accuracy of
OrthoFinder is robust to missing data and that it
achieved an F-score of over of 0.75 even when 60 % of
the genes were missing from the input dataset
(Additional file 5: Figure S4). Thus OrthoFinder is suitable for orthogroup inference from partial and incomplete datasets.
OrthoFinder is fast and scalable
The number of species for which genome or transcriptome sequence resources are available is increasing rapidly and there is a corresponding need to be able to infer
orthogroups using these datasets as they emerge. To
keep pace with these increasing demands the algorithm
utilises sparse matrices as the central data structure and
performs many steps using matrix operations. For
example, starting from pre-computed raw BLAST scores
the identification of orthogroups for the OrthoBench
dataset (12 species, 235,033 sequences) takes 14 min 20
s using OrthoFinder on a single core of an Intel Core i74770 3.4GHz CPU. For comparison, OrthoMCL takes 20
Page 5 of 14
h 12 min to perform the same operation using the same
CPU and MySQL for its relational database management
system. As the number of genomes that must be analysed increases, the scalability of the methods used
becomes increasingly important. To demonstrate the
scalability performance of OrthoFinder, the full set of
sequenced plant genomes from Phytozome version 9.0
(n = 41 [28]) were clustered and the results are shown in
Fig. 4. Plant genomes were selected for this test as they
are large with an average of 30,731 protein coding genes
per species in Phytozome version 9.0 and thus they
represent a stringent assessment of the scalability of
OrthoFinder. The memory (RAM) requirements increase linearly with the number of species clustered
(Fig. 4a). This is despite the fact that the number of
BLAST hits increases quadratically with the number of
species (Fig. 4c). This linear scaling is achieved by processing the BLAST hits for each species sequentially and
independently within OrthoFinder. Though the memory
requirements increase linearly, the time requirements
starting from pre-computed raw BLAST scores increases
quadratically with the number of species (Fig. 4b). This
is to be expected as the number of BLAST hits that
must be processed also increases quadratically. For example, identifying the orthogroups for all 41 plant species from Phytozome requires approximately 4 GB of
RAM and took approximately 3 h on a single CPU core.
Fitting the data to a line and extrapolating we estimate
that approximately 450 plant sized genomes can be clustered on a linux computer with 64GB of RAM (Fig. 4a).
Thus OrthoFinder is capable of large analyses on conventional computing resources. It should be noted here
that the BLAST searches incur the largest computational
cost in any orthogroup inference analysis and that this
cost is the same for all inference methods that use BLAST.
In summary OrthoFinder is fast and scalable to hundreds
of species on conventional computing resources.
Inference of high accuracy plant transcription factor
orthogroups
Given that OrthoFinder has increased accuracy over
other methods and that gene length bias has been eliminated from orthogroup inference, we sought to provide
an additional demonstration of the utility of OrthoFinder
for the inference of orthogroups. To do this we selected
plant transcription factors as they are short genes and
will thus suffer from low rates of recall in assignment to
orthogroups in the absence of gene length bias correction. Moreover transcription factor genes are preferentially retained following whole genome duplication
events [29, 30] and thus transcription factor orthogroups
are larger than average and contain multiple independent duplication events in multiple independent lineages
that can cause some inference methods to fail. Finally,
Emms and Kelly Genome Biology (2015) 16:157
Fig. 4 Memory and time requirements of OrthoFinder. Sub-samples
of between two and 41 plant genomes from Phytozome version 9.0
given pre-calculated BLAST results. The average number of genes
per species was 30,731. a Maximum RAM requirements. b Time
requirements. c The number of BLAST hits that must be processed
for a given number of species (provided to place the time and
RAM requirements into context)
previous efforts to define transcription factor orthogroups
have utilised OrthoMCL [31]. Thus current transcription
factor orthogroups will have low recall resulting in fragmented orthogroups spanning few species.
Using established rules for the identification and classification of transcription factors [31] we identified and
typed all of the transcription factors present in the 41
genomes present in Phytozome v9. The complete predicted proteomes from these 41 genomes were then
subject to clustering using OrthoFinder and OrthoMCL
and the distribution of transcription factors in the resultant orthogroups were analysed. OrthoMCL was used
here as it is the method by which all transcription factor
families are currently classified [31]. Consistent with the
increased recall rate for OrthoFinder, analysis of the
resulting orthogroups revealed that 8.5 % more transcription factors were placed in orthogroups using
Page 6 of 14
OrthoFinder than OrthoMCL (Fig. 5a, 97.6 % and 89.1
%, respectively, n = 52,744). Also consistent with the
increased recall rate is that these orthogroups were less
fragmented than those that were produced by OrthoMCL
(Fig. 5b, 897 and 3,024 orthogroups, respectively). Importantly, the orthogroups inferred using OrthoFinder were
missing fewer RBHs (Fig. 5c, 2.15 % and 5.77 %, respectively) and clustered more of the same type of transcription
factor together (Fig. 5d and e). A major difference between
those orthogroups inferred using OrthoFinder and
OrthoMCL is that those produced by OrthoFinder encompass a larger number of species than those recovered
by OrthoMCL (Fig. 5f), thus orthogroups produced by
OrthoFinder encompass greater phylogenetic distances.
As OrthoFinder clustered the transcription factors together into far fewer orthogroups than OrthoMCL (897
versus 3024) we sought to demonstrate that it was correct in doing so. To do this we used gene-tree/speciestree reconciliation to determine if the orthogroups were
true orthogroups if they incorrectly clustered sequences
that are separated by a gene duplication event that
occurred before the last common ancestor of the species
in the analysis. Overall, 858 of the 897 OrthoFinder
orthogroups (96 %) consisted entirely of genes that were
correctly clustered together and only 39 contained some
genes that were separated by a duplication prior to the
last common ancestor (Additional file 6: Table S2 and
Additional file 7: Table S3). Of the 897 OrthoFinder
orthogroups, 210 were identical to ones from OrthoMCL
and 471 OrthoFinder orthogroups were strict supersets of
2,271 OrthoMCL orthogroups (Additional file 6: Table S2
and Additional file 7: Table S3). Of these, 90 % (425) were
true orthogroups that each encompassed on average four
OrthoMCL orthogroups (1,709 in total).
An illustrated example showing an OrthoFinder
orthogroup and its constituent OrthoMCL orthogroups is
provided in Fig. 6. Here the OrthoFinder orthogroup (labelled bHLH 8 in Additional file 6: Table S2) contains all
known type IVc bHLH transcription factors [32]. Type
IVc bHLH transcription factors have previously been
shown to be conserved from green algae to land plants
and thus span the complete taxonomic range contained
in this analysis [32]. The OrthoFinder orthogroup correctly united eight paraphyletic OrthoMCL orthogroups
and included 36 transcription factors (highlighted in grey)
that were not clustered into any orthogroups by
OrthoMCL (Fig. 6). The phylogenetic tree shows that
there are no genes present in this OrthoFinder orthogroup
that were the product of a gene duplication event prior to
the divergence of the last common ancestor of all species
in the analysis. This is only one example and the complete
set of phylogenetic trees for each OrthoFinder transcription factor orthogroup are provided in Additional file 6:
Table S2 along with the OrthoMCL subsets that comprise
Emms and Kelly Genome Biology (2015) 16:157
Page 7 of 14
Fig. 5 Inference of orthogroups of plant transcription factors. In all cases dark grey bars indicate the results for OrthoFinder and light grey bars
indicate the results for OrthoMCL. a Comparison of the fraction of transcription factors that are assigned to orthogroups by OrthoFinder and by
OrthoMCL. b Comparison of the number of transcription factor orthogroups identified using each method. c The percentage of RBNH/RBH (for
OrthoFinder/OrthoMCL) hits that are not contained in orthogroups identified using each method. d The number of transcription factors of the
same type that each transcription factor is connected to in the orthogroups produced by OrthoFinder. e as in (d) but for OrthoMCL. f Comparison of
species coverage for transcription factor orthogroups identified by each method. g The number of orthogroups for each transcription factor type
identified by OrthoFinder
these groups where appropriate. Also contained in this
table are the results of the gene-tree/species-tree reconciliation for each tree inferred from an OrthoFinder
orthogroup.
Taken together, using OrthoFinder to cluster transcription factor genes resulted in the identification 687 (897
less the 210 that were the same) novel orthogroups of
transcription factors across 41 different species comprising 7.7 million pairwise relationships (of which 6.9 million are not detected by OrthoMCL). Thus using
OrthoFinder to cluster transcription factors has provided
significant new insight into the relationship of transcription factor genes across plants. The number of
orthogroups for each transcription factor type is provided in Fig. 5g and the full classification including all
constituent accession numbers is provided in Additional
file 6: Table S2.
Algorithm implementation and evaluation criteria
OrthoFinder is an algorithm that infers orthogroups
across multiple species. The method does not classify
the pairwise relationships that exist between genes
within these orthogroups. The method does not require
synteny information and is thus equally suitable for
clustering protein sequences predicted from genome or
transcriptome datasets. OrthoFinder is run with a single command and requires as input a directory containing one protein sequence FASTA file per species to be
clustered. OrthoFinder does not require preprocessing
of FASTA files (such as filtering of sequences) and does
not require knowledge or use of any relational database
management system such as MySQL. It outputs
orthogroups in two file formats: the Quest for Orthologs
community standard OrthoXML [33] and in plain text
format with one orthogroup per line.
There are two common problem definitions used by the
majority of homology inference algorithms. One is to predict pairs of orthologues (pairs of genes from two different
species descendent from a single gene in the last common
ancestor of the two species) and pairs of recent, withinspecies paralogues (genes-pairs arising from a duplication
event since the last speciation event for that species). The
other approach, and the one used here for OrthoFinder, is
to predict orthogroups. An orthogroup is the set of genes
derived from a single gene in the last common ancestor of
all the species under consideration. This is the approach
used by OrthoMCL [13] and eggNOG [34]. OrthoFinder
follows this second approach to produce orthogroups of
protein coding genes as this is a logical extension of
Emms and Kelly Genome Biology (2015) 16:157
th
_G
ra
V.
SV
V.
vin
A. ifera
c
_
IV oeru GS
v
T
ini
le VIV
tus
a
fe 01
g
su
m_ utta _mg ra_ 036 _Aq T01
S.
2
lyc
t
G
v
op PGS us_ 1a0 SV 020 uca 036
m
ers
_ 2
C
1
icu 000 gv1 425 IVT0 01 017 03
a
3
m_
M.
5
10
_0 00
So DMP 012 m
tru
27
07 1
84
ly
n
4
76
06
G. catula c09g 000 3m
60
.1
ma
08 299
_
0
M
1
x_G
98
ed
98
G.
7
t
ly
0
r
ma
1
P. v
x_G ma10 g04 .2.1
ulg
g01 343
aris
lym
0
_Ph
a20
0
.
g22 10.1 1
E. g vul.00
01
7
ran
T. c
dis_ G272 0.1
aca
G. ra
6
o_T
imon
hec Eucgr. 00.1
c1E
dii_G
H
G02 0017
orai.
C. sa
4
086
0
tivus
6t1 .1
C. pa
_Cuc 11G172
paya_
50
sa.25
evm.m
5290 0.1
odel.s
T. halo
.1
uperc
phila_T
o
ntig_4
hhalv1
4.78
002592
7m
A. lyrat
a_4934
A. thaliana
62
_AT4G144
10
.1
C. rubella_C
arubv1000533
5m
B. rapa_Bra041033
B. rapa_Bra023751
pa
ife
0.33
tta
M.
ero
S.
tub
80 - 100
60 - 79
40 - 59
20 - 39
0 - 19
pa
vin
S.
gu
_G
V.
ys
M.
C.
ma
3
88
05
_4
id
nd br
lle -hy
oe .0
46
m v1
35
76
S. 4.103 1
03
9
5 0m 17
00
46 73 000 232 P4 .2.1
2
3
0
M
0
0
a
rn a01 DP 000 03D 997
8m
_m pp _M P0 00 g07 1513
0
ca a_ ca MD SC 02
es rsic esti ca_ _PG olyc 4.1_ 681
v
a
3
i
69
00
F. . pe dom est sum m_S ssav m00
02
.
P . om ero sicu ca
0.1
42 us10
M . d b er
100
ta_ 98
L
M S. tu cop ulen is_2 um_ 1G03 800.1
1
c
1
ly
un
im
S. M.es omm atiss otri.0 04G03 0.1
P
it
0
0
c
_
R. L. us arpa Potri. G1760
hoc arpa_ ai.009
.11
c
r
82t1 ntig_81
tric
P. tricho ii_Go G0300
rco
e
p
P. imond cc1E
u
el.s
e
d
a
h
r
o
T
.
m
1
G cao_ evm.m
1651
a
_
T. c apaya iclev100
702m
C. p ntina_C e1.1g026
me
rang
1
0.
C. cleinensis_o
86
a15g06
3.1
C. s
_Glym ma15g0680
G. max
_Gly
1
G. max lyma13g32470.
6600.1
19
G. max_G
6G
_Phvul.00
P. vulgaris
007339001
T01
VIV
_GS
V. vinifera
38001
10073
VIVT0
V. vinifera_GS
fi
or
O. lucimarinus_29509
M. pusilla RCC299_109344
M. pusilla CCMP154
5_47661
V. carteri_Vo
car20000686m
C. reinhard
tii_g5331.t
P. pate
1
C. su
ns_Pp1
s13_30 bellipsoidea
C1
4V
69
P
6.2
_42874
P.
. pa
P. patens_ tens_Pp1
s107_
S. m patens Pp1s22
145V
S. m oellen _Pp1s2 9_67V6
6.1
.1
oell dorfi_ 46_45
V6.3
1
P. v endo
irga rfi_ 65556
P. tum_ 165311
v
P
P. irgatu avirv0
virg m_
002
Pa
79
at
B. um_P virv00 73m
d
050
avir
O ist
S. . sa achy v000 745m
on
6
tiv
ita
S. lica a_L _Br 4016
ad
P.
m
b
O
_
vir
Z. icolo Si01 C_O i3g0
1
ga
82
ma r_
s
0
S
2g 540.
3
t Z
1
02
R um . m ys_ b04 2m
48
M .com _Pa ays GRM g00
0.3
. e m vi _G
1
sc u rv0 RM ZM2 200.
n
1
0
ul
en is_ 06 ZM G31
48
ta 29 206 5G
8
_c 60
8
as 3. 5m 568 2_T
04
sa m0
37
_T
va 00
01
51
4.
1_
3
01
58
83
m
m
79
78
01 86
1_ 21 9
2
4.
9
va 00 11 8
4
sa s1 02
as _Lu s10 411 6
0
7
_c
1
.
ta um _Lu s10 218
00 .1
0
u
en im m
57
10
ul iss u _L
16 2900
sc tat sim um Lus 8G
07
_
0
rid
. e si tis im
M L. u sita tiss um otri.0 10G -hyb
.0
u ta sim P
ri.0
9
L. usi tis pa_ Pot 1-v1
74
a
.
5
L. usit car rpa_ 436 7m
82
0
333
o
00
999
L. rich oca na11 1121 P00
269
t
h
r
068 0000
D
0
P. . tric a_m ppa a_M P000
314
DP
P sc
D
tic
821
a_
_M
0
s
M
e
_
ve sic
tica
000
58
F. . per . dom estica omes
DP
948
d
P
M
007
a_M
0
c
dom M.
esti
DP0
M.
0
dom
a_M
278
5877
M.
stic
265
0001
ome
000
DP0
M. d _MDP0
ca_M
stica domesti
ome
M.
M. d
872m
0005
iclev1
28060m
na_C
e1.1g0
menti
C. cle nsis_orang
14200m
ubv100
C. sine
lla_Car
C. rube
479835
A. lyrata_
98832
A. lyrata_8
3G23210.1
A. thaliana_AT
177m
T. halophila_Thhalv10021
B. rapa_Bra014923
al
ia
na
_A
ya A. l T1G
y
_
ev rat 51
bic
Z.
R
MZ
o
ma
V.
E. m. a_4 07
P.
T
M2 lor_
ys
v
vir
_G
Sb inif gra U.c 742 0.2
ga
er nd
RM G06
07
on 2
tum
a
9
g
Z
6
_P
00 _G is_E tig 4
_
S
av M2G 30_
2
Z.
irv
T0 900 VIV ucg 33
ma
00 093
2
.1
ys_
T0 r.F 751
7
10 03 .1
GR 0492 44_
29
T0
14 42
MZ
m
1
58 0.
M
90 1
P. v
S. 2G01
01
it
irga
7
P. v
tum alica_ 586_
irga
_Pa
Si0
T0
tum
P. v
1
1
_
v
4
P
ir
irga
339
avir
v00
tum
v
0
m
0
0
0
_Pa
1
P. v
virv 01848 646m
irga
000
5
tum
_Pa
310 m
virv0
5
O. sa
004 2m
tiva_
269
LOC
5m
B. dis
_
tachyo Os08g0
43
n_Bra
P. virg
di3g1 90.2
atum_P
4590
avirv00
P. virgat
.1
03
um_Pav
95
irv0002 12m
9776m
S. italica_
Si030877
m
S. bicolor_S
b02g03558
0.1
Z. mays_GRMZ
M2G058451_T0
1
O. sativa_LOC_Os07g35870.1
B. distachyon_Bradi1g25470.1
A. coerulea_Aquca_005_00425.1
0.1
C. sativus_Cucsa.07294
a017518m
P. persica_pp
00.1
.002G0165 77t1
rai
Go
ii_
77
G. raimond
c1EG00 1.1
o_Thec
87
T. caca
gr.K00
.1
dis_Euc
670.2
E. gran
7g052 004540340001
c0
ly
o
P40 10306 81m
m_S
u
M
ic
D
0
rs
3
pe
C000 GSVIVT 100163646m
S
S. lyco
G
P
v
_
6
sum_ . vinifera a_Cicle .1g02 2672m
bero
1
3
n
V
S. tu
enti orange 1.1g0
lem
C. c ensis_ range 2280.2
o
3
in
_
s
g
C.
.1
nsis ma12
e
40 0.1
in
81
C. s x_Gly .1
g3 250
ma 1470 a13 G11 00.1
m 05 90
G.
2
g
ly
4
0
G
8
0
a0
01
x_
ul.
lym . ma Phv tr2g 6_T .1
d
0
_
G
x_G
ris _Me 6190 228 m
ma
lga
G.
vu atula 2G0 g02 608 m
.
3 m .1 1
9
5
c
P
un MZM Sb0 000 313 621 20 40.
r
t
M. _GR lor_ irv0 i02 000 235 381
o av _S 0 2g g
s
c
y
i
0
i
5
ma S. b m_P alica virv rad s0
Z.
t
u
at S. i _Pa n_B C_O
irg
o O
m
v
y
L
tu h
P.
ga ac a_
vir ist tiv
P. B. d . sa
O
Z.
G. max_Glyma06g41620.1
P. vulgaris_Phvul.011G141300
.1
C. sativus_Cucsa
.267170.1
C. sativus_C
ucsa.267130.
P. persica
1
_ppa0090
77m
F. vesc
a_
M. dom mrna03303.1
-v1.0-hy
es
brid
M. do tica_MDP00
mest
007539
ica_M
46
G. ra M. dom
DP00
0026
T. c imondii estica_M
4
8
aca
03
_
o_T Gorai.0 DP0000
1432
hec
c1E 02G08
66
G.
raim G. rai G0311 1900.1
mon
18t1
ond
M. G. ra
R. escu imond ii_Gor dii_Go
rai.0
co len
ai.0
ii_
07G
C. mmu ta_c Gorai. 13G1
315
as
n
p
6
0
200
P. apay is_28 sava 07G3 4100.1
.1
t
4
148
C.
L. P. richo a_ev 166.m .1_0
0
0.1
151
C. cl L. us tric ca m.m
0
0
79m
10
sin em us itat hoc rpa
od
5
i
_
e
e
s
i
4
en nt tat si arp Po
l.s
C. sis ina_ issim mum a_P tri.0 uper
sa _or Ci um _L otri 11G con
tig
us .0
1
tiv an cle
us ge v1 _Lu 100 01G 3240 _97.
78
0.
_C 1. 00 s1
43 41
uc 1g0 093 001 076 660 1
0
1
sa 26 54
.1
.0 59 m 167
86 9
m
30
0.
1
G. max_Glyma12g16560.1
0.1
M. truncatula_Medtr4g08137
edtr2g082640.1
M. truncatula_M
300.1
lyma12g34
G. max_G
36260.1
Glyma13g 1000.1
09
G. max_
ul.005G 9
aris_Phv
54
P. vulg
ta_495 0.1
A. lyra
8
G546
_AT5 4564m
liana
001
8m 4
A. tha
halv1
675
h
6
2
ila_T
029
100
loph
rubv a_Bra0 011
T. ha
_C a
p
29
bella
B. ra _Bra0 22738
C. ru
0m
r a0
apa
B. r pa_B 01290 2
a
B. r gv1a 3347 .1
s_m 000 0.2
atu
04 30
P4
gutt 3DM g064 254 .1 2m
M.
00
00 0.2 09
07
0
1
3
C0
m
4
lyc
GS _So DMP 0065 a01 628
_P
1
9
g
m
3
v1
00 c10 mg 001 045
sum rsicu
1
ly
C0
e
ero
3
s_
tub lycop PGS _So tatu rubv ra0
S.
_
a _B 7
ut
um
S.
m
C
g
su
pa 51 0 2m
rsic M.
a_
ero ope
ell . ra 920 37 75
9 1
c
B
tub
ub
_
.r
. ly
S.
ta _86 01
S
C
a
0
lyr ata v1
A. lyr hal
A. Th
_
ila
ph
lo
ha
T.
A.
Page 8 of 14
OrthoMCL group 1
OrthoMCL group 2
OrthoMCL group 3
OrthoMCL group 4
OrthoMCL group 5
OrthoMCL group 6
OrthoMCL group 7
OrthoMCL group 8
Not in any OrthoMCL group
Fig. 6 A bootstrapped maximum likelihood phylogenetic tree of the OrthoFinder orthogroup containing the type IVc bHLH transcription factors
(bHLH 8). The OrthoMCL orthogroups that are subsets of the OrthoFinder orthogroup are indicated by different coloured fonts. Thirty-six of the
OrthoFinder clustered genes (coloured grey) failed to be clustered in any OrthoMCL orthogroup. The tree was inferred using RAxML using the
PROTGAMMAAUTO model (the JTT was model was selected as having the highest likelihood) with 100 bootstrap replicates. Scale bar indicates
the number of substitutions per site. Percentage bootstrap support values are indicated by coloured circles shown at internal nodes
orthology to multiple species as it groups all genes
descended from a single gene in the last common ancestor
of all species being considered.
Methodological overview of the OrthoFinder algorithm
An overview of the algorithm in shown in Fig. 7, it proceeds in five phases corresponding to sections b-f in the
figure:
1. BLAST all-versus-all search (Fig. 7b). Protein
BLAST (blastp) with an e-value threshold of 10−3 is
used so as to avoid discarding putative good hits for
very short sequences. A relaxed threshold is used at
this stage of the method as subsequent steps filter
out false positive hits using stringent, orthogroupspecific criteria for inclusion (described below).
2. Gene length and phylogenetic distance
normalisation of the BLAST bit scores (Fig. 7c).
This step models the all-vs-all BLAST hits for each
pairwise comparison between species to identify and
remove the gene similarity dependency on gene
length and phylogenetic distance. This is done so
Emms and Kelly Genome Biology (2015) 16:157
Page 9 of 14
Fig. 7 Overview of the steps in the OrthoFinder algorithm for two example orthogroups of genes from three species. a The unknown orthogroups
that the algorithm must recover, shown as a gene tree. b BLAST search of all genes against all genes. c Gene length and phylogenetic distance
normalisation of BLAST bit scores to give the scores to be used for orthogroup inference. d Selection of putative cognate gene-pairs from normalised
BLAST scores. e Construction of orthogroup graph, genes are nodes in the graph and pairs of genes are connected by an edge with edge weights
given by the normalised bit score. f Clustering of genes into discrete orthogroups using MCL
that the best hits between all species achieve the
same scores regardless of sequence length or phylogenetic distance.
3. Delimitation of orthogroup sequence similarity
thresholds using RBNHs (Fig. 7d). This step uses
information from RBNHs (Reciprocal Best lengthNormalised hit) to define the lower limit of
sequence similarity for putative cognate genes of
each query gene. To be included in the orthogroup
graph a gene-pair must be an RBNH or produce a
hit that is better scoring than the lowest scoring
RBNH (irrespective of species) for either gene.
4. Constructing an orthogroup graph for input into
MCL (Fig. 7e). Putative cognate gene-pairs are
identified as above and are connected in the
orthogroup graph with weights given by the
normalised BLAST bit scores.
5. Clustering of genes into orthogroups using MCL
(Fig. 7f ).
The steps 2 to 4 are the novel parts of our algorithm
and are described in detail below.
Gene length and phylogenetic distance normalisation
The aim of this normalisation procedure is to remove
gene length bias from BLAST bit scores and to normalise for phylogenetic distance between species. MCL converts sets of similarity scores into clusters by breaking
apart clusters of genes that have low similarity scores
(and therefore are unlikely to be orthogroups) and preserving clusters of sequences that have high similarity
scores. If the similarity scores between long sequences
are inherently larger than the similarity scores between
short sequences then the clustering will preferentially
Emms and Kelly Genome Biology (2015) 16:157
break apart clusters of short sequences while preserving
clusters of long sequences. This effect can be clearly seen
in the results of a typical OrthoMCL cluster. Here, long sequences are placed in overly large clusters leading to low
precision, and short sequences remain un-clustered leading
to low recall (Fig. 3a). The species-wise normalisation implemented by OrthoFinder similarly ensures that orthologues from more distant species (that have inherently lower
similarity scores due to phylogenetic distance) are not preferentially discarded and is similar to a step that is performed in OrthoMCL wherein all scores are divided by the
average score between a given species pair [13].
Previous methods have exploited BLAST e-values (rather than bit scores) as a measure of similarity between
sequences. However, as can be seen in Fig. 1b the use of
e-values for assessment of similarity between sequences
is flawed. Here, the minimum e-value that can be
obtained for a given query sequence decreases with
increasing sequence length until, at a certain length, the
lower bound for e-values is reached and BLAST returns
an e-value of 0. This creates two problems: (1) long sequences will frequently have low quality hits with better
e-values than the best possible hits of short sequences;
and (2) the floor value for the e-value calculation means
that length bias is non-uniform and thus irreversible.
Specifically, beyond the floor-value e-values cannot be
used to distinguish between the qualities of hits as they
are all assigned the same e-value. As can be seen in the
heat map shown in Fig. 1b, many hits obtain this floorvalue for a given pairwise species comparison and thus
their similarities are indistinguishable. This length-bias
must be removed to prevent biasing downstream clustering applications.
In this method we construct a similarity measure
between sequences based on the bit-score normalised to
take into account the query and hit sequences lengths and
the phylogenetic distance between species. Unlike evalues, the bit-scores do not suffer from the presence of a
threshold limit and thus different amounts of sequence
similarity can be distinguished regardless of the lengths of
the sequences involved. Let Lq be the length of the query
sequence and Lh be the length of the hit sequence. In an
analogous manner to the e-value calculation made by
BLAST and other sequence comparison methods, we use
the variable Lqh = LqLh to quantify the lengths of a pair of
sequences that are being compared.
The length normalisation procedure is shown in Fig. 2.
For each species pair, we:
1. Sort all BLAST hits according to Lqh = LqLh.
2. Put the hits into equal sized bins of 1,000 hits (put
the ‘shortest’ 1,000 hits according to Lqh into one
bin, the next 1,000 hits into the next bin and so on
for all the hits). If there are fewer than 5,000 hits
Page 10 of 14
then we put the hits into bins of 200. Using fixed
sized bins means that it is not necessary for the
algorithm to specify the location of the bins in
advance.
3. Sort the hits in each bin according to BLAST bit
score and select the top 5 % of hits from each bin.
Find the parameters a and b that best describe the
fit between sequence similarity scores and gene
length for the selected hits using the equation
log10Bqh = a log10Lqh + b where Bqh is the BLAST
bit score between sequences q and h.
4. Normalise all obtained BLAST bit scores (not just
the top 5 %) between the given species pair
according to, B'qh = Bqh/10bLaqh, so that B'qh, (the
normalised score) is the BLAST bit score for a
hit divided by the BLAST bit score that would be
expected for the best hits between sequences of
that length for the species pair under
consideration.
The top 5 % of hits are used rather than RBHs as
selection of RBHs will be affected by the gene lengthbias that we wish to correct. Moreover, gene duplication
events can frequently cause RBHs to fail (Additional file
8: Figure S5) and thus reduce the number of data points
that are available for fitting. The normalisation procedure ensures that the best hits between a given species
pair achieve (on average) the same scores irrespective of
their gene length.
OrthoFinder also normalises for phylogenetic distance, this is done so that the similarity scores
between orthologues will be independent of phylogenetic distance (that is, the true orthologues in distantly
related species will obtain similar scores to the true
orthologues in closely related species). If this step is
not done then true orthologues in distantly related
species will always obtain lower scores than true
orthologues in closely related species. Thus during
graph clustering (which is unaware of phylogenetic
relationship between species) distantly related true
orthologues (and cognates) will become disconnected
from each other more easily than closely related true
orthologues (and cognates) in the orthogroup graph.
Previous efforts to prevent this phylogenetic bias include dividing the observed similarity score for any
given gene-pair by the mean similarity score observed
for all reciprocal best hits between genes in that species pair [13]. However, in the absence of gene length
information this means that short genes will always
be penalised more than long genes.
Though there is precedent for the use of Lqh = LqLh to
quantify the lengths of a pair of sequences that are being
compared [14], different functions for gene length normalisation were also assessed. All other functions,
Emms and Kelly Genome Biology (2015) 16:157
including for example the use of the variable L~ qh ¼ Lq
þLh in place of Lqh, gave a lower overall clustering
accuracy.
Identification of putative cognate gene-pairs for inclusion
in the orthogroup graph
Once scores are normalised OrthoFinder exploits
RBNHs to identify putative cognate gene-pairs. RBHs
are a high precision method to identify putative orthologues [25–27] and OrthoFinder uses the reciprocal
requirement exploiting its length and phylogenetic distance normalised BLAST scores. For each gene the
scores for its RBNHs are used to delimit the extent of
sequence similarity of that gene’s orthogroup. Specifically, for each query sequence, q, any hit, h, with a normalised score, B'qh, greater than or equal to the score for
the lowest scoring RBNH of q is selected as a putative
cognate gene-pair of q and therefore is connected to q
in the orthogroup graph that is subsequently subjected
to MCL clustering.
The rationale for this approach is that the level of
normalised similarity of a query gene and its RBNHs can
be used to estimate the extent of similarity of other
genes within the same orthogroup. All genes more similar to a query gene than any of the query gene’s RBNHs
(irrespective of species) are likely members of the same
orthogroup. Therefore, the normalised similarity score
for the most dissimilar RBNH of a gene is used as a cutoff for inclusion of additional cognate gene-pairs from
all species. That is q is connected to h in the orthogroup
graph if B'qh > B'qR where R is an RBNH of q. This provides a simple and robust method for recovering cognate
gene-pairs that may otherwise be difficult to detect due
to duplication events that can cause the RBNH method
to fail. Further details, explanation and worked examples
are provided in Additional file 8: Figure S5.
In summary, the novel method presented here generates, for each query gene, an independent prediction of
all the genes in its orthogroup. This orthogroup graph is
then clustered using MCL with its default inflation parameter of 1.5. The effect of varying the MCL inflation
parameter on the OrthoFinder result is shown in
Additional file 9: Figure S6. The F-score of OrthoFinder
is relatively stable to variation in MCL inflation parameter, however it is possible to trade precision against recall by varying this parameter (Additional file 9: Figure
S6). For comparison the analogous analysis is also presented for OrhtoMCL (Additional file 10: Figure S7).
Implementation
OrthoFinder is written in python. It requires python
together with the numpy and scipy libraries [35] to be
installed. OrthoFinder requires the standalone BLAST+
Page 11 of 14
and MCL algorithms that are freely available. These
standalone applications must be installed separately to
OrthoFinder and are not included in the OrthoFinder
package. The implementation makes use of sparse matrices to store hits between sequences. This provides a
memory efficient method of storing the data and allows
key parts of the algorithm to be expressed using scipy’s
highly optimised C++ implementations of sparse matrix
operations. OrthoFinder can either run the BLAST
searches for you or can be run on pre-computed BLAST
searches. If you chose to run BLAST searches independently then instructions are provided in the documentation
for how to process your sequence names in the precomputed BLAST output. Similarly OrthoFinder will also
automatically run MCL for you. However if you wish to
run MCL separately using different parameter settings
then the MCL input files are stored for this purpose in a
working directory.
Evaluation
OrthoBench [5] is the only manually curated dataset of
orthogroups for the assessment of orthogroup prediction
algorithms. It was used in this work for assessing OrthoFinder as it has been independently evaluated, it underpins the testing of multiple different methods and it is a
well-defined and stringent test of the problem that
OrthoFinder was designed to solve. Criteria such as
functional similarity within orthogroups, expressed for
example using enzyme classification numbers [36], were
not used in this work since not all proteins with the
same function are members of the same orthogroup and
members of the same orthogroup do not necessarily all
have the same function. As we are using real benchmark
datasets for which only a subset of sequences have been
assigned to ‘true’ gene families the extent of true negative orthologue assignments is unknown (as is the case
for all methods tested on this dataset). Thus we cannot
use the Matthews correlation coefficient to assess the
performance of the orthogroup inference methods. In
the absence of this information the simplest and most
transparent evaluation of the accuracy of any prediction
method is to measure its precision and recall.
precision ¼
recall ¼
TP
T P þ FP
TP
T P þ FN
Where TP is the number of true positive orthogroup
assignments (that is, correct assignments), FP is the
number of false positive orthogroup assignments (that
is, incorrect assignments) and FN is the number of false
negative orthologue assignments (that is, missing assignments). The F-score is the harmonic mean of these two
Emms and Kelly Genome Biology (2015) 16:157
measures, where the harmonic mean weights towards
the worst performing measure. We also provide other
evaluation measures from the original OrthoBench
analysis in Additional file 3: Figure S2.
Inference of transcription factor orthogroups
To infer transcription factor orthogroups we first identified the set of transcription factors present in all genomes present in Phytozome V9. This identification was
performed using the same rules for the presence and
absence of PFAM domains as has been previously
described [31]. The full set of protein coding genes from
these genomes (including all the transcription factors)
was then clustered using OrthoFinder and OrthoMCL
and the distribution of the transcription factors within
these orthogroups was analysed. OrthoMCL was selected for comparison here as it is the method by which
all orthogroups of transcription factors are currently
defined [31]. An orthogroup of transcription factors was
defined as an orthogroup whose constituent genes comprised ≥50 % transcription factors of the same domain
classification.
To determine if OrthoFinder was correct in combining multiple separate OrthoMCL orthogroups each
orthogroup was subject to gene-tree—species-tree reconciliation. Using, gene-tree species-tree reconciliation
it is possible to determine if OrthoFinder had incorrectly placed together genes that had diverged prior to
the last common ancestor of the species being analysed.
To do this, gene trees were inferred for each orthogroup
by aligning the sequences using mafft-linsi [37] and inferring a maximum likelihood tree from this alignment using
FastTree [38]. DLCpar [39] was used to reconcile these
gene trees with the known species tree [28]. Using this
method, each gene tree was assessed to determine if it
contained bipartitions that occurred prior to the divergence of the last common ancestor of all the species being
analysed. If such a bipartition was identified then the
orthogroup was considered not to be a true orthogroup as
it contained one or more genes that evolved by duplication prior to the last common ancestor of all species under
consideration.
Discussion
In this work we have presented OrthoFinder, a novel
method for inference of orthogroups. Our method is focused on a clear definition of an orthogroup, that is, that
an orthogroup contains all genes descended from a single
gene in the last common ancestor of the species whose
genes are being analysed. This definition avoids conflating
shared ancestry with other criteria that are not equivalent,
such as functional conservation. Our method is designed
to address the problem of orthogroup inference rather than
categorise the disparate relationships that occur between
Page 12 of 14
individual genes within an orthogroup. These relationships
are best resolved by first inferring orthogroups using
OrthoFinder and then using multiple sequence alignment
and phylogenetic methods on these orthogroups.
The two key novel features of our method are: (1) a
method to automatically remove gene length bias and
phylogenetic distance from sequence similarity scores;
and (2) a novel method to define the sequence similarity
limits of an orthogroup. In the tests performed on the
only publicly available orthogroup benchmark dataset
(OrthoBench) OrthoFinder outperformed all of the commonly used orthogroup assignment methods by between
8 % and 33 %. Moreover we have shown OrthoFinder to
be scalable and robust to missing genes typical of incomplete genomes and de novo transcriptome assemblies. The
software is freely available and can take pre-computed
BLAST scores as input making it easy to test on any newly
developed benchmarks for which pre-computed BLAST
scores are available.
We further demonstrate the utility of OrthoFinder by
providing a novel classification of all transcription factors
in the available, fully-sequenced plant genomes present in
Phytozome V9. This analysis clusters 97.6 % of the 52,744
putative transcription factors into orthogroups. This novel
analysis identifies millions of relationships that have not
previously been reported providing new insight into the
relationship and evolution of transcription factor gene
families in plants.
Inferring orthologues underpins much of modern
biological research and is among the first steps in the
annotation and analysis of genome and transcriptome
sequencing projects. As sequencing technologies are
now within the budgets of most research groups these
data resources are increasing in number rapidly. Thus
there is a requirement for an orthogroup inference
method that is accurate, robust, scalable, and that can
be run easily by independent research groups on conventional computing resources. Many orthogroup inference methods are not available for general use but
are provided as static databases (for example, EggNog
and TreeFam). Thus the most widely used methods
are those that enable researchers to analyse their own
data resources. With this in mind OrthoFinder has
been developed with the aim of being easy to use.
The method is executed as a single command, has
minimal dependencies and requires as input just the
individual protein sequence FASTA files for each species that is being clustered. The algorithm carries out
all calculations (including BLAST searches and MCL
clustering) and outputs the orthogroups in both a
plain tab delimited text file and in the OrthoXML
community format. The algorithm itself is small, fast
and memory efficient, making it suitable for use on
linux desktop computers. Further information about
Emms and Kelly Genome Biology (2015) 16:157
the algorithm can be found at [19] and a standalone
implementation of the algorithm is available under
the GPLv3 licence at [20].
Additional files
Additional file 1: Table S1. Table of all false positive and false negative
genes produced by clustering OrthoBench using OrthoMCL. (XLSX 57 kb)
Additional file 2: Figure S1. An overview of how the OrthoFinder
score transform also normalises for phylogenetic distance between
BLAST scores. For illustration the All-Vs-All BLASTp scores are shown
for the longest protein isoform from each protein coding gene in
Homo sapiens vs all other species in the test. Note the difference in
properties of fitted line, here the slope of the lines for the more
closely related species are greater than for the more distant species.
Following transform all fitted lines are transformed to the same value
with a slope of 0. Thus the best scoring hits between distantly
related species pairs and closely related species pairs achieve the
same score hence normalising for interspecies phylogenetic distance.
To provide further illustration all fitted lines between Homo sapiens and all
other species before and after normalisation are shown on the right.
(PDF 1440 kb)
Additional file 3: Figure S2. Results on the OrthoBench dataset using
additional assessment criteria presented in the original OrthoBench paper
(the calculation for one of the plots could not be reproduced using the
information provided in the OrthoBench paper and so has not been
included). (PDF 29 kb)
Additional file 4: Figure S3. F-scores on the OrthoBench dataset for
the 30 randomly chosen gene families and the 40 biologically or
technically challenging gene families that make up the dataset.
(PDF 357 kb)
Additional file 5: Figure S4. Accuracy of OrthoFinder as a function of
fraction of missing sequences. With poor gene coverage many RBNBs will
be missing and so cannot inform the identification of orthogroups. To
simulate this, genes were removed at random from the OrthoBench
dataset input into the OrthoFinder and the precision, recall and F-score
on the remaining genes were measured. (PDF 356 kb)
Additional file 6: Table S2. Transcription factor orthogroups inferred
using OrthoFinder. Orthogroups are named according to the PFAM
domain classification and from largest to smallest where #1 is the largest
orthogroup of that transcription factor type. Accession numbers provided
are correct for Phytozome version 9. Maximum likelihood phylogenetic
trees are provided in column AS. The relationship of the OrthoFinder
orthogroup to OrthoMCL orthogroups is listed in column B. ‘Identical’
means that the OrthoFinder orthogroup is identical to an OrthoMCL
orthogroup. ‘Superset’ means that the OrthoFinder orthogroup is a
superset of more than one OrthoMCL orthogroups. ‘Subset’ means that
the OrthoFinder orthogroup is a subset of a larger OrthoMCL orthogroup.
‘Neither subset nor superset’ means that the OrthoFinder orthogroup
contains sequences that were not clustered into any orthogroups by
OrthoMCL. The results of the gene tree species tree reconciliation are
provided in column C, this column specifics whether the phylogenetic
tree inferred from the OrthoFinder orthogroup contained genes that
were separated by a duplication prior to the last common ancestor.
(XLSX 2110 kb)
Additional file 7: Table S3. Results of the gene tree species tree
reconciliation analysis. This table contains a summary of the full dataset
presented in Additional file 1: Table S1. LCA is the last common ancestor
of all the species being analysed. OG is orthogroup. (XLSX 10 kb)
Additional file 8: Figure S5. A worked example showing the criteria for
identification of putative cognate gene-pairs used by OrthoFinder. (PDF 72 kb)
Additional file 9: Figure S6. The effect of the MCL inflation parameter
on the F-score, precision and recall of OrthoFinder on the OrthoBench
dataset. In OrthoFinder we use the default parameter of 1.5 which gives
the results reported in this paper (84.5 %, 80.8 % and 82.6 % for precision,
recall and F-score, respectively). Increasing the inflation parameter can be
Page 13 of 14
used to achieve higher precision at the cost of lower recall. Conversely, a
smaller value of inflation can be used to achieve higher recall at the cost
of lower precision. In this dataset the best result obtained by OrthoFinder
in terms of F-score was 83.9 % using a value of 1.7 for the inflation
parameter. The scores for precision and recall were 88.4 % and 79.8 %,
respectively. (PDF 357 kb)
Additional file 10: Figure S7. The effect of the MCL inflation parameter
on the F-score, precision and recall of OrthoMCL on the OrthoBench
dataset. The standalone OrthoMCL v2.0.9 was run on the OrthoBench
dataset to produce the MCL input graph and MCL was rerun on this graph
with a range of inflation parameters. (PDF 357 kb)
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SK conceived the project. DE developed the algorithm. SK and DE analysed
the data and wrote the manuscript. All authors read and approved the final
manuscript.
Acknowledgements
The authors would like to thank David Roos and Steve Fischer for their
advice and comments on the manuscript. The authors would also like to
thank the anonymous Reviewers whose suggested additional analyses
strengthened the manuscript. This work was supported by the Bill and
Melinda Gates Foundation and UKAID as part of the C4 rice project.
Received: 23 December 2014 Accepted: 8 July 2015
References
1. Alexeyenko A, Tamas I, Liu G, Sonnhammer ELL. Automatic clustering of
orthologs and inparalogs shared by multiple proteomes. Bioinformatics.
2006;22:E9–15.
2. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C. OMA 2011: orthology
inference among 1000 complete genomes. Nucleic Acids Res. 2011;39:D289–94.
3. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein
families. Science. 1997;278:631–7.
4. Fitch WM. Homology - a personal view on some of the problems. Trends
Genet. 2000;16:227–31.
5. Trachana K, Larsson TA, Powell S, Chen WH, Doerks T, Muller J, et al.
Orthology prediction methods: a quality assessment using curated protein
families. Bioessays. 2011;33:769–80.
6. Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV. OrthoDB: a
hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids
Res. 2013;41:D358–65.
7. Chen F, Mackey AJ, Stoeckert CJ, Roos DS. OrthoMCL-DB: querying a
comprehensive multi-species collection of ortholog groups. Nucleic Acids
Res. 2006;34:D363–8.
8. Powell S, Forslund K, Szklarczyk D, Trachana K, Roth A, Huerta-Cepas J, et al.
eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic
Acids Res. 2014;42:D231–9.
9. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al.
The COG database: an updated version includes eukaryotes. BMC
Bioinformatics. 2003;4:41.
10. Simola DF, Wissler L, Donahue G, Waterhouse RM, Helmkampf M, Roux J,
et al. Social insect genomes exhibit dramatic evolution in gene composition
and regulation while preserving regulatory features linked to sociality.
Genome Res. 2013;23:1235–47.
11. Waterhouse RM, Zdobnov EM, Kriventseva EV. Correlating traits of gene
retention, sequence divergence, duplicability and essentiality in vertebrates,
arthropods, and fungi. Genome Biol Evol. 2011;3:75–86.
12. Wapinski I, Pfeffer A, Friedman N, Regev A. Natural history and evolutionary
principles of gene duplication in fungi. Nature. 2007;449:54–U36.
13. Li L, Stoeckert CJ, Roos DS. OrthoMCL: Identification of ortholog groups for
eukaryotic genomes. Genome Res. 2003;13:2178–89.
14. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment
search tool. J Mol Biol. 1990;215:403–10.
15. van Dongen S. A cluster algorithm for graphs. Amsterdam: CWI (Centre for
Mathematics and Computer Science); 2000.
Emms and Kelly Genome Biology (2015) 16:157
16. Soderlund C, Bomhoff M, Nelson WM. SyMAP v3.4: a turnkey synteny
system with application to plant genomes. Nucleic Acids Res. 2011;39, e68.
17. Jun J, Mandoiu II, Nelson CE. Identification of mammalian orthologs using
local synteny. BMC Genomics. 2009;10:630.
18. Daniels JP, Gull K, Wickstead B. Cell biology of the trypanosome genome.
Microbiol Mol Biol Rev. 2010;74:552–69.
19. www.stevekellylab.com/software/orthofinder.
20. https://github.com/davidemms/OrthoFinder.
21. Kriventseva EV, Rahman N, Espinosa O, Zdobnov EM. OrthoDB: the hierarchical
catalog of eukaryotic orthologs. Nucleic Acids Res. 2008;36:D271–5.
22. O’Brien KP, Remm M, Sonnhammer ELL. Inparanoid: a comprehensive
database of eukaryotic orthologs. Nucleic Acids Res. 2005;33:D476–80.
23. Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, et al. TreeFam: a
curated database of phylogenetic trees of animal gene families. Nucleic
Acids Res. 2006;34:D572–80.
24. Kelly S, Maini PK. DendroBLAST: approximate phylogenetic trees in the
absence of multiple sequence alignments. Plos One. 2013;8, e58537.
25. Wall DP, Fraser HB, Hirsh AE. Detecting putative orthologs. Bioinformatics.
2003;19:1710–1.
26. Wolf YI, Koonin EV. A tight link between orthologs and bidirectional best
hits in bacterial and archaeal genomes. Genome Biol Evol. 2012;4:1286–94.
27. Dalquen DA, Dessimoz C. Bidirectional best hits miss many orthologs in
duplication-rich clades such as plants and animals. Genome Biol Evol.
2013;5:1800–6.
28. Goodstein DM, Shu SQ, Howson R, Neupane R, Hayes RD, Fazo J, et al.
Phytozome: a comparative platform for green plant genomics. Nucleic
Acids Res. 2012;40:D1178–86.
29. Freeling M. Bias in plant gene content following different sorts of
duplication: tandem, whole-genome, segmental, or by transposition. Annu
Rev Plant Biol. 2009;60:433–53.
30. Blanc G, Wolfe KH. Functional divergence of duplicated genes formed by
polyploidy during Arabidopsis evolution. Plant Cell. 2004;16:1679–91.
31. Jin J, Zhang H, Kong L, Gao G, Luo J. PlantTFDB 3.0: a portal for the
functional and evolutionary study of plant transcription factors. Nucleic
Acids Res. 2014;42:D1182–7.
32. Pires N, Dolan L. Origin and diversification of basic-helix-loop-helix proteins
in plants. Mol Biol Evol. 2010;27:862–74.
33. Dessimoz C, Gabaldon T, Roos DS, Sonnhammer ELL, Herrero J, Consortium
QO. Toward community standards in the quest for orthologs.
Bioinformatics. 2012;28:900–4.
34. Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, et al.
eggNOG: automated construction and annotation of orthologous groups of
genes. Nucleic Acids Res. 2008;36:D250–4.
35. Jones E, Oliphant T, Peterson P. SciPy: Open source scientific tools for
Python. 2001. Available at: http://www.scipy.org/.
36. International Union of Biochemistry and Molecular Biology, Nomenclature
Committee, Webb EC. Enzyme nomenclature 1992: recommendations of
the Nomenclature Committee of the International Union of Biochemistry
and Molecular Biology on the nomenclature and classification of enzymes.
San Diego: Published for the International Union of Biochemistry and
Molecular Biology by Academic Press; 1992.
37. Katoh K, Standley DM. MAFFT multiple sequence alignment software
version 7: improvements in performance and usability. Mol Biol Evol.
2013;30:772–80.
38. Price MN, Dehal PS, Arkin AP. FastTree 2-approximately maximum-likelihood
trees for large alignments. Plos One. 2010;5, e9490.
39. Wu YC, Rasmussen MD, Bansal MS, Kellis M. Most parsimonious
reconciliation in the presence of gene duplication, loss, and deep
coalescence using labeled coalescent trees. Genome Res. 2014;24:475–86.
Page 14 of 14
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit