Developing a Web Application for the DisEnT Disease Enrichment

Transcription

Developing a Web Application for the DisEnT Disease Enrichment
Developing a Web Application for the DisEnT
Disease Enrichment Tool
Ernest Walzel
Master of Science
School of Informatics
University of Edinburgh
2014
Abstract
Disease enrichment analysis is a statistical method used to determine whether a specific
disease trait is overrepresented or underrepresented in a set of genes. Accuracy of this
method relies heavily on coverage of available gene-disease association data.
Problematically, sources of gene-disease associations differ in what data they store
and how they store it. Data from three such sources – GeneRIF, OMIM and Ensembl
Variation – were aggregated and unified, creating a dataset of gene-disease associations
with unprecedented coverage.
We developed DisEnT – a web-based application that enables disease enrichment
analysis on this novel dataset. DisEnT was designed to be a reliable and accessible
scientific tool that can scale with increasing workload and is sustainable for future
expansion.
DisEnT has a modular architecture consisting of a Ruby on Rails application, an
R programming language module and a MySQL database. It was built in line with a
published set of best practices for scientific computing.
The system was evaluated for correctness, usability, scalability and sustainability,
satisfying each of the criteria.
DisEnT is available for access at https://synprot.inf.ed.ac.uk/disent.
i
Acknowledgements
I would like to thank Dr Ian Simpson, the supervisor of this project, for his exceptional
support and trust. Thank you for all your help, your feedback, your insights and for
giving me the freedom to do what I thought was best for the project. This has been the
most enjoyable experience and I hope to have a chance to collaborate with you in the
future.
I would also like to thank Xin He for his help and his work on the data underlying
this project. It is Xin’s data that makes DisEnT possible.
Finally, my thanks goes to my girlfriend Thea Koutsoukis for her around-the-clock,
around-the-globe help and support in keeping this thesis legible and in keeping me
sensible. You are the best.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Ernest Walzel)
iii
To my brothers, Oto and Edo.
iv
Contents
1
Introduction
1
2
Background
4
2.1
Studying Sets of Genes . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.1.1
Gene Set Enrichment Analysis . . . . . . . . . . . . . . . . .
6
Enrichment Analysis beyond the Gene Ontology . . . . . . . . . . .
9
2.2
3
2.2.1
The Human Disease Ontology . . . . . . . . . . . . . . . . .
10
2.2.2
Mining for Disease Annotations . . . . . . . . . . . . . . . .
10
2.3
Current Solutions for Disease Enrichment Analysis . . . . . . . . . .
13
2.4
Introducing DisEnT . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.4.1
16
Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . .
Developing DisEnT
18
3.1
System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.1.1
Web Application . . . . . . . . . . . . . . . . . . . . . . . .
19
3.1.2
Database . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.1.3
Computation . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.1.4
Job Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2.1
User Input . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2.2
Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.3
Gene Identification . . . . . . . . . . . . . . . . . . . . . . .
29
3.2.4
Mapping to Homologs . . . . . . . . . . . . . . . . . . . . .
31
3.2.5
Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . .
33
3.2.6
Presenting Results . . . . . . . . . . . . . . . . . . . . . . .
34
Programmatic Interface . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.3.1
37
3.2
3.3
Submitting a Query . . . . . . . . . . . . . . . . . . . . . . .
v
3.3.2
3.4
4
5
Retrieving Results . . . . . . . . . . . . . . . . . . . . . . .
38
Development Methodology . . . . . . . . . . . . . . . . . . . . . . .
39
Evaluation
41
4.1
Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.2
Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.2.1
Practical Tasks . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.2.2
User Experience . . . . . . . . . . . . . . . . . . . . . . . .
50
4.2.3
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.3
Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.4
Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.4.1
Write programs for people, not computers . . . . . . . . . . .
55
4.4.2
Let the computer do the work . . . . . . . . . . . . . . . . .
56
4.4.3
Make incremental changes . . . . . . . . . . . . . . . . . . .
56
4.4.4
Don’t repeat yourself (or others) . . . . . . . . . . . . . . . .
57
4.4.5
Plan for mistakes . . . . . . . . . . . . . . . . . . . . . . . .
57
4.4.6
Optimize software only after it works correctly . . . . . . . .
57
4.4.7
Document design and purpose, not mechanics . . . . . . . . .
58
4.4.8
Collaborate . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Conclusion
60
A Technical Specifications
62
A.1 Software Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
A.2 Supported Species . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
B User Study
65
B.1 Participant Answers . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
B.1.1
Practical Part . . . . . . . . . . . . . . . . . . . . . . . . . .
65
B.1.2
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
B.2 Survey Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
Bibliography
72
vi
Chapter 1
Introduction
Untangling the genetic basis of disease has always been of paramount interest to researchers worldwide. Billions of dollars and more than a decade of work were invested
in the sequencing of the human genome by the Human Genome Project. One of the
main driving forces behind the project was the promise of bringing new insights into
pathologies of human diseases by annotating the genome sequence with genes and
other areas of interest (Collins, 1998). However, recognizing the genes embedded in
our DNA was only the start of a much longer and much larger effort in understanding
how genes link to diseases.
For a particular subset of diseases, it is possible to pin-point a mutation in a single
gene as a cause of a disease. These diseases are referred to as monogenic or Mendelian
diseases (Botstein and Risch, 2003), and they include conditions such as Huntington’s
disease and cystic fibrosis. However, while these diseases are undoubtedly important
to study and understand, their incidence is usually low, commonly predicted at around
5% (Chakravarti, 2011).
It is the unfortunate case that the most frequently occurring and often very damaging diseases such as cancer or Alzheimer’s disease have a much more complicated
pathology. These diseases usually occur as a result of a complex interplay between
many genetic and environmental factors. In order to understand such diseases, one
needs to look beyond a single gene.
Genetic etiologies of complex diseases often describe whole sets of genes that can
have a role in development of a disease (Hopkins, 2007). It is also true that a given
gene can be an important factor for multiple diseases. In fact, even a typical monogenic
disease such as sickle cell anaemia – a phenotype caused by a particular single mutation
– often occurs accompanied by a multitude of other diseases connected to the mutation
1
Chapter 1. Introduction
2
(Barabási et al., 2011). Thus, as well as looking beyond a single gene, one must look
beyond a single disease.
The work described in this thesis is a resource for studying relationships between
diseases and genes within a context of a disease hierarchy. For a given set of genes
(e.g. genes highlighted by a biological experiment), it can answer questions such as Are
these genes characteristic for a particular type of disease? or Which other diseases
should be considered whilst studying this disease?
The resource developed – an web application called DisEnT – provides answers
to those questions using a statistical method known as enrichment analysis. Enrichment analysis is used to determine whether a specific trait (in this case a disease) is
overrepresented or underrepresented in a set of genes investigated (Tirrell et al., 2010).
The aforementioned disease hierarchy used in DisEnT is called the Human Disease
Ontology (HDO) (Schriml and Arze, 2012). HDO classifies diseases into categories
and thus adds structure to the disease terminology. It is this hierarchical structure
that allows exploring not only the links between diseases and genes, but also how the
diseases themselves are linked.
This thesis describes DisEnT in the context of its application and its implementation. Chapter 2 describes the principles and ideas behind disease enrichment analysis,
its importance and its limitations. This chapter also outlines how DisEnT enables
disease enrichment analysis and how it overcomes some of the technical hurdles associated with the process. DisEnT builds on ideas of other similar systems using the
Human Disease Ontology. This chapter outlines similarities and differences between
these systems and DisEnT and it defines evaluation criteria for DisEnT.
The main focus of Chapter 3 is to describe the implementation of DisEnT. DisEnT
was developed whilst following best practices for scientific computing as suggested
by Wilson et al. (2014). It was built with scalability and robustness in mind, allowing
for future expansion and addition of new features. It comes with an extensive test
coverage and a documented Application Programming Interface (API), both of which
are described.
Chapter 4 evaluates DisEnT against its design criteria: correctness, usability, scalability and sustainability. Observations made in this chapter suggest that results produced by DisEnT can be used to infer valuable knowledge supported by evidence. Furthermore, the application is perceived positively by its users based on a user study, it
Chapter 1. Introduction
3
can cope with increased workload and it was built following best practices for building
scientific software.
Finally, Chapter 5 discusses suitability and limitations of the current version of
DisEnT and suggests other features that could be useful for users of this application.
Chapter 2
Background
2.1
Studying Sets of Genes
Many modern biological experiments use highly parallel assay methods to profile the
biomolecular signal of a sample. Examples include the use of mass-spectrometry to
identify protein constituents or a range of experiments measuring gene expression.
Results from these experiments are typically analysed with advanced statistical methods that produce a list of ‘significant’ genes (Shah et al., 2012). Finding a biological
meaning behind these lists can be a daunting task: information about genes is usually
dispersed across a number of specialised databases that often only contain a subset
of the required information. Moreover, these databases often differ in what data they
store and how they store it, making it difficult to use them effectively for a systematic
analysis.
In order to find a ‘common language’ for the differing databases, a structured dictionary of gene functions was developed called the Gene Ontology (GO). (Ashburner
et al., 2000). The Gene Ontology consists of a hierarchical structure of terms that describe biological functions and processes ordered by their ‘specificity’: terms at the
root of the ontology are more generic, e.g. response to stimulus, and terms deeper in
the hierarchy are more specific, e.g. chronic inflammatory response. Each one of the
terms contains an unambiguous identifier number, such as GO:0002544. Figure 2.1
shows a subset of the GO hierarchy and some of its terms in a diagram.
Thanks to its hierarchical structure, GO lends itself to a number of methods of
statistical analysis. A typical question a researcher may ask is whether any of the GO
terms are over-represented or under-represented in the list of genes deemed significant.
One of the methods that can provide answer to such a question is gene set enrichment
4
Chapter 2. Background
5
Figure 2.1: Diagram showing a subset of the Gene Ontology (GO) hierarchy. The directed edges (shown as black arrows) show direction of the isa relationship (for example
inflammatory response is a response to wounding). The coloured rectangles at the bottom of the term rectangles mark terms that are also used in specific sub-sets of GO
called GO slims. The diagram was generated by the QuickGO browser (Binns et al.,
2009).
Chapter 2. Background
6
analysis (Zeeberg et al., 2003).
2.1.1
Gene Set Enrichment Analysis
Gene set enrichment analysis (GSEA) is a computational method commonly used to
determine whether a group of genes show a statistical significance in a given context.
The context itself is interchangable: it may be a biological function (e.g. a GO term),
chromosomal location or indeed a disease – the general principles are the same (Subramanian, 2005). Thanks to its robust methodology and easily interpretable results,
GSEA has gained a lot of popularity in the scientific community (Tirrell et al., 2010).
Enrichment analysis can be applied in two types of scenarios. Firstly, it can be
used to confirm (or reject) a hypothesis, e.g. whether a group of genes is associated
with a property it is thought to be associated with. Secondly, by analysing a previously
unexplored set of genes, it can potentially discover previously unknown attributes associated with the gene set. Shah et al. (2012) refer to these two settings as hypothesisdriven and hypothesis-generating.
2.1.1.1
Calculating the Enrichment
The basic idea behind GSEA is a comparison of an ‘interesting’ list of genes, or an
input list, to a chosen background list sometimes referred to as reference set. The input
list is usually produced as an output of an experiment (as mentioned at the beginning of
this chapter), and it is the primary subject of interest for the analysis. The background
list is used as a reference providing a benchmark for calculations.
The choice of the background list is not always obvious and it often depends on
the origin of the input list. For example, if the input list is obtained from a microarray
chip experiment, only genes whose expression was measured by the chip should be
included in the background list. In other cases, the background list can be constructed
from the genome of the investigated organism, selecting genes that are annotated.
To perform GSEA, both the lists are annotated with a given set of terms – in a typical scenario, GO terms are used. The enrichment analysis then compares the number
of genes n in the input list annotated with a particular GO term (e.g. inflammatory
response) and compares it to the number of all genes m annotated with that term in the
background list. Based on those counts, and on the size of both lists (N and M respectively), the method produces a statistical output. In a simple scenario, the result can be
a probability value (p-value) that marks how likely it is to observe at least n genes in
Chapter 2. Background
7
Figure 2.2: Calculation of enrichment of GO terms using the hypergeometric distribution.
Figure by Shah et al. (2012).
the input list by chance (Shah et al., 2012) and it can be used to devise an enrichment
score for a given term in the setting.
Depending on the size of the lists, the p-value can be calculated using a number of
statistical methods including a hypergeometric test or Fisher’s exact test. For example,
Drghici et al. (2003) propose a formula assuming the hypergeometric distribution as
shown in Figure 2.2.
The Fisher’s exact test is another popular method for calculating enrichment pvalues and as Shah et al. (2012) note, it is also more suitable for smaller gene lists.
Using same notation as in Figure 2.2, the formula for the Fisher’s exact test would be
p=
2.1.1.2
N M
n m
N+M n+m
(2.1)
Incorporating Hierarchy
Although useful, the approach of enrichment analysis based purely on gene counts
associated with GO terms has a limitation that can hinder its interpretability. As Alexa
et al. (2006) demonstrated, because of GO’s hierarchical structure, there is a strong
correlation between terms in close hierarchical proximity. As a result, if a specific
term (e.g. chronic inflammatory response) is marked as strongly enriched, there is a
high probability of its more generic ancestor terms (e.g. response to stimulus) to also
be marked as enriched. Figure 2.3 illustrates this bias pattern in an example.
Chapter 2. Background
8
Figure 2.3: A subgraph of the GO hierarchy with its enriched terms highlighted. Legend
in the bottom of the figure describes the associated p-values.
Figure by Alexa et al. (2006).
Chapter 2. Background
9
Figure 2.4: Scoring of GO terms after applying the elim algorithm (left) and the weight
algorithm (right). Enrichment of more specific terms is higher and the bias caused by
the hierarchical proximity between terms is reduced. Figure by Alexa et al. (2006).
To address this problem, Alexa et al. (2006) proposed two scoring algorithms that
make use of the hierarchical structure of the ontology. One of their proposed algorithms named elim works by iterating over the enriched terms starting from the bottom
of the hierarchy progressing towards the top. Starting with the most specific terms, it
calculates their enrichment scores and then eliminates genes incorporated in that enrichment from all of the ancestral terms. This leads to much lower enrichment scores
for terms that are not enriched by previously ‘unseen’ genes.
The elimination process in elim can be considered as assigning weights of either
0 or 1 to genes in the terms. Alexa et al. present a generalisation of this approach in
their second algorithm named weight that assigns weights from the interval [0, 1] to the
genes instead of eliminating them completely.
As shown in Figure 2.4, both these algorithms lead to more interpretable results.
Both of these algorithms are commonly used for enrichment analysis using GO terms
and as Section 3.2.5 explains, DisEnT also makes use of the elim algorithm. However,
DisEnT does so while using an ontology different from GO.
2.2
Enrichment Analysis beyond the Gene Ontology
While the Gene Ontology is a commonly-used reference for enrichment analysis, the
method, as described by Subramanian (2005), can be easily generalised and extended
onto other ontologies. As well as asking which biological functions are shared by a set
of genes, one can ask which disease is over-represented or under-represented in a gene
Chapter 2. Background
10
set.
Performing enrichment analysis using an ontology of diseases may lead to interesting discoveries in disease pathologies. For example, by combining a disease ontology
with several protein-labelling datasets, Mort et al. (2010) discovered that a particular
class of blood coagulation diseases is associated with a previously unknown structural
change in a group of proteins.
As well as new insights into pathology of diseases, enrichment analysis can help to
improve healthcare by predicting the needs of patients suffering from a particular disease. For instance, several comorbidities (co-occurring disorders) have been identified
in cohorts of patients suffering from rheumatoid arthritis using a technique similar to
enrichment analysis (Petri, 2010). Using this knowledge, we can help target and develop more effective drug treatments and improve the quality of life of these patients
(Michaud and Wolfe, 2007).
Finally, Machado et al. (2013) have recently investigated the possibility of using
enrichment analysis as a first step in disease prognosis. While their findings still need
to be verified by clinical data, their preliminary results are promising in characterising
a disease called hypertrophic cardiomyopathy based on genetic data from a group of
patients.
2.2.1
The Human Disease Ontology
The main purpose of using the Gene Ontology in enrichment analysis is to provide a
common dictionary of biological functions and processes. In the realm of diseases,
this role is fulfilled by the Human Disease Ontology (Schriml and Arze, 2012) .
Similarly to the Gene Ontology, the Human Disease Ontology (HDO) is a hierarchy of diseases stemming from generic disease terms such as disease to more specific
terms like Huntington’s disease . Figure 2.5 shows a diagram of a subset of the ontology. Similarly to the GO, the hierarchical structure provides a unifying structure for
classifying diseases as well as a platform for statistical analysis.
2.2.2
Mining for Disease Annotations
A common dictionary of diseases is essential for disease-based enrichment analysis.
However, in order to be able to perform the analysis, one needs to annotate genes with
terms from the chosen dictionary. While there are many well-established resources
available that map genes to GO terms (The UniProt Consortium, 2014; Maglott et al.,
Chapter 2. Background
11
Figure 2.5: Sub-set of the Human Disease Ontology hierarchy. A ‘specific’ disease
(Huntington’s disease) is shown in a red rectangle, while its ‘parent’ disease terms are
shown in green rectangles. The orange circles represent number of nodes hidden from
the visualisation.
Figure obtained from the Disease Ontology database (Schriml and Arze, 2012).
Chapter 2. Background
12
2005), there is a lack of disease-annotation resources mapping to the Human Disease
Ontology (Osborne et al., 2009; LePendu et al., 2011; Peng et al., 2013). One of the
ways to obtain the HDO annotations is to retrieve gene-disease associations from other
sources and map them onto the HDO using natural language processing (NLP) tools.
The main resources for human gene-disease associations are the Online Mendelian
Inheritance in Man (OMIM) database (Hamosh et al., 2005), the Gene Reference into
Function (GeneRIF) database (NCBI, 2014) and the Ensembl Variation database (Chen
et al., 2010). While useful in their own right, each of these databases differs in the
kind of information they store and how they store it. This means that researchers
interested in using gene-disease annotation data must often consider a combination of
these sources in order to establish a comprehensive overview of the data (Peng et al.,
2013). However retrieval, parsing and integration of data from these sources is not
straightforward.
The OMIM database is a manually-curated database of diseases and their genetic
causes. The information recorded here comes from published literature and is reviewed
and entered by a team of expert curators. This means that data in this database is
usually of high-quality, but the manual curation also introduces a considerable delay
into the process of updating the database. Moreover, the annotations are added with
a high level of detailed description, but the resource is almost entirely free-text based,
making it difficult to retrieve information from it programmatically (Mailman et al.,
2007; Osborne et al., 2009). As a result, OMIM is a valuable source of information
that is challenging to access using automated methods.
GeneRIF offers a different approach to gene annotation. It provides a simple mechanism for its users to add new annotations to genes based on scientific literature. The
annotations are brief – up to 255 characters in length – and they typically describe the
gene’s function or a role in a disease (see Figure 2.6 for an example). Osborne et al.
(2007) describe this wiki-like process as successful and in a later paper (Osborne et al.,
2009) they show that information stored here can be easily used for data-mining purposes. However, even though the data here is presented in a clearer and more accessible
manner, it still has to be mapped onto the Human Disease Ontology.
Finally, Ensembl Variation provides a completely automated system of annotating genes with phenotypic information. It mines multiple databases to aggregate trait
data (e.g. diseases) caused by a specific variation (e.g. mutation) in a gene. This automated data aggregation makes it possible to update the Ensembl Variation database
very rapidly. However, the lack of manual curation can make the data prone to quality
Chapter 2. Background
13
issues (Ghisalberti et al., 2010). One of the biggest advantages of Ensembl Variation for the purposes of enrichment analysis is that the data is available in an easilyaccessible relational database (Chen et al., 2010). As in the case of GeneRIF however,
this data still needs to be mapped onto an ontology.
2.3
Current Solutions for Disease Enrichment Analysis
Several approaches have been developed to map disease descriptions onto the Human
Disease Ontology (Aronson, 2001; Osborne et al., 2009; LePendu et al., 2011; Peng
et al., 2013). Although the mapping process is not a central topic of this project, the
mapped disease annotation data is the key resource of the DisEnT project.
One of the earliest attempts to systematically annotate human genes with disease
terms was published by Osborne et al. (2009). Authors of this study used the MetaMap
Transfer (MMTx) tool (Aronson, 2001), a software package that analyses biomedical
texts using NLP techniques and maps their contents onto terms from a pre-defined
dictionary.
The goal of this study was to use MMTx to ‘translate’ GeneRIF entries onto HDO
terms. The ontology’s hierarchical structure was used to improve accuracy of the mapping (e.g. by avoiding mapping to both a class and a subclass of a disease). Establishing
the mapping between a GeneRIF and a HDO thus formed a link between the HDO term
and the gene. An example of this process is shown in Figure 2.6.
This study demonstrated that mapping GeneRIF entries onto HDO terms is viable using NLP techniques. Although the proposed methodology was not as effective
for other types of sources, such as the longer free-text entries in OMIM, this paper
provided a foundation for later similar endeavours, including the Disease and Gene
Annotations database (DGA).
DGA (Peng et al., 2013) is a largely automated database system that combines
GeneRIF entries with molecular interactions networks, offering an ‘integrated systems
approach’ to gene-disease associations. While not directly providing enrichment analysis functionality, DGA provides a good example of how combining multiple sources
of information can lead to a resource offering much more contextual information. As
it will become clear later, the idea of combining multiple sources of information is also
central to DisEnT.
One of the earliest studies that enabled enrichment analysis with Human Disease
Chapter 2. Background
14
Figure 2.6: Inference from a GeneRIF suggests that a gene TGF-beta1 is associated
with the Malignant neoplasm of breast DO term. Figure by Osborne et al. (2009).
Ontology was published by LePendu et al. (2011). This study largely built on previous
work of (Tirrell et al., 2010) and their methodology entitled RANSUM (Rich Annotation Summarizer). RANSUM described a workflow that enabled its users to perform
enrichment analysis on any given ontology. Gene annotations with the ontology terms
would have to be either provided by the user or could be derived automatically using a
mapping tool, however, the accuracy of this process in this study is not clear.
LePendu et al. developed RANSUM further, into a workflow leveraging the alreadyexisting database of Gene Ontology annotations. More specifically, the key data sources
for the mapping were PubMed1 identifiers – links to articles providing evidence for
some of the GO annotations.
Using the PubMed articles provided, a mapping tool similar to MMTx named Open
Biomedical Annotator (Jonquet et al., 2009) was used to infer HDO terms from the
articles, ultimately linking HDO terms to genes. These newly-established annotation
links allowed the authors to carry out enrichment analysis using the acquired data as the
background set. The workflow describing the process from mapping to the enrichment
analysis is shown in a diagram in Figure 2.7.
1 http://www.ncbi.nlm.nih.gov/pubmed
Chapter 2. Background
15
Figure 2.7: Wokflow for mapping disease association data as developed by LePendu
et al. (2011). The ‘NCBO Annotator service’ component in the figure refers to the Open
Biomedical Annotator. Figure by LePendu et al. (2011).
2.4
Introducing DisEnT
DisEnT – the Disease Enrichment Tool – further develops the solutions described in
the previous section. The first objective of DisEnT was to improve coverage of genedisease association data mapped onto the Human Disease Ontology. DisEnT’s data was
collated from the three different sources described in Section 2.2.2: OMIM, GeneRIF
and Ensembl Variation. Data from these sources was retrieved and mapped onto HDO
by He and Simpson (2014) using methodologies similar to those described by Osborne
et al. and LePendu et al.. These workflows utilised both of the aforementioned mapping tools: MMTx and OBA. Thanks to the wide range of sources and mapping tools,
we believe that the DisEnT database provides unprecedented coverage of gene-disease
annotations.
The main aim of this project, is to enable statistically robust gene set enrichment
analysis through an accessible web application. The studies described in Section 2.3
all propose useful methodologies for collecting the data, but apart from the Disease
and Gene Annotations database (DGA), none of the studies offer a reliable resource
for using this data for enrichment analysis. (While DGA does provide an interface
to retrieve gene-disease associations, it does not offer any statistical analysis tools.)
In order to achieve this new objective, we have developed a web-based application
enabling gene set disease enrichment analysis using the DisEnT dataset.
In addition to enabling enrichment analysis, the data in DisEnT can be used for
other purposes including network analysis and visualisation as introduced in Section 2.3. For example, one could examine a network of genes shared between diseases
Chapter 2. Background
16
or diseases shared between genes. Although this functionality was not implemented as
part of this project, we demonstrate feasibility of this use case with DisEnT’s data in
Section 4.1.
2.4.1
Design Goals
The DisEnT application was designed to be intuitive and reliable, so that it can be used
by a wide range of users in the scientific community. DisEnT communicates with its
users by standard web protocols and it is supported by a mature statistical software infrastructure. The technical details of DisEnT are described in Chapter {secMethods},
while the following paragraphs introduce the reader to our overall design goals.
In order to be used effectively, the DisEnT system was developed whilst concentrating on four main criteria: correctness, usability, scalability and sustainability.
Correctness is one of the most important aspects of any scientific software (Smith,
2006; Kelly and Sanders, 2008). Just like any other scientific tool, scientific software needs to be inspected and validated to ensure its output is correct (Wilson et al.,
2014). This notion of quality assurance is particularly critical for software that is being actively developed, as it often changes its behaviour. DisEnT was built using a
‘test-before-code’ methodology, resulting in an extensive coverage of software tests
(see Section 4.4.5). In addition to the automated tests DisEnT was evaluated manually
in a realistic use case scenario (see Section 4).
The usability criterion ensures that researchers should be able to use DisEnT effectively and easily. To achieve this objective, DisEnT was developed using standard
Web technologies and libraries, ensuring compatibility with as many internet browsing
platforms as possible. Usability was also the main goal for designing the user interface. As Section 3.2.1 describes, DisEnT’s front-end was designed to be as simple and
intuitive as possible. Rather than a myriad of options, users are presented with simple
input fields and usable defaults. Advanced users can, however, tweak the application’s
behaviour by changing the defaults or by using DisEnT’s programmatic interface.
DisEnT can parse three types of widely-accepted gene identifiers and recognise
genes of 23 species. It aims to not impose unnecessary restraints on the user input,
allowing the user to concentrate on the analysis task rather than on converting or formatting their input. The application aims to infer as much as possible about the input
without requesting additional information from the user, as long as the inference can
Chapter 2. Background
17
be done unambiguously.
The system is required to be scalable to handle concurrent workload. DisEnT offers an Application Programming Interface (API), opening it for programmatic access,
which often puts a heavy load on similar computing systems. For this reason, DisEnT implements a job queue where each of the requests stored and handled. The job
queue solution prevents DisEnT from overloading its hosting system while remaining
responsive to all of its users.
Finally, DisEnT was designed to be sustainable, It is an application developed
following good practices and recommendations for scientific software development as
outlined by Wilson et al. (2014). While the DisEnT application currently does not
provide additional features such as network visualisation, the test coverage and code
base were implemented with robustness in mind, making DisEnT suitable for future
extension and improvement.
Chapter 3
Developing DisEnT
This chapter describes the DisEnT application from both the system design point-ofview and the user experience perspective. First, DisEnT’s high-level design decisions
are explained and justified. Then, the system is described in more detail by following
a typical use-case example.
3.1
System Overview
DisEnT consists of four main components: web application, database, computation
module and a job queueing system. Figure 3.1 shows a high-level overview of the
system architecture.
The core component of DisEnT is the web application. It communicates with the
user, accepts their input and formulates output. This module is the only component
visible to the user and was therefore designed with a strong focus on user experience.
The web application component also implements most of the system’s domain logic but
it delegates tasks such as statistical computation or job queuing to the more specialised
computation module.
The other three components provide the infrastructure necessary for DisEnT’s effective operation. The job queue component controls scheduling of computationally
intensive tasks in the background, the database module is responsible for persisting
user data and results, while the computational module carries out most of DisEnT’s
calculations.
This modular approach allows DisEnT to be more scalable and robust. For example, if the computational module was required to be changed or moved to more
powerful host (e.g. in order to have more computational resources), this change would
18
Chapter 3. Developing DisEnT
19
Web Access
Access perimeter
Web Application
Ruby on Rails
Job Queue
Resque
Database
MySQL
Computation
R
Figure 3.1: A high-level diagram of DisEnT’s architecture showing DisEnT’s main components. The dotted line represents the ‘access perimeter’ of the system. Users of
DisEnT directly interact only with the Web Application component.
only require a slight configuration change in the web application module.
Another advantage this set-up provides is that thanks to this clear division of responsibilities, each one of the modules can remain concentrated on a specialised set of
tasks, which results in a ‘slimmer’ and ultimately more maintainable code base.
The following sections describe each of these components from a technical perspective, offering some insight into why these particular technologies and platforms
were chosen to develop DisEnT. (Note: versions of the software packages used in
DisEnT are listed in Table A.1 of Appendix A.)
3.1.1
Web Application
The web application component forms most of DisEnT’s code base. It was developed in the Ruby programming language (Matsumoto and Ishituka, 2002), using a
development framework called Ruby on Rails1 . The Ruby on Rails (‘Rails’ for short)
framework is a popular set of libraries designed for rapid development of web-based
applications.
To aid the rapid development, Rails adopts the principle of convention over config1 http://rubyonrails.org/
Chapter 3. Developing DisEnT
20
uration (Miller, 2009). This means that Rails follows several conventions that are
assumed throughout the application and developers do not have to configure them
explicitly. For example, the framework expects certain files to be stored in specific
directories or enforces standard naming conventions. This philosophy may appear restricting at first, but the purpose of it is to reduce amount of code needed to write a
functioning application. Writing less code leads to better readability without having to
write extensive documentation explaining the code.
Rails also follows a popular design pattern known as model-view-controller, or
MVC (Krasner and Pope, 1988). MVC helps to maintain a clean code base by dividing
the application’s components into three separate groups:
• Models are responsible for maintaining most of the domain logic. They are
responsible for most of the ‘heavy-lifting’ and and are usually most complex.
• Views are designed purely for the purpose of presentation and they do not perform any computations containing domain logic. Views are often implemented
as templates with placeholders for dynamic values.
• Controllers serve as an interface between user input, the application and user
output. They are typically responsible for passing messages between models and
views. Similarly to views, controllers should generally contain as little domain
logic as possible.
Figure 3.2 shows how the MVC pattern is implemented in Ruby on Rails. Typically, a request from a user (e.g. a form submission) is redirected to an appropriate
controller that triggers an action involving a model. The model carries out the necessary operation (e.g. database update) and returns results to the controller. Once the
controller obtains the results, these are passed to a view component which renders a
web page populated with the requested data.
Other advantages of using Rails include a preinstalled development server for testing changes and a basic unit testing suite already included in the framework. Thanks
to these features, Ruby on Rails provides a stable platform for fast development of
reliable web applications.
A Rails application deployed into a ‘production’ environment is typically hosted
on the popular Apache2 server platform using a specialised module called Phusion
2 http://httpd.apache.org/
Chapter 3. Developing DisEnT
21
1
Request
Controller
2
4
3
View
Model
Database
Figure 3.2: Schematic drawing of request handling by a system using the MVC pattern.
Figure inspired by Ruby et al. (2013).
Passenger3 . Phusion Passenger is an Apache server module specialised in hosting web
applications written in interpreted languages such as Ruby. While the Rails code is
written in Ruby, Phusion Passenger is written in a compiled language (C++), allowing
it to execute faster and make better use of the server host’s memory.
Another useful Rails feature for production use is the asset pipeline. The asset
pipeline is a Rails framework used to serve the web application’s assets (e.g. images,
script files) more efficiently. In a usual scenario, when a browser accesses a website
it needs to retrieve each one of these files separately, resulting in several separate calls
to the server. Rails addresses this problem by pre-packaging and compressing most
of its assets so that they are delivered in a single file and a single call. This feature is
particularly useful for clients with unreliable Internet connections.
3.1.1.1
Alternative Solutions
While Ruby on Rails is a popular and well-established framework for creating web
applications, other similar solutions were also investigated during DisEnT’s design
stage.
One of the alternatives considered was the R Studio’s Shiny platform (RStudio,
2014). Shiny enables users of the computational language R (Ihaka and Gentleman,
3 https://www.phusionpassenger.com/
Chapter 3. Developing DisEnT
22
1996) to easily create web applications using native R code. This solution makes it
very easy for Shiny applications to have direct access to R’s powerful computational
libraries and the platform comes prepackaged with powerful visualisation and data
interaction tools.
Although Shiny is a great tool for presenting data, it is predominantly a computational solution, not a platform for building fully-fledged web applications. For example, the web site and its computational counterpart both have to be hosted on the same
physical host. In Shiny’s basic version (the Open Source Edition), only one process
is allowed to run on the host, which hinders scalability of this set-up. While the Professional Edition of Shiny allows users to run multiple computational processes, these
still have to be bound to a single physical host.
Other limitations of Shiny include a lack of support for building interfaces for
programmatic access and close coupling with Shiny’s own web hosting technology –
Shiny applications cannot be hosted on other hosting platforms such as Apache.
In its current set-up, DisEnT has a connection to R (described in Section 3.1.3)
that is customizable and not limited by the number of processes or by a particular
hosting technology. Although the downside of this approach is that Shiny’s data interaction features have to be implemented using other technologies, the current architecture arguably provides a more robust and scalable foundation for DisEnT’s future
adjustments.
3.1.2
Database
DisEnT’s data is stored in a standard relational database. A combination of two
database platforms is used: SQLite4 database for development purposes and MySQL5
in the production environment. SQLite stores its data in specially-formatted files and
is a suitable choice for development because it does not require a running process in
order to provide access to the data. On the other hand, MySQL provides more features and better scalability but comes at a price of having to configure and run a server
process before its databases can be used.
As shown in Figure 3.1, both the Rails web application and the computation component communicate with the database over their own separate channels. This approach is useful in the event of the database becoming a bottleneck in the system. In
4 http://www.sqlite.org/
5 http://www.mysql.com/
Chapter 3. Developing DisEnT
23
such scenario, the MySQL database can be easily mirrored into multiple synchronised
instances which can be queried instead of the main instance (Schwartz et al., 2012).
While there are other database systems available that could fulfill the role of persisting data, DisEnT is using MySQL and SQLite due to their suitable feature sets,
widespread use and good community support. Both these technologies are also popular in the field of bioinformatics; the MySQL database in particular had already been
set up as part of pre-existing infrastructure for DisEnT.
3.1.3
Computation
The computation module comprises of an instance of R interpreter capable of accepting commands over the standard TCP/IP (Transmission Control Protocol/Internet Protocol) channel.
The R programming language is a well-established platform for statistical computation in the field of bioinformatics and computational biology. Utilisation of R allows
DisEnT to make use of some of the best peer-reviewed computation packages for enrichment analysis.
Communication over the TCP/IP channel is not available in R by default. To establish this link between the computation module the and the web application module,
DisEnT makes use of an R package called RServe(Hornik and Leisch, 2003), which
opens up a TCP/IP interface (socket) bound to an R session accepting commands over
this protocol.
Although the computation of disease enrichment analysis could be performed on
any platform capable of computing statistics and connecting to a database, R has been
the standard tool of choice for this purpose. There is a wealth of resources available
for bioinformatics computation, such as Bioconductor (Gentleman et al., 2004), which
provides a central repository for many packages used in DisEnT. Choice of R thus
enables DisEnT to make use of a mature statistical computation infrastructure that
would otherwise have to be re-implemented using other tools.
3.1.4
Job Queue
DisEnT also implements a job queue component that enables the system to remain responsive under heavy load, by executing long-running and computationally-expensive
Chapter 3. Developing DisEnT
24
tasks asynchronously (i.e. in the background). This process is managed by a Ruby
library called Resque6 using a high-performance database system called Redis7 .
The system is implemented in the following way: The job queue stores all pending
jobs and their parameters in the Redis database. Resque runs as a number of processes
known as workers. Resque workers periodically query (poll) the Redis database, looking for jobs that need to be executed. Once they find a ‘pending’ job, they load its
instructions and execute it, effectively removing it from the queue.
Resque is a feature-rich solution for managing asynchronous tasks, but it comes
at a price of a dependency on a constantly-running Redis process. More lightweight
alternatives for managing job queues for Rails applications include Rails runner and
the Delayed Job8 (DJ) package.
Rails runner is a Ruby script included in the default Rails set-up. The script is
capable of executing custom Rails code from the command-line interface, making this
tool predominantly useful for running singular tasks such as database updates. While
this tool can be used as a simple platform for asynchronous job management, this use
case is generally not recommended. The Rails runner loads all of the Rails components
into memory every time it starts, which is a wasteful approach, especially if multiple
short tasks need to be executed at the same time.
Delayed Job offers similar functionality to the Rails runner, but it also implements
a job queue system. The job queue implemented by Delayed Job is stored in the same
database that Rails uses to store its application’s data. This has two consequences: first
of all, there are no external dependencies for this system as Delayed Job makes use
of the already-existing infrastructure. However, the disadvantage of this approach is
that when the number of jobs is higher, Delayed Job can potentially block access to
the database for other parts of the application. Moreover, like Rails runner, Delayed
Job also loads all of the Rails code into memory upon execution, which can lead to
unnecessary delays and exhaustion of the host system’s resources (Fernandez et al.,
2011).
Redis, on the other hand, provides a solution that does not depend on the Rails
database infrastructure but still allows execution of code in the Rails environment.
While Redis workers also have to load the Rails environment into the memory, this
6 https://github.com/resque/resque
7 http://redis.io/
8 https://github.com/collectiveidea/delayed_job
Chapter 3. Developing DisEnT
25
Asynchronous Execution
User submits
data
Yes
Valid?
Gene
Identification
Mapping to
Homologs
Enrichment
Analysis
Data is
presented
No
Figure 3.3: Overview of a typical DisEnT workflow as explained in Section 3.2. The
dashed lines delimit part of the process that is executed in the background.
operation is only done once and the memory use is limited to the number of running
workers, resulting in faster and more effective execution.
3.2
Implementation
In order to describe DisEnT’s implementation, this section follows a typical scenario
of a DisEnT use case. This scenario is first presented in full and then step-by-step in
more detail, presenting the capabilities and limitations of DisEnT.
1. A user enters their data (e.g. gene lists) into a simple web form and submits it
for processing.
2. The web application validates the input data. If the data is valid, a new Resque
job is created and added to the job queue. If the data is deemed not valid, the
user is presented with a meaningful error explaining which parts of their query
need to be adjusted.
3. Once the job has been identified by a Resque worker, genes from both of the
gene lists are first identified, and any non-human genes are ‘converted’ to human
genes in a process called homolog mapping.
4. After the gene identification and mapping, the gene data is passed to the computation module which will carry out the enrichment analysis
5. The results are presented to the user.
Figure 3.3 outlines these steps in a diagram.
Chapter 3. Developing DisEnT
26
Figure 3.4: The DisEnT web interface showing the input form for a new query.
3.2.1
User Input
Following the usability criterion, DisEnT’s user interface was designed to be simple
and user-friendly. Users can enter their data into DisEnT using a simple web form that
was designed to be minimalistic and usable with its pre-set defaults. A preview of this
form is shown in Figure 3.4.
The form contains the following fields:
• The Input list field expects a list of ‘interesting genes’ that will form the central
target of the analysis (see Section 2.1.1 for further explanation). This list can be
submitted either manually via a free-form textbox or as a file.
• In terms of data entry, the Background list field is almost identical to the Input
list field. In addition to entering the list manually or via a file upload, this field
allows users to select a pre-compiled background list for their analysis.
• The Species field allows the use of gene symbols in gene lists by explicitly stating species of the entered genes. For lists containing global gene identifiers, this
field does not need to be specified. Section 3.2.3 explains the reason behind this
field in more detail.
Chapter 3. Developing DisEnT
27
• Finally, the Sources field allows users to choose which sources of gene-disease
annotations would be used in their query (Section 2.2.2 compares these sources).
By default, all of the sources are used, but users may choose different combinations.
3.2.1.1
User-friendly Input Handling
The form offers a set of useful pre-set values, allowing the user to perform a query
by simply providing a list of input genes and clicking the Submit button. By default,
DisEnT will use all annotated human genes stored in its database tables – an equivalent
of using all genes in the human genome. Any genes identified as non-human will
also be automatically mapped to their human homologs (Section 3.2.4 describes this
process).
The application was designed to minimize restrictions on the user input. DisEnT can parse input and background gene list separated by commas and/or by type
of whitespace (including tabs and line breaks, and any combination thereof). Gene
names that consist of multiple words are accepted if they are enclosed in double- or
single-quotes. This enables the user to paste the list contents from manuscripts, spreadsheet software, comma-separated-value (CSV) files and others common sources. As
well as manual entry, users can opt in to upload the gene lists in text-based or CSV
files.
As Section 3.2.3 describes, DisEnT can recognize three major formats of gene
identifiers. The identifiers can be mixed within in any of the gene lists, allowing the
user to concentrate on the analysis without having to convert between the formats.
To further improve the user experience, DisEnT provides explanation of some of
the form fields in form of ‘popovers’ – tooltips that can be triggered by the user to
reveal their content. An example of a popover is shown in Figure 3.5.
3.2.2
Validation
Before the user data is submitted for processing, DisEnT checks its validity. The validation step is in place to improve the quality of the results as well as the user experience. If the data is invalid, DisEnT will report this fact to the user immediately, rather
than attempt to use it and potentially produce low-quality results. In its current version,
DisEnT performs 12 validation checks, including the following:
• Have both the input list and the background list been provided or selected?
Chapter 3. Developing DisEnT
28
Figure 3.5: A popover assisting the user in submitting their gene list.
Figure 3.6: An error message informing user about an input validation failure.
• Can the gene lists be parsed? I.e. can the provided lists be broken down into
gene names. This includes checks for unmatched quotes in the gene names.
• Can each one of the submitted genes be unambiguously identified? This
check ensures that if any of the provided genes identifiers are ambiguous (e.g. many
organisms have a gene named TP53), then there is additional information available to identify them unambiguously.
Other validation checks ensure integrity of the provided data. e.g. whether at least
one annotation source has been selected for the query or whether the gene species
provided by the user is supported in DisEnT.
If any of the validation checks fail, the user is presented with a meaningful error
message explaining why the data has been deemed invalid and what can be done to
rectify the failure. Figure 3.6 shows an example of such an error message.
Chapter 3. Developing DisEnT
29
Figure 3.7: An intermediate message informing the user of the current state of their
query.
If all of the validation checks pass, a new search query is recorded in the system
and a new job is added to the Resque job queue. After that, the user is redirected to
a web page showing the current status of their query. This page periodically polls the
database to refresh its information about the state of the query and to display the results
as soon as they become available. Figure 3.7 shows this screen and its message.
3.2.3
Gene Identification
After passing the validation stage, each of the submitted genes go through an identification process where they are looked up against various database tables in order to
unambiguously find their global identifier and their species. This step enables DisEnT
to compare and use the provided genes in the context of other databases.
More specifically, DisEnT will attempt to ‘translate’ the provided gene identifiers
into Entrez Gene (Maglott et al., 2005) identifiers (Entrez IDs). Entrez Gene is one of
the most comprehensive gene databases and it is operated by the the National Center
for Biotechnology Information (NCBI). Entrez IDs have the form of an integer number
that is unique across all the genes and species. If DisEnT detects this format, the
identifier is looked up to confirm existence of such gene and to identify its species. If,
however, the input format is not an Entrez ID, the identification is more complicated.
3.2.3.1
Identifying Gene Symbols
The simple numerical format makes Entrez IDs a popular choice for gene identification
in many gene-specific databases outside the Entrez Gene database. However, most
scientific literature refers to genes by gene symbols, making it necessary for DisEnT
to be able to recognize them. The Entrez Gene database does provide a database for
Chapter 3. Developing DisEnT
30
mapping between their Entrez IDs and gene symbols, but this process comes with a
number of challenges.
Firstly, while Entrez IDs are globally unique, gene symbols are often repeated
across organisms. For example, the human gene tumor protein p539 is known under the gene symbol TP53 and Entrez ID 7157, but cattle and pig also carry genes
under the same symbol but a different Entrez ID (397276 and 281542 respectively).
Because these naming clashes are relatively common, DisEnT requires species of the
gene symbols to be provided with the input in order to be able to identify them unambiguously.
Another challenge associated with the use of gene symbols that they can change.
As new genes are discovered and old ones are re-defined, the gene symbols change to
reflect this new knowledge (Povey et al., 2001). For DisEnT’s identification process,
this means that the system has to be able to maintain an up-to-date list of current gene
symbols as well as a list of their synonyms. Data mapping current gene symbols to
their synonyms is also provided by the Entrez Gene database, but maintaining both
these lists exacerbates the problem of re-used gene symbols.
Finally, a technical consideration for translating gene symbols is size of the dataset.
As of 8th August 2014, the NCBI Entrez Gene database contained 16,013,303 gene
entries, making textual lookup of gene symbols and their synonyms a computationallyexpensive task. For this reason, DisEnT limits its support to 23 organisms, reducing
this number to 727,770 entries.
The 23 species were chosen based on species supported by the widely-used Ensembl (Flicek et al., 2014) database. This choice allowed us to reduce computational
cost while ensuring that DisEnT has support for some of the most popular model organisms. The list of supported species can be found in Table A.2 of Appendix A.
An alternative approach to automatic identification of gene symbols would be to
consider the rest of the list and ‘assume’ the species of the provided gene symbols
based on genes that can be identified without ambiguity. For example, if 80 percent
of all submitted genes were safely identified as human, DisEnT could narrow down its
search to human gene symbols only.
While this would be an interesting avenue to explore from the user experience perspective, it could introduce a certain level of unpredictability to DisEnT’s behaviour.
Specifically, DisEnT cannot offer any a priori assurance to the user that any proportion
9 http://www.ncbi.nlm.nih.gov/gene/7157
Chapter 3. Developing DisEnT
31
of the their genes will uniquely identifiable (without additional species information)
before going through the computationally expensive identification stage. Thus, in order to improve the likelihood of returning meaningful results after a successful query
submission, DisEnT will ask for species specification if it detects any gene symbols in
the input.
3.2.3.2
Identifying Ensembl IDs
Another type of popular identifiers recognized by DisEnT are Ensembl (Flicek et al.,
2014) gene identifiers. Ensembl identifiers follow a naming convention such as ENSG
followed by a 11-digit number (e.g. ENSG00000141510 for the aforementioned TP53).
While the naming convention can differ between species, these identifiers are stable
and globally unique. Similar to the gene symbols, conversion between Ensembl and
Entrez identifiers can be done using data from the Entrez Gene database.
To summarize, DisEnT can identify Entrez and Ensembl identifiers as well as gene
symbols and their synonyms. Genes that have been successfully identified will be used
in the next stage of the process – mapping to their human homologs.
3.2.4
Mapping to Homologs
Because DisEnT uses the Human Disease Ontology to perform the enrichment analysis, it requires human genes for its computations. If the user enters any non-human
genes in any of their lists, they have to be ‘translated’ into their human ‘equivalents’ –
i.e. mapped to their human homologs. A homolog is a gene that shares some ancestral
DNA with another gene. This often implies that they are functionally similar, though
the similarity levels may vary (Storm and Sonnhammer, 2002).
Gene homologs can be identified using the NCBI HomoloGene10 database table.[h]
The homolog mapping process, depicted in Figure 3.8, consists of the following steps.
First, a given non-human gene of interest is identified in the HomoloGene table. Each
of the entries in the HomoloGene table belongs to a homology group – a grouping
containing homologs across many species. Once the homology group is identified, all
that remains is to find a gene within the homology group that is a human gene – the
human homolog.
10 http://www.ncbi.nlm.nih.gov/homologene
Chapter 3. Developing DisEnT
32
Gene of Interest
Matching Homolog
Entrez ID: 22059
Entrez ID: 7157
Entrez ID Gene Symbol
1. Find the non-human
gene entry
22059 Trp53
281542 TP53
7157 TP53
24842 Tp53
Homology
Group Species
460 Mouse
460 Cattle
460 Human
3. Identify the
human entry
460 Rat
2. Identify homology
group
Figure 3.8: The homolog mapping process. The (non-human) gene of interest is looked
up and its homology group identified. The human homolog is a human gene in in the
same homology group. If there is more than 1 human homolog present in the homology
group, no homologs are returned.
Note that the HomoloGene table in the diagram has been simplified for the sake of
clarity, but the mechanism of the process remains the same.
Chapter 3. Developing DisEnT
33
Because a given gene can have more than one human homolog (see Koonin (2005)
for an explanation why), the process described above may return more than result. The
process of choosing the right which homolog to return is not trivial and therefore DisEnT discards these multi-hit results and does not produce a mapping in such case. This
is a trade-off DisEnT makes, sacrificing user experience in order to preserve validity
of its results.
Having all of the genes translated into human homologs is the last step of data
preparation. This data can be finally submitted to the computation module, which will
perform the disease enrichment analysis.
3.2.5
Enrichment Analysis
As previously stated, the disease enrichment analysis step is carried out entirely in the
R language environment, with the computation module communicating with the rest
of DisEnT via a TCP/IP interface. While implementation of the enrichment analysis
procedure was not in the scope of this project (it was instead implemented by Simpson (2014)), it may be instructive to explain some of the design choices behind this
component.
As mentioned in Section 2.1.1.2, DisEnT makes use of the elim algorithm developed by Alexa et al. (2006). This algorithm is implemented and distributed as an R
package called topGO by the authors of the algorithm (Alexa and Rahnenfuhrer, 2010)
using the previously mentioned Bioconductor repository.
As its name suggests, topGO was created to enable enrichment analysis with Gene
Ontology terms. This package was modified by He and Simpson (2014) to enable its
use for disease enrichment analysis. As of today, this modified version of the topGO
package has not been published, but its working title is topOnto. The topOnto package
is being developed as a ontology-neutral statistical package, enabling users to leverage
the functionality of the topGO package using any ontology.
The interface between Rails and the R component is enabled by the RServe package
described in Section 3.1.3. In order to perform the analysis, Rails sends the gene
lists and the choice of annotation sources (explained in Section 3.2.1) to R over the
RServe channel. R loads all dependencies needed for the computation (e.g. topOnto)
and calculates the enrichment values. Once the results have been processed and the
elim algorithm applied, they are passed to Rails. The Rails component then saves the
Chapter 3. Developing DisEnT
34
Figure 3.9: The results page.
results to the database and progresses into the final stage of the process – presentation
of the data.
3.2.6
Presenting Results
The design goals for the data presentation interface were identical to those of the input form. The data presentation page was designed to be simple and minimalistic by
default, but allowing users to explore the results in more detail if needed.
Figure 3.9 shows an example of the results page. The central component of this
page is the table of results. This table lists the enriched disease terms including their
HDO identifiers and a number of statistical values. The Background Count and List
Count column values represent the number of genes of the background and the input
list matching to the disease term. The Expected Count, Enrichment and the Fisher
‘elim’ p-value are presented here as calculated by the process described in the previous
section. The results table allows users to order the results by any of the columns
without refreshing the page. All the presented data is also available for download in
the CSV format.
The Figure 3.9 also shows a number of other features available in this page. For
Chapter 3. Developing DisEnT
35
Figure 3.10: An expanded Query details section of the results page.
example, a warning message is shown to the user pointing out the fact that there were
some genes submitted in the query that did not have (unambiguous) matches in DisEnT’s gene database. Note that in order to keep the interface uncluttered, the message
does not report any more details about the unidentified genes, making it the user’s
choice to investigate or ignore the warning.
If the user chooses to investigate which genes were not mapped to gene identifiers,
they can do so by opening the Query details part of the page. This page section offers
a detailed description of the outcomes of the identification and mapping of the entered
genes (processes described in Sections 3.2.3 and 3.2.4). As shown in Figure 3.10, the
Query details section reports numbers of identified and unidentified genes as well as
their species.
The different colouring of the reported counts mark active hyperlinks. For example, if the user wishes to find out which of their genes were not found in the DisEnT
database, they can simply click on the count (in this case 6) and they will be presented
with the list of genes in question. This interface is shown in Figure 3.11.
This section described a typical DisEnT use case scenario when interacting with
the system using a web browser. Though we expect that this will be the usual channel
for using DisEnT, we also offer an application programming interface (API) access for
more automated interaction with the system.
Chapter 3. Developing DisEnT
36
Figure 3.11: A list of genes reported after clicking on a gene count value.
3.3
Programmatic Interface
The API access was implemented to enable DisEnT’s users to interact with the system
automatically using custom scripts and programmes. This interface can help share
DisEnT’s data and open its capabilities to novel use cases.
The initial API implementation uses the JavaScript Object Notation (JSON) format
to share data with its clients. An example of a JSON-formatted object (more specifically, a search result) is shown below:
{
"disease_id":
"DOID:9974",
"fisher_elim_p_value":
"0.123769808173478",
"genes_input_names":
["ENSG00000004487", "ENSG00000002745"]
}
In order to conform to commonly accepted Internet standards, DisEnT API also
makes use of the Hypertext Transfer Protocol (HTTP) status codes. HTTP status codes
are a standard way of sending a ‘connotation’ along with the data (i.e. payload) of a
HTTP request. Each HTTP code is represented by a number (e.g. 404) and its meaning
(e.g. Not Found). Conforming to these standards enables software clients such as web
browsers to communicate with remote services more effectively (Fielding et al., 1999).
In its current version, DisEnT API supports two API actions: submission of a
new query and retrieval of its results. Both of these actions will be briefly described
in this section and their detailed documentation is available via the Apiary service
at http://docs.disent.apiary.io/. The Apiary service was chosen to host the
Chapter 3. Developing DisEnT
37
DisEnT API documentation because it can automatically generate code snippets in
several programming languages that can be immediately used to query DisEnT.
Apiary also offers a so-called mock interface that does not make any actual calculations but exhibits the same behaviour as the DisEnT API. The mock interface is used
to describe the API in the examples below.
3.3.1
Submitting a Query
A query can be submitted in a HTTP POST request (a request type commonly used to
submit form data) sent to a designated DisEnT URL. The POST signal is accompanied
with a JSON representation of the query data.
POST http://disent.apiary-mock.com/search_queries.json
{
"search_query":
{
"input_list":
{ "genes_string": "ENSG00000001617, ENSG00000002745" },
"background_list":
{ "predefined_name": "Human genome" },
"mapping_sources":
{
"omim":
true,
"ensembl": true,
"generif": true
}
}
}
The request above describes a query that contains two genes in its input list and
selects a predefined background list. Because this is a valid request, the API will
respond with the HTTP code 201 known as Created and an JSON object stating the ID
of the newly-created query:
{
Chapter 3. Developing DisEnT
38
"notice": "The search query has been submitted",
"id": 14
}
In case the request is not valid, the DisEnT API will return the encountered validation errors (identical to those discussed in Section 3.2.2) along with the HTTP code
400 (Bad Request):
{
"errors": [
"Input list contains gene symbols. Please specify their species",
"Background list can’t be blank"
]
}
3.3.2
Retrieving Results
A query submitted via the API is assigned to the job queue, just like a query submitted
via the web-page interface. To retrieve the results of a query, the API client can simply
issue a HTTP GET call with the query’s ID included in the URL (Uniform resource
locator):
GET http://disent.apiary-mock.com/search_queries/14/results.json
which requests results of the search query under ID 14. If the search query is
still being processed and its results are not available, DisEnT will respond with a 202
(Accepted) status, along with a short message:
{
"notice":
"The query results are not yet available. Please try again later."
}
Once the results are computed, the same call will return a 200 (OK) call, listing the
results:
[
{
Chapter 3. Developing DisEnT
39
"disease_id": "DOID:9974",
"fisher_elim_p_value":
"0.123769808173478",
"genes_input_names": ["ENSG00000004487", "ENSG00000002745"]
},
{
"disease_id": "DOID:0060041",
"fisher_elim_p_value": "0.358765288293535E-1",
"genes_input_names": ["ENSG00000004487", "ENSG00000002745"]
}
]
Although the current implementation of the API only supports a subset of the DisEnT’s web-page features, thanks to Rails’ Model-View-Controller architecture (explained Section 3.1.1), the features can be easily extended. Similarly, DisEnT can be
easily extended to support other data representation formats, such as XML (Extensible
Markup Language). It also may be worth noting that a combination of both approaches
can be used, e.g. API for submitting the results along with web-browsing for their investigation.
3.4
Development Methodology
In order to further address the sustainability criterion outlined in Section 2.4.1 of the
previous chapter, this section will briefly comment on the development methods used
to design and implement DisEnT.
DisEnT was developed using the Agile software development approach. Agile is a
term coined in 2001 by Beck et al. (2001) and one of the main tenets of this methodology is to develop software built for change rather than to follow a pre-set plan.
Building software ‘ready’ for change means building software that is robust, modular and orthogonal, i.e. building software without introducing unnecessary complexity
and dependencies between its components. This chapter has described how DisEnT
adheres to these principles in several aspects.
For example, delegating specialised tasks to specialised modules decreases dependencies between the components. In an event of a failure of the DisEnT website, the
job queue module along with the computation module should remain unaffected, pro-
Chapter 3. Developing DisEnT
40
cessing the enqueued jobs.
Similarly, the choice of technologies adhering to the Agile principles, such as Ruby
on Rails, also improves robustness and ultimately sustainability of DisEnT. Thanks
to the Model-View-Controller pattern DisEnT’s web interface is easily extendible to
communicate using JSON, XML or to provide its data in formats such as CSV.
The Agile methodology also affected the ‘project management’ aspects of DisEnT.
Each of the project’s features was first outlined as a brief and well-defined use case
(sometimes referred to as user story) that was then broken down into short and achievable tasks. Over the course of this project, these tasks were prioritized and delivered
on a weekly basis, followed by frequent ‘releases’ of new versions of DisEnT.
While this approach arguably introduced some operational overhead into the development process, it helped the project to remain on track and deliver the most important
features based on frequent feedback from the project’s supervisor.
Finally, DisEnT was developed using the test-driven development (TDD) method.
TDD is a popular software development methodology, in which software tests are written before implementation of code. Adoption of TDD enabled clearly set expectations
of DisEnT’s desired behaviour before extending its code base. Moreover, because TDD
code is written to specifically address a set of discrete and well-defined expectations,
the code itself becomes more modular, robust and thus more maintainable.
While use of this methodology does not guarantee completely bug-free software,
TDD has been reported to radically decrease the number of software flaws. For example, Maximilien and Williams (2003) report reduction of software defects by 50
percent after adoption of TDD.
Chapter 4
Evaluation
The aim of this project was development of a reliable and accessible scientific tool for
disease enrichment analysis using a novel dataset. In order to evaluate whether this
goal was achieved, Chapter 2 introduced four main requirements for DisEnT: correctness, usability, scalability and sustainability.
As Chapter 3 described, these four criteria were used as guidelines when designing
DisEnT’s architecture, creating its interfaces and implementing its functionality. While
we suggest that the decisions made during the development of the project were suitable
for the set requirements, this chapter will evaluate whether DisEnT’s meets its criteria
using a number of other methods.
The correctness criterion is evaluated by using results produced by DisEnT to
find relationships between diseases and genes and comparing them to findings in the
literature.
To validate DisEnT’s usability, a user study was conducted, where a group of
participants were asked to use DisEnT to perform disease enrichment analysis.
Scalability of DisEnT is evaluated using a simulated load impact test to verify
whether the system can cope with increased concurrent workload.
Finally, the system’s sustainability is evaluated against a set of peer-reviewed recommendations for scientific software
4.1
Correctness
Although DisEnT’s source code is covered by a high number of automated tests, not
even extensive software testing can guarantee complete correctness of a software system. Thus, in order to evaluate whether the results produced by DisEnT are correct, we
41
Chapter 4. Evaluation
42
Figure 4.1: A cartoon depicting a synapse with its presynaptic (top) and postsynaptic
parts (bottom). Figure by Gécz (2010).
tested the system in a realistic scenario. We used DisEnT to find enriched diseases in
two distinct sets of genes. The results were merged together and analysed as a network
in order to find linkages between the diseases identified in each set. Our findings were
compared to the published literature.
The lists used for the evaluations consisted of genes from the presynaptic and the
postsynaptic part of the synapse. The synapse is a structural feature between two neurons that enables their connection. The presynaptic part of the neuron can be considered a transmitter in this connection and the postsynaptic part as a receiver. Figure 4.1
shows an example of a synapse in a cartoon.
The gene lists were sourced from two published studies. The source of the presynaptic gene list was Boyken et al. (2013) and it contained 419 human genes. The postsynaptic gene list was published by Bayés et al. (2012) and it contained 1442 human
genes.
Each one of the gene lists was submitted to DisEnT in a separate query, setting
human as the target species and choosing all of the available annotation sources. Each
Chapter 4. Evaluation
43
Table 4.1: Top 10 disease terms enriched in the post-synaptic list, ranked by their pvalue.
Disease Term (HDO ID)
Enrichment Score
p-value
schizophrenia (DOID:5419)
2.96
8.0847E-20
frontotemporal dementia (DOID:9255)
7.67
3.1864E-17
Leigh disease (DOID:3652)
11.94
1.5542E-13
pyruvate decarboxylase deficiency (DOID:3649)
15.78
1.2052E-06
9.23
3.5613E-05
chief cell adenoma (DOID:7607)
30.00
5.4008E-05
parathyroid carcinoma (DOID:1540)
11.76
2.7085E-04
8.33
2.7818E-04
10.52
4.7406E-04
2.60
4.8057E-04
Ohtahara syndrome (DOID:0050709)
mitochondrial encephalomyopathy (DOID:890)
leber hereditary optic neuropathy (DOID:705)
schizophreniform disorder (DOID:11328)
of the queries returned 50 results and the top 10 enriched diseases for the queries
are shown in Tables 4.1 and 4.2. As one would expect, most of these terms describe
neurological disorders.
The full result set for each of the lists was downloaded in the CSV format for
further processing. In order to find connections between the diseases enriched in each
gene list, we combined the results from the two lists and used them to construct a
graph. Nodes of the graph were the enriched diseases and their edges were formed
by genes shared between diseases. Separate edges were drawn for presynaptic and
postsynaptic genes in common. This network was visualised using Cytoscape version
3.1.1. Figure 4.2 shows the initial layout of this network.
Edges in Figure 4.2 are weighted by the number of genes in common between each
of the diseases. ‘Postsynaptic’ edges are shown in red, while edges representing the
presynaptic genes are shown in green. Size of the nodes and their labels is determined
by the nodes’ betweenness centrality. A node’s betweenness centrality is calculated by
counting the proportion of shortest paths in the network that include that node.
Figure 4.2 shows that there are many common gene connections between the postsynaptic diseases (connections shown in red) and that there is a number of presynaptic
disease terms (connections shown in green) that are separate from the postsynaptic
disease cluster. While most of the terms refer to neurological disorders, there are a
number of metabolic disorders present in the network such as the pyruvate decarboxy-
the text).
plasma cell neoplasm
familial medullary thyroid carcinoma
gastric leiomyoma
mucinous ovarian cystadenoma
mitochondrial
phaeochromocytoma
Cowden disease
Alpers syndrome
congenital nonspherocytic hemolytic
anemia
X-linked sideroblastic anemia
Charcot-Marie-Tooth disease type 4
amyotrophic lateral sclerosis
simplex
axonal
neuropathy
amyloidherpes
angiopathy
amebiasis
motor peripheral neuropathy
juvenilecerebral
myelomonocytic
leukemia
Charcot-Marie-Tooth disease
Charcot-Marie-Tooth
D-2-hydroxyglutaric aciduria
disease type encephalomyopathy
2
severe acute
respiratory syndrome
oligodendroglioma
distal hereditary
neuropathy
multiplemotor
system
atrophy
schizophreniform disorder
cascade stomach
choroid plexus cancer
generalized
epilepsy Pick's
with febrile
lateral
sclerosis
disease
pilocytic
astrocytoma
glioblastoma
brain cancer
seizures
plus
disorder
bipolar disorder
breast cancer
Huntington's
disease
periventricular nodular heterotopia
rectal disease
focalbody
epilepsy
Lewy
dementia
schizophrenia
juxtacortical chondroma
frontotemporal
Alzheimer's
brain disease
paraganglioma
rabies
early disability
myoclonic
encephalopathy
parathyroid
carcinoma
intellectual
disease
dementia
non-small cell lung
carcinoma
leber hereditary optic neuropathy
neurotic
astrocytoma autistic disorder
vaccinia
alternating hemiplegia of childhood
hereditary spherocytosis
chief cell adenoma
nasopharynx carcinoma
Down syndrome
uterine carcinosarcoma
temporal lobe epilepsy
melanoma
polyneuropathy
encephalitis
Leigh disease
hemiplegia
inclusion
body myositis
Ohtahara syndrome
uveal cancer
autoimmune hemolytic anemia
protein-deficiency anemia
ganglioneuroblastoma
Wiskott-Aldrich syndrome
kidney leiomyosarcoma
leiomyoma cutis
pyruvate
metabolic
decarboxylase
acidosis
deficiency
West syndrome
Bernard-Soulier syndrome
benign familial infantile epilepsy
Smith-Lemli-Opitz syndrome
pancreatic gastrinoma
epidermolysis bullosa dystrophica
retinitis
Chapter 4. Evaluation
44
Figure 4.2: Network of diseases linked by their presynaptic and postsynaptic genes in
common. Nodes represent the diseases and edges represent the shared genes. Green
edges represent presynaptic genes and red edges represent postsynaptic genes.
Thickness of the edges is determined by the number of genes in common. Size of
the disease nodes is determined by their betweenness centrality value (explained in
Chapter 4. Evaluation
45
Table 4.2: Top 10 disease terms enriched in the pre-synaptic list, ranked by their pvalue.
Disease Term (HDO ID)
Enrichment Score
p-value
schizophrenia (DOID:5419)
2.91
4.5421E-65
frontotemporal dementia (DOID:9255)
5.43
2.6852E-33
Alzheimer’s disease (DOID:10652)
2.07
1.3499E-21
Down syndrome (DOID:14250)
2.47
6.7531E-08
glioblastoma (DOID:3068)
1.78
1.9764E-07
neurotic disorder (DOID:4964)
1.66
2.0477E-07
autistic disorder (DOID:12849)
1.90
3.9672E-07
temporal lobe epilepsy (DOID:3328)
3.23
6.0301E-07
amyotrophic lateral sclerosis (DOID:332)
2.04
1.4568E-06
Charcot-Marie-Tooth disease (DOID:10595)
3.73
2.5146E-06
lase deficiency. Although this connection may appear surprising, past studies have
shown various links between metabolic and neurological disorders. For example, a
link has been found between a neurological disorder called Leigh’s disease (ranked
3rd in Table 4.1) and pyruvate carboxylase deficiency (Toshima et al., 1982; Ohtake
et al., 1982).
In order to simplify the network, we reduced it to top 10 nodes based on their
betweenness centrality value (Figure 4.3). We also built separate networks from the
presynaptic and postsynaptic connections and reduced them to top 10 nodes ordered
by betweenness centrality. The reduced presynaptic network is shown in Figure 4.4
and the postsynaptic network in Figure 4.5.
Figures 4.3, 4.4 and 4.5 all show a significant enrichment for the schizophrenia
term. This significance, supported by this term’s high ranking in Tables 4.1 and 4.2 is
also evidenced by a number of studies of schizophrenia in the context of presynaptic
(Sternberg, 1982) and postsynaptic (Pandey et al., 1977) processes. It should be noted,
however, that the high number of gene connections between schizophrenia and other
diseases may also be a consequence of the high amount of published literature studying
schizophrenia in general.
Another interesting phenomenon in Figures 4.4 and 4.5 is the number of terms
related to cancer (e.g. glioblastoma, breast cancer, paraganglioma) present in the two
networks.
Chapter 4. Evaluation
46
metabolic
acidosis
pyruvate
decarboxylase
deficiency
mitochondrial
encephalomyopathy
schizophrenia
phaeochromocytoma
neurotic
disorder
paraganglioma
Charcot-Marie-Tooth
disease type 2
Alzheimer's
disease
frontotemporal
dementia
Figure 4.3: The network of postsynaptic and presynaptic gene associations reduced to
top 10 nodes ordered by betweennes centrality.
metabolic
acidosis
pyruvate
decarboxylase
deficiency
Alzheimer's
disease
Charcot-Marie-Tooth
disease type 2
mitochondrial
encephalomyopathy
oligodendroglioma
schizophrenia
frontotemporal
dementia
paraganglioma
phaeochromocytoma
neurotic
disorder
Figure 4.4: The network of the presynaptic gene associations reduced to top 10 nodes
ordered by betweennes centrality.
Chapter 4. Evaluation
47
melanoma
non-small
cell lung
carcinoma
Alzheimer's
disease
frontotemporal
dementia
neurotic
disorder
glioblastoma
breast
cancer
astrocytoma
schizophrenia
brain
disease
Figure 4.5: The network of the postsynaptic and presynaptic gene associations reduced
to top 10 nodes ordered by betweennes centrality.
One of the reasons behind the high number of connections between the neurological and the cancer-related terms could be that certain gene products, such as those
controlling the cell-cycle, are re-used in different parts of a biological system. In order
to verify this hypothesis, we performed a functional enrichment analysis on the genes
associated with the cancerous terms. We found that processes associated with the cellcycle are highly enriched in these genes. The enrichment analysis was performed using
the ToppGene suite (Chen et al., 2009) and its results are summarized in Table 4.3.
While the involvement of cell-cycle genes may explain some of the connections
between the cancer terms and neurological diseases, it should be noted that these links
may still be interesting. For example, several studies have founds links between incidence of cancer and neurological diseases such as Alzheimer’s disease (Roe et al.,
2005) and schizophrenia (Barak et al., 2005; Grinshpoon et al., 2005).
The scenario described above attempted to use data produced by DisEnT to make
meaningful inferences about links between diseases and genes. Our findings suggest
that using DisEnT’s data in this manner can lead to correct and interesting results, such
as links between neurological disorders and metabolic disorders or cancer.
However, as with any biological research, care needs to be taken when choosing
investigation methods as well as when evaluating the results. Similar future attempts
Chapter 4. Evaluation
48
Table 4.3: Top 5 results of enrichment analysis for genes associated with cancer-related
terms. Enrichments of molecular functions, biological processes and cellular components are listed.
Rank
GO Term (ID)
p-value
Molecular Function
1
enzyme binding (GO:0019899)
5.883E-49
2
kinase binding (GO:0019900)
2.305E-29
3
protein complex binding (GO:0032403)
1.183E-27
4
cytoskeletal protein binding (GO:0008092)
1.809E-27
5
protein kinase binding (GO:0019901)
2.508E-26
1
cell projection organization (GO:0030030)
1.880E-46
2
neuron projection development (GO:0031175)
6.365E-45
3
neuron development (GO:0048666)
7.478E-44
4
neurogenesis (GO:0022008)
4.818E-42
5
neuron projection morphogenesis (GO:0048812)
5.556E-42
1
neuron projection (GO:0043005)
1.282E-37
2
neuron part (GO:0097458)
9.476E-35
3
cytoskeletal part (GO:0044430)
2.627E-28
4
vesicle (GO:0031982)
4.244E-27
5
cell junction (GO:0030054)
5.093E-27
Biological Process
Cellular Component
Chapter 4. Evaluation
49
of the presented evaluation scenario may benefit from more granular reduction of the
constructed networks in order to find more subtle connections, and more sophisticated
statistical analysis of the network accounting for the fact that certain diseases are studied more than others. Enabling this form of analysis within the DisEnT application
may also be valuable to its users.
4.2
Usability
To further test the usability of DisEnT, we conducted a small user study whose participants were asked to use DisEnT to perform a simple disease enrichment analysis. The
participants were instructed to carry out a set of practical tasks using DisEnT’s web
interface and then comment on their user experience.
The study was conducted using the Google Forms service1 . The Forms service
was used to present instructions to the participants as well as to collect their answers.
The full content of the survey as well as its results can be found in Appendix B and at
http://goo.gl/sgLhd2.
4.2.1
Practical Tasks
The study instructed its participants to access DisEnT’s pre-production website, enter
a given set of genes as an input list, specify the genes’ species and submit the query.
The specified query generated 50 results that were all displayed to the participants. In
order to answer the questions above, the participants were instructed to use features
that DisEnT’s user interface provides but no further guidance was given. In the event
of not knowing the answer to any of the questions, the participants were allowed to not
submit an answer.
Once the participants were presented with their search results, they were asked to
answer the following questions:
1. What is the name of the disease with the highest enrichment score?
2. How many genes from the input list are associated with that disease?
3. Please list all genes from your list associated with Alzheimer’s disease.
1 https://docs.google.com/forms/
Chapter 4. Evaluation
50
To answer question 1, the participants were expected to sort the query results by
the ‘Enrichment’ column in descending order. Out of 10 participants in the study 9
answered question 1 correctly. The participant that answered this question incorrectly
reported the first result in the default ordering (by p-value).
The incorrect answer could have been caused by a confusion between the enrichment score column and DisEnT’s default ranking. This error could have been potentially avoided by stating the question more explicitly or by targeting it to a different
column.
Answering question 2 required the users to identify the List count column value
for the result identified in question 1. All of the participants answered this question
correctly. (The participant who reported an incorrect disease in question 1 provided a
count matching to their answer).
Finally, question 3 asked the users to find the List count value for Alzheimer’s disease and report the associated genes by clicking on the count. 9 participants answered
this question correctly. One participant reported genes associated with the disease from
questions 1 and 2, suggesting that they misunderstood the question 3.
4.2.2
User Experience
Following the practical part, the second section of the study asked the participants to
evaluate difficulty of the preceding tasks and comment on usability of DisEnT’s user
interface. The questions in this section were as follows:
1. How difficult was it to answer the questions above?
Rating on a scale of 1 to 5, where 1 means ‘Easy’ and 5 means ‘Difficult’
2. Can you rate the ease of use of DisEnT?
Rating on a scale of 1 to 5, where 1 means ‘Easy to use’ and 5 means ‘Difficult
to use’
3. Can you describe the DisEnT’s user interface in 1-3 words?
4. What is your level of experience with the enrichment analysis method?
Rating on a scale of 1 to 5, where 1 means ‘No experience’ and 5 means ‘Expert’
5. Any further comments
In question 1, 5 of the participants marked the difficulty level as 1 and 5 participants
as 2. This suggests that the participants considered finding answers to the questions
Chapter 4. Evaluation
51
in the practical section to be relatively easy. As Section 4.2.3 reports, most of the
participants had very little or no previous experience with the enrichment analysis
method, so reports of a certain level of difficulty were expected from the experiment.
Answers to question 2 consisted of 7 ratings of 1 and 3 ratings of 2, which suggests
that DisEnT’s is generally considered to be easy to use by non-experts.
Question 3 of this section allowed users to label DisEnT’s user interface using their
own words. Answers to this question were predominantly positive, using words such
as ‘intuitive’, ‘minimalist’, ‘clean’ and ‘friendly’. In contrast, however, one of the
participants described the interface as ‘a bit confusing’.
Answers for question 4 reported that most of the study participants had little or no
prior experience with enrichment analysis. When asked to rate their level of experience
with this method, 7 participants chose the lowest rating, ‘No experience’.
Optional comments provided by the participants contained additional positive feedback a number of suggestions. Specifically, two of the responses suggested that more
of the search result headings should be annotated and one respondent found implementation of the table sorting functionality unclear. While this particular type of ‘freeform’ feedback is difficult to quantify, we believe that the suggestions and shortcomings pointed out in this study may form valuable information for future improvements
of DisEnT.
4.2.3
Evaluation
Overall, the feedback provided by this study suggests that DisEnT’s user interface is
perceived positively. Almost all study participants completed the given tasks correctly
and evaluated the difficulty of the given tasks as low. Likewise, most of the participants
expressed a positive opinion about the DisEnT’s web interface. We believe that these
results suggest that the usability criterion was met by DisEnT. However, it is worth
discussing some of the limitations of this user study.
The difficulty of the practical tasks was set relatively low in order to be able to
conduct the study with non-expert participants. Setting more complicated tasks to a
more experienced audience could provide an insight into DisEnT’s usability in a more
realistic scenario.
Moreover, the participants chosen for this study were university-level students of
informatics and bioinformatics. There is some expectation that participants of this type
Chapter 4. Evaluation
52
are technically able and accustomed to a range of web technologies. This notion, along
with the observed experience levels, may not be an adequate reflection of DisEnT’s
target user base. A future user study could include a more diverse range of participants.
The reason behind choosing non-expert participants was twofold: firstly, it allowed
to test DisEnT in a scenario where the users were not familiar with the enrichment
analysis process were only guided by brief instructions. This setting required the users
to carry out the task exclusively using the features of DisEnT’s interface. The second
reason for choosing non-expert participants was practical: non-expert users are easier
to find.
To summarize, while the conducted study was relatively short and not without its
limitations, we believe that the collected results provide some evidence that the application developed in this project meets its usability criterion.
4.3
Scalability
Scalability of DisEnT was tested by a technique known as load testing (Weyuker and
Vokolos, 2000). As its name suggests, the aim of load testing is to observe performance
of a system when exposed to heavy workload.
To carry out a load test DisEnT, we used an online service called Load Impact2 .
The Load Impact service enables its users to configure ‘user scenarios’ – custom scripts
imitating usual user interaction with a service. These user scenarios are then ‘replayed’
on the target system, simulating usage ‘traffic’ on the system. Load Impact replays the
scenarios at a gradually increasing rate, putting more workload on the target system.
We simultaneously tested DisEnT using two user scenarios. One of the scenarios
simulated a user loading a results page showing 50 analysis results. The second scenario simulated a user submitting a query containing 100 gene symbols using the API.
Each of the scenarios had a randomly chosen back-off period of 1 to 10 seconds before
they were repeated.
The reason for choosing an API call for query submission over a web-form interface was a technical one. Because allowing automated submission of web forms can
be considered a security risk (known as cross-site request forgery ), Rails implements
a safety feature in its web forms that prevents this risk. Unfortunately, this also makes
it difficult to automate the form submission task in a Load Impact user scenario.
2 http://loadimpact.com/
Chapter 4. Evaluation
53
The Load Impact test worked by gradually increasing user connections that replayed one of user scenarios on DisEnT. Both the scenarios were spread evenly among
the connections. The test started with a single connection and ended with 50 user connections. This increase was achieved over the course of 5 minutes. The parameters of
50 connections and 5 minutes were both limited by the free version of the Load Impact
service and thus could not be adjusted.
The results of the load test are presented in form of a chart in Figure 4.6.The parameters observed during the test were: the number of active connections (i.e. Clients
active), the number of requests per second (Requests/second) and the service response
time (User load time). The full details of the test and its results can be found at the
Load Impact website3 .
As Figure 4.6 shows, the number of connections had a slight detrimental effect
on DisEnT’s response time. However, all of the requests were processed in under
1 second, which is considered an acceptable response time for an optimal web user
experience (Nielsen, 1999).
Over the course of its 5-minute run, the test issued 3831 unique HTTP requests
at a rate of up to 30 requests per second. 195.68 megabytes of data were transferred
between the DisEnT server and the simulated Load Impact users.
The result page scenario was triggered 628 times with an average response time of
839.88 milliseconds (ms). The 100-gene query submission scenario was triggered 695
times with the average response time of 472.14ms. All of the requests were processed
and responded to successfully.
We believe that the difference in response time between the two scenarios can be
explained by the the use of two different protocols. While the results page scenario
is less computationally expensive (it only involves reading data), it actually consists
of several ‘sub-requests’ for assets contained on the page (e.g. stylesheets, scripts, images). On the other hand, the query submission scenario is computationally more expensive but the process only involves issuing a single request to which DisEnT replies
with a single response message.
We believe that the load testing results show that DisEnT is usable under heavy
workload and thus meets the scalability criterion. While limitations of the free version of the Load Impact service did not allow us to determine any limits of DisEnT’s
scalability, we do not expect that a scientific tool of this type would be used under
3 http://loadimpact.com/load-test/synprot.inf.ed.ac.uk-356f529ecaff26571c779e5c52624bda
Chapter 4. Evaluation
54
Figure 4.6: Results of load testing showing DisEnT’s response time (blue), active client
connections (green) and requests per second (red) over time. Figure provided by Load
Impact.
Chapter 4. Evaluation
55
conditions exceeding the conditions of the described load test.
Future load tests could benefit from more intensive testing with increased the request rate and more observed variables. For example, the time needed to return results
from a newly-submitted query could be observed. DisEnT in its current version only
operates one worker processing its job queue, but this number can be easily increased
to optimize the query turnaround time.
4.4
Sustainability
Sustainability was the fourth and final criterion considered when designing DisEnT.
We believe that a sustainable piece of software is designed to be reusable, maintainable and long-lasting. While these attributes are often only identifiable over a longer
period of time, this chapter attempts to evaluate sustainability of DisEnT by matching
it against a set of recommendations for writing scientific software recently published
by Wilson et al. (2014).
In their study entitled Best practices for scientific computing, Wilson et al. suggest eight recommendations based on the authors’ collective experience in developing
scientific software, on various open-source software guidelines, as well on published
scientific computing studies. This chapter briefly introduces these recommendations
and describes how DisEnT addresses each one of them.
4.4.1
Write programs for people, not computers
This recommendation suggests that software source code should be easy to read and
comprehend by people reading it. This can be achieved by adhering to a consistent
style of formatting and writing, as well as choosing meaningful names for variables,
methods and other components included in the code.
This requirement is inherently addressed in the Ruby on Rails architecture. By
adhering to the convention over configuration pattern described in Section 3.1.1, Rails
encourages developers to be consistent and descriptive in naming of components in
their code. DisEnT was developed in line with this convention in Rails as much as
possible.
The Ruby code in DisEnT was written according to a community-driven Ruby code
style guide4 , ensuring its formatting is coherent and the source code is legible and
4 https://github.com/bbatsov/ruby-style-guide
Chapter 4. Evaluation
56
self-explanatory. Custom methods have been named descriptively and more complex
methods have been annotated with code comments.
4.4.2
Let the computer do the work
The message of this recommendation is to automate repetitive tasks in scenarios such
as database maintenance or software deployments.
Ruby on Rails also addresses this requirement by offering the Rake tasks – scripts
written in Ruby for common tasks such as database migrations. The Rake tasks are
functionally identical to the Rails runners described in Section 3.1.4.
Examples of standard Rails Rake tasks include the db:setup task for automatic
creation of database tables based on their definition or the db:seed task for populating
the database with pre-defined data. These tasks are tremendously useful when creating
a new development environment.
The Rake tasks are not limited to database operations and can in fact execute any
Ruby (and Rails) code so that developers can create their own customized tasks. DisEnT contains a number of custom tasks, such as the disent:restart task. This task
performs a number of operations needed for updating DisEnT in production environment. These tasks include applying any pending changes to the database structure,
re-starting Resque workers and finally re-loading the Rails server. These steps need to
be completed every time a new version of DisEnT is deployed this Rake task reduces
this multi-step process to a single step.
4.4.3
Make incremental changes
This recommendation consists of two points. First, the authors advise scientific software developers to make small incremental changes in order to be able to accommodate
frequent feedback. Second, they recommend using version control for all manuallyproduced code.
As Section 3.4 describes, DisEnT was developed in small increments by delivering
each feature in a separate, well-defined task. The management aspect of this project
also included weekly meetings where DisEnT’s latest version was discussed and any
outstanding tasks could be re-prioritised.
DisEnT was developed using the Git5 version control system from the very beginning of the project. In addition to the Ruby on Rails code, all R code used for
5 http://git-scm.com/
Chapter 4. Evaluation
57
computation is also stored in Git.
4.4.4
Don’t repeat yourself (or others)
This point recommends making code reusable and modular, so that each piece of
knowledge in the system is represented only once. This approach makes it easier to
maintain the code base.
The Don’t repeat yourself (DRY) principle is one of the major design goals of the
Ruby on Rails framework. For example, attributes of each model defined in Rails (Section 3.1.1 explains the model-view-controller pattern) are inferred from the database
schema so that they do not have to be re-defined.
Code written Rails also has access to all of the features of Ruby, making it possible
for developers to create and include custom classes and modules for sharing behavior
and knowledge across the code base.
4.4.5
Plan for mistakes
This recommendation stresses the importance of automated software testing and it suggests to prevent already-discovered bugs from re-occurring by writing tests capturing
them.
As described in Section 3.4, DisEnT was developed using the test-driven development (TDD) methodology, where tests for a piece of code are written before the
code itself. Using the popular RSpec6 testing framework, DisEnT was developed with
an exceptionally-high test coverage. The SimpleCov7 Ruby plugin for measuring test
coverage, reported 96.23% of all DisEnT’s Ruby code to be covered by a test case.
4.4.6
Optimize software only after it works correctly
This recommendation suggests to use profilers for identifying performance bottlenecks
as well as using high-level programming languages to implement functionality before
optimizing the performance.
We carried out performance testing in Section 4.3 in order to identify any performance bottlenecks in the RSpec system. As mentioned in that section, we have not
yet identified any performance bottlenecks in the production version of the system, but
this may partially be due to limitations enforced by the performance profiler used.
6 https://github.com/rspec/rspec
7 https://github.com/colszowka/simplecov
Chapter 4. Evaluation
58
Implementing functionality before optimizing performance was the method for implementation of the gene identification and homolog mapping processes (Sections 3.2.3
and 3.2.4 respectively).
4.4.7
Document design and purpose, not mechanics
The message of this point is to write documentation that explains design decisions
rather than implementation details. This points also recommends to embed documentation in the code.
Because most of DisEnT’s code consists of simple and short methods, its documentation embedded in code is not extensive. However, any major design decisions were
recorded in a set of wiki pages created for the project. The wiki pages contain the challenges encountered and their solutions. Moreover, DisEnT’s API has been extensively
documented via the Apiary service at http://docs.disent.apiary.io/.
Should there be a need, future versions of DisEnT’s code base could be annotated
with specially-formatted comments that can be automatically translated into web-based
documentation. An example of a package offering such functionality is RDoc8 .
4.4.8
Collaborate
This point encourages scientific software developers to use collaborative tools and
techniques such as code reviews or pair programming. They also advise to use specialised tools for issue tracking.
Due to the nature of this project (i.e. an individual Master’s project), the implementation phase did not offer many opportunities for programming collaboration but it was
developed with frequent feedback.
The progress of this project was always trackable in a constantly-updated spreadsheet available online. Each of the tasks was assigned a unique identifier (e.g. DisEnT023) which could be referred to in all parts of the system, including the source code
and the wiki-based notes mentioned in the preceding section.
All in all, we suggest that DisEnT was developed in a way that will enable its re-use
and expansion in the future. While some of its aspects, such as the extent of its documentation, may be improved as the project becomes more complex, we believe that we
8 http://docs.seattlerb.org/rdoc
Chapter 4. Evaluation
59
diligently followed the recommendations set above in order to produce a sustainable
piece of software.
Chapter 5
Conclusion
Relationships between diseases and genes are tremendously complex, but the effort
invested in understanding them can reveal new insights into some of the world’s most
important health challenges. This thesis has introduced a method for studying genedisease relationships by enrichment analysis and a tool enabling its use.
We have described DisEnT – a disease enrichment tool that allows access to genedisease data collected from an unprecedented number of sources. The aim of this
project was to design, implement and evaluate DisEnT as a reliable and accessible
scientific tool. In order to achieve this aim, we set four main goals for DisEnT: correctness, usability, scalability and sustainability.
DisEnT has been developed to address all of the four criteria. Its code was implemented with tests safeguarding its correctness and its user interface was designed with
a focus on usability. DisEnT’s architecture is modular to allow for expansion of its
features and its underlying platforms were chosen to make it suitable for future re-use.
DisEnT was evaluated against each one of its four goals. In order to evaluate the
system’s correctness, we showed that results produced by DisEnT can be used to find
evidence-supported links, including links that are not immediately obvious. DisEnT’s
usability was evaluated in a user study, producing positive outcomes. We subjected
DisEnT to a load impact test to evaluate its scalability and found that it can cope with
increased workload. Finally, we provide evidence that we followed recommended best
practices in our methodology. While each of the evaluation methods we used have their
limits, we believe that their outcome is a reasonable indication of DisEnT’s quality.
While we suggest that the current version of DisEnT is a reliable scientific tool, its
main limitation currently is the lack of features it provides. Although the currently60
Chapter 5. Conclusion
61
available enrichment analysis functionality can provide valuable information to its
users, there is much more that can be done with the data available.
Future versions of DisEnT could introduce a visualisation functionality into the
application. For example, automated construction of gene-disease networks (similar to
those described in Section 4.1) could enable users to gain even more insight from their
results without having to reach for external solutions.
Another valuable feature of DisEnT would be a support for ontologies other than
the Human Disease Ontology. This feature could help users gain more specific insights
by using ontologies targeted specifically to their problem of interest, e.g. neurological
diseases. As a matter of fact, there is an already-ongoing effort to include the Human
Phenotype Ontology (Robinson et al., 2008) into DisEnT’s annotation data. Another
similar effort exists to enable research of diseases of the synapse – synaptopathies
(Grant, 2012) – in DisEnT.
DisEnT could also allow its users to trace supporting evidence for the enrichment
results it provides. Such feature would require an enhancement of the current dataset,
but it would vastly improve the transparency of DisEnT’s internal processes.
Overall, this project has successfully produced a software tool capable of performing disease enrichment analysis on a given set of genes. We believe that the quality
and accessibility criteria set for DisEnT have been met and that the tool is correct, usable, scalable and sustainable. There are a number of features that could enhance this
tool further and we believe that this project has laid a solid foundation to enable this
expansion, allowing DisEnT to become a useful and reliable resource for the scientific
community.
Appendix A
Technical Specifications
A.1
Software Versions
This section describes versions of software packages used in DisEnT.
A.2
Supported Species
This section lists species supported in DisEnT as described in Section 3.2.3.1.
62
Appendix A. Technical Specifications
63
Table A.1: Versions of software packages used in DisEnT divided by their components
defined in Section 3.1.
Name
Version
Web Application
Ruby
2.1.2
Ruby on Rails
4.1.0
Apache
2.2.22
Phusion Passenger
4.0.48
Rserve-client
0.3.1
Resque
1.25.2
SQLite
3.8.2
MySQL
5.5.38
R
3.1.1
RServe
1.8.0
Redis
2.8.13
Database
Computation
Job Queue
Appendix A. Technical Specifications
64
Table A.2: Species supported in DisEnT. The ‘common names’ are listed as they are
listed in DisEnT
Common Name
Scientific Name
Anole lizard
Anolis carolinensis
C.intestinalis
Ciona intestinalis
Cat
Felis catus
Chicken
Gallus gallus
Chimpanzee
Pan troglodytes
Cow
Bos taurus
Dog
Canis lupus familiaris
Fruitfly
Drosophila melanogaster
Gorilla
Gorilla gorilla
Horse
Equus caballus
Human
Homo sapiens
Macaque
Macaca mulatta
Marmoset
Callithrix jacchus
Mouse
Mus musculus
Opossum
Monodelphis domestica
Orangutan
Pongo abelii
Pig
Sus scrofa
Rabbit
Oryctolagus cuniculus
Rat
Rattus norvegicus
S. cerevisiae
Saccharomyces cerevisiae
Turkey
Meleagris gallopavo
Zebra Finch
Taeniopygia guttata
Zebrafish
Danio rerio
Appendix B
User Study
B.1
Participant Answers
The following section lists all answers submitted to the user study described in Section 4.2 of Chapter 4.
B.1.1
Practical Part
The questions were as follows:
1. What is the name of the disease with the highest enrichment score?
2. How many genes from the input list are associated with that disease?
3. Please list all genes from your list associated with Alzheimer’s disease.
B.1.2
Evaluation
1. How difficult was it to answer the questions above?
Rating on a scale of 1 to 5, where 1 means ‘Easy’ and 5 means ‘Difficult’
2. Can you rate the ease of use of DisEnT?
Rating on a scale of 1 to 5, where 1 means ‘Easy to use’ and 5 means ‘Difficult
to use’
3. Can you describe the DisEnT’s user interface in 1-3 words?
4. What is your level of experience with the enrichment analysis method?
Rating on a scale of 1 to 5, where 1 means ‘No experience’ and 5 means ‘Expert’
65
Appendix B. User Study
66
Table B.1: Answers submitted to the practical part of the user study.
No.
1
Question 1
Tropical spastic paraparesis
Question 2
2
Question 3
BSG,
C3,
CST3,
GRIA1,
GRIA2,
GRIA3,
GRM2,
GSK3B,
HECW1,
HNRNPA1,
MAPT,
NEFH, NEFL, PARK7,
PARP1, PIN1, RAB5A
2
Tropical spastic paraparesis
2 As above
3
Tropical spastic paraparesis
2 As above
4
Tropical spastic paraparesis
2 As above
5
Tropical spastic paraparesis
2 As above
6
Amyotrophic lateral sclerosis
7
Tropical spastic paraparesis
2 As above
8
Tropical spastic paraparesis
2
NEFH, NEFL
9
Tropical spastic paraparesis
2
BSG,
39 As above
C3,
CST3,
GRIA1,
GRIA2,
GRIA3,
GRM2,
GSK3B,
HECW1,
HNRNPA1,
MAPT,
NEFH, NEFL, PARK7,
PARP1, PIN1, RAB5A
10
Tropical spastic paraparesis
2 As above
Appendix B. User Study
5. Any further comments
67
Appendix B. User Study
68
Table B.2: Answers submitted to the evaluation part of the user study.
No.
Q1
Q2
1
2
1
Q3
Q4
beautiful
1
Question 5
I love the font and the question markers that offer some
additional info.
2
2
2
it is ok
1
3
1
1
minimalist, modern, in-
1
tuitive
4
1
1
white
1
Clean interface with great attention to detail, very intuitive. I like the progress reports, very cool! Hyperlinks
to Home and Search seem to
be redundant.
5
1
1
Easy to use
2
6
2
2
Minimalist
4
Nice looking, simple, works
intuitively.
When clicking
buttons (?) provided information I was looking for.
7
1
1
Clean,
Intuitive,
3
Friendly
A ’helper’ function to describe what the headings
mean would be nice.
8
2
2
Pretty straightforward
1
Very nice.
9
1
1
minimalist, clean
1
The arrow indicating which
column is used to sort the input looks like it belongs to
the column on the left. I almost made that mistake with
the first question.
10
2
1
Simple, effective
1
It might be useful to explain
what each column means
Appendix B. User Study
B.2
Survey Form
On the following page.
69
Disease Enrichment Analysis with DisEnT
Hello, and thank you for taking part in this study.
You will be asked to perform a disease enrichment analysis using a new tool called DisEnT. Disease enrichment analysis is a computational method used to determine whether a disease is overrepresented in a set of genes.
This experiment should not take more than 5 minutes.
* Required
Practical Part
1. Go to the following URL: https://synprot.inf.ed.ac.uk/disent/
2. Enter the following genes as your input list (you can copy and paste):
ANKS1B, ARHGEF2, ATL1, BSG, C3, CST3, DCTN1, DOCK1, DPP6, ENO3, EPHA4, FGF2, GABRA1, GRIA1, GRIA2, GRIA3, GRIA4, GRM2, GRM5, GSK3B, HECW1, HNRNPA1, HSPA4L, HTT, KARS, KIAA0513, KIF3A, MAPT, NEFH, NEFL, PARK7, PARP1, PFN1, PIN1, PKN1, PPIA, PRPH, PTPRF, RAB5A
3. Specify the species as Human
4. Submit your search and wait for the results.
5. Based on the results, please answer the questions below
You should be able to answer each of the questions below just by using DisEnT's user interface. If you are not sure about the right answer, feel free to skip it.
1. What is the name of the disease with the
highest enrichment score?
2. How many genes from the input list are
associated with that disease?
3. Please list all genes from your list associated with Alzheimer's disease
Evaluation
Please answer these questions based on your experience from the practical part of the study.
4. How difficult was it to answer the questions above? *
Mark only one oval.
1
2
3
4
5
Easy
Difficult
5. Can you rate the ease of use of DisEnT? *
Mark only one oval.
1
2
3
4
5
Easy to use
Difficult to use
6. Can you describe the DisEnT's user interface
in 1­3 words? *
7. What is your level of experience with the enrichment analysis method? *
Mark only one oval.
1
No experience
8. Any further comments
Powered by
2
3
4
5
Expert
Bibliography
Alexa, A. and Rahnenfuhrer, J. (2010). topGO: enrichment analysis for gene ontology.
R package version 2.8.
Alexa, A., Rahnenführer, J., and Lengauer, T. (2006). Improved scoring of functional
groups from gene expression data by decorrelating GO graph structure. Bioinformatics (Oxford, England), 22(13):1600–7.
Aronson, A. (2001). Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium, pages 17–21.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis,
A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver,
L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin,
G. M., and Sherlock, G. (2000). Gene ontology: tool for the unification of biology.
The Gene Ontology Consortium. Nature genetics, 25(1):25–9.
Barabási, A.-L., Gulbahce, N., and Loscalzo, J. (2011). Network medicine: a networkbased approach to human disease. Nature reviews. Genetics, 12(1):56–68.
Barak, Y., Achiron, A., Mandel, M., Mirecki, I., and Aizenberg, D. (2005). Reduced
cancer incidence among patients with schizophrenia. Cancer, 104(12):2817–21.
Bayés, A., Collins, M. O., Croning, M. D. R., van de Lagemaat, L. N., Choudhary, J. S.,
and Grant, S. G. N. (2012). Comparative study of human and mouse postsynaptic
proteomes finds high compositional conservation and abundance differences for key
synaptic proteins. PloS one, 7(10):e46683.
Beck, K., Beedle, M., and van Bennekum, A. (2001). The agile manifesto.
Binns, D., Dimmer, E., Huntley, R., Barrell, D., O’Donovan, C., and Apweiler, R.
(2009). QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics
(Oxford, England), 25(22):3045–6.
Botstein, D. and Risch, N. (2003). Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease.
Nature genetics.
Boyken, J., Grø nborg, M., Riedel, D., Urlaub, H., Jahn, R., and Chua, J. J. E. (2013).
Molecular profiling of synaptic vesicle docking sites reveals novel proteins but few
differences between glutamatergic and GABAergic synapses. Neuron, 78(2):285–
97.
72
Bibliography
73
Chakravarti, A. (2011). Genomic contributions to Mendelian disease. Genome research, 21(5):643–4.
Chen, J., Bardes, E. E., Aronow, B. J., and Jegga, A. G. (2009). ToppGene Suite
for gene list enrichment analysis and candidate gene prioritization. Nucleic acids
research, 37(Web Server issue):W305–11.
Chen, Y., Cunningham, F., and Rios, D. (2010). Ensembl variation resources. BMC
. . . , 11(1):293.
Collins, F. S. (1998). New Goals for the U.S. Human Genome Project: 1998-2003.
Science, 282(5389):682–689.
Drghici, S., Khatri, P., Martins, R. P., Ostermeier, G., and Krawetz, S. A. (2003).
Global functional profiling of gene expressionThis work was funded in part by a Sun
Microsystems grant awarded to S.D., NIH Grant HD36512 to S.A.K., a Wayne State
University SOM Deans Post-Doctoral Fellowship, and an NICHD Contraception
and Infertility. Genomics, 81(2):98–104.
Fernandez, O., Jordan, D., Larkowski, J., Noria, X., and Pope, T. (2011). The Rails 3
Way. Addison-Wesley, Boston, 2nd edition.
Fielding, R., Gettys, J., Mogul, J., and Frystyk, H. (1999). Hypertext transfer protocolHTTP/1.1.
Flicek, P., Amode, M. R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva, D.,
Clapham, P., Coates, G., Fitzgerald, S., Gil, L., Girón, C. G., Gordon, L., Hourlier,
T., Hunt, S., Johnson, N., Juettemann, T., Kähäri, A. K., Keenan, S., Kulesha, E.,
Martin, F. J., Maurel, T., McLaren, W. M., Murphy, D. N., Nag, R., Overduin,
B., Pignatelli, M., Pritchard, B., Pritchard, E., Riat, H. S., Ruffier, M., Sheppard,
D., Taylor, K., Thormann, A., Trevanion, S. J., Vullo, A., Wilder, S. P., Wilson,
M., Zadissa, A., Aken, B. L., Birney, E., Cunningham, F., Harrow, J., Herrero, J.,
Hubbard, T. J. P., Kinsella, R., Muffato, M., Parker, A., Spudich, G., Yates, A.,
Zerbino, D. R., and Searle, S. M. J. (2014). Ensembl 2014. Nucleic acids research,
42(Database issue):D749–55.
Gécz, J. (2010). Glutamate receptors and learning and memory. Nature genetics,
42(11):925–6.
Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S.,
Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus,
S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith,
C., Smyth, G., Tierney, L., Yang, J. Y. H., and Zhang, J. (2004). Bioconductor:
open software development for computational biology and bioinformatics. Genome
biology, 5(10):R80.
Ghisalberti, G., Masseroli, M., and Tettamanti, L. (2010). Quality controls in integrative approaches to detect errors and inconsistencies in biological databases. Journal
of integrative bioinformatics, 7(3):1–13.
Bibliography
74
Grant, S. G. N. (2012). Synaptopathies: diseases of the synaptome. Current opinion
in neurobiology, 22(3):522–9.
Grinshpoon, A., Barchana, M., Ponizovsky, A., Lipshitz, I., Nahon, D., Tal, O., Weizman, A., and Levav, I. (2005). Cancer in schizophrenia: is the risk higher or lower?
Schizophrenia research, 73(2-3):333–41.
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A.
(2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human
genes and genetic disorders. Nucleic acids research, 33(Database issue):D514–7.
He, X. and Simpson, T. I. (2014). Personal communication.
Hopkins, A. L. (2007). Network pharmacology. Nature biotechnology, 25(10):1110–1.
Hornik, K. and Leisch, F. (2003). A Fast Way to Provide R Functionality to Applications. In Proceedings of DSC.
Ihaka, R. and Gentleman, R. (1996). R: A Language for Data Analysis and Graphics.
Journal of Computational and Graphical Statistics, 5(3):299–314.
Jonquet, C., Shah, N., and Musen, M. (2009). The open biomedical annotator. Summit
on translational . . . , 2009:56–60.
Kelly, D. and Sanders, R. (2008). Assessing the quality of scientific software. First
International Workshop on Software . . . .
Koonin, E. V. (2005). Orthologs, paralogs, and evolutionary genomics. Annual review
of genetics, 39:309–38.
Krasner, G. and Pope, S. (1988). A description of the model-view-controller user
interface paradigm in the smalltalk-80 system. Journal of object . . . .
LePendu, P., Musen, M. a., and Shah, N. H. (2011). Enabling enrichment analysis with
the Human Disease Ontology. Journal of biomedical informatics, 44 Suppl 1:S31–8.
Machado, C. M., Freitas, A. T., and Couto, F. M. (2013). Enrichment analysis applied
to disease prognosis. Journal of biomedical semantics, 4(1):21.
Maglott, D., Ostell, J., Pruitt, K., and Tatusova, T. (2005). Entrez Gene: gene-centered
information at NCBI. Nucleic acids research, 33(Database issue):D54–8.
Mailman, M., Feolo, M., Jin, Y., and Kimura, M. (2007). The NCBI dbGaP database
of genotypes and phenotypes. Nature . . . , 39(10):1181–6.
Matsumoto, Y. and Ishituka, K. (2002). Ruby programming language.
Maximilien, E. and Williams, L. (2003). Assessing test-driven development at IBM.
Software Engineering, 2003. . . . , 6.
Michaud, K. and Wolfe, F. (2007). Comorbidities in rheumatoid arthritis. Best practice
& research Clinical rheumatology, 21(5):885–906.
Bibliography
75
Miller, J. (2009). Design For Convention Over Configuration. Microsoft.
Mort, M., Evani, U., and Krishnan, V. (2010). In silico functional profiling of human diseaseassociated and polymorphic amino acid substitutions. Human . . . ,
31(3):335–46.
NCBI (2014). GeneRIF: Gene Reference into Function.
Nielsen, J. (1999). User interface directions for the Web. Communications of the ACM,
42(1):65–72.
Ohtake, M., Takada, G., and Miyabayashi, S. (1982). Pyruvate decarboxylase deficiency in a patient with Leigh’s encephalomyelopathy. The Tohoku journal of . . . ,
137(4):379–386.
Osborne, J., Lin, S., and Kibbe, W. (2007). Other riffs on cooperation are already
showing how well a wiki could work. Nature, 446(7138):856.
Osborne, J. D., Flatow, J., Holko, M., Lin, S. M., Kibbe, W. a., Zhu, L. J., Danila,
M. I., Feng, G., and Chisholm, R. L. (2009). Annotating the human genome with
Disease Ontology. BMC genomics, 10 Suppl 1:S6.
Pandey, N., Garver, D., and Tamminga, C. (1977). Postsynaptic Supersensitivity in
Schizophrenia. Am J Psychiatry, 134(5):518–522.
Peng, K., Xu, W., Zheng, J., Huang, K., Wang, H., Tong, J., Lin, Z., Liu, J., Cheng,
W., Fu, D., Du, P., Kibbe, W. a., Lin, S. M., and Xia, T. (2013). The Disease and
Gene Annotations (DGA): an annotation resource for human disease. Nucleic acids
research, 41(Database issue):D553–60.
Petri, H. (2010). Data-driven identification of co-morbidities associated with rheumatoid arthritis in a large US health plan claims database. BMC musculoskeletal . . . ,
11(1):247.
Povey, S., Lovering, R., Bruford, E., Wright, M., Lush, M., and Wain, H. (2001). The
HUGO Gene Nomenclature Committee (HGNC). Human genetics, 109(6):678–80.
Robinson, P. N., Köhler, S., Bauer, S., Seelow, D., Horn, D., and Mundlos, S. (2008).
The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. American journal of human genetics, 83(5):610–5.
Roe, C. M., Behrens, M. I., Xiong, C., Miller, J. P., and Morris, J. C. (2005). Alzheimer
disease and cancer. Neurology, 64(5):895–8.
RStudio (2014). Shiny.
Ruby, S., Thomas, D., and Hansson, D. H. (2013). Agile Web Development with Rails
4.
Schriml, L. and Arze, C. (2012). Disease Ontology: a backbone for disease semantic
integration. Nucleic acids . . . .
Bibliography
76
Schwartz, B., Zaitsev, P., and Tkachenko, V. (2012). High performance MySQL: Optimization, backups, and replication.
Shah, N. H., Cole, T., and Musen, M. A. (2012). Chapter 9: Analyses using disease
ontologies. PLoS computational biology, 8(12):e1002827.
Simpson, T. I. (2014). Personal communication.
Smith, S. (2006). Systematic development of requirements documentation for general purpose scientific computing software. Requirements Engineering, 14th IEEE
International . . . , pages 209–218.
Sternberg, D. E. (1982). Impaired Presynaptic Regulation of Norepinephrine in
Schizophrenia. Archives of General Psychiatry, 39(3):285.
Storm, C. and Sonnhammer, E. (2002). Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics, 18(1):92–99.
Subramanian, A. (2005). Gene set enrichment analysis: a knowledge-based approach
for interpreting genome-wide expression profiles. Proceedings of the . . . .
The UniProt Consortium (2014). Activities at the Universal Protein Resource
(UniProt). Nucleic acids research, 42(Database issue):D191–8.
Tirrell, R., Evani, U., Berman, A. E., Mooney, S. D., Musen, M. A., and Shah, N. H.
(2010). An ontology-neutral framework for enrichment analysis. AMIA ... Annual
Symposium proceedings / AMIA Symposium. AMIA Symposium, 2010:797–801.
Toshima, K., Kuroda, Y., Hashimoto, T., Ito, M., Watanabe, T., Miyao, M., and Ii,
K. (1982). Enzymologic studies and therapy of Leigh’s disease associated with
pyruvate decarboxylase deficiency. Pediatric research, 16(6):430–5.
Weyuker, E. and Vokolos, F. (2000). Experience with performance testing of software systems: issues, an approach, and case study. IEEE Transactions on Software
Engineering, 26(12):1147–1156.
Wilson, G., Aruliah, D. A., Brown, C. T., Chue Hong, N. P., Davis, M., Guy, R. T.,
Haddock, S. H. D., Huff, K. D., Mitchell, I. M., Plumbley, M. D., Waugh, B., White,
E. P., and Wilson, P. (2014). Best practices for scientific computing. PLoS biology,
12(1):e1001745.
Zeeberg, B., Feng, W., Wang, G., and Wang, M. (2003). GoMiner: a resource for
biological interpretation of genomic and proteomic data. Genome . . . , 4(4):R28.