Developing a Web Application for the DisEnT Disease Enrichment
Transcription
Developing a Web Application for the DisEnT Disease Enrichment
Developing a Web Application for the DisEnT Disease Enrichment Tool Ernest Walzel Master of Science School of Informatics University of Edinburgh 2014 Abstract Disease enrichment analysis is a statistical method used to determine whether a specific disease trait is overrepresented or underrepresented in a set of genes. Accuracy of this method relies heavily on coverage of available gene-disease association data. Problematically, sources of gene-disease associations differ in what data they store and how they store it. Data from three such sources – GeneRIF, OMIM and Ensembl Variation – were aggregated and unified, creating a dataset of gene-disease associations with unprecedented coverage. We developed DisEnT – a web-based application that enables disease enrichment analysis on this novel dataset. DisEnT was designed to be a reliable and accessible scientific tool that can scale with increasing workload and is sustainable for future expansion. DisEnT has a modular architecture consisting of a Ruby on Rails application, an R programming language module and a MySQL database. It was built in line with a published set of best practices for scientific computing. The system was evaluated for correctness, usability, scalability and sustainability, satisfying each of the criteria. DisEnT is available for access at https://synprot.inf.ed.ac.uk/disent. i Acknowledgements I would like to thank Dr Ian Simpson, the supervisor of this project, for his exceptional support and trust. Thank you for all your help, your feedback, your insights and for giving me the freedom to do what I thought was best for the project. This has been the most enjoyable experience and I hope to have a chance to collaborate with you in the future. I would also like to thank Xin He for his help and his work on the data underlying this project. It is Xin’s data that makes DisEnT possible. Finally, my thanks goes to my girlfriend Thea Koutsoukis for her around-the-clock, around-the-globe help and support in keeping this thesis legible and in keeping me sensible. You are the best. ii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Ernest Walzel) iii To my brothers, Oto and Edo. iv Contents 1 Introduction 1 2 Background 4 2.1 Studying Sets of Genes . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Gene Set Enrichment Analysis . . . . . . . . . . . . . . . . . 6 Enrichment Analysis beyond the Gene Ontology . . . . . . . . . . . 9 2.2 3 2.2.1 The Human Disease Ontology . . . . . . . . . . . . . . . . . 10 2.2.2 Mining for Disease Annotations . . . . . . . . . . . . . . . . 10 2.3 Current Solutions for Disease Enrichment Analysis . . . . . . . . . . 13 2.4 Introducing DisEnT . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 16 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . Developing DisEnT 18 3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 Web Application . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.4 Job Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 User Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Gene Identification . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.4 Mapping to Homologs . . . . . . . . . . . . . . . . . . . . . 31 3.2.5 Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . 33 3.2.6 Presenting Results . . . . . . . . . . . . . . . . . . . . . . . 34 Programmatic Interface . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 37 3.2 3.3 Submitting a Query . . . . . . . . . . . . . . . . . . . . . . . v 3.3.2 3.4 4 5 Retrieving Results . . . . . . . . . . . . . . . . . . . . . . . 38 Development Methodology . . . . . . . . . . . . . . . . . . . . . . . 39 Evaluation 41 4.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Practical Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 User Experience . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 Write programs for people, not computers . . . . . . . . . . . 55 4.4.2 Let the computer do the work . . . . . . . . . . . . . . . . . 56 4.4.3 Make incremental changes . . . . . . . . . . . . . . . . . . . 56 4.4.4 Don’t repeat yourself (or others) . . . . . . . . . . . . . . . . 57 4.4.5 Plan for mistakes . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.6 Optimize software only after it works correctly . . . . . . . . 57 4.4.7 Document design and purpose, not mechanics . . . . . . . . . 58 4.4.8 Collaborate . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Conclusion 60 A Technical Specifications 62 A.1 Software Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.2 Supported Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 B User Study 65 B.1 Participant Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.1.1 Practical Part . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 B.2 Survey Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Bibliography 72 vi Chapter 1 Introduction Untangling the genetic basis of disease has always been of paramount interest to researchers worldwide. Billions of dollars and more than a decade of work were invested in the sequencing of the human genome by the Human Genome Project. One of the main driving forces behind the project was the promise of bringing new insights into pathologies of human diseases by annotating the genome sequence with genes and other areas of interest (Collins, 1998). However, recognizing the genes embedded in our DNA was only the start of a much longer and much larger effort in understanding how genes link to diseases. For a particular subset of diseases, it is possible to pin-point a mutation in a single gene as a cause of a disease. These diseases are referred to as monogenic or Mendelian diseases (Botstein and Risch, 2003), and they include conditions such as Huntington’s disease and cystic fibrosis. However, while these diseases are undoubtedly important to study and understand, their incidence is usually low, commonly predicted at around 5% (Chakravarti, 2011). It is the unfortunate case that the most frequently occurring and often very damaging diseases such as cancer or Alzheimer’s disease have a much more complicated pathology. These diseases usually occur as a result of a complex interplay between many genetic and environmental factors. In order to understand such diseases, one needs to look beyond a single gene. Genetic etiologies of complex diseases often describe whole sets of genes that can have a role in development of a disease (Hopkins, 2007). It is also true that a given gene can be an important factor for multiple diseases. In fact, even a typical monogenic disease such as sickle cell anaemia – a phenotype caused by a particular single mutation – often occurs accompanied by a multitude of other diseases connected to the mutation 1 Chapter 1. Introduction 2 (Barabási et al., 2011). Thus, as well as looking beyond a single gene, one must look beyond a single disease. The work described in this thesis is a resource for studying relationships between diseases and genes within a context of a disease hierarchy. For a given set of genes (e.g. genes highlighted by a biological experiment), it can answer questions such as Are these genes characteristic for a particular type of disease? or Which other diseases should be considered whilst studying this disease? The resource developed – an web application called DisEnT – provides answers to those questions using a statistical method known as enrichment analysis. Enrichment analysis is used to determine whether a specific trait (in this case a disease) is overrepresented or underrepresented in a set of genes investigated (Tirrell et al., 2010). The aforementioned disease hierarchy used in DisEnT is called the Human Disease Ontology (HDO) (Schriml and Arze, 2012). HDO classifies diseases into categories and thus adds structure to the disease terminology. It is this hierarchical structure that allows exploring not only the links between diseases and genes, but also how the diseases themselves are linked. This thesis describes DisEnT in the context of its application and its implementation. Chapter 2 describes the principles and ideas behind disease enrichment analysis, its importance and its limitations. This chapter also outlines how DisEnT enables disease enrichment analysis and how it overcomes some of the technical hurdles associated with the process. DisEnT builds on ideas of other similar systems using the Human Disease Ontology. This chapter outlines similarities and differences between these systems and DisEnT and it defines evaluation criteria for DisEnT. The main focus of Chapter 3 is to describe the implementation of DisEnT. DisEnT was developed whilst following best practices for scientific computing as suggested by Wilson et al. (2014). It was built with scalability and robustness in mind, allowing for future expansion and addition of new features. It comes with an extensive test coverage and a documented Application Programming Interface (API), both of which are described. Chapter 4 evaluates DisEnT against its design criteria: correctness, usability, scalability and sustainability. Observations made in this chapter suggest that results produced by DisEnT can be used to infer valuable knowledge supported by evidence. Furthermore, the application is perceived positively by its users based on a user study, it Chapter 1. Introduction 3 can cope with increased workload and it was built following best practices for building scientific software. Finally, Chapter 5 discusses suitability and limitations of the current version of DisEnT and suggests other features that could be useful for users of this application. Chapter 2 Background 2.1 Studying Sets of Genes Many modern biological experiments use highly parallel assay methods to profile the biomolecular signal of a sample. Examples include the use of mass-spectrometry to identify protein constituents or a range of experiments measuring gene expression. Results from these experiments are typically analysed with advanced statistical methods that produce a list of ‘significant’ genes (Shah et al., 2012). Finding a biological meaning behind these lists can be a daunting task: information about genes is usually dispersed across a number of specialised databases that often only contain a subset of the required information. Moreover, these databases often differ in what data they store and how they store it, making it difficult to use them effectively for a systematic analysis. In order to find a ‘common language’ for the differing databases, a structured dictionary of gene functions was developed called the Gene Ontology (GO). (Ashburner et al., 2000). The Gene Ontology consists of a hierarchical structure of terms that describe biological functions and processes ordered by their ‘specificity’: terms at the root of the ontology are more generic, e.g. response to stimulus, and terms deeper in the hierarchy are more specific, e.g. chronic inflammatory response. Each one of the terms contains an unambiguous identifier number, such as GO:0002544. Figure 2.1 shows a subset of the GO hierarchy and some of its terms in a diagram. Thanks to its hierarchical structure, GO lends itself to a number of methods of statistical analysis. A typical question a researcher may ask is whether any of the GO terms are over-represented or under-represented in the list of genes deemed significant. One of the methods that can provide answer to such a question is gene set enrichment 4 Chapter 2. Background 5 Figure 2.1: Diagram showing a subset of the Gene Ontology (GO) hierarchy. The directed edges (shown as black arrows) show direction of the isa relationship (for example inflammatory response is a response to wounding). The coloured rectangles at the bottom of the term rectangles mark terms that are also used in specific sub-sets of GO called GO slims. The diagram was generated by the QuickGO browser (Binns et al., 2009). Chapter 2. Background 6 analysis (Zeeberg et al., 2003). 2.1.1 Gene Set Enrichment Analysis Gene set enrichment analysis (GSEA) is a computational method commonly used to determine whether a group of genes show a statistical significance in a given context. The context itself is interchangable: it may be a biological function (e.g. a GO term), chromosomal location or indeed a disease – the general principles are the same (Subramanian, 2005). Thanks to its robust methodology and easily interpretable results, GSEA has gained a lot of popularity in the scientific community (Tirrell et al., 2010). Enrichment analysis can be applied in two types of scenarios. Firstly, it can be used to confirm (or reject) a hypothesis, e.g. whether a group of genes is associated with a property it is thought to be associated with. Secondly, by analysing a previously unexplored set of genes, it can potentially discover previously unknown attributes associated with the gene set. Shah et al. (2012) refer to these two settings as hypothesisdriven and hypothesis-generating. 2.1.1.1 Calculating the Enrichment The basic idea behind GSEA is a comparison of an ‘interesting’ list of genes, or an input list, to a chosen background list sometimes referred to as reference set. The input list is usually produced as an output of an experiment (as mentioned at the beginning of this chapter), and it is the primary subject of interest for the analysis. The background list is used as a reference providing a benchmark for calculations. The choice of the background list is not always obvious and it often depends on the origin of the input list. For example, if the input list is obtained from a microarray chip experiment, only genes whose expression was measured by the chip should be included in the background list. In other cases, the background list can be constructed from the genome of the investigated organism, selecting genes that are annotated. To perform GSEA, both the lists are annotated with a given set of terms – in a typical scenario, GO terms are used. The enrichment analysis then compares the number of genes n in the input list annotated with a particular GO term (e.g. inflammatory response) and compares it to the number of all genes m annotated with that term in the background list. Based on those counts, and on the size of both lists (N and M respectively), the method produces a statistical output. In a simple scenario, the result can be a probability value (p-value) that marks how likely it is to observe at least n genes in Chapter 2. Background 7 Figure 2.2: Calculation of enrichment of GO terms using the hypergeometric distribution. Figure by Shah et al. (2012). the input list by chance (Shah et al., 2012) and it can be used to devise an enrichment score for a given term in the setting. Depending on the size of the lists, the p-value can be calculated using a number of statistical methods including a hypergeometric test or Fisher’s exact test. For example, Drghici et al. (2003) propose a formula assuming the hypergeometric distribution as shown in Figure 2.2. The Fisher’s exact test is another popular method for calculating enrichment pvalues and as Shah et al. (2012) note, it is also more suitable for smaller gene lists. Using same notation as in Figure 2.2, the formula for the Fisher’s exact test would be p= 2.1.1.2 N M n m N+M n+m (2.1) Incorporating Hierarchy Although useful, the approach of enrichment analysis based purely on gene counts associated with GO terms has a limitation that can hinder its interpretability. As Alexa et al. (2006) demonstrated, because of GO’s hierarchical structure, there is a strong correlation between terms in close hierarchical proximity. As a result, if a specific term (e.g. chronic inflammatory response) is marked as strongly enriched, there is a high probability of its more generic ancestor terms (e.g. response to stimulus) to also be marked as enriched. Figure 2.3 illustrates this bias pattern in an example. Chapter 2. Background 8 Figure 2.3: A subgraph of the GO hierarchy with its enriched terms highlighted. Legend in the bottom of the figure describes the associated p-values. Figure by Alexa et al. (2006). Chapter 2. Background 9 Figure 2.4: Scoring of GO terms after applying the elim algorithm (left) and the weight algorithm (right). Enrichment of more specific terms is higher and the bias caused by the hierarchical proximity between terms is reduced. Figure by Alexa et al. (2006). To address this problem, Alexa et al. (2006) proposed two scoring algorithms that make use of the hierarchical structure of the ontology. One of their proposed algorithms named elim works by iterating over the enriched terms starting from the bottom of the hierarchy progressing towards the top. Starting with the most specific terms, it calculates their enrichment scores and then eliminates genes incorporated in that enrichment from all of the ancestral terms. This leads to much lower enrichment scores for terms that are not enriched by previously ‘unseen’ genes. The elimination process in elim can be considered as assigning weights of either 0 or 1 to genes in the terms. Alexa et al. present a generalisation of this approach in their second algorithm named weight that assigns weights from the interval [0, 1] to the genes instead of eliminating them completely. As shown in Figure 2.4, both these algorithms lead to more interpretable results. Both of these algorithms are commonly used for enrichment analysis using GO terms and as Section 3.2.5 explains, DisEnT also makes use of the elim algorithm. However, DisEnT does so while using an ontology different from GO. 2.2 Enrichment Analysis beyond the Gene Ontology While the Gene Ontology is a commonly-used reference for enrichment analysis, the method, as described by Subramanian (2005), can be easily generalised and extended onto other ontologies. As well as asking which biological functions are shared by a set of genes, one can ask which disease is over-represented or under-represented in a gene Chapter 2. Background 10 set. Performing enrichment analysis using an ontology of diseases may lead to interesting discoveries in disease pathologies. For example, by combining a disease ontology with several protein-labelling datasets, Mort et al. (2010) discovered that a particular class of blood coagulation diseases is associated with a previously unknown structural change in a group of proteins. As well as new insights into pathology of diseases, enrichment analysis can help to improve healthcare by predicting the needs of patients suffering from a particular disease. For instance, several comorbidities (co-occurring disorders) have been identified in cohorts of patients suffering from rheumatoid arthritis using a technique similar to enrichment analysis (Petri, 2010). Using this knowledge, we can help target and develop more effective drug treatments and improve the quality of life of these patients (Michaud and Wolfe, 2007). Finally, Machado et al. (2013) have recently investigated the possibility of using enrichment analysis as a first step in disease prognosis. While their findings still need to be verified by clinical data, their preliminary results are promising in characterising a disease called hypertrophic cardiomyopathy based on genetic data from a group of patients. 2.2.1 The Human Disease Ontology The main purpose of using the Gene Ontology in enrichment analysis is to provide a common dictionary of biological functions and processes. In the realm of diseases, this role is fulfilled by the Human Disease Ontology (Schriml and Arze, 2012) . Similarly to the Gene Ontology, the Human Disease Ontology (HDO) is a hierarchy of diseases stemming from generic disease terms such as disease to more specific terms like Huntington’s disease . Figure 2.5 shows a diagram of a subset of the ontology. Similarly to the GO, the hierarchical structure provides a unifying structure for classifying diseases as well as a platform for statistical analysis. 2.2.2 Mining for Disease Annotations A common dictionary of diseases is essential for disease-based enrichment analysis. However, in order to be able to perform the analysis, one needs to annotate genes with terms from the chosen dictionary. While there are many well-established resources available that map genes to GO terms (The UniProt Consortium, 2014; Maglott et al., Chapter 2. Background 11 Figure 2.5: Sub-set of the Human Disease Ontology hierarchy. A ‘specific’ disease (Huntington’s disease) is shown in a red rectangle, while its ‘parent’ disease terms are shown in green rectangles. The orange circles represent number of nodes hidden from the visualisation. Figure obtained from the Disease Ontology database (Schriml and Arze, 2012). Chapter 2. Background 12 2005), there is a lack of disease-annotation resources mapping to the Human Disease Ontology (Osborne et al., 2009; LePendu et al., 2011; Peng et al., 2013). One of the ways to obtain the HDO annotations is to retrieve gene-disease associations from other sources and map them onto the HDO using natural language processing (NLP) tools. The main resources for human gene-disease associations are the Online Mendelian Inheritance in Man (OMIM) database (Hamosh et al., 2005), the Gene Reference into Function (GeneRIF) database (NCBI, 2014) and the Ensembl Variation database (Chen et al., 2010). While useful in their own right, each of these databases differs in the kind of information they store and how they store it. This means that researchers interested in using gene-disease annotation data must often consider a combination of these sources in order to establish a comprehensive overview of the data (Peng et al., 2013). However retrieval, parsing and integration of data from these sources is not straightforward. The OMIM database is a manually-curated database of diseases and their genetic causes. The information recorded here comes from published literature and is reviewed and entered by a team of expert curators. This means that data in this database is usually of high-quality, but the manual curation also introduces a considerable delay into the process of updating the database. Moreover, the annotations are added with a high level of detailed description, but the resource is almost entirely free-text based, making it difficult to retrieve information from it programmatically (Mailman et al., 2007; Osborne et al., 2009). As a result, OMIM is a valuable source of information that is challenging to access using automated methods. GeneRIF offers a different approach to gene annotation. It provides a simple mechanism for its users to add new annotations to genes based on scientific literature. The annotations are brief – up to 255 characters in length – and they typically describe the gene’s function or a role in a disease (see Figure 2.6 for an example). Osborne et al. (2007) describe this wiki-like process as successful and in a later paper (Osborne et al., 2009) they show that information stored here can be easily used for data-mining purposes. However, even though the data here is presented in a clearer and more accessible manner, it still has to be mapped onto the Human Disease Ontology. Finally, Ensembl Variation provides a completely automated system of annotating genes with phenotypic information. It mines multiple databases to aggregate trait data (e.g. diseases) caused by a specific variation (e.g. mutation) in a gene. This automated data aggregation makes it possible to update the Ensembl Variation database very rapidly. However, the lack of manual curation can make the data prone to quality Chapter 2. Background 13 issues (Ghisalberti et al., 2010). One of the biggest advantages of Ensembl Variation for the purposes of enrichment analysis is that the data is available in an easilyaccessible relational database (Chen et al., 2010). As in the case of GeneRIF however, this data still needs to be mapped onto an ontology. 2.3 Current Solutions for Disease Enrichment Analysis Several approaches have been developed to map disease descriptions onto the Human Disease Ontology (Aronson, 2001; Osborne et al., 2009; LePendu et al., 2011; Peng et al., 2013). Although the mapping process is not a central topic of this project, the mapped disease annotation data is the key resource of the DisEnT project. One of the earliest attempts to systematically annotate human genes with disease terms was published by Osborne et al. (2009). Authors of this study used the MetaMap Transfer (MMTx) tool (Aronson, 2001), a software package that analyses biomedical texts using NLP techniques and maps their contents onto terms from a pre-defined dictionary. The goal of this study was to use MMTx to ‘translate’ GeneRIF entries onto HDO terms. The ontology’s hierarchical structure was used to improve accuracy of the mapping (e.g. by avoiding mapping to both a class and a subclass of a disease). Establishing the mapping between a GeneRIF and a HDO thus formed a link between the HDO term and the gene. An example of this process is shown in Figure 2.6. This study demonstrated that mapping GeneRIF entries onto HDO terms is viable using NLP techniques. Although the proposed methodology was not as effective for other types of sources, such as the longer free-text entries in OMIM, this paper provided a foundation for later similar endeavours, including the Disease and Gene Annotations database (DGA). DGA (Peng et al., 2013) is a largely automated database system that combines GeneRIF entries with molecular interactions networks, offering an ‘integrated systems approach’ to gene-disease associations. While not directly providing enrichment analysis functionality, DGA provides a good example of how combining multiple sources of information can lead to a resource offering much more contextual information. As it will become clear later, the idea of combining multiple sources of information is also central to DisEnT. One of the earliest studies that enabled enrichment analysis with Human Disease Chapter 2. Background 14 Figure 2.6: Inference from a GeneRIF suggests that a gene TGF-beta1 is associated with the Malignant neoplasm of breast DO term. Figure by Osborne et al. (2009). Ontology was published by LePendu et al. (2011). This study largely built on previous work of (Tirrell et al., 2010) and their methodology entitled RANSUM (Rich Annotation Summarizer). RANSUM described a workflow that enabled its users to perform enrichment analysis on any given ontology. Gene annotations with the ontology terms would have to be either provided by the user or could be derived automatically using a mapping tool, however, the accuracy of this process in this study is not clear. LePendu et al. developed RANSUM further, into a workflow leveraging the alreadyexisting database of Gene Ontology annotations. More specifically, the key data sources for the mapping were PubMed1 identifiers – links to articles providing evidence for some of the GO annotations. Using the PubMed articles provided, a mapping tool similar to MMTx named Open Biomedical Annotator (Jonquet et al., 2009) was used to infer HDO terms from the articles, ultimately linking HDO terms to genes. These newly-established annotation links allowed the authors to carry out enrichment analysis using the acquired data as the background set. The workflow describing the process from mapping to the enrichment analysis is shown in a diagram in Figure 2.7. 1 http://www.ncbi.nlm.nih.gov/pubmed Chapter 2. Background 15 Figure 2.7: Wokflow for mapping disease association data as developed by LePendu et al. (2011). The ‘NCBO Annotator service’ component in the figure refers to the Open Biomedical Annotator. Figure by LePendu et al. (2011). 2.4 Introducing DisEnT DisEnT – the Disease Enrichment Tool – further develops the solutions described in the previous section. The first objective of DisEnT was to improve coverage of genedisease association data mapped onto the Human Disease Ontology. DisEnT’s data was collated from the three different sources described in Section 2.2.2: OMIM, GeneRIF and Ensembl Variation. Data from these sources was retrieved and mapped onto HDO by He and Simpson (2014) using methodologies similar to those described by Osborne et al. and LePendu et al.. These workflows utilised both of the aforementioned mapping tools: MMTx and OBA. Thanks to the wide range of sources and mapping tools, we believe that the DisEnT database provides unprecedented coverage of gene-disease annotations. The main aim of this project, is to enable statistically robust gene set enrichment analysis through an accessible web application. The studies described in Section 2.3 all propose useful methodologies for collecting the data, but apart from the Disease and Gene Annotations database (DGA), none of the studies offer a reliable resource for using this data for enrichment analysis. (While DGA does provide an interface to retrieve gene-disease associations, it does not offer any statistical analysis tools.) In order to achieve this new objective, we have developed a web-based application enabling gene set disease enrichment analysis using the DisEnT dataset. In addition to enabling enrichment analysis, the data in DisEnT can be used for other purposes including network analysis and visualisation as introduced in Section 2.3. For example, one could examine a network of genes shared between diseases Chapter 2. Background 16 or diseases shared between genes. Although this functionality was not implemented as part of this project, we demonstrate feasibility of this use case with DisEnT’s data in Section 4.1. 2.4.1 Design Goals The DisEnT application was designed to be intuitive and reliable, so that it can be used by a wide range of users in the scientific community. DisEnT communicates with its users by standard web protocols and it is supported by a mature statistical software infrastructure. The technical details of DisEnT are described in Chapter {secMethods}, while the following paragraphs introduce the reader to our overall design goals. In order to be used effectively, the DisEnT system was developed whilst concentrating on four main criteria: correctness, usability, scalability and sustainability. Correctness is one of the most important aspects of any scientific software (Smith, 2006; Kelly and Sanders, 2008). Just like any other scientific tool, scientific software needs to be inspected and validated to ensure its output is correct (Wilson et al., 2014). This notion of quality assurance is particularly critical for software that is being actively developed, as it often changes its behaviour. DisEnT was built using a ‘test-before-code’ methodology, resulting in an extensive coverage of software tests (see Section 4.4.5). In addition to the automated tests DisEnT was evaluated manually in a realistic use case scenario (see Section 4). The usability criterion ensures that researchers should be able to use DisEnT effectively and easily. To achieve this objective, DisEnT was developed using standard Web technologies and libraries, ensuring compatibility with as many internet browsing platforms as possible. Usability was also the main goal for designing the user interface. As Section 3.2.1 describes, DisEnT’s front-end was designed to be as simple and intuitive as possible. Rather than a myriad of options, users are presented with simple input fields and usable defaults. Advanced users can, however, tweak the application’s behaviour by changing the defaults or by using DisEnT’s programmatic interface. DisEnT can parse three types of widely-accepted gene identifiers and recognise genes of 23 species. It aims to not impose unnecessary restraints on the user input, allowing the user to concentrate on the analysis task rather than on converting or formatting their input. The application aims to infer as much as possible about the input without requesting additional information from the user, as long as the inference can Chapter 2. Background 17 be done unambiguously. The system is required to be scalable to handle concurrent workload. DisEnT offers an Application Programming Interface (API), opening it for programmatic access, which often puts a heavy load on similar computing systems. For this reason, DisEnT implements a job queue where each of the requests stored and handled. The job queue solution prevents DisEnT from overloading its hosting system while remaining responsive to all of its users. Finally, DisEnT was designed to be sustainable, It is an application developed following good practices and recommendations for scientific software development as outlined by Wilson et al. (2014). While the DisEnT application currently does not provide additional features such as network visualisation, the test coverage and code base were implemented with robustness in mind, making DisEnT suitable for future extension and improvement. Chapter 3 Developing DisEnT This chapter describes the DisEnT application from both the system design point-ofview and the user experience perspective. First, DisEnT’s high-level design decisions are explained and justified. Then, the system is described in more detail by following a typical use-case example. 3.1 System Overview DisEnT consists of four main components: web application, database, computation module and a job queueing system. Figure 3.1 shows a high-level overview of the system architecture. The core component of DisEnT is the web application. It communicates with the user, accepts their input and formulates output. This module is the only component visible to the user and was therefore designed with a strong focus on user experience. The web application component also implements most of the system’s domain logic but it delegates tasks such as statistical computation or job queuing to the more specialised computation module. The other three components provide the infrastructure necessary for DisEnT’s effective operation. The job queue component controls scheduling of computationally intensive tasks in the background, the database module is responsible for persisting user data and results, while the computational module carries out most of DisEnT’s calculations. This modular approach allows DisEnT to be more scalable and robust. For example, if the computational module was required to be changed or moved to more powerful host (e.g. in order to have more computational resources), this change would 18 Chapter 3. Developing DisEnT 19 Web Access Access perimeter Web Application Ruby on Rails Job Queue Resque Database MySQL Computation R Figure 3.1: A high-level diagram of DisEnT’s architecture showing DisEnT’s main components. The dotted line represents the ‘access perimeter’ of the system. Users of DisEnT directly interact only with the Web Application component. only require a slight configuration change in the web application module. Another advantage this set-up provides is that thanks to this clear division of responsibilities, each one of the modules can remain concentrated on a specialised set of tasks, which results in a ‘slimmer’ and ultimately more maintainable code base. The following sections describe each of these components from a technical perspective, offering some insight into why these particular technologies and platforms were chosen to develop DisEnT. (Note: versions of the software packages used in DisEnT are listed in Table A.1 of Appendix A.) 3.1.1 Web Application The web application component forms most of DisEnT’s code base. It was developed in the Ruby programming language (Matsumoto and Ishituka, 2002), using a development framework called Ruby on Rails1 . The Ruby on Rails (‘Rails’ for short) framework is a popular set of libraries designed for rapid development of web-based applications. To aid the rapid development, Rails adopts the principle of convention over config1 http://rubyonrails.org/ Chapter 3. Developing DisEnT 20 uration (Miller, 2009). This means that Rails follows several conventions that are assumed throughout the application and developers do not have to configure them explicitly. For example, the framework expects certain files to be stored in specific directories or enforces standard naming conventions. This philosophy may appear restricting at first, but the purpose of it is to reduce amount of code needed to write a functioning application. Writing less code leads to better readability without having to write extensive documentation explaining the code. Rails also follows a popular design pattern known as model-view-controller, or MVC (Krasner and Pope, 1988). MVC helps to maintain a clean code base by dividing the application’s components into three separate groups: • Models are responsible for maintaining most of the domain logic. They are responsible for most of the ‘heavy-lifting’ and and are usually most complex. • Views are designed purely for the purpose of presentation and they do not perform any computations containing domain logic. Views are often implemented as templates with placeholders for dynamic values. • Controllers serve as an interface between user input, the application and user output. They are typically responsible for passing messages between models and views. Similarly to views, controllers should generally contain as little domain logic as possible. Figure 3.2 shows how the MVC pattern is implemented in Ruby on Rails. Typically, a request from a user (e.g. a form submission) is redirected to an appropriate controller that triggers an action involving a model. The model carries out the necessary operation (e.g. database update) and returns results to the controller. Once the controller obtains the results, these are passed to a view component which renders a web page populated with the requested data. Other advantages of using Rails include a preinstalled development server for testing changes and a basic unit testing suite already included in the framework. Thanks to these features, Ruby on Rails provides a stable platform for fast development of reliable web applications. A Rails application deployed into a ‘production’ environment is typically hosted on the popular Apache2 server platform using a specialised module called Phusion 2 http://httpd.apache.org/ Chapter 3. Developing DisEnT 21 1 Request Controller 2 4 3 View Model Database Figure 3.2: Schematic drawing of request handling by a system using the MVC pattern. Figure inspired by Ruby et al. (2013). Passenger3 . Phusion Passenger is an Apache server module specialised in hosting web applications written in interpreted languages such as Ruby. While the Rails code is written in Ruby, Phusion Passenger is written in a compiled language (C++), allowing it to execute faster and make better use of the server host’s memory. Another useful Rails feature for production use is the asset pipeline. The asset pipeline is a Rails framework used to serve the web application’s assets (e.g. images, script files) more efficiently. In a usual scenario, when a browser accesses a website it needs to retrieve each one of these files separately, resulting in several separate calls to the server. Rails addresses this problem by pre-packaging and compressing most of its assets so that they are delivered in a single file and a single call. This feature is particularly useful for clients with unreliable Internet connections. 3.1.1.1 Alternative Solutions While Ruby on Rails is a popular and well-established framework for creating web applications, other similar solutions were also investigated during DisEnT’s design stage. One of the alternatives considered was the R Studio’s Shiny platform (RStudio, 2014). Shiny enables users of the computational language R (Ihaka and Gentleman, 3 https://www.phusionpassenger.com/ Chapter 3. Developing DisEnT 22 1996) to easily create web applications using native R code. This solution makes it very easy for Shiny applications to have direct access to R’s powerful computational libraries and the platform comes prepackaged with powerful visualisation and data interaction tools. Although Shiny is a great tool for presenting data, it is predominantly a computational solution, not a platform for building fully-fledged web applications. For example, the web site and its computational counterpart both have to be hosted on the same physical host. In Shiny’s basic version (the Open Source Edition), only one process is allowed to run on the host, which hinders scalability of this set-up. While the Professional Edition of Shiny allows users to run multiple computational processes, these still have to be bound to a single physical host. Other limitations of Shiny include a lack of support for building interfaces for programmatic access and close coupling with Shiny’s own web hosting technology – Shiny applications cannot be hosted on other hosting platforms such as Apache. In its current set-up, DisEnT has a connection to R (described in Section 3.1.3) that is customizable and not limited by the number of processes or by a particular hosting technology. Although the downside of this approach is that Shiny’s data interaction features have to be implemented using other technologies, the current architecture arguably provides a more robust and scalable foundation for DisEnT’s future adjustments. 3.1.2 Database DisEnT’s data is stored in a standard relational database. A combination of two database platforms is used: SQLite4 database for development purposes and MySQL5 in the production environment. SQLite stores its data in specially-formatted files and is a suitable choice for development because it does not require a running process in order to provide access to the data. On the other hand, MySQL provides more features and better scalability but comes at a price of having to configure and run a server process before its databases can be used. As shown in Figure 3.1, both the Rails web application and the computation component communicate with the database over their own separate channels. This approach is useful in the event of the database becoming a bottleneck in the system. In 4 http://www.sqlite.org/ 5 http://www.mysql.com/ Chapter 3. Developing DisEnT 23 such scenario, the MySQL database can be easily mirrored into multiple synchronised instances which can be queried instead of the main instance (Schwartz et al., 2012). While there are other database systems available that could fulfill the role of persisting data, DisEnT is using MySQL and SQLite due to their suitable feature sets, widespread use and good community support. Both these technologies are also popular in the field of bioinformatics; the MySQL database in particular had already been set up as part of pre-existing infrastructure for DisEnT. 3.1.3 Computation The computation module comprises of an instance of R interpreter capable of accepting commands over the standard TCP/IP (Transmission Control Protocol/Internet Protocol) channel. The R programming language is a well-established platform for statistical computation in the field of bioinformatics and computational biology. Utilisation of R allows DisEnT to make use of some of the best peer-reviewed computation packages for enrichment analysis. Communication over the TCP/IP channel is not available in R by default. To establish this link between the computation module the and the web application module, DisEnT makes use of an R package called RServe(Hornik and Leisch, 2003), which opens up a TCP/IP interface (socket) bound to an R session accepting commands over this protocol. Although the computation of disease enrichment analysis could be performed on any platform capable of computing statistics and connecting to a database, R has been the standard tool of choice for this purpose. There is a wealth of resources available for bioinformatics computation, such as Bioconductor (Gentleman et al., 2004), which provides a central repository for many packages used in DisEnT. Choice of R thus enables DisEnT to make use of a mature statistical computation infrastructure that would otherwise have to be re-implemented using other tools. 3.1.4 Job Queue DisEnT also implements a job queue component that enables the system to remain responsive under heavy load, by executing long-running and computationally-expensive Chapter 3. Developing DisEnT 24 tasks asynchronously (i.e. in the background). This process is managed by a Ruby library called Resque6 using a high-performance database system called Redis7 . The system is implemented in the following way: The job queue stores all pending jobs and their parameters in the Redis database. Resque runs as a number of processes known as workers. Resque workers periodically query (poll) the Redis database, looking for jobs that need to be executed. Once they find a ‘pending’ job, they load its instructions and execute it, effectively removing it from the queue. Resque is a feature-rich solution for managing asynchronous tasks, but it comes at a price of a dependency on a constantly-running Redis process. More lightweight alternatives for managing job queues for Rails applications include Rails runner and the Delayed Job8 (DJ) package. Rails runner is a Ruby script included in the default Rails set-up. The script is capable of executing custom Rails code from the command-line interface, making this tool predominantly useful for running singular tasks such as database updates. While this tool can be used as a simple platform for asynchronous job management, this use case is generally not recommended. The Rails runner loads all of the Rails components into memory every time it starts, which is a wasteful approach, especially if multiple short tasks need to be executed at the same time. Delayed Job offers similar functionality to the Rails runner, but it also implements a job queue system. The job queue implemented by Delayed Job is stored in the same database that Rails uses to store its application’s data. This has two consequences: first of all, there are no external dependencies for this system as Delayed Job makes use of the already-existing infrastructure. However, the disadvantage of this approach is that when the number of jobs is higher, Delayed Job can potentially block access to the database for other parts of the application. Moreover, like Rails runner, Delayed Job also loads all of the Rails code into memory upon execution, which can lead to unnecessary delays and exhaustion of the host system’s resources (Fernandez et al., 2011). Redis, on the other hand, provides a solution that does not depend on the Rails database infrastructure but still allows execution of code in the Rails environment. While Redis workers also have to load the Rails environment into the memory, this 6 https://github.com/resque/resque 7 http://redis.io/ 8 https://github.com/collectiveidea/delayed_job Chapter 3. Developing DisEnT 25 Asynchronous Execution User submits data Yes Valid? Gene Identification Mapping to Homologs Enrichment Analysis Data is presented No Figure 3.3: Overview of a typical DisEnT workflow as explained in Section 3.2. The dashed lines delimit part of the process that is executed in the background. operation is only done once and the memory use is limited to the number of running workers, resulting in faster and more effective execution. 3.2 Implementation In order to describe DisEnT’s implementation, this section follows a typical scenario of a DisEnT use case. This scenario is first presented in full and then step-by-step in more detail, presenting the capabilities and limitations of DisEnT. 1. A user enters their data (e.g. gene lists) into a simple web form and submits it for processing. 2. The web application validates the input data. If the data is valid, a new Resque job is created and added to the job queue. If the data is deemed not valid, the user is presented with a meaningful error explaining which parts of their query need to be adjusted. 3. Once the job has been identified by a Resque worker, genes from both of the gene lists are first identified, and any non-human genes are ‘converted’ to human genes in a process called homolog mapping. 4. After the gene identification and mapping, the gene data is passed to the computation module which will carry out the enrichment analysis 5. The results are presented to the user. Figure 3.3 outlines these steps in a diagram. Chapter 3. Developing DisEnT 26 Figure 3.4: The DisEnT web interface showing the input form for a new query. 3.2.1 User Input Following the usability criterion, DisEnT’s user interface was designed to be simple and user-friendly. Users can enter their data into DisEnT using a simple web form that was designed to be minimalistic and usable with its pre-set defaults. A preview of this form is shown in Figure 3.4. The form contains the following fields: • The Input list field expects a list of ‘interesting genes’ that will form the central target of the analysis (see Section 2.1.1 for further explanation). This list can be submitted either manually via a free-form textbox or as a file. • In terms of data entry, the Background list field is almost identical to the Input list field. In addition to entering the list manually or via a file upload, this field allows users to select a pre-compiled background list for their analysis. • The Species field allows the use of gene symbols in gene lists by explicitly stating species of the entered genes. For lists containing global gene identifiers, this field does not need to be specified. Section 3.2.3 explains the reason behind this field in more detail. Chapter 3. Developing DisEnT 27 • Finally, the Sources field allows users to choose which sources of gene-disease annotations would be used in their query (Section 2.2.2 compares these sources). By default, all of the sources are used, but users may choose different combinations. 3.2.1.1 User-friendly Input Handling The form offers a set of useful pre-set values, allowing the user to perform a query by simply providing a list of input genes and clicking the Submit button. By default, DisEnT will use all annotated human genes stored in its database tables – an equivalent of using all genes in the human genome. Any genes identified as non-human will also be automatically mapped to their human homologs (Section 3.2.4 describes this process). The application was designed to minimize restrictions on the user input. DisEnT can parse input and background gene list separated by commas and/or by type of whitespace (including tabs and line breaks, and any combination thereof). Gene names that consist of multiple words are accepted if they are enclosed in double- or single-quotes. This enables the user to paste the list contents from manuscripts, spreadsheet software, comma-separated-value (CSV) files and others common sources. As well as manual entry, users can opt in to upload the gene lists in text-based or CSV files. As Section 3.2.3 describes, DisEnT can recognize three major formats of gene identifiers. The identifiers can be mixed within in any of the gene lists, allowing the user to concentrate on the analysis without having to convert between the formats. To further improve the user experience, DisEnT provides explanation of some of the form fields in form of ‘popovers’ – tooltips that can be triggered by the user to reveal their content. An example of a popover is shown in Figure 3.5. 3.2.2 Validation Before the user data is submitted for processing, DisEnT checks its validity. The validation step is in place to improve the quality of the results as well as the user experience. If the data is invalid, DisEnT will report this fact to the user immediately, rather than attempt to use it and potentially produce low-quality results. In its current version, DisEnT performs 12 validation checks, including the following: • Have both the input list and the background list been provided or selected? Chapter 3. Developing DisEnT 28 Figure 3.5: A popover assisting the user in submitting their gene list. Figure 3.6: An error message informing user about an input validation failure. • Can the gene lists be parsed? I.e. can the provided lists be broken down into gene names. This includes checks for unmatched quotes in the gene names. • Can each one of the submitted genes be unambiguously identified? This check ensures that if any of the provided genes identifiers are ambiguous (e.g. many organisms have a gene named TP53), then there is additional information available to identify them unambiguously. Other validation checks ensure integrity of the provided data. e.g. whether at least one annotation source has been selected for the query or whether the gene species provided by the user is supported in DisEnT. If any of the validation checks fail, the user is presented with a meaningful error message explaining why the data has been deemed invalid and what can be done to rectify the failure. Figure 3.6 shows an example of such an error message. Chapter 3. Developing DisEnT 29 Figure 3.7: An intermediate message informing the user of the current state of their query. If all of the validation checks pass, a new search query is recorded in the system and a new job is added to the Resque job queue. After that, the user is redirected to a web page showing the current status of their query. This page periodically polls the database to refresh its information about the state of the query and to display the results as soon as they become available. Figure 3.7 shows this screen and its message. 3.2.3 Gene Identification After passing the validation stage, each of the submitted genes go through an identification process where they are looked up against various database tables in order to unambiguously find their global identifier and their species. This step enables DisEnT to compare and use the provided genes in the context of other databases. More specifically, DisEnT will attempt to ‘translate’ the provided gene identifiers into Entrez Gene (Maglott et al., 2005) identifiers (Entrez IDs). Entrez Gene is one of the most comprehensive gene databases and it is operated by the the National Center for Biotechnology Information (NCBI). Entrez IDs have the form of an integer number that is unique across all the genes and species. If DisEnT detects this format, the identifier is looked up to confirm existence of such gene and to identify its species. If, however, the input format is not an Entrez ID, the identification is more complicated. 3.2.3.1 Identifying Gene Symbols The simple numerical format makes Entrez IDs a popular choice for gene identification in many gene-specific databases outside the Entrez Gene database. However, most scientific literature refers to genes by gene symbols, making it necessary for DisEnT to be able to recognize them. The Entrez Gene database does provide a database for Chapter 3. Developing DisEnT 30 mapping between their Entrez IDs and gene symbols, but this process comes with a number of challenges. Firstly, while Entrez IDs are globally unique, gene symbols are often repeated across organisms. For example, the human gene tumor protein p539 is known under the gene symbol TP53 and Entrez ID 7157, but cattle and pig also carry genes under the same symbol but a different Entrez ID (397276 and 281542 respectively). Because these naming clashes are relatively common, DisEnT requires species of the gene symbols to be provided with the input in order to be able to identify them unambiguously. Another challenge associated with the use of gene symbols that they can change. As new genes are discovered and old ones are re-defined, the gene symbols change to reflect this new knowledge (Povey et al., 2001). For DisEnT’s identification process, this means that the system has to be able to maintain an up-to-date list of current gene symbols as well as a list of their synonyms. Data mapping current gene symbols to their synonyms is also provided by the Entrez Gene database, but maintaining both these lists exacerbates the problem of re-used gene symbols. Finally, a technical consideration for translating gene symbols is size of the dataset. As of 8th August 2014, the NCBI Entrez Gene database contained 16,013,303 gene entries, making textual lookup of gene symbols and their synonyms a computationallyexpensive task. For this reason, DisEnT limits its support to 23 organisms, reducing this number to 727,770 entries. The 23 species were chosen based on species supported by the widely-used Ensembl (Flicek et al., 2014) database. This choice allowed us to reduce computational cost while ensuring that DisEnT has support for some of the most popular model organisms. The list of supported species can be found in Table A.2 of Appendix A. An alternative approach to automatic identification of gene symbols would be to consider the rest of the list and ‘assume’ the species of the provided gene symbols based on genes that can be identified without ambiguity. For example, if 80 percent of all submitted genes were safely identified as human, DisEnT could narrow down its search to human gene symbols only. While this would be an interesting avenue to explore from the user experience perspective, it could introduce a certain level of unpredictability to DisEnT’s behaviour. Specifically, DisEnT cannot offer any a priori assurance to the user that any proportion 9 http://www.ncbi.nlm.nih.gov/gene/7157 Chapter 3. Developing DisEnT 31 of the their genes will uniquely identifiable (without additional species information) before going through the computationally expensive identification stage. Thus, in order to improve the likelihood of returning meaningful results after a successful query submission, DisEnT will ask for species specification if it detects any gene symbols in the input. 3.2.3.2 Identifying Ensembl IDs Another type of popular identifiers recognized by DisEnT are Ensembl (Flicek et al., 2014) gene identifiers. Ensembl identifiers follow a naming convention such as ENSG followed by a 11-digit number (e.g. ENSG00000141510 for the aforementioned TP53). While the naming convention can differ between species, these identifiers are stable and globally unique. Similar to the gene symbols, conversion between Ensembl and Entrez identifiers can be done using data from the Entrez Gene database. To summarize, DisEnT can identify Entrez and Ensembl identifiers as well as gene symbols and their synonyms. Genes that have been successfully identified will be used in the next stage of the process – mapping to their human homologs. 3.2.4 Mapping to Homologs Because DisEnT uses the Human Disease Ontology to perform the enrichment analysis, it requires human genes for its computations. If the user enters any non-human genes in any of their lists, they have to be ‘translated’ into their human ‘equivalents’ – i.e. mapped to their human homologs. A homolog is a gene that shares some ancestral DNA with another gene. This often implies that they are functionally similar, though the similarity levels may vary (Storm and Sonnhammer, 2002). Gene homologs can be identified using the NCBI HomoloGene10 database table.[h] The homolog mapping process, depicted in Figure 3.8, consists of the following steps. First, a given non-human gene of interest is identified in the HomoloGene table. Each of the entries in the HomoloGene table belongs to a homology group – a grouping containing homologs across many species. Once the homology group is identified, all that remains is to find a gene within the homology group that is a human gene – the human homolog. 10 http://www.ncbi.nlm.nih.gov/homologene Chapter 3. Developing DisEnT 32 Gene of Interest Matching Homolog Entrez ID: 22059 Entrez ID: 7157 Entrez ID Gene Symbol 1. Find the non-human gene entry 22059 Trp53 281542 TP53 7157 TP53 24842 Tp53 Homology Group Species 460 Mouse 460 Cattle 460 Human 3. Identify the human entry 460 Rat 2. Identify homology group Figure 3.8: The homolog mapping process. The (non-human) gene of interest is looked up and its homology group identified. The human homolog is a human gene in in the same homology group. If there is more than 1 human homolog present in the homology group, no homologs are returned. Note that the HomoloGene table in the diagram has been simplified for the sake of clarity, but the mechanism of the process remains the same. Chapter 3. Developing DisEnT 33 Because a given gene can have more than one human homolog (see Koonin (2005) for an explanation why), the process described above may return more than result. The process of choosing the right which homolog to return is not trivial and therefore DisEnT discards these multi-hit results and does not produce a mapping in such case. This is a trade-off DisEnT makes, sacrificing user experience in order to preserve validity of its results. Having all of the genes translated into human homologs is the last step of data preparation. This data can be finally submitted to the computation module, which will perform the disease enrichment analysis. 3.2.5 Enrichment Analysis As previously stated, the disease enrichment analysis step is carried out entirely in the R language environment, with the computation module communicating with the rest of DisEnT via a TCP/IP interface. While implementation of the enrichment analysis procedure was not in the scope of this project (it was instead implemented by Simpson (2014)), it may be instructive to explain some of the design choices behind this component. As mentioned in Section 2.1.1.2, DisEnT makes use of the elim algorithm developed by Alexa et al. (2006). This algorithm is implemented and distributed as an R package called topGO by the authors of the algorithm (Alexa and Rahnenfuhrer, 2010) using the previously mentioned Bioconductor repository. As its name suggests, topGO was created to enable enrichment analysis with Gene Ontology terms. This package was modified by He and Simpson (2014) to enable its use for disease enrichment analysis. As of today, this modified version of the topGO package has not been published, but its working title is topOnto. The topOnto package is being developed as a ontology-neutral statistical package, enabling users to leverage the functionality of the topGO package using any ontology. The interface between Rails and the R component is enabled by the RServe package described in Section 3.1.3. In order to perform the analysis, Rails sends the gene lists and the choice of annotation sources (explained in Section 3.2.1) to R over the RServe channel. R loads all dependencies needed for the computation (e.g. topOnto) and calculates the enrichment values. Once the results have been processed and the elim algorithm applied, they are passed to Rails. The Rails component then saves the Chapter 3. Developing DisEnT 34 Figure 3.9: The results page. results to the database and progresses into the final stage of the process – presentation of the data. 3.2.6 Presenting Results The design goals for the data presentation interface were identical to those of the input form. The data presentation page was designed to be simple and minimalistic by default, but allowing users to explore the results in more detail if needed. Figure 3.9 shows an example of the results page. The central component of this page is the table of results. This table lists the enriched disease terms including their HDO identifiers and a number of statistical values. The Background Count and List Count column values represent the number of genes of the background and the input list matching to the disease term. The Expected Count, Enrichment and the Fisher ‘elim’ p-value are presented here as calculated by the process described in the previous section. The results table allows users to order the results by any of the columns without refreshing the page. All the presented data is also available for download in the CSV format. The Figure 3.9 also shows a number of other features available in this page. For Chapter 3. Developing DisEnT 35 Figure 3.10: An expanded Query details section of the results page. example, a warning message is shown to the user pointing out the fact that there were some genes submitted in the query that did not have (unambiguous) matches in DisEnT’s gene database. Note that in order to keep the interface uncluttered, the message does not report any more details about the unidentified genes, making it the user’s choice to investigate or ignore the warning. If the user chooses to investigate which genes were not mapped to gene identifiers, they can do so by opening the Query details part of the page. This page section offers a detailed description of the outcomes of the identification and mapping of the entered genes (processes described in Sections 3.2.3 and 3.2.4). As shown in Figure 3.10, the Query details section reports numbers of identified and unidentified genes as well as their species. The different colouring of the reported counts mark active hyperlinks. For example, if the user wishes to find out which of their genes were not found in the DisEnT database, they can simply click on the count (in this case 6) and they will be presented with the list of genes in question. This interface is shown in Figure 3.11. This section described a typical DisEnT use case scenario when interacting with the system using a web browser. Though we expect that this will be the usual channel for using DisEnT, we also offer an application programming interface (API) access for more automated interaction with the system. Chapter 3. Developing DisEnT 36 Figure 3.11: A list of genes reported after clicking on a gene count value. 3.3 Programmatic Interface The API access was implemented to enable DisEnT’s users to interact with the system automatically using custom scripts and programmes. This interface can help share DisEnT’s data and open its capabilities to novel use cases. The initial API implementation uses the JavaScript Object Notation (JSON) format to share data with its clients. An example of a JSON-formatted object (more specifically, a search result) is shown below: { "disease_id": "DOID:9974", "fisher_elim_p_value": "0.123769808173478", "genes_input_names": ["ENSG00000004487", "ENSG00000002745"] } In order to conform to commonly accepted Internet standards, DisEnT API also makes use of the Hypertext Transfer Protocol (HTTP) status codes. HTTP status codes are a standard way of sending a ‘connotation’ along with the data (i.e. payload) of a HTTP request. Each HTTP code is represented by a number (e.g. 404) and its meaning (e.g. Not Found). Conforming to these standards enables software clients such as web browsers to communicate with remote services more effectively (Fielding et al., 1999). In its current version, DisEnT API supports two API actions: submission of a new query and retrieval of its results. Both of these actions will be briefly described in this section and their detailed documentation is available via the Apiary service at http://docs.disent.apiary.io/. The Apiary service was chosen to host the Chapter 3. Developing DisEnT 37 DisEnT API documentation because it can automatically generate code snippets in several programming languages that can be immediately used to query DisEnT. Apiary also offers a so-called mock interface that does not make any actual calculations but exhibits the same behaviour as the DisEnT API. The mock interface is used to describe the API in the examples below. 3.3.1 Submitting a Query A query can be submitted in a HTTP POST request (a request type commonly used to submit form data) sent to a designated DisEnT URL. The POST signal is accompanied with a JSON representation of the query data. POST http://disent.apiary-mock.com/search_queries.json { "search_query": { "input_list": { "genes_string": "ENSG00000001617, ENSG00000002745" }, "background_list": { "predefined_name": "Human genome" }, "mapping_sources": { "omim": true, "ensembl": true, "generif": true } } } The request above describes a query that contains two genes in its input list and selects a predefined background list. Because this is a valid request, the API will respond with the HTTP code 201 known as Created and an JSON object stating the ID of the newly-created query: { Chapter 3. Developing DisEnT 38 "notice": "The search query has been submitted", "id": 14 } In case the request is not valid, the DisEnT API will return the encountered validation errors (identical to those discussed in Section 3.2.2) along with the HTTP code 400 (Bad Request): { "errors": [ "Input list contains gene symbols. Please specify their species", "Background list can’t be blank" ] } 3.3.2 Retrieving Results A query submitted via the API is assigned to the job queue, just like a query submitted via the web-page interface. To retrieve the results of a query, the API client can simply issue a HTTP GET call with the query’s ID included in the URL (Uniform resource locator): GET http://disent.apiary-mock.com/search_queries/14/results.json which requests results of the search query under ID 14. If the search query is still being processed and its results are not available, DisEnT will respond with a 202 (Accepted) status, along with a short message: { "notice": "The query results are not yet available. Please try again later." } Once the results are computed, the same call will return a 200 (OK) call, listing the results: [ { Chapter 3. Developing DisEnT 39 "disease_id": "DOID:9974", "fisher_elim_p_value": "0.123769808173478", "genes_input_names": ["ENSG00000004487", "ENSG00000002745"] }, { "disease_id": "DOID:0060041", "fisher_elim_p_value": "0.358765288293535E-1", "genes_input_names": ["ENSG00000004487", "ENSG00000002745"] } ] Although the current implementation of the API only supports a subset of the DisEnT’s web-page features, thanks to Rails’ Model-View-Controller architecture (explained Section 3.1.1), the features can be easily extended. Similarly, DisEnT can be easily extended to support other data representation formats, such as XML (Extensible Markup Language). It also may be worth noting that a combination of both approaches can be used, e.g. API for submitting the results along with web-browsing for their investigation. 3.4 Development Methodology In order to further address the sustainability criterion outlined in Section 2.4.1 of the previous chapter, this section will briefly comment on the development methods used to design and implement DisEnT. DisEnT was developed using the Agile software development approach. Agile is a term coined in 2001 by Beck et al. (2001) and one of the main tenets of this methodology is to develop software built for change rather than to follow a pre-set plan. Building software ‘ready’ for change means building software that is robust, modular and orthogonal, i.e. building software without introducing unnecessary complexity and dependencies between its components. This chapter has described how DisEnT adheres to these principles in several aspects. For example, delegating specialised tasks to specialised modules decreases dependencies between the components. In an event of a failure of the DisEnT website, the job queue module along with the computation module should remain unaffected, pro- Chapter 3. Developing DisEnT 40 cessing the enqueued jobs. Similarly, the choice of technologies adhering to the Agile principles, such as Ruby on Rails, also improves robustness and ultimately sustainability of DisEnT. Thanks to the Model-View-Controller pattern DisEnT’s web interface is easily extendible to communicate using JSON, XML or to provide its data in formats such as CSV. The Agile methodology also affected the ‘project management’ aspects of DisEnT. Each of the project’s features was first outlined as a brief and well-defined use case (sometimes referred to as user story) that was then broken down into short and achievable tasks. Over the course of this project, these tasks were prioritized and delivered on a weekly basis, followed by frequent ‘releases’ of new versions of DisEnT. While this approach arguably introduced some operational overhead into the development process, it helped the project to remain on track and deliver the most important features based on frequent feedback from the project’s supervisor. Finally, DisEnT was developed using the test-driven development (TDD) method. TDD is a popular software development methodology, in which software tests are written before implementation of code. Adoption of TDD enabled clearly set expectations of DisEnT’s desired behaviour before extending its code base. Moreover, because TDD code is written to specifically address a set of discrete and well-defined expectations, the code itself becomes more modular, robust and thus more maintainable. While use of this methodology does not guarantee completely bug-free software, TDD has been reported to radically decrease the number of software flaws. For example, Maximilien and Williams (2003) report reduction of software defects by 50 percent after adoption of TDD. Chapter 4 Evaluation The aim of this project was development of a reliable and accessible scientific tool for disease enrichment analysis using a novel dataset. In order to evaluate whether this goal was achieved, Chapter 2 introduced four main requirements for DisEnT: correctness, usability, scalability and sustainability. As Chapter 3 described, these four criteria were used as guidelines when designing DisEnT’s architecture, creating its interfaces and implementing its functionality. While we suggest that the decisions made during the development of the project were suitable for the set requirements, this chapter will evaluate whether DisEnT’s meets its criteria using a number of other methods. The correctness criterion is evaluated by using results produced by DisEnT to find relationships between diseases and genes and comparing them to findings in the literature. To validate DisEnT’s usability, a user study was conducted, where a group of participants were asked to use DisEnT to perform disease enrichment analysis. Scalability of DisEnT is evaluated using a simulated load impact test to verify whether the system can cope with increased concurrent workload. Finally, the system’s sustainability is evaluated against a set of peer-reviewed recommendations for scientific software 4.1 Correctness Although DisEnT’s source code is covered by a high number of automated tests, not even extensive software testing can guarantee complete correctness of a software system. Thus, in order to evaluate whether the results produced by DisEnT are correct, we 41 Chapter 4. Evaluation 42 Figure 4.1: A cartoon depicting a synapse with its presynaptic (top) and postsynaptic parts (bottom). Figure by Gécz (2010). tested the system in a realistic scenario. We used DisEnT to find enriched diseases in two distinct sets of genes. The results were merged together and analysed as a network in order to find linkages between the diseases identified in each set. Our findings were compared to the published literature. The lists used for the evaluations consisted of genes from the presynaptic and the postsynaptic part of the synapse. The synapse is a structural feature between two neurons that enables their connection. The presynaptic part of the neuron can be considered a transmitter in this connection and the postsynaptic part as a receiver. Figure 4.1 shows an example of a synapse in a cartoon. The gene lists were sourced from two published studies. The source of the presynaptic gene list was Boyken et al. (2013) and it contained 419 human genes. The postsynaptic gene list was published by Bayés et al. (2012) and it contained 1442 human genes. Each one of the gene lists was submitted to DisEnT in a separate query, setting human as the target species and choosing all of the available annotation sources. Each Chapter 4. Evaluation 43 Table 4.1: Top 10 disease terms enriched in the post-synaptic list, ranked by their pvalue. Disease Term (HDO ID) Enrichment Score p-value schizophrenia (DOID:5419) 2.96 8.0847E-20 frontotemporal dementia (DOID:9255) 7.67 3.1864E-17 Leigh disease (DOID:3652) 11.94 1.5542E-13 pyruvate decarboxylase deficiency (DOID:3649) 15.78 1.2052E-06 9.23 3.5613E-05 chief cell adenoma (DOID:7607) 30.00 5.4008E-05 parathyroid carcinoma (DOID:1540) 11.76 2.7085E-04 8.33 2.7818E-04 10.52 4.7406E-04 2.60 4.8057E-04 Ohtahara syndrome (DOID:0050709) mitochondrial encephalomyopathy (DOID:890) leber hereditary optic neuropathy (DOID:705) schizophreniform disorder (DOID:11328) of the queries returned 50 results and the top 10 enriched diseases for the queries are shown in Tables 4.1 and 4.2. As one would expect, most of these terms describe neurological disorders. The full result set for each of the lists was downloaded in the CSV format for further processing. In order to find connections between the diseases enriched in each gene list, we combined the results from the two lists and used them to construct a graph. Nodes of the graph were the enriched diseases and their edges were formed by genes shared between diseases. Separate edges were drawn for presynaptic and postsynaptic genes in common. This network was visualised using Cytoscape version 3.1.1. Figure 4.2 shows the initial layout of this network. Edges in Figure 4.2 are weighted by the number of genes in common between each of the diseases. ‘Postsynaptic’ edges are shown in red, while edges representing the presynaptic genes are shown in green. Size of the nodes and their labels is determined by the nodes’ betweenness centrality. A node’s betweenness centrality is calculated by counting the proportion of shortest paths in the network that include that node. Figure 4.2 shows that there are many common gene connections between the postsynaptic diseases (connections shown in red) and that there is a number of presynaptic disease terms (connections shown in green) that are separate from the postsynaptic disease cluster. While most of the terms refer to neurological disorders, there are a number of metabolic disorders present in the network such as the pyruvate decarboxy- the text). plasma cell neoplasm familial medullary thyroid carcinoma gastric leiomyoma mucinous ovarian cystadenoma mitochondrial phaeochromocytoma Cowden disease Alpers syndrome congenital nonspherocytic hemolytic anemia X-linked sideroblastic anemia Charcot-Marie-Tooth disease type 4 amyotrophic lateral sclerosis simplex axonal neuropathy amyloidherpes angiopathy amebiasis motor peripheral neuropathy juvenilecerebral myelomonocytic leukemia Charcot-Marie-Tooth disease Charcot-Marie-Tooth D-2-hydroxyglutaric aciduria disease type encephalomyopathy 2 severe acute respiratory syndrome oligodendroglioma distal hereditary neuropathy multiplemotor system atrophy schizophreniform disorder cascade stomach choroid plexus cancer generalized epilepsy Pick's with febrile lateral sclerosis disease pilocytic astrocytoma glioblastoma brain cancer seizures plus disorder bipolar disorder breast cancer Huntington's disease periventricular nodular heterotopia rectal disease focalbody epilepsy Lewy dementia schizophrenia juxtacortical chondroma frontotemporal Alzheimer's brain disease paraganglioma rabies early disability myoclonic encephalopathy parathyroid carcinoma intellectual disease dementia non-small cell lung carcinoma leber hereditary optic neuropathy neurotic astrocytoma autistic disorder vaccinia alternating hemiplegia of childhood hereditary spherocytosis chief cell adenoma nasopharynx carcinoma Down syndrome uterine carcinosarcoma temporal lobe epilepsy melanoma polyneuropathy encephalitis Leigh disease hemiplegia inclusion body myositis Ohtahara syndrome uveal cancer autoimmune hemolytic anemia protein-deficiency anemia ganglioneuroblastoma Wiskott-Aldrich syndrome kidney leiomyosarcoma leiomyoma cutis pyruvate metabolic decarboxylase acidosis deficiency West syndrome Bernard-Soulier syndrome benign familial infantile epilepsy Smith-Lemli-Opitz syndrome pancreatic gastrinoma epidermolysis bullosa dystrophica retinitis Chapter 4. Evaluation 44 Figure 4.2: Network of diseases linked by their presynaptic and postsynaptic genes in common. Nodes represent the diseases and edges represent the shared genes. Green edges represent presynaptic genes and red edges represent postsynaptic genes. Thickness of the edges is determined by the number of genes in common. Size of the disease nodes is determined by their betweenness centrality value (explained in Chapter 4. Evaluation 45 Table 4.2: Top 10 disease terms enriched in the pre-synaptic list, ranked by their pvalue. Disease Term (HDO ID) Enrichment Score p-value schizophrenia (DOID:5419) 2.91 4.5421E-65 frontotemporal dementia (DOID:9255) 5.43 2.6852E-33 Alzheimer’s disease (DOID:10652) 2.07 1.3499E-21 Down syndrome (DOID:14250) 2.47 6.7531E-08 glioblastoma (DOID:3068) 1.78 1.9764E-07 neurotic disorder (DOID:4964) 1.66 2.0477E-07 autistic disorder (DOID:12849) 1.90 3.9672E-07 temporal lobe epilepsy (DOID:3328) 3.23 6.0301E-07 amyotrophic lateral sclerosis (DOID:332) 2.04 1.4568E-06 Charcot-Marie-Tooth disease (DOID:10595) 3.73 2.5146E-06 lase deficiency. Although this connection may appear surprising, past studies have shown various links between metabolic and neurological disorders. For example, a link has been found between a neurological disorder called Leigh’s disease (ranked 3rd in Table 4.1) and pyruvate carboxylase deficiency (Toshima et al., 1982; Ohtake et al., 1982). In order to simplify the network, we reduced it to top 10 nodes based on their betweenness centrality value (Figure 4.3). We also built separate networks from the presynaptic and postsynaptic connections and reduced them to top 10 nodes ordered by betweenness centrality. The reduced presynaptic network is shown in Figure 4.4 and the postsynaptic network in Figure 4.5. Figures 4.3, 4.4 and 4.5 all show a significant enrichment for the schizophrenia term. This significance, supported by this term’s high ranking in Tables 4.1 and 4.2 is also evidenced by a number of studies of schizophrenia in the context of presynaptic (Sternberg, 1982) and postsynaptic (Pandey et al., 1977) processes. It should be noted, however, that the high number of gene connections between schizophrenia and other diseases may also be a consequence of the high amount of published literature studying schizophrenia in general. Another interesting phenomenon in Figures 4.4 and 4.5 is the number of terms related to cancer (e.g. glioblastoma, breast cancer, paraganglioma) present in the two networks. Chapter 4. Evaluation 46 metabolic acidosis pyruvate decarboxylase deficiency mitochondrial encephalomyopathy schizophrenia phaeochromocytoma neurotic disorder paraganglioma Charcot-Marie-Tooth disease type 2 Alzheimer's disease frontotemporal dementia Figure 4.3: The network of postsynaptic and presynaptic gene associations reduced to top 10 nodes ordered by betweennes centrality. metabolic acidosis pyruvate decarboxylase deficiency Alzheimer's disease Charcot-Marie-Tooth disease type 2 mitochondrial encephalomyopathy oligodendroglioma schizophrenia frontotemporal dementia paraganglioma phaeochromocytoma neurotic disorder Figure 4.4: The network of the presynaptic gene associations reduced to top 10 nodes ordered by betweennes centrality. Chapter 4. Evaluation 47 melanoma non-small cell lung carcinoma Alzheimer's disease frontotemporal dementia neurotic disorder glioblastoma breast cancer astrocytoma schizophrenia brain disease Figure 4.5: The network of the postsynaptic and presynaptic gene associations reduced to top 10 nodes ordered by betweennes centrality. One of the reasons behind the high number of connections between the neurological and the cancer-related terms could be that certain gene products, such as those controlling the cell-cycle, are re-used in different parts of a biological system. In order to verify this hypothesis, we performed a functional enrichment analysis on the genes associated with the cancerous terms. We found that processes associated with the cellcycle are highly enriched in these genes. The enrichment analysis was performed using the ToppGene suite (Chen et al., 2009) and its results are summarized in Table 4.3. While the involvement of cell-cycle genes may explain some of the connections between the cancer terms and neurological diseases, it should be noted that these links may still be interesting. For example, several studies have founds links between incidence of cancer and neurological diseases such as Alzheimer’s disease (Roe et al., 2005) and schizophrenia (Barak et al., 2005; Grinshpoon et al., 2005). The scenario described above attempted to use data produced by DisEnT to make meaningful inferences about links between diseases and genes. Our findings suggest that using DisEnT’s data in this manner can lead to correct and interesting results, such as links between neurological disorders and metabolic disorders or cancer. However, as with any biological research, care needs to be taken when choosing investigation methods as well as when evaluating the results. Similar future attempts Chapter 4. Evaluation 48 Table 4.3: Top 5 results of enrichment analysis for genes associated with cancer-related terms. Enrichments of molecular functions, biological processes and cellular components are listed. Rank GO Term (ID) p-value Molecular Function 1 enzyme binding (GO:0019899) 5.883E-49 2 kinase binding (GO:0019900) 2.305E-29 3 protein complex binding (GO:0032403) 1.183E-27 4 cytoskeletal protein binding (GO:0008092) 1.809E-27 5 protein kinase binding (GO:0019901) 2.508E-26 1 cell projection organization (GO:0030030) 1.880E-46 2 neuron projection development (GO:0031175) 6.365E-45 3 neuron development (GO:0048666) 7.478E-44 4 neurogenesis (GO:0022008) 4.818E-42 5 neuron projection morphogenesis (GO:0048812) 5.556E-42 1 neuron projection (GO:0043005) 1.282E-37 2 neuron part (GO:0097458) 9.476E-35 3 cytoskeletal part (GO:0044430) 2.627E-28 4 vesicle (GO:0031982) 4.244E-27 5 cell junction (GO:0030054) 5.093E-27 Biological Process Cellular Component Chapter 4. Evaluation 49 of the presented evaluation scenario may benefit from more granular reduction of the constructed networks in order to find more subtle connections, and more sophisticated statistical analysis of the network accounting for the fact that certain diseases are studied more than others. Enabling this form of analysis within the DisEnT application may also be valuable to its users. 4.2 Usability To further test the usability of DisEnT, we conducted a small user study whose participants were asked to use DisEnT to perform a simple disease enrichment analysis. The participants were instructed to carry out a set of practical tasks using DisEnT’s web interface and then comment on their user experience. The study was conducted using the Google Forms service1 . The Forms service was used to present instructions to the participants as well as to collect their answers. The full content of the survey as well as its results can be found in Appendix B and at http://goo.gl/sgLhd2. 4.2.1 Practical Tasks The study instructed its participants to access DisEnT’s pre-production website, enter a given set of genes as an input list, specify the genes’ species and submit the query. The specified query generated 50 results that were all displayed to the participants. In order to answer the questions above, the participants were instructed to use features that DisEnT’s user interface provides but no further guidance was given. In the event of not knowing the answer to any of the questions, the participants were allowed to not submit an answer. Once the participants were presented with their search results, they were asked to answer the following questions: 1. What is the name of the disease with the highest enrichment score? 2. How many genes from the input list are associated with that disease? 3. Please list all genes from your list associated with Alzheimer’s disease. 1 https://docs.google.com/forms/ Chapter 4. Evaluation 50 To answer question 1, the participants were expected to sort the query results by the ‘Enrichment’ column in descending order. Out of 10 participants in the study 9 answered question 1 correctly. The participant that answered this question incorrectly reported the first result in the default ordering (by p-value). The incorrect answer could have been caused by a confusion between the enrichment score column and DisEnT’s default ranking. This error could have been potentially avoided by stating the question more explicitly or by targeting it to a different column. Answering question 2 required the users to identify the List count column value for the result identified in question 1. All of the participants answered this question correctly. (The participant who reported an incorrect disease in question 1 provided a count matching to their answer). Finally, question 3 asked the users to find the List count value for Alzheimer’s disease and report the associated genes by clicking on the count. 9 participants answered this question correctly. One participant reported genes associated with the disease from questions 1 and 2, suggesting that they misunderstood the question 3. 4.2.2 User Experience Following the practical part, the second section of the study asked the participants to evaluate difficulty of the preceding tasks and comment on usability of DisEnT’s user interface. The questions in this section were as follows: 1. How difficult was it to answer the questions above? Rating on a scale of 1 to 5, where 1 means ‘Easy’ and 5 means ‘Difficult’ 2. Can you rate the ease of use of DisEnT? Rating on a scale of 1 to 5, where 1 means ‘Easy to use’ and 5 means ‘Difficult to use’ 3. Can you describe the DisEnT’s user interface in 1-3 words? 4. What is your level of experience with the enrichment analysis method? Rating on a scale of 1 to 5, where 1 means ‘No experience’ and 5 means ‘Expert’ 5. Any further comments In question 1, 5 of the participants marked the difficulty level as 1 and 5 participants as 2. This suggests that the participants considered finding answers to the questions Chapter 4. Evaluation 51 in the practical section to be relatively easy. As Section 4.2.3 reports, most of the participants had very little or no previous experience with the enrichment analysis method, so reports of a certain level of difficulty were expected from the experiment. Answers to question 2 consisted of 7 ratings of 1 and 3 ratings of 2, which suggests that DisEnT’s is generally considered to be easy to use by non-experts. Question 3 of this section allowed users to label DisEnT’s user interface using their own words. Answers to this question were predominantly positive, using words such as ‘intuitive’, ‘minimalist’, ‘clean’ and ‘friendly’. In contrast, however, one of the participants described the interface as ‘a bit confusing’. Answers for question 4 reported that most of the study participants had little or no prior experience with enrichment analysis. When asked to rate their level of experience with this method, 7 participants chose the lowest rating, ‘No experience’. Optional comments provided by the participants contained additional positive feedback a number of suggestions. Specifically, two of the responses suggested that more of the search result headings should be annotated and one respondent found implementation of the table sorting functionality unclear. While this particular type of ‘freeform’ feedback is difficult to quantify, we believe that the suggestions and shortcomings pointed out in this study may form valuable information for future improvements of DisEnT. 4.2.3 Evaluation Overall, the feedback provided by this study suggests that DisEnT’s user interface is perceived positively. Almost all study participants completed the given tasks correctly and evaluated the difficulty of the given tasks as low. Likewise, most of the participants expressed a positive opinion about the DisEnT’s web interface. We believe that these results suggest that the usability criterion was met by DisEnT. However, it is worth discussing some of the limitations of this user study. The difficulty of the practical tasks was set relatively low in order to be able to conduct the study with non-expert participants. Setting more complicated tasks to a more experienced audience could provide an insight into DisEnT’s usability in a more realistic scenario. Moreover, the participants chosen for this study were university-level students of informatics and bioinformatics. There is some expectation that participants of this type Chapter 4. Evaluation 52 are technically able and accustomed to a range of web technologies. This notion, along with the observed experience levels, may not be an adequate reflection of DisEnT’s target user base. A future user study could include a more diverse range of participants. The reason behind choosing non-expert participants was twofold: firstly, it allowed to test DisEnT in a scenario where the users were not familiar with the enrichment analysis process were only guided by brief instructions. This setting required the users to carry out the task exclusively using the features of DisEnT’s interface. The second reason for choosing non-expert participants was practical: non-expert users are easier to find. To summarize, while the conducted study was relatively short and not without its limitations, we believe that the collected results provide some evidence that the application developed in this project meets its usability criterion. 4.3 Scalability Scalability of DisEnT was tested by a technique known as load testing (Weyuker and Vokolos, 2000). As its name suggests, the aim of load testing is to observe performance of a system when exposed to heavy workload. To carry out a load test DisEnT, we used an online service called Load Impact2 . The Load Impact service enables its users to configure ‘user scenarios’ – custom scripts imitating usual user interaction with a service. These user scenarios are then ‘replayed’ on the target system, simulating usage ‘traffic’ on the system. Load Impact replays the scenarios at a gradually increasing rate, putting more workload on the target system. We simultaneously tested DisEnT using two user scenarios. One of the scenarios simulated a user loading a results page showing 50 analysis results. The second scenario simulated a user submitting a query containing 100 gene symbols using the API. Each of the scenarios had a randomly chosen back-off period of 1 to 10 seconds before they were repeated. The reason for choosing an API call for query submission over a web-form interface was a technical one. Because allowing automated submission of web forms can be considered a security risk (known as cross-site request forgery ), Rails implements a safety feature in its web forms that prevents this risk. Unfortunately, this also makes it difficult to automate the form submission task in a Load Impact user scenario. 2 http://loadimpact.com/ Chapter 4. Evaluation 53 The Load Impact test worked by gradually increasing user connections that replayed one of user scenarios on DisEnT. Both the scenarios were spread evenly among the connections. The test started with a single connection and ended with 50 user connections. This increase was achieved over the course of 5 minutes. The parameters of 50 connections and 5 minutes were both limited by the free version of the Load Impact service and thus could not be adjusted. The results of the load test are presented in form of a chart in Figure 4.6.The parameters observed during the test were: the number of active connections (i.e. Clients active), the number of requests per second (Requests/second) and the service response time (User load time). The full details of the test and its results can be found at the Load Impact website3 . As Figure 4.6 shows, the number of connections had a slight detrimental effect on DisEnT’s response time. However, all of the requests were processed in under 1 second, which is considered an acceptable response time for an optimal web user experience (Nielsen, 1999). Over the course of its 5-minute run, the test issued 3831 unique HTTP requests at a rate of up to 30 requests per second. 195.68 megabytes of data were transferred between the DisEnT server and the simulated Load Impact users. The result page scenario was triggered 628 times with an average response time of 839.88 milliseconds (ms). The 100-gene query submission scenario was triggered 695 times with the average response time of 472.14ms. All of the requests were processed and responded to successfully. We believe that the difference in response time between the two scenarios can be explained by the the use of two different protocols. While the results page scenario is less computationally expensive (it only involves reading data), it actually consists of several ‘sub-requests’ for assets contained on the page (e.g. stylesheets, scripts, images). On the other hand, the query submission scenario is computationally more expensive but the process only involves issuing a single request to which DisEnT replies with a single response message. We believe that the load testing results show that DisEnT is usable under heavy workload and thus meets the scalability criterion. While limitations of the free version of the Load Impact service did not allow us to determine any limits of DisEnT’s scalability, we do not expect that a scientific tool of this type would be used under 3 http://loadimpact.com/load-test/synprot.inf.ed.ac.uk-356f529ecaff26571c779e5c52624bda Chapter 4. Evaluation 54 Figure 4.6: Results of load testing showing DisEnT’s response time (blue), active client connections (green) and requests per second (red) over time. Figure provided by Load Impact. Chapter 4. Evaluation 55 conditions exceeding the conditions of the described load test. Future load tests could benefit from more intensive testing with increased the request rate and more observed variables. For example, the time needed to return results from a newly-submitted query could be observed. DisEnT in its current version only operates one worker processing its job queue, but this number can be easily increased to optimize the query turnaround time. 4.4 Sustainability Sustainability was the fourth and final criterion considered when designing DisEnT. We believe that a sustainable piece of software is designed to be reusable, maintainable and long-lasting. While these attributes are often only identifiable over a longer period of time, this chapter attempts to evaluate sustainability of DisEnT by matching it against a set of recommendations for writing scientific software recently published by Wilson et al. (2014). In their study entitled Best practices for scientific computing, Wilson et al. suggest eight recommendations based on the authors’ collective experience in developing scientific software, on various open-source software guidelines, as well on published scientific computing studies. This chapter briefly introduces these recommendations and describes how DisEnT addresses each one of them. 4.4.1 Write programs for people, not computers This recommendation suggests that software source code should be easy to read and comprehend by people reading it. This can be achieved by adhering to a consistent style of formatting and writing, as well as choosing meaningful names for variables, methods and other components included in the code. This requirement is inherently addressed in the Ruby on Rails architecture. By adhering to the convention over configuration pattern described in Section 3.1.1, Rails encourages developers to be consistent and descriptive in naming of components in their code. DisEnT was developed in line with this convention in Rails as much as possible. The Ruby code in DisEnT was written according to a community-driven Ruby code style guide4 , ensuring its formatting is coherent and the source code is legible and 4 https://github.com/bbatsov/ruby-style-guide Chapter 4. Evaluation 56 self-explanatory. Custom methods have been named descriptively and more complex methods have been annotated with code comments. 4.4.2 Let the computer do the work The message of this recommendation is to automate repetitive tasks in scenarios such as database maintenance or software deployments. Ruby on Rails also addresses this requirement by offering the Rake tasks – scripts written in Ruby for common tasks such as database migrations. The Rake tasks are functionally identical to the Rails runners described in Section 3.1.4. Examples of standard Rails Rake tasks include the db:setup task for automatic creation of database tables based on their definition or the db:seed task for populating the database with pre-defined data. These tasks are tremendously useful when creating a new development environment. The Rake tasks are not limited to database operations and can in fact execute any Ruby (and Rails) code so that developers can create their own customized tasks. DisEnT contains a number of custom tasks, such as the disent:restart task. This task performs a number of operations needed for updating DisEnT in production environment. These tasks include applying any pending changes to the database structure, re-starting Resque workers and finally re-loading the Rails server. These steps need to be completed every time a new version of DisEnT is deployed this Rake task reduces this multi-step process to a single step. 4.4.3 Make incremental changes This recommendation consists of two points. First, the authors advise scientific software developers to make small incremental changes in order to be able to accommodate frequent feedback. Second, they recommend using version control for all manuallyproduced code. As Section 3.4 describes, DisEnT was developed in small increments by delivering each feature in a separate, well-defined task. The management aspect of this project also included weekly meetings where DisEnT’s latest version was discussed and any outstanding tasks could be re-prioritised. DisEnT was developed using the Git5 version control system from the very beginning of the project. In addition to the Ruby on Rails code, all R code used for 5 http://git-scm.com/ Chapter 4. Evaluation 57 computation is also stored in Git. 4.4.4 Don’t repeat yourself (or others) This point recommends making code reusable and modular, so that each piece of knowledge in the system is represented only once. This approach makes it easier to maintain the code base. The Don’t repeat yourself (DRY) principle is one of the major design goals of the Ruby on Rails framework. For example, attributes of each model defined in Rails (Section 3.1.1 explains the model-view-controller pattern) are inferred from the database schema so that they do not have to be re-defined. Code written Rails also has access to all of the features of Ruby, making it possible for developers to create and include custom classes and modules for sharing behavior and knowledge across the code base. 4.4.5 Plan for mistakes This recommendation stresses the importance of automated software testing and it suggests to prevent already-discovered bugs from re-occurring by writing tests capturing them. As described in Section 3.4, DisEnT was developed using the test-driven development (TDD) methodology, where tests for a piece of code are written before the code itself. Using the popular RSpec6 testing framework, DisEnT was developed with an exceptionally-high test coverage. The SimpleCov7 Ruby plugin for measuring test coverage, reported 96.23% of all DisEnT’s Ruby code to be covered by a test case. 4.4.6 Optimize software only after it works correctly This recommendation suggests to use profilers for identifying performance bottlenecks as well as using high-level programming languages to implement functionality before optimizing the performance. We carried out performance testing in Section 4.3 in order to identify any performance bottlenecks in the RSpec system. As mentioned in that section, we have not yet identified any performance bottlenecks in the production version of the system, but this may partially be due to limitations enforced by the performance profiler used. 6 https://github.com/rspec/rspec 7 https://github.com/colszowka/simplecov Chapter 4. Evaluation 58 Implementing functionality before optimizing performance was the method for implementation of the gene identification and homolog mapping processes (Sections 3.2.3 and 3.2.4 respectively). 4.4.7 Document design and purpose, not mechanics The message of this point is to write documentation that explains design decisions rather than implementation details. This points also recommends to embed documentation in the code. Because most of DisEnT’s code consists of simple and short methods, its documentation embedded in code is not extensive. However, any major design decisions were recorded in a set of wiki pages created for the project. The wiki pages contain the challenges encountered and their solutions. Moreover, DisEnT’s API has been extensively documented via the Apiary service at http://docs.disent.apiary.io/. Should there be a need, future versions of DisEnT’s code base could be annotated with specially-formatted comments that can be automatically translated into web-based documentation. An example of a package offering such functionality is RDoc8 . 4.4.8 Collaborate This point encourages scientific software developers to use collaborative tools and techniques such as code reviews or pair programming. They also advise to use specialised tools for issue tracking. Due to the nature of this project (i.e. an individual Master’s project), the implementation phase did not offer many opportunities for programming collaboration but it was developed with frequent feedback. The progress of this project was always trackable in a constantly-updated spreadsheet available online. Each of the tasks was assigned a unique identifier (e.g. DisEnT023) which could be referred to in all parts of the system, including the source code and the wiki-based notes mentioned in the preceding section. All in all, we suggest that DisEnT was developed in a way that will enable its re-use and expansion in the future. While some of its aspects, such as the extent of its documentation, may be improved as the project becomes more complex, we believe that we 8 http://docs.seattlerb.org/rdoc Chapter 4. Evaluation 59 diligently followed the recommendations set above in order to produce a sustainable piece of software. Chapter 5 Conclusion Relationships between diseases and genes are tremendously complex, but the effort invested in understanding them can reveal new insights into some of the world’s most important health challenges. This thesis has introduced a method for studying genedisease relationships by enrichment analysis and a tool enabling its use. We have described DisEnT – a disease enrichment tool that allows access to genedisease data collected from an unprecedented number of sources. The aim of this project was to design, implement and evaluate DisEnT as a reliable and accessible scientific tool. In order to achieve this aim, we set four main goals for DisEnT: correctness, usability, scalability and sustainability. DisEnT has been developed to address all of the four criteria. Its code was implemented with tests safeguarding its correctness and its user interface was designed with a focus on usability. DisEnT’s architecture is modular to allow for expansion of its features and its underlying platforms were chosen to make it suitable for future re-use. DisEnT was evaluated against each one of its four goals. In order to evaluate the system’s correctness, we showed that results produced by DisEnT can be used to find evidence-supported links, including links that are not immediately obvious. DisEnT’s usability was evaluated in a user study, producing positive outcomes. We subjected DisEnT to a load impact test to evaluate its scalability and found that it can cope with increased workload. Finally, we provide evidence that we followed recommended best practices in our methodology. While each of the evaluation methods we used have their limits, we believe that their outcome is a reasonable indication of DisEnT’s quality. While we suggest that the current version of DisEnT is a reliable scientific tool, its main limitation currently is the lack of features it provides. Although the currently60 Chapter 5. Conclusion 61 available enrichment analysis functionality can provide valuable information to its users, there is much more that can be done with the data available. Future versions of DisEnT could introduce a visualisation functionality into the application. For example, automated construction of gene-disease networks (similar to those described in Section 4.1) could enable users to gain even more insight from their results without having to reach for external solutions. Another valuable feature of DisEnT would be a support for ontologies other than the Human Disease Ontology. This feature could help users gain more specific insights by using ontologies targeted specifically to their problem of interest, e.g. neurological diseases. As a matter of fact, there is an already-ongoing effort to include the Human Phenotype Ontology (Robinson et al., 2008) into DisEnT’s annotation data. Another similar effort exists to enable research of diseases of the synapse – synaptopathies (Grant, 2012) – in DisEnT. DisEnT could also allow its users to trace supporting evidence for the enrichment results it provides. Such feature would require an enhancement of the current dataset, but it would vastly improve the transparency of DisEnT’s internal processes. Overall, this project has successfully produced a software tool capable of performing disease enrichment analysis on a given set of genes. We believe that the quality and accessibility criteria set for DisEnT have been met and that the tool is correct, usable, scalable and sustainable. There are a number of features that could enhance this tool further and we believe that this project has laid a solid foundation to enable this expansion, allowing DisEnT to become a useful and reliable resource for the scientific community. Appendix A Technical Specifications A.1 Software Versions This section describes versions of software packages used in DisEnT. A.2 Supported Species This section lists species supported in DisEnT as described in Section 3.2.3.1. 62 Appendix A. Technical Specifications 63 Table A.1: Versions of software packages used in DisEnT divided by their components defined in Section 3.1. Name Version Web Application Ruby 2.1.2 Ruby on Rails 4.1.0 Apache 2.2.22 Phusion Passenger 4.0.48 Rserve-client 0.3.1 Resque 1.25.2 SQLite 3.8.2 MySQL 5.5.38 R 3.1.1 RServe 1.8.0 Redis 2.8.13 Database Computation Job Queue Appendix A. Technical Specifications 64 Table A.2: Species supported in DisEnT. The ‘common names’ are listed as they are listed in DisEnT Common Name Scientific Name Anole lizard Anolis carolinensis C.intestinalis Ciona intestinalis Cat Felis catus Chicken Gallus gallus Chimpanzee Pan troglodytes Cow Bos taurus Dog Canis lupus familiaris Fruitfly Drosophila melanogaster Gorilla Gorilla gorilla Horse Equus caballus Human Homo sapiens Macaque Macaca mulatta Marmoset Callithrix jacchus Mouse Mus musculus Opossum Monodelphis domestica Orangutan Pongo abelii Pig Sus scrofa Rabbit Oryctolagus cuniculus Rat Rattus norvegicus S. cerevisiae Saccharomyces cerevisiae Turkey Meleagris gallopavo Zebra Finch Taeniopygia guttata Zebrafish Danio rerio Appendix B User Study B.1 Participant Answers The following section lists all answers submitted to the user study described in Section 4.2 of Chapter 4. B.1.1 Practical Part The questions were as follows: 1. What is the name of the disease with the highest enrichment score? 2. How many genes from the input list are associated with that disease? 3. Please list all genes from your list associated with Alzheimer’s disease. B.1.2 Evaluation 1. How difficult was it to answer the questions above? Rating on a scale of 1 to 5, where 1 means ‘Easy’ and 5 means ‘Difficult’ 2. Can you rate the ease of use of DisEnT? Rating on a scale of 1 to 5, where 1 means ‘Easy to use’ and 5 means ‘Difficult to use’ 3. Can you describe the DisEnT’s user interface in 1-3 words? 4. What is your level of experience with the enrichment analysis method? Rating on a scale of 1 to 5, where 1 means ‘No experience’ and 5 means ‘Expert’ 65 Appendix B. User Study 66 Table B.1: Answers submitted to the practical part of the user study. No. 1 Question 1 Tropical spastic paraparesis Question 2 2 Question 3 BSG, C3, CST3, GRIA1, GRIA2, GRIA3, GRM2, GSK3B, HECW1, HNRNPA1, MAPT, NEFH, NEFL, PARK7, PARP1, PIN1, RAB5A 2 Tropical spastic paraparesis 2 As above 3 Tropical spastic paraparesis 2 As above 4 Tropical spastic paraparesis 2 As above 5 Tropical spastic paraparesis 2 As above 6 Amyotrophic lateral sclerosis 7 Tropical spastic paraparesis 2 As above 8 Tropical spastic paraparesis 2 NEFH, NEFL 9 Tropical spastic paraparesis 2 BSG, 39 As above C3, CST3, GRIA1, GRIA2, GRIA3, GRM2, GSK3B, HECW1, HNRNPA1, MAPT, NEFH, NEFL, PARK7, PARP1, PIN1, RAB5A 10 Tropical spastic paraparesis 2 As above Appendix B. User Study 5. Any further comments 67 Appendix B. User Study 68 Table B.2: Answers submitted to the evaluation part of the user study. No. Q1 Q2 1 2 1 Q3 Q4 beautiful 1 Question 5 I love the font and the question markers that offer some additional info. 2 2 2 it is ok 1 3 1 1 minimalist, modern, in- 1 tuitive 4 1 1 white 1 Clean interface with great attention to detail, very intuitive. I like the progress reports, very cool! Hyperlinks to Home and Search seem to be redundant. 5 1 1 Easy to use 2 6 2 2 Minimalist 4 Nice looking, simple, works intuitively. When clicking buttons (?) provided information I was looking for. 7 1 1 Clean, Intuitive, 3 Friendly A ’helper’ function to describe what the headings mean would be nice. 8 2 2 Pretty straightforward 1 Very nice. 9 1 1 minimalist, clean 1 The arrow indicating which column is used to sort the input looks like it belongs to the column on the left. I almost made that mistake with the first question. 10 2 1 Simple, effective 1 It might be useful to explain what each column means Appendix B. User Study B.2 Survey Form On the following page. 69 Disease Enrichment Analysis with DisEnT Hello, and thank you for taking part in this study. You will be asked to perform a disease enrichment analysis using a new tool called DisEnT. Disease enrichment analysis is a computational method used to determine whether a disease is overrepresented in a set of genes. This experiment should not take more than 5 minutes. * Required Practical Part 1. Go to the following URL: https://synprot.inf.ed.ac.uk/disent/ 2. Enter the following genes as your input list (you can copy and paste): ANKS1B, ARHGEF2, ATL1, BSG, C3, CST3, DCTN1, DOCK1, DPP6, ENO3, EPHA4, FGF2, GABRA1, GRIA1, GRIA2, GRIA3, GRIA4, GRM2, GRM5, GSK3B, HECW1, HNRNPA1, HSPA4L, HTT, KARS, KIAA0513, KIF3A, MAPT, NEFH, NEFL, PARK7, PARP1, PFN1, PIN1, PKN1, PPIA, PRPH, PTPRF, RAB5A 3. Specify the species as Human 4. Submit your search and wait for the results. 5. Based on the results, please answer the questions below You should be able to answer each of the questions below just by using DisEnT's user interface. If you are not sure about the right answer, feel free to skip it. 1. What is the name of the disease with the highest enrichment score? 2. How many genes from the input list are associated with that disease? 3. Please list all genes from your list associated with Alzheimer's disease Evaluation Please answer these questions based on your experience from the practical part of the study. 4. How difficult was it to answer the questions above? * Mark only one oval. 1 2 3 4 5 Easy Difficult 5. Can you rate the ease of use of DisEnT? * Mark only one oval. 1 2 3 4 5 Easy to use Difficult to use 6. Can you describe the DisEnT's user interface in 13 words? * 7. What is your level of experience with the enrichment analysis method? * Mark only one oval. 1 No experience 8. Any further comments Powered by 2 3 4 5 Expert Bibliography Alexa, A. and Rahnenfuhrer, J. (2010). topGO: enrichment analysis for gene ontology. R package version 2.8. Alexa, A., Rahnenführer, J., and Lengauer, T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics (Oxford, England), 22(13):1600–7. Aronson, A. (2001). Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium, pages 17–21. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25(1):25–9. Barabási, A.-L., Gulbahce, N., and Loscalzo, J. (2011). Network medicine: a networkbased approach to human disease. Nature reviews. Genetics, 12(1):56–68. Barak, Y., Achiron, A., Mandel, M., Mirecki, I., and Aizenberg, D. (2005). Reduced cancer incidence among patients with schizophrenia. Cancer, 104(12):2817–21. Bayés, A., Collins, M. O., Croning, M. D. R., van de Lagemaat, L. N., Choudhary, J. S., and Grant, S. G. N. (2012). Comparative study of human and mouse postsynaptic proteomes finds high compositional conservation and abundance differences for key synaptic proteins. PloS one, 7(10):e46683. Beck, K., Beedle, M., and van Bennekum, A. (2001). The agile manifesto. Binns, D., Dimmer, E., Huntley, R., Barrell, D., O’Donovan, C., and Apweiler, R. (2009). QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics (Oxford, England), 25(22):3045–6. Botstein, D. and Risch, N. (2003). Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nature genetics. Boyken, J., Grø nborg, M., Riedel, D., Urlaub, H., Jahn, R., and Chua, J. J. E. (2013). Molecular profiling of synaptic vesicle docking sites reveals novel proteins but few differences between glutamatergic and GABAergic synapses. Neuron, 78(2):285– 97. 72 Bibliography 73 Chakravarti, A. (2011). Genomic contributions to Mendelian disease. Genome research, 21(5):643–4. Chen, J., Bardes, E. E., Aronow, B. J., and Jegga, A. G. (2009). ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic acids research, 37(Web Server issue):W305–11. Chen, Y., Cunningham, F., and Rios, D. (2010). Ensembl variation resources. BMC . . . , 11(1):293. Collins, F. S. (1998). New Goals for the U.S. Human Genome Project: 1998-2003. Science, 282(5389):682–689. Drghici, S., Khatri, P., Martins, R. P., Ostermeier, G., and Krawetz, S. A. (2003). Global functional profiling of gene expressionThis work was funded in part by a Sun Microsystems grant awarded to S.D., NIH Grant HD36512 to S.A.K., a Wayne State University SOM Deans Post-Doctoral Fellowship, and an NICHD Contraception and Infertility. Genomics, 81(2):98–104. Fernandez, O., Jordan, D., Larkowski, J., Noria, X., and Pope, T. (2011). The Rails 3 Way. Addison-Wesley, Boston, 2nd edition. Fielding, R., Gettys, J., Mogul, J., and Frystyk, H. (1999). Hypertext transfer protocolHTTP/1.1. Flicek, P., Amode, M. R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fitzgerald, S., Gil, L., Girón, C. G., Gordon, L., Hourlier, T., Hunt, S., Johnson, N., Juettemann, T., Kähäri, A. K., Keenan, S., Kulesha, E., Martin, F. J., Maurel, T., McLaren, W. M., Murphy, D. N., Nag, R., Overduin, B., Pignatelli, M., Pritchard, B., Pritchard, E., Riat, H. S., Ruffier, M., Sheppard, D., Taylor, K., Thormann, A., Trevanion, S. J., Vullo, A., Wilder, S. P., Wilson, M., Zadissa, A., Aken, B. L., Birney, E., Cunningham, F., Harrow, J., Herrero, J., Hubbard, T. J. P., Kinsella, R., Muffato, M., Parker, A., Spudich, G., Yates, A., Zerbino, D. R., and Searle, S. M. J. (2014). Ensembl 2014. Nucleic acids research, 42(Database issue):D749–55. Gécz, J. (2010). Glutamate receptors and learning and memory. Nature genetics, 42(11):925–6. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. H., and Zhang, J. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10):R80. Ghisalberti, G., Masseroli, M., and Tettamanti, L. (2010). Quality controls in integrative approaches to detect errors and inconsistencies in biological databases. Journal of integrative bioinformatics, 7(3):1–13. Bibliography 74 Grant, S. G. N. (2012). Synaptopathies: diseases of the synaptome. Current opinion in neurobiology, 22(3):522–9. Grinshpoon, A., Barchana, M., Ponizovsky, A., Lipshitz, I., Nahon, D., Tal, O., Weizman, A., and Levav, I. (2005). Cancer in schizophrenia: is the risk higher or lower? Schizophrenia research, 73(2-3):333–41. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic acids research, 33(Database issue):D514–7. He, X. and Simpson, T. I. (2014). Personal communication. Hopkins, A. L. (2007). Network pharmacology. Nature biotechnology, 25(10):1110–1. Hornik, K. and Leisch, F. (2003). A Fast Way to Provide R Functionality to Applications. In Proceedings of DSC. Ihaka, R. and Gentleman, R. (1996). R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics, 5(3):299–314. Jonquet, C., Shah, N., and Musen, M. (2009). The open biomedical annotator. Summit on translational . . . , 2009:56–60. Kelly, D. and Sanders, R. (2008). Assessing the quality of scientific software. First International Workshop on Software . . . . Koonin, E. V. (2005). Orthologs, paralogs, and evolutionary genomics. Annual review of genetics, 39:309–38. Krasner, G. and Pope, S. (1988). A description of the model-view-controller user interface paradigm in the smalltalk-80 system. Journal of object . . . . LePendu, P., Musen, M. a., and Shah, N. H. (2011). Enabling enrichment analysis with the Human Disease Ontology. Journal of biomedical informatics, 44 Suppl 1:S31–8. Machado, C. M., Freitas, A. T., and Couto, F. M. (2013). Enrichment analysis applied to disease prognosis. Journal of biomedical semantics, 4(1):21. Maglott, D., Ostell, J., Pruitt, K., and Tatusova, T. (2005). Entrez Gene: gene-centered information at NCBI. Nucleic acids research, 33(Database issue):D54–8. Mailman, M., Feolo, M., Jin, Y., and Kimura, M. (2007). The NCBI dbGaP database of genotypes and phenotypes. Nature . . . , 39(10):1181–6. Matsumoto, Y. and Ishituka, K. (2002). Ruby programming language. Maximilien, E. and Williams, L. (2003). Assessing test-driven development at IBM. Software Engineering, 2003. . . . , 6. Michaud, K. and Wolfe, F. (2007). Comorbidities in rheumatoid arthritis. Best practice & research Clinical rheumatology, 21(5):885–906. Bibliography 75 Miller, J. (2009). Design For Convention Over Configuration. Microsoft. Mort, M., Evani, U., and Krishnan, V. (2010). In silico functional profiling of human diseaseassociated and polymorphic amino acid substitutions. Human . . . , 31(3):335–46. NCBI (2014). GeneRIF: Gene Reference into Function. Nielsen, J. (1999). User interface directions for the Web. Communications of the ACM, 42(1):65–72. Ohtake, M., Takada, G., and Miyabayashi, S. (1982). Pyruvate decarboxylase deficiency in a patient with Leigh’s encephalomyelopathy. The Tohoku journal of . . . , 137(4):379–386. Osborne, J., Lin, S., and Kibbe, W. (2007). Other riffs on cooperation are already showing how well a wiki could work. Nature, 446(7138):856. Osborne, J. D., Flatow, J., Holko, M., Lin, S. M., Kibbe, W. a., Zhu, L. J., Danila, M. I., Feng, G., and Chisholm, R. L. (2009). Annotating the human genome with Disease Ontology. BMC genomics, 10 Suppl 1:S6. Pandey, N., Garver, D., and Tamminga, C. (1977). Postsynaptic Supersensitivity in Schizophrenia. Am J Psychiatry, 134(5):518–522. Peng, K., Xu, W., Zheng, J., Huang, K., Wang, H., Tong, J., Lin, Z., Liu, J., Cheng, W., Fu, D., Du, P., Kibbe, W. a., Lin, S. M., and Xia, T. (2013). The Disease and Gene Annotations (DGA): an annotation resource for human disease. Nucleic acids research, 41(Database issue):D553–60. Petri, H. (2010). Data-driven identification of co-morbidities associated with rheumatoid arthritis in a large US health plan claims database. BMC musculoskeletal . . . , 11(1):247. Povey, S., Lovering, R., Bruford, E., Wright, M., Lush, M., and Wain, H. (2001). The HUGO Gene Nomenclature Committee (HGNC). Human genetics, 109(6):678–80. Robinson, P. N., Köhler, S., Bauer, S., Seelow, D., Horn, D., and Mundlos, S. (2008). The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. American journal of human genetics, 83(5):610–5. Roe, C. M., Behrens, M. I., Xiong, C., Miller, J. P., and Morris, J. C. (2005). Alzheimer disease and cancer. Neurology, 64(5):895–8. RStudio (2014). Shiny. Ruby, S., Thomas, D., and Hansson, D. H. (2013). Agile Web Development with Rails 4. Schriml, L. and Arze, C. (2012). Disease Ontology: a backbone for disease semantic integration. Nucleic acids . . . . Bibliography 76 Schwartz, B., Zaitsev, P., and Tkachenko, V. (2012). High performance MySQL: Optimization, backups, and replication. Shah, N. H., Cole, T., and Musen, M. A. (2012). Chapter 9: Analyses using disease ontologies. PLoS computational biology, 8(12):e1002827. Simpson, T. I. (2014). Personal communication. Smith, S. (2006). Systematic development of requirements documentation for general purpose scientific computing software. Requirements Engineering, 14th IEEE International . . . , pages 209–218. Sternberg, D. E. (1982). Impaired Presynaptic Regulation of Norepinephrine in Schizophrenia. Archives of General Psychiatry, 39(3):285. Storm, C. and Sonnhammer, E. (2002). Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics, 18(1):92–99. Subramanian, A. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the . . . . The UniProt Consortium (2014). Activities at the Universal Protein Resource (UniProt). Nucleic acids research, 42(Database issue):D191–8. Tirrell, R., Evani, U., Berman, A. E., Mooney, S. D., Musen, M. A., and Shah, N. H. (2010). An ontology-neutral framework for enrichment analysis. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, 2010:797–801. Toshima, K., Kuroda, Y., Hashimoto, T., Ito, M., Watanabe, T., Miyao, M., and Ii, K. (1982). Enzymologic studies and therapy of Leigh’s disease associated with pyruvate decarboxylase deficiency. Pediatric research, 16(6):430–5. Weyuker, E. and Vokolos, F. (2000). Experience with performance testing of software systems: issues, an approach, and case study. IEEE Transactions on Software Engineering, 26(12):1147–1156. Wilson, G., Aruliah, D. A., Brown, C. T., Chue Hong, N. P., Davis, M., Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I. M., Plumbley, M. D., Waugh, B., White, E. P., and Wilson, P. (2014). Best practices for scientific computing. PLoS biology, 12(1):e1001745. Zeeberg, B., Feng, W., Wang, G., and Wang, M. (2003). GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome . . . , 4(4):R28.