- White Rose Etheses Online
Transcription
- White Rose Etheses Online
The Identification and Characterisation of Microbes in Complex Environments Alexander L.B. Leach Doctor of Philosophy University of York Biology September, 2013 Abstract The practice of genetically identifying microbes has become increasingly commonplace in recent decades. Since Carl Woese discovered the utility of small subunit ribosomal RNA, for identifying an organism and Frederick Sanger introduced his method for de novo sequencing, the throughput of producing taxonomically relevant sequence information has risen exponentially. Small subunit rRNA has been invaluable in preliminarily identifying microbial organisms. With just a fragment of this single gene sequence, evolutionary distances between organisms can be inferred and microbes identified. A novel software pipeline - SSuMMo - was designed and developed to help identify organisms present in complex microbial communities, using datasets produced by the latest high-throughput sequencing technologies. SSuMMo was stringently tested for accuracy, speed and efficacy on a variety of datasets to assess its utility when analysing real sequence datasets, generated from both 16S rRNA primer-targeted and whole genome shotgun sequencing experiments. Sequence length is often compromised with recent high-throughput sequencing technologies, so simulations were performed to ascertain the best candidate regions for primer design on the 16S rDNA gene. The software is further demonstrated on public sequence datasets generated from sequencing the human oral and gut microbiomes. Our analyses show that SSuMMo is a viable software package for identifying species present in complex communities, particularly with primer-targeted high-throughput sequence datasets. iii Contents Abstract iii List of Tables ix List of Figures x List of Code Listings xii List of Accompanying Material xiii Acknowledgements xv Declaration 1 xvii Introduction 1 1.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 The First Signs of Microbial Life . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Darwin’s struggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.2 A (re-)revolution of biological philosophy . . . . . . . . . . . . . . 7 1.3.3 The Hereditary mechanism . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.4 Distance of difference . . . . . . . . . . . . . . . . . . . . . . . . . . 11 v 1.4 2 System and Methods 12 19 2.1 Hidden Markov Models in Biological Systems . . . . . . . . . . . . . . . . . . . 20 2.2 Building the SSuMMo database of HMMs . . . . . . . . . . . . . . . . . . . . . 22 2.3 Associating names with taxonomic rank . . . . . . . . . . . . . . . . . . . . . 23 2.4 Assigning novel sequences to taxa . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Accuracy Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.1 HMM Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 Sequence length versus Assignment Accuracy . . . . . . . . . . . . 25 2.5.3 SSU rRNA hypervariable region accuracy . . . . . . . . . . . . . . 25 2.6 Optimizing SSuMMo for speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7 Comparative metagenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8 Assigning Training Sequences to Taxa . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Training the Database of HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.10 Matching Names Between ARB and NCBI Taxonomy Databases . . . . . . . . . 28 2.11 Testing Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.12 Importing and Exporting Trees to IToL . . . . . . . . . . . . . . . . . . . . . . 31 2.13 Calculating Biodiversity Indices . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.14 Finding Taxa and their Lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Plotting Tabular Data on to Trees . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.16 Inferring Sequence Conservation . . . . . . . . . . . . . . . . . . . . . . . . . 34 Comparing Processing Times Against BLAST . . . . . . . . . . . . . . . . . . . 35 2.15 2.17 3 Genetic Sequencing since the 1970s . . . . . . . . . . . . . . . . . . . . . . . . Identifying Microbes with Small Subunit ribosomal RNA 37 3.1 Taxon Identification with SSuMMo . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 vi 4 Assignment Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Software comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.3 Sequence Windows and Primer Design . . . . . . . . . . . . . . . . 48 3.2.4 Biological Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.5 Repository Annotation Effects . . . . . . . . . . . . . . . . . . . . . 51 3.2.6 SSuMMo for database curation . . . . . . . . . . . . . . . . . . . . 52 Microbes Inhabiting the Human Microbiome 43 55 4.1 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.1 Sample Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 5 3.2.1 4.3.1 A core healthy microbiome . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.2 Healthy and IBD-infected gut microbiotas . . . . . . . . . . . . . . 64 4.3.3 Gut microbiome differences relating to adiposity . . . . . . . . . . 67 4.3.4 Annotating WGS sequences . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Gut microbiome diversity relating to geographic location . . . . . 76 Discussion 74 81 5.1 Sequence clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Comparing microbiological community diversities . . . . . . . . . . . . . . . . 84 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Appendix I 87 A1.1 Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A1.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Nomenclature 103 vii Bibliography 105 viii List of Tables 3.1 SSuMMo annotation accuracies. . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 SSuMMo species mismatches. . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 SSuMMo vs. BLAST runtimes. . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 Geographical human gut dataset statistics. . . . . . . . . . . . . . . . . . . 58 4.2 Healthy vs. IBD gut dataset statistics. . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Analysis of Variance of biodiversity index calculations. . . . . . . . . . . . 66 4.4 SSuMMo assignment statistics of Human Microbiome sequence data. . . 69 A1.1 Biodiversity indices for geological datasets at different ranks. . . . . . . . . 101 ix List of Figures 1.1 Shannon Entropy of DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 A simplified schematic of an Hidden Markov Model. . . . . . . . . . . . . 21 3.1 High level overview of A) SSuMMo annotation pipeline; and B) select post-analysis programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Accuracy of SSuMMo assignments in SSU rRNA hypervariable regions. . 41 3.3 SSuMMo accuracy for antisense strands of hypervariable regions. . . . . . 42 3.4 Accuracy of SSuMMo compared with sequence length. . . . . . . . . . . . 44 3.5 Accuracy of Archaeal 16S rRNA sequences run through SSuMMo. . . . . 47 3.6 Counts of rRNA operon genes in Human Oral Microbiome Database. . . 50 4.1 Distribution of taxa up to genus specificity, present in the guts of 154 lean, overweight and obese individuals, pooled by sequencing method and BMI category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 Gut microbiota of 99 healthy vs. 25 IBD-suffering individuals. 4.3 Body Mass Index vs. Firmicutes / Bacteroidetes ratio. . . . . . . . . . . . . 70 4.4 Biodiversity analyses of Lean, Overweight and Obese individuals’ gut 4.5 . . . . . . 65 microflora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Genera identified in Japanese guts, from WGS experiment. . . . . . . . . . 75 x 4.6 Biodiversity indices calculated for geographical datasets. . . . . . . . . . . 76 4.7 Rarefaction curves of 66 healthy individuals’ gut microbiota. . . . . . . . . xi 77 List of Code Listings A1.1 A Python script to plot informational entropy of a DNA sequence . . . . A1.2 A shared module containing common plotting functions 87 . . . . . . . . . 88 A1.3 A Python script that calculates informational entropy from genomic DNA sequence files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A1.4 A Python script to extract sequences of interest from the Clinical Production Pilot Study (PPS) of the NIH Human Microbiome Project . . . . . . 100 xii List of Accompanying Material SSuMMo Documentation API documentation, tutorial and installation instructions produced for the SSuMMo software package, described in chapter 3. This software was generated using Sphinx [Brandl, Georg, 2009], by extracting documentation directly from source code, as well as using additional reStructured Text (rst) files, written specifically for the purpose of documenting the source code packages. xiii Acknowledgements I wish to thank the following for helping me on my way. . . First and foremost, my supervisors James & Kelly, for their guidance, support and confidence in me, since I began researching in York. Members of my Training Advisory Committee: Gavin Thomas & Jon Pitchford, for their insight, encouragement and enthusiasm throughout my research. The Burgess family for their philanthropism and faith in science. Their generosity provided the funds necessary to conduct the research contained herein, as well as that of many other York biologists, both before and after my own. The Trustees of the Holbeck Trust, for their shared contribution to my research funding, as well as to others throughout the University. My eldest brother Ben, for his generosity and the support he offers many talented individuals. My family, friends and departmental colleagues, who are too numerous to mention individually, have helped me immeasurably in the struggle to remain diligent, determined and vivacious. Thank you, all! xv Declaration I declare that this thesis is a presentation of original work and I am the sole author. This work has not previously been presented for an award at this, or any other, University. All sources are acknowledged as References. Work in chapter 3 has largely already been published in Bioinformatics [2012] and presented at the Nordic Archaeal [2011a] conference. Work in chapter 4 has previously been presented at the Modelling & Microbiology [2011b] and SGM Autumn [2011c] conferences. Leach, A.L.B., Chong, J.P.J. & Redeker, K.R. (2012). SSuMMo: rapid analysis, comparison and visualization of microbial communities. Bioinformatics, 28, 679–686. Leach, A.L.B., Chong, J.P.J. & Redeker, K.R. (2011a). A novel method for phylotyping complex populations. Nordic Archaeal conference, Helskinki. Leach, A.L.B., Chong, J.P.J. & Redeker, K.R. (2011b). Microbial community auditing with ss-RNA. Searching for a core microbial community in the guts of healthy humans. Modelling & Microbiology conference, Edinburgh. Leach, A.L.B., Chong, J.P.J. & Redeker, K.R. (2011c). Searching for a core microbial community in the human gut microbiome. SGM Autumn conference, York. xvii For Ben, James & Kelly • 1 Introduction Microbial life pervades all reaches of the Earth. As our understanding grows, so too has its apparent ubiquity and number. From the bottom of oceans to clouds in the sky [Sattler et al., 2001; Vetriani et al., 1999], microscopic life persists where we can just visit. As more and more natural habitats are explored, so too do we acknowledge the unknown forms of life that inhabit them. As way of example, in a single gram of soil, it is estimated that there are up to twenty billion individual prokaryotes living therein [Whitman et al., 1998]. Of those, less than one percent of species are purported to be cultivable [Amann et al., 1995; Schloss & Handelsman, 2006]. The importance of microorganisms on Earth cannot be overstated. In their conquering of the globe billions of years ago, it was they who formed the atmosphere that we now require to live [Kasting & Siefert, 2002]. It was they who first learned how to harvest energy from the sun, how to sense, swim [Blair, 1995] and even to communicate with one another [Williams et al., 2007]. In a sense, to learn about microorganisms is to learn about ourselves. In manipulating microorganisms, we can create fuel and medicine; food and drink; life and death. Since Koch’s postulates were founded in the 19th century [Falkow, 2004; Koch, 1890], isolating microbes in pure culture has historically been one of the first steps taken in attempting to understand a microorganism. In doing so, physiological and phenotypic observations are made, providing knowledge of the organism in question. Since this is 1 recognised as impossible for a vast majority of organisms in natural environments, new culture-independent methods have had to be developed. Many of these methods consist of refinements, improvements and miniaturisation of DNA sequencing technologies, used to determine the genetic information contained within and passed down between generations of living organisms (see section 1.4). 1.1 Aims This thesis begins with exploring how methods of microbiological inquiry arose and have developed in human history, from identification of the first signs of microscopic life, to the latest technologies used to inspect them. Computational tools were developed to assist with analysing and visualising datasets resulting from such high-throughput sequencing experiments and are presented herein. User manuals and library documentation, produced as part of the software development process, are attached separately. The overarching goals of the project are to create helpful and informative computational tools, to assist with identifying and characterising microbes in complex environments. As sequencing experiments become increasingly large and frequently created, it is the aim of this project to create tools that may prove invaluable, in future analyses of high-throughput sequencing data. 1.2 Motivation Genetic sequencing has impacted and affected virtually all branches of contemporary biology [Shendure & Aiden, 2012]. The scale at which the technology has developed over the last decade has been unparalleled, in terms of speed, capacity and resolution [Mardis, 2011; Metzker, 2009; Shendure & Ji, 2008]. Conversely, the cost of sequencing has 2 seen a rapid decline, shifting the main financial burden of sequencing experiments away from generation of the sequence data itself, to practically every other stage of the process: from collection of samples to storage of the resulting data [Shendure & Aiden, 2012]. The technical challenges of sequencing experiments have seen a similar shift, resulting from the dramatic increase in dataset size. One of the remaining technical difficulties regards manipulating resulting sequence data to provide meaningful insight from the sheer quantity of genetic information produced [Nielsen et al., 2010]. Not only is technical knowledge and skill required in using one of the many computational tools available, but a huge amount of computational power and time is necessary to process the sequence data [MacLean et al., 2009; Pop & Salzberg, 2008]. One of the many consequences of the sequencing revolution is the increased range and scope of natural environments that can be investigated. While sequencing originally had very limited coverage (see section 1.4), it is becoming increasingly common for experiments to produce gigabases of DNA at a time (e.g. [Hess et al., 2011; Qin et al., 2010]), with this upward trend unlikely to stop any time in the near future. Although 100% genomic coverage is unlikely to be obtained from such densely populated environmental samples, the amount of raw data generated from single experiments has still managed to overwhelm public data warehouses, to the extent that the National Centre for Biotechnology Information (NCBI) announced in 2011 that, because of budget constraints, they would at some point have to stop supporting the Trace and Short Read Archives (the ‘SRA’ - since renamed the “Sequence Read Archive”) [Galperin & Fernández-Suárez, 2012]. Due to public demand, the NIH has since changed their stance and has decided to continue funding the SRA, keeping in line with other consortia who comprise the INSDC [Nakamura et al., 2013]. Although the NIH’s budget has only rarely seen decreases in its annual budget since the 1970’s [Loscalzo, 2006], the recent technical innovations in DNA sequencing have 3 been improving faster than computer technologies have been able to keep up [Rothberg et al., 2011]. Solutions to this problem include continually increasing the allocated budget for computational infrastructure used to both analyse and store this mass of sequence data. Another aim is to improve upon and develop new software for the job of both data processing and storage [Fritz et al., 2011; Richter & Sexton, 2009]. 1.3 The First Signs of Microbial Life Biology has one of the longest and most illustriously documented histories in scientific literature. Microbiology was a relatively recent introduction to the discipline, but can be traced through the pages of history equally well. But what is a micro-organism? How can they be identified and how, can they be told apart? These questions will be answered here in the context of some important historical discoveries, before applying some classical methodologies to contemporary datasets. Nowadays, microbiological methods are used in a plethora of theoretical and applied science, ranging from improving human health [Mitsuoka, 1990], to its detriment [Wheelis, 1998]; from biofuel production [Holder et al., 2011] to atmospheric cleansing [Falkowski et al., 2008]; from manufacture of food and drink [Leroy & De Vuyst, 2004] to the processing of waste [Tsai et al., 2007]. The use cases of microbes are now so widespread that it is a wonder how the human race lived without recognising their existence for so long. So when did the human race first become aware of microbial life? “Microbe” and “microorganism” are fairly common terms nowadays, so a good place to start might be the Oxford English Dictionary [2013], which contains entries and etymological records for both: microbe, n. An extremely small living organism, a microorganism; esp. a bacterium causing disease or fermentation. 4 microorganism, n. An organism so small as to be visible only under a microscope; esp. bacterium, fungus, or alga. For linguists and scientists alike, the common Greek ‘micro’ prefix is indicative of something too small to see with the naked eye, exactly what the above dictionary definitions imply. This would also explain their relatively recent introduction to the English language. The first known uses of each word date only back to 1880 [Holden, 2013], although microscopy had been practiced in England since the 17th century, when Robert Hooke published Micrographia [1665], his notorious, illustrated book of observations made under the microscope. From this publication, Robert Hooke is recognised as the first to give a detailed description of a microorganism; likely a fungus of the common Mucor genus [Gest, 2004; Orlowski, 1991]. But it wasn’t until the next decade that the Dutch shopkeeper Antonie van Leeuwenhoek first described unicellular microorganisms. In letters written in Dutch to the Royal Society of London, he described what later became known to be protists, as ‘animalcules’ or ‘little eels’, ‘very prettily moving’ in pepper-infused water [Gest, 2004; Mazzarello, 1999; Porter, 1976; Smit & Heniger, 1975]. The fact that they were motile was indication enough that they were alive, but little more insight could be learned about microorganisms until two centuries later. This is understandable when considering the accepted philosophies of the period, as well as the technical achievement of constructing a microscope in the 17th century. Both Hooke and van Leeuwenhoek had to make their microscope components themselves and van Leeuwenhoek chose to keep his methods a close-guarded secret [Gest, 2004; Porter, 1976]. Other than morphological and physiological observations made under the microscope, it wasn’t until the 20th century that micro-organisms could be distinguished by more specific means. The 19th century did herald a series of novel techniques for isolating, culturing and distinguishing certain bacteria based on physical appearance [Barnett, 2003; 5 Drews, 2000], but it still required more theoretical, philosophical and technical advance before microorganisms could be distinguished by any quantitative means. Even macroorganisms - those lifeforms visible with the naked eye - which had been categorised based on physiological properties since Aristotle (c. 384-322BC) [Gaarder, 1991] - could be given no quantitative measure of relatedness until the 20th century. 1.3.1 Darwin’s struggle Of course it was Darwin’s On the Origin on Species [Darwin, 1859] that provided some of the first evidence for a theory of evolution, but it took time for this to become accepted. Philosophers of the day were said mostly to be of the ‘essentialist’ school of thought, which fundamentally contradicts the idea of evolution [Mayr, 1982]. Essentialism was introduced by the well-renowned philosopher Plato (c. 428-437BC), a faithful student of Socrates, whose ‘theory of ideas’ attempted to explain how individuals could be of the same species, yet each individual of a species be different. Plato supposed that for every type of thing that exists, be it living or otherwise, each has an eternal eide, or ‘essence’, of which we perceive only imperfect manifestations. The essences would exist only in the ‘world of ideas’, a place both eternal and immutable [Gaarder, 1991], while the observable forms exist in the natural, sensory world. New species would therefore be an impossibility, as a species’ ‘essence’ could not change or be created in the eternal world of ideas. This theory, dubbed the “dead hand of Plato”, might explain what took mankind so long to accept the theory of evolution [Dawkins, 2008, 2009; Mayr, 1959]. Ideas can evolve and so too, can species. After 2,000 years of Platoan, essentialist thought and this began to be accepted. Darwin’s famous voyage on the Beagle provided ample evidence supporting evolution, with natural selection as the mechanism in life’s struggle to survive. But the conclusions his evidence led towards were hard for many to accept, not only the ‘essentialists’, but creationists too [Dawkins, 2009]. Perhaps the most 6 astonishing conclusion, was that species on Earth are related, in a family tree that spans at least the entirety of macroscopic life [Glansdorff et al., 2008; Woese, 1998]. At the turn of the 20th century, this was still far from accepted, however. The mechanisms by which to understand heredity were still a long way off, and a biological mechanism for evolution equally so. Only once these were discovered and understood, could a method to measure the relatedness of species be found. It took another half-century for the necessary breakthroughs to arrive, but the insight gained from Darwin’s work allowed a new dawn of biological thought. 1.3.2 A (re-)revolution of biological philosophy According to Mayr [1959], a shift in thought away from essentialism led to ‘populationism’, where types are not real, but are instead only averaged abstractions of individuals’ characteristics [Dawkins, 2008; Sober, 1980]. The theories are directly controvertible, as Plato’s earlier philosophies assume the observable, sensory world we live in consists of abstractions from eternal forms, whereas “for the populationist, the type (average) is an abstraction and only the variation is real” [Mayr, 1959]. Evolutionary theory undermines the assumption in essentialism that species are static in nature, instead enforcing uniqueness of individuals, concordant with Mayr’s populationism [Bradshaw, 2001]. This was an age-old argument dating again back to Aristotle, who was the first to challenge Plato’s theory of ideas, claiming: “every change in nature [. . . ] is a transformation of substance from the ‘potential’ to the ‘actual’” [Gaarder, 1991]. So why then, did Plato’s earlier philosophies dominate Aristotle’s up until the 19th century? The reason may have been the so-called ‘neo-platonism’, said to have been re-introduced into Western philosophy by Plotinus (c. 205-270), who brought Plato’s theory of ideas from Alexandria to Rome, merging Plato’s theories into common theological beliefs regarding an eternal soul [Gaarder, 1991]. Over 500 years after Aristotle, Western philosophy could be said 7 to have taken a step backward: a disputed philosophical reasoning was merged with theological belief, simultaneously strengthening both modes of thought and enforcing a preconception against evolution. A key consideration in both Aristotelian and Darwinian theory, but missing from Platonic, is time. Darwin understood that evolution in the visible world could only be valid if physical changes occurred over “geological time-scales” [Gould, 1983]. Although Aristotle wasn’t privy to the same information as Darwin when it came to geological timescales, change of state is fundamentally a function of time. Furthermore, it remains that what is ‘actual’ is only a subset of nature’s ‘potential’; natural environments dictate what life has ‘potential’ to succeed, but we can only observe what actually has. Another re-popularised concept in Aristotelian philosophy during the biological renaissance of last century, was the argument for a Primum Mobile - a “prime mover” causing all motion in the universe. One of the key ideas here was that “every motion must ultimately be traceable to an unmoved mover” [Bradshaw, 2001]. This statement necessitates time in its definition: the unit of motion being speed, of which both time and distance form a direct relation. These units (time, rate, distance) have also been adopted by evolutionary biologists (e.g. [Kimura, 1981; Tamura et al., 2011]), but before this adoption, physicists had unwittingly demonstrated Aristotle’s “unmoved mover” by estimating an age for the universe, tracing time all the way back to the Big Bang, by theorising, measuring and finally confirming a rate for the universe’s expansion [Silk, 1999]. Max Delbück was keen to apply the Primum Mobile to biological processes, and managed to do so, once it was understood that DNA acted as an unmodified template for protein synthesis. In 1935, Delbück initially struggled to apply this physical concept to biological processes [Delbrück, 1935; Stent, 1968], but revisited the idea in later years [Delbrück, 1971], claiming that it was in fact Aristotle who first conceived the DNA 8 principle: “the ‘unmoved mover’ perfectly describes DNA. It acts, creates form and development, and is not changed in the process” [Kay, 2000, p. 38]. 1.3.3 The Hereditary mechanism Heredity had already long been observed by the time Darwin published his works [Gould, 2002], yet no-one had until then provided evidence as compelling or voluminous as in On the Origin of Species. Through rigorous experimental and statistical analyses, the century that followed flourished with studies on Eukaryotic progenial and ecological phenomena. Microbiology was still fairly limited to physiological observations made under the microscope, but biochemical methodology had by then progressed to allow qualitative distinction between categories of bacteria, through Gram-staining techniques [Brock, 1999; Gram, 1884]. It wasn’t until the 1950s that progress in physical sciences provided determination of the fundamental structures of reproduction and heredity, but through deductive reasoning and application of known, physical law, a minimal mechanism for hereditary transfer was theorised as early as 1944, by the renowned physicist Erwin Schrödinger [Stent, 1968]. In his Dublin lecture series, later published as a short book entitled What is Life? [1944], Schrödinger admitted at the offset that physical and chemical knowledge of the day could not account for all events occurring inside a living organism, but conversely, he disputed that the phenomena of life could not be accounted for by those sciences. Such orderliness as is found in nature, he noted, could still obey the laws of thermodynamics1 , by drawing on surrounding “negative entropy”. Until then, no reasonable explanation had been given as to how life seemed to contradict the fundamental laws of thermodynamics, by its avoiding decay to equilibrium. The key metaphor Schrödinger chose, when postulating chromosomal structures as 1 The 2nd Law of Thermodynamics states that a closed system will tend towards maximum entropy. 9 ‘aperiodic crystals’1 , was that of a “Morse-like code script” [Kay, 2000; Stent, 1968, p. 61-62]. In subsequent decades, the code-script metaphor was revisited and redefined in the context of information transfer, a concept not cemented in genetics until after Henry Quastler’s efforts to apply Shannon and Weaver’s communication theory [Shannon, 1949; Shannon & Weaver, 1949] to biological phenomena [Dancoff & Quastler, 1953; Kay, 2000, p. 118]. Interestingly, both Schrödinger and Shannon had separately arrived at almost identical mathematical formulae (equations 1.1 and 1.2, respectively) to describe their respective systems: Schrödinger’s describing the amount of order extracted from an environment into a living system; Shannon’s describing the information content in a message. The relationship between the two was perhaps most simply described by Norbert Wiener: “Just as the amount of information in a system is a measure of its degree of organization, so the entropy of a system is a measure of its degree of disorganization” [Wiener, 1948]. −(entropy) = k ⋅ log 1 D (1.1) where D denotes “a quantitative measure of the atomistic disorder of the body in question”. Equation 1.1: Schrödinger negative entropy n H = −K ⋅ ∑ p i ⋅ log p i i=1 (1.2) where K “merely amounts to a choice of a unit of measure”; p i denotes the probability of a symbol within a message; p i ⋅ log p i a defined sample. Equation 1.2: Shannon informational entropy The significance of these formulae has impacted not only the fields for which they were originally intended (genetics and communication theory, respectively) but also many 1 As opposed to periodic (repetitive) crystal structures found in inanimate objects, aperiodicity reflects an elaborate non-uniformity in structure. 10 H 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 GC ratio GC ratio (a) Simulated Shannon Entropy of DNA. (b) Shannon Entropy of 2,087 genomes. Figure 1.1: Shannon Entropy of DNA. The Shannon relative entropy was computed for DNA, in a simulation based on the full range of GC ratios (a), and calculated for a number of complete genome sequences downloaded from NCBI (b). Plasmids and incomplete genomes were excluded. others, including: cryptology [Ahmadian et al., 2010], machine-learning [Elias et al., 2004] and ecological diversity studies [Magurran, 2009]. To illustrate Shannon’s formula within a genetic context, figures have been plotted to show the informational entropy contained within currently available genomes (Figure 1.1). Source code used for plotting these figures is also provided (section A1.1). 1.3.4 Distance of difference The 1950s held some of the most significant discoveries in the history of biology. At the start of the decade, the first genetic metric of species difference had (albeit unknowingly) been experimentally demonstrated. Retrospectively named ‘Chargaff ’s Rule’, a striking discovery was made with respect to nucleic acids: molar ratios of purine:pyrimidine, adenine:thymine and guanine:cytosine, all approximated unity [Chargaff, 1950; Kay, 2000, p. 57]. Whilst smashing the ‘tetranucleotide hypothesis’ 1 , a global shift in research followed, 1 The presumption that all nucleotides were present in equimolar proportions, precluding nucleic acids as carriers of hereditary information. 11 targeting nucleic acids (as opposed to the earlier misconception of proteins) as the key hereditary material [Cobb, 2013; Kay, 2000, p. 55-57]. Even to present day, organismal GC ratios provide a standard metric for distinguishing organisms based on overall genetic content (e.g. Albertsen et al. [2013]). The discovery of DNA’s helical structure in 1953 [Watson & Crick, 1953] was another landmark event in biology; finally a physical structure for the hereditary material was known! But similar to public expectation following the first human genome sequence, it took a lot longer than anticipated for the promises of the result to be fulfilled. It wasn’t until 1961 that Marshall Nirenberg and Heinrich Matthaei published the results from their famous “poly-U” experiments [Matthaei & Nirenberg, 1961; Nirenberg & Matthaei, 1961], providing the first ‘translation’ of a nucleic acid codon to an amino acid residue [Kay, 2000, p. 251-252]. Following the race to ‘crack the code’ in the 1950s and 60s, the next marked improvement to a genetic metric of species difference wasn’t demonstrated until 1977 [Woese & Fox, 1977]. Even though decades later, Carl Woëse’s choice of using SSU rRNA as a phylogenetic marker gene was described as a “prescient” prediction by Pace [2009]. 1.4 Genetic Sequencing since the 1970s DNA sequencing is said to have started in the 1970s, when in 1972 recombinant DNA technology first emerged [Jackson et al., 1972], and then three years later Sanger published his novel, notorious chain-termination sequencing method [1975]. Sanger’s was the first reliably reproducible and relatively safe and easy method of determining the order of nucleotide residues in DNA sequences, compared with Maxam and Gilbert’s chemical equivalent [Maxam & Gilbert, 1977]. Now, high-throughput (or next-generation) genomic sequencing has arrived, bringing various innovative and competing technologies, which 12 are essential for projects like the 1000 Genomes project [Siva, 2008], whose aim is to produce a diverse set of 1,000 anonymous human genomes within 3 years; the $1,000 genome ideal; and of course for the procurement of invaluable knowledge and insight. In 2007, two notorious geneticists were the first to have their genomes sequenced and publicised: J. Craig Venter and James Watson [Wadman, 2008]. Having once been supervised by Watson, Venter later admitted having had a ‘love-hate’ relationship with his former mentor [Wolinsky, 2007]. He made his genome available without publication, just 9 days before James Watson was due to receive his at a ceremony organised specifically for the occasion. Venter produced his genome at an estimated cost of roughly USD 70 million [Metzker, 2009], whereas Watson’s was quoted by a Vice-President of 454 Life Sciences as costing “well under USD 1 million” [Wolinsky, 2007]. If that seems expensive, what about the first complete human genome sequence? Taking 13 years to complete, it was released in 2003 as a collaborative multinational effort from over 20 different organisations, it summed to a total of USD 2.7 billion [National Human Genome Research Institute, 2003]. Although more human genome sequences have since been produced, as of 2009, this first human genome is still considered to be the only finished-grade1 human genome [Metzker, 2009]. Still, the technology has not reached a point where we can be satisfied. In 2006 Archon announced a huge prize in the field of genomic sequencing: if a team can generate 100 high-quality human genomes in under 10 days for less than USD 10,000 per genome, then that team would be awarded USD 10 million [Kedes & Liu, 2010]. The prize was short-lived however. Before it could be awarded, the competition was cancelled, as the organisers considered that iterative improvements to existing sequencing technologies were advancing rapidly towards the competition’s goal, without producing any significant technological breakthroughs, which the competition was designed to incentivise 1 A finished grade, or “finished genome”, represents a high-quality genome with more of the genome covered than in a “draft genome”, with fewer sequence errors and gaps. 13 [Diamandis, 2013]. As sequencing technologies continue to emerge and compete with one another, there is currently no “best” or standardised method when it comes to next-generation sequencing (NGS). Instead, “the potential of NGS is akin to the early days of PCR, with one’s imagination being the primary limitation to its use” [Metzker, 2009]. Without delving into the biochemical fundamentals of these technologies (which are covered in a range of excellent review articles e.g. [Fuller et al., 2009; Metzker, 2009; Rothberg & Leamon, 2008]), some of the targeted applications of NGS already include:• Seq-based methods e.g. ChIP-Seq, which is used to study interactions between protein and DNA [Park, 2009]; • Genome-wide Association Studies, to find genetic traits associated with undesirable phenotypic traits such as disease [Hirschhorn & Daly, 2005]; • Resequencing of specific genomic target regions to search for genetic variants and; • Taxonomic and functional metagenomic profiling [Segata et al., 2012]. Taxonomic metagenomic profiling is the application for which computational tools have been developed as part of this thesis. When de novo sequencing methods are applied to organisms within a complex microbial community, it is a huge challenge to associate DNA fragments with the species from which they derive. Various computational methods have been and are continuing to be developed for the purpose of identifying species however [McHardy & Rigoutsos, 2007]. As well as trying to identify species present in a habitat and reconstructing entire genomes from a complex community, it is also informative to determine statistics relating to the diversity of species present and the abundance of each species found. With the increasing number of publicly available whole genome sequences, it is probable that 14 there is a representative whole genome for each known family found within a mixedcommunity sample. As of September 2013, NCBI offer over 7,000 prokaryotic genomes, whereas the ‘List of Prokaryotic names with Standing in Nomenclature’ (LPSN) includes just 337 families, 2,393 genera, and 12,391 different prokaryotic species [http://www.bacterio. net/-number.html#total (accessed 2010)]. Of course, the number of currently available whole genome sequences varies widely between different taxa. For example, Ensembl offers 33 whole genome sequences for the different strains and substrains of the model bacterial species Escherichia coli and none for many other genera, while many other species are underrepresented [Dini-Andreote et al., 2012]. The coverage of available fully sequenced species presents obvious gaps and misrepresentations of species abundance and diversity naturally present on the planet, something that will need to be considered when working with metagenomic experiments. Taxonomic metagenomic profiling works with the Roche / 454 Life Sciences sequence assembler on the basis that primers can be designed to target a specific conserved region or gene in a multitude of organisms. The conserved gene of choice, that has become what some called just a few years ago the “gold standard of phylogenetic taxonomy, and the most accurate” [McHardy & Rigoutsos, 2007], is 16S small subunit ribosomal RNA. The historical reasons for 16S rRNA becoming the target gene of choice - according to the same authors [McHardy & Rigoutsos, 2007] - include: • Its presence in almost all bacteria, often existing as a multi-gene family, or operon; • The function of the 16S rRNA gene over time has not changed, suggesting that random sequence changes are a more accurate measure of time (evolution); and • The 16S rRNA gene (1,500 bp) is large enough for bioinformatics purposes. These reasons have led to make 16S rRNA the gene of choice for phylogenetically classifying a prokaryotic species and it has been a universally robust method of identifying 15 a species or associating it with a taxa. However, I disagree with the assumption that just one gene out of thousands in a genome can alone give a fair and comprehensive representation of the evolutionary distances, in terms of time, between different species. It has been reported [Wu & Eisen, 2008] that aligning, trimming concatenating multiple conserved genes within and organisms results in much higher resolution trees in terms of evolutionary distance, than when using just a single gene. This increased resolution is a result of the fact that more sequence mutations will be present with more residues, making the evolutionary distances in any distance-calculating algorithm become more profound and better separable. That said; the aims of my project do not include defining or redefining taxonomic nomenclature or relationships between taxa. Instead I aim to provide tools with which to represent species diversity and abundance within a sample using existing and standardised nomenclature and topologies. This is a purpose for which SSU rRNA sequences are ideal. The number of publicly available SSU genes far surpasses the number of sequences determined for any other single gene, since Stackebrandt and Goebel first suggested [Pruesse et al., 2007; Stackebrandt & Goebel, 1994] viability as a phylogenetic marker in 1994 . The ARB database houses over a million aligned SSU sequences of various qualities (quality in this sense summed through a combination of sequence length and number of predicted sequencing errors / gaps) and nearly half a million high quality sequences with minimum length of 900 residues [Pruesse et al., 2007]. All the high quality sequences are provided with their individual topologies down to genus level, and this makes for a perfect training set for supervised classification of sequences into their most likely operational taxonomic unit (OTU - meaning any node or leaf on the tree of life, from kingdom down to subspecies). What databases seem to lack however, are programs designed specifically to find the closest fully sequenced relatives to species found in a sample from a metagenomic 16 experiment. It’s at least a new thought concept, and no robust method has been decided on as a platform of choice. 17 2 System and Methods Many methods have been developed to compare biological nucleic and amino acid sequences in order to quantify their differences. There are alignment-based and alignment-free methods, the former requiring sequence alignment a priori, in order to quantify differences. Alignment-based methods commonly assume that sequences are somewhat homologous and share contiguous similarities [Castresana, 2000; Feng & Doolittle, 1987], allowing for some genetic mutations but failing to accommodate more unrelated sequences [Blaisdell, 1989; Vinga & Almeida, 2003]. Multiple sequences that share enough similarity for alignment-based methods can often benefit from better accuracy [Höhl & Ragan, 2007] and can be utilised for a number of goals, including reconstructing phylogenetic trees, predicting structure, predicting function and more [Kemena & Notredame, 2009]. When comparing sequences that share little or no similarities, alignment-free methods have been employed, generally relying on counting word-frequencies [Pham & Zuegg, 2004; Vinga & Almeida, 2003; Wu et al., 1997], although other methods based on complexity do exist [Almeida & Vinga, 2006; Almeida et al., 2001; Li et al., 2001]. Over a hundred different sequence alignment implementations have been produced in the last few decades alone [Kemena & Notredame, 2009], yet as de novo sequencing technologies develop, new alignment methods will be needed to scale with increased dataset sizes [Li & Homer, 2010]. There are a number of reviews that cover in detail the 19 different alignment-based similarity metrics [Li & Homer, 2010; Notredame, 2007]. Here, one category is focused upon and employed for our own methods: those based on Hidden Markov Models. 2.1 Hidden Markov Models in Biological Systems Hidden Markov models have successfully been applied to many aspects of biological sequence analysis, including alignment [Krogh et al., 1994], database searching [Eddy, 1996; Finn et al., 2010], reconstructing phylogenetic trees [Siepel & Haussler, 2004] and predicting higher order structures [Asai et al., 1993; Bystroff et al., 2000; Söding, 2005]. When working with many orthologous sequences that have the same function, constructing profile HMMs can be useful for performing all these types of analyses. The process of creating a profile HMM is alignment-based, but a multiple sequence alignment (MSA) need not first be created in order to create one [Eddy, 1996]. However, creating a high-quality seed alignment a priori is known to create better profile HMMs [Bateman et al., 1999]. So what is a Hidden Markov Model and how does it work? A profile hidden Markov model is a probabilistic representation of a set of sequences that can be used to calculate confidence scores for sequences being described by that model. Multiple sequences are modelled as Markov chains, insofar that each residue’s confidence score is independent from adjacent residues’ identity [Eddy, 1996]. Profiles can represent any number of homologous sequences without increasing in size, as the only required information are probabilities for every metric described by the model, whose number relates to the number of columns in an MSA, not the number of sequences in it. Traditional MSA profiles are based on position-specific scoring matrices (PSSMs), where residue probabilities are calculated merely from their occurrence in available sequences [Edgar & Sjölander, 2004]. HMMs go further by also considering 20 P(t0|e1) B P(t0|B) P(t1|e2) P(t1|e1) e1 P(ex|e1) e2 P(t2|e3) P(t2|e2) P(ex|e2) e3 P(ex|e3) P(tn-1|en) P(t3|e3) en P(tn|en) E P(ex|en) Figure 2.1: A simplified schematic of an Hidden Markov Model. Shown is a basic architecture of an Hidden Markov Model, with begin and end states shown in circles and emission states represented as diamonds. State emissions and transmissions are represented by arrows, with each having an associated probability. transition probabilities. Figure 2.1 shows a simplified representation of how an HMM might be designed to model a set of sequences. It is overly simplified for the purpose of representing biological sequences, for reasons discussed below, but contains the basic ingredients for an HMM: states, transitions and symbol emissions. Each state has an associated set of transition probabilities P(t⋃︀e i ) that describe the possible paths that can be followed from it. In this simplified model, each model state is an emission state that can only follow one of two paths: one that returns to itself, the other transitioning to the next emission state. Emission states are so called because each time the path through the model reaches an emission state, a symbol x is emitted from a defined alphabet with K different symbols. Each emission state has its own set of probabilities associated with each symbol in the alphabet. Both the sum of all transition probabilities and the sum of all symbol emission probabilities from a particular state must separately equal one. That is: ∑ P(t⋃︀e i ) = 1 and ∑ P(e x ⋃︀e i ) = 1. n P(S, π⋃︀HMM, θ) = ∏ P(e x ⋃︀e i ) ⋅ P(t⋃︀e i )) i=1 (2.1) Equation 2.1: Probability of a sequence S being emitted by a profile HMM with parameters θ , by taking state path π . 21 The product of all emission and transition probabilities equals the probability that a sequence S took a particular path π through the model (Equation 2.1) [Eddy, 1996; Krogh et al., 1994]. The hidden aspect of an HMM rises from not knowing the state and transition path through the model, even when a sequence has been aligned to it. This is because a single sequence could potentially be created by many different paths through the same model. The probability that a sequence is actually described by that model is therefore the sum of all possible paths through the model that can produce that sequence [Eddy, 1996, 2004; Krogh et al., 1994]. As mentioned above, the model in Figure 2.1 is over-simplified for the purpose of biological sequence analysis. Other states need to be considered in the model, to encompass different evolutionary phenomena. The Plan 7 architecture of HMMs also includes symbol insertion and deletion states [Eddy, 1998], which model sequence mutations of the same name. These extra states increase the number of both transition and emission probabilities in a model, as insert states have their own symbol emission probabilities and both have their own allowed transition paths. When building a profile HMM, all model probabilities are calculated from the training sequences. By creating a multiple sequence alignment, position specific probabilities can be calculated purely from their observed frequency. However, this would potentially overfit the data, so to accommodate unseen sequences and avoid overfitting the data, mixture Dirichlet priors are usually applied to observed symbol distributions [Eddy, 1998; Krogh et al., 1994]. 2.2 Building the SSuMMo database of HMMs Taxonomy information was parsed from the sequence headers of ARB ‘tax’ sequence datasets, to create a traversable Python object representing sequenced representatives of 22 the tree of life. Due to the size of the uncompressed sequence file (60 GB), an index of sequence locations was created and saved, while simultaneously associating sequence IDs with their relevant species in the Python object model (see section 2.8). The ARB Silva [Pruesse et al., 2007] reference alignment of SSU rRNA sequences was made compatible with HMMER, and sequences with gaps or errors were removed. The sequence alignment file was also split by domain, with each produced file processed to remove alignment columns which are gapped in 100% of the domain’s sequences. HMMs were trained by all sequences selected from the alignments that are members of each taxonomic group, and were saved in a directory structure created according to ARB’s taxonomy (see section 2.9). The model building program (dictify.py) was designed to use a dynamic number of hmmbuild subprocesses that can be used to dramatically accelerate this building stage. 2.3 Associating names with taxonomic rank A Python program (link_EMBL_taxonomy.py) was developed to load the latest NCBI taxonomy database and link the taxonomic IDs and ranks to as many ARB taxon names as possible, keeping the associations in a MySQL database. The script automatically downloads and extracts the latest NCBI taxonomy database and loads selected rows (where NameClass = ‘scientific name’) and columns (tax_ID, name, UniqueName) from the included ‘names’ table into a local MySQL database. All rows were loaded from the nodes table, but only columns: tax_ID, parent_tax_ID and rank. New tables for Prokaryotes and Eukaryotes were populated with the ARB taxonomic structure, taking taxon name, parent name, associated NCBI taxonomic ID and rank, wherever the ARB’s OTU name/parent name combination uniquely matched. Non-unique name/parent name combinations were inserted into a separate table ‘NonUniques’ and all IDs recorded. If no match was found for a node, it was given a taxonomic ID of 0 and rank ‘unknown’ (see 23 section 2.10). 2.4 Assigning novel sequences to taxa Each query sequence gets scored against profile HMMs in the SSuMMo database one node at a time, choosing the best scoring child of each node as the most probable taxon that the sequence has derived. Starting from the top, each sequence is compared and scored against six profile HMMs: HMMs trained from forward and reverse-transcribed Bacteria, Archaea and Eukaryota SSU rRNA sequence alignments. Each query sequence is assigned to the model that returns the highest bit score, according to HMMer v3.0’s hmmsearch program. SSuMMo.py continues to recursively traverse the taxon hierarchy, scoring sequences against all HMMs that are direct children of the previous round’s assigned taxon. If at any node there are multiple taxa resulting in the same bit-score, SSuMMo will recursively score against all subsequent children from all these equal top-scorers until a unique winner is found. When a clear winner cannot be found, the program will assign the sequence to the last taxon with a unique top-score. 2.5 Accuracy Testing 2.5.1 HMM Testing Several scoring and model training mechanisms built into hmmbuild were tested to see what effect they had on overall accuracy. HMMs were built using the hmmbuild’s default model-building options, but HMMs were also built and tested with the --wgiven option. Several different search modes provided with hmmsearch were also tested, including --max, --nobias and --nonull2 options. --wgiven calculates the probability of observing residues in each position directly from the training alignment, whereas by 24 default residue probabilities are calculated with a Dirichlet-prior weighting mechanism. --max and --nobias options affect model sensitivity and acceleration heuristics, and --nonull2 affects the scoring procedure by turning off score corrections based on biased residue compositions. 2.5.2 Sequence length versus Assignment Accuracy The 144 NCBI Archaea sequences were used to test how sequence length affects accuracy of taxon assignment. The full and partial length sequences were shortened at the 3- end of each sequence by five residues at a time, ensuring that all sequences had identical length, i.e. shorter sequences were removed from the dataset until their sequence lengths were at least the length being analysed. Sequence lengths spanning from 34 to 1,509 bases were scored, and NCBI annotations compared with SSuMMo taxon predictions to calculate percentage accuracy according to length (section 2.10, Figure 3.4). 2.5.3 SSU rRNA hypervariable region accuracy SSU rRNA hypervariable regions were detected and extracted using Vxtractor [Hartmann et al., 2010]. Sequence datasets were synthesized as if primers had been designed to target regions adjacent to each hypervariable region, by extracting sequences of a user-defined length either from the 5- end or up to the 3 -end of each hypervariable region. Five residues were removed at a time from the opposite end of each sequence window, and the percentage genus accuracy was noted at lengths between 500 and 35 residues ( Figure 3.2 and 3.3). 25 2.6 Optimizing SSuMMo for speed A test set of 144 full-length Archaeal rRNA sequences, downloaded from the NCBI ftp servers (ftp://ftp.ncbi.nih.gov/genomes/TARGET/) was used for benchmarking. SSuMMo v0.0.1 worked on a one-to-one basis, parsing one sequence at a time and scoring that sequence against a single profile HMM using hmmsearch. SSuMMo v0.0.2 worked on a many-to-one basis, perceived as such because all sequences are scored against a single model at a time, again using HMMer v3.0’s hmmsearch. SSuMMo v0.0.3 was built with a many-to-many sequence-model comparison in mind, by using HMMer v3.0’s hmmscan. In order to use hmmscan, the SSuMMo database had to be modified to include ‘pressed’ collections of HMMs. In order to facilitate this database update, dictify.py was extended to optionally use hmmpress on all HMMs at a given node. Upon updating the database, SSuMMo v0.0.3 was updated to use hmmscan, scoring all sequences at a node to that node’s pressed collection of HMMs in a single program call. The aforementioned set of 144 sequences were used to test all versions of SSuMMo and times taken for analysis compared (data not shown). SSuMMo v0.0.2 was found to be the quickest implementation and was selected for further development to utilize multiple processors. 2.7 Comparative metagenomics A Python program (comparative_results.py) was written to combine SSuMMo results files and show community differences in terms of diversity, ubiquity and abundance. Phyloxml formatted trees can be exported and programmatically uploaded to ITOL [Letunic & Bork, 2006], with delimited data files showing population structure and community differences, which can be co-represented on cladograms as multi-value bar 26 graphs. (e.g. http://itol.embl.de/external.cgi?tree=226561982157513085564600). Multiple sequence files can be grouped and the ubiquity of species across each group exported as tabular form or ITOL representation as heatmaps. The user also has the option of programmatically downloading the tree again in any of the formats ITOL allows to be exported (pdf, jpeg, etc.; see Appendix I). 2.8 Assigning Training Sequences to Taxa The ‘tax’ datasets provided by the ARB Silva database contain unaligned reference sequences for ribosomal RNA, which are annotated to species recognised in their taxonomy database. The full taxonomic lineage is contained in each sequence header, and this was used to create a multi-dimensional Python object, as an hierarchical mapping to the tree of life. The sequence accessions (unique identifiers) were parsed from the sequence headers and stored in the taxon instance at the bottom of the lineage. In this initial pass-through of the sequence file, dictify.py also remembers the byte location of each sequence, and stores these in a separate dictionary mapping of accessions to byte locations, as this was found to significantly improve performance when later retrieving sequences from files too big to store in memory. All the above can be done with a single program call:$ dictify.py --indexTaxa SSURef_<version>_tax_silva.fasta This creates a .pkl file which holds the taxonomic hierarchy and training data accessions, as well as a .pklindex file, which stores the byte locations of each sequence in <ARB_tax_file>. 27 2.9 Training the Database of HMMs ARB release 104 of aligned SSU rRNA sequences was first rewritten with dictify.py, using the ‘--rewrite’ option. ARB sequence alignments contain both ‘.’ and ‘-’ characters, which is incompatible with hmmer. A ‘.’ in the middle of a sequence signifies missing or unknown residues, whereas a ‘-’ signifies a known insertion or deletion. Sequences are also padded with leading and trailing ‘.’ characters. dictify.py was thus used to remove sequences with ‘.’ characters in the middle and to convert all leading and trailing ‘.’ characters to an equal number of ‘-’ characters. Bacteria, Archaea and Eukaryote sequences were then separated and in the next step, alignment columns which were gaps in every sequence within the relevant alignment file were removed. This is performed in two calls to dictify.py, by using the subcommands: ‘--splitTaxa’ and then ‘--gapbgone’. The HMMs are built using dictify.py’s ‘--buildhmms’ subcommand. This first loads the taxonomic index built previously, and uses it as a template to create a directory hierarchy representing the tree of life. In each directory, hmmbuild is started and sequences assigned to that taxa are piped to the process. Each profile-HMM is saved in the relevant directory. The number of simultaneously running hmmbuild processes can be specified on the command line, but the default is to use all processor cores less one. 2.10 Matching Names Between ARB and NCBI Taxonomy Databases Each taxonomy database holds a different representation of the tree of life, and so sequence annotations can differ between identical sequences. SSuMMo uses the NCBI and ARB taxonomy databases together to maximize the information available to the user. The MySQL backend of SSuMMo holds five tables when fully populated: two from NCBI 28 (names and nodes tables), and three which are populated using both ARB and NCBI taxonomy information (Eukaryotes, Prokaryotes and NonUniques). These tables were populated with link_EMBL_taxonomy.py, which has two main modes of operation (‘--NCBI’ and ‘--compare’): the first downloads the latest NCBI taxonomy database and populates the first two tables, and the second associates ARB sequence annotations with NCBI taxonomy database entries. The MySQL database is populated with two subsequent program calls:$ link_EMBL_taxonomy.py --NCBI $ link_EMBL_taxonomy.py --compare Names, lineages and recognized phyla commonly differed between databases, so advanced methods to recursively walk up the NCBI taxonomy using the MySQL database were required to map taxa where parental lineages differed. The program works by walking down the ARB taxonomy from the ‘root’ of the tree, and for each name and parent name combination, link_EMBL_taxonomy.py will search the NCBI database for matching nodes, based on their names. First, it checks if the ARB taxon name alone can be mapped to a unique entry in the NCBI database. If the taxon is found and is unique, then its NCBI taxonomic ID and rank are returned, and entered into either the Eukaryotes or Prokaryotes table, along with the ARB name and parent name. If there are multiple NCBI entries matching that name, then the NCBI database is searched again for entries whose parent name also match. If this produces a unique match, then its ID and rank are returned, but if none are found, then the taxon name is searched with a wildcard at the end of the taxon name (see below). But if this still produces multiple possible children, all of their parents and grand-parents (according to NCBI) are checked to see if the ARB name / parent name combination can be matched with an NCBI name / grand-parent name. If still there is not a unique match, then the taxon name is shortened by a word, if possible, and the function calls itself again to repeat the process. 29 The program link_EMBL_taxonomy.py is written in Python and makes use of the MySQLdb library to make raw SQL calls against NCBI’s taxonomy database. 2.11 Testing Accuracy Four datasets of annotated reference sequences were downloaded from NCBI (ftp://ftp.ncbi. nih.gov/genomes/TARGET/) and the Human Oral Microbiome Database (HOMD) (http: //www.homd.org/Download) for accuracy testing. SSUMMO_tally.py was developed to parse sequence annotations from sequence headers and match them to entries in the combined ARB and NCBI MySQL taxonomy database. This uses recursive and wildcard matching techniques to map taxa between databases. The ARB taxonomy is known from SSuMMo sequence annotations, but the species name in the original sequence header (query name) is matched separately to entries in the MySQL database, to try and locate a corresponding NCBI taxonomic ID. If a unique match is found, then its taxonomic lineage is identified by recursively searching up through the parents from that identified taxon. This way, we can identify the lineage from sequence annotation and compare it with the ARB lineage, as inferred by SSuMMo, at each rank. Any query name which cannot be matched to an entry in the NCBI taxonomy database leads to all higher level ranks being unidentified. This negatively affects the percentage of “compared” sequences (3.1), which decrease with higher level rank from genus specificity. To compensate this effect, percentage accuracies were inferred only from those ranks which could be directly matched to a corresponding NCBI taxonomic identifier. Where no species level match is found between original annotation and NCBI taxonomy database, the number of words matching between original species annotations and assigned taxonomy names is counted, so long as the first word is confirmed to be a genus. The first word is only here considered a genus if it ends in one of 35 two character-long 30 endings identified within genera acknowledged by the NCBI database. If this is satisfied, a single word match is considered a correct genus assignment, and two matching words considered a correct species assignment. To compare any annotated sequences to SSuMMo allocations, the command is:$ SSUMMO_tally.py [-format (fasta|sff|...)] --tally <SEQUENCE_FILE_NAME> 2.12 Importing and Exporting Trees to IToL comparative_results.py and versions of SSuMMo.py can programmatically upload phyloxml formatted trees and associated metadata to IToL, as well as download them in any format IToL supports. A Python API for IToL, produced by Albert Wang, is available from the IToL website (http://itol.embl.de/help/iTOL_python.zip), and was used to facilitate this functionality. From our experiences however, manually uploading trees allowed more advanced IToL features to be used, enabling better manipulation of the trees, as well as greater reliability. To enable automated upload and download from IToL, a user will need to first create an account at IToL and enable “batch access”. This is documented in the IToL website’s help pages and in the SSuMMo User Manual. 2.13 Calculating Biodiversity Indices Ecologists have used biodiversity metrics to describe and compare macroscopic, natural habitats for over 50 years. In the simplest of cases, biodiversity is just species richness; that is, a count of the number of unique species in a given area [Magurran, 2009]. However, further metrics were devised to incorporate other population-level features, including evenness (Equation 2.2) and richness (Equation 2.3) between groupings. 31 The Shannon index “assumes that individuals are randomly sampled from an infinitely large community and that all species are represented” [Magurran, 2009; Pielou, 1975], and is calculated with the following equation:S H ′ = − ∑ p i ln p i i=0 (2.2) where p i is the relative number of individuals belonging to the i th species in the sample and S is number of species. A derivation of this equation shows that for the case where all species are present in equal numbers, H ′ will reach a maximum: H max = ln S. Although the Shannon index (Equation 2.2) takes into account species evenness within a population, a separate evenness measure can be calculated by dividing the Shannon index by its value at maximum evenness, H max [Magurran, 2009]. This amounts to a normalised Shannon evenness and is calculated with J ′ = H ⇑H max . ′ Another commonly used biological diversity metric, Simpson’s index D, captures the variance between species abundances in a population [Magurran, 2009]. The form used in the context of the current work (Equation 2.3) rises with the diversity and evenness in a community. S ∑ n i ⋅ (n i − 1) D = 1 − i=1 N ⋅ (N − 1) (2.3) N = total number of sequences sampled ; S = total number of observed taxa ; n i = number of sequences in the i th taxon. Equation 2.3: Simpson richness index. In microscopic environments, where the definition of species can be somewhat ambiguous, alternative features like the number of KEGG metabolic pathways or OTUs have been used to describe genetic or functional biodiversity. SSuMMo can calculate 32 biodiversity information at levels of specificity defined by taxonomic rank, rather than arbitrary, percentage sequence dissimilarity. rankAbundance.py was developed to calculate the percentage of sequences assigned to each taxon at user specified rank, and save tabular data ranked in order of taxon abundance. This information can be loaded into other programs for further anlysis (e.g. Excel or EstimateS). Calculated biodiversity metrics are also printed to screen. For example:$ rankAbundance.py -in results.pkl -out rankdata.txt rarefactionCurve.py was developed with a multitude of configurable options to calculate and plot biodiversity information after resampling the data. For example, Simpson and Shannon indices can be plotted against the size of a randomly selected pool of sequences, according to their genus allocations. The pool size could be increased by 1000 sequences each iteration, and 10 replicates performed at each pool size, with the command:$ rarefactionCurve.py -collapse-at-rank genus -replicates 10 -increment 1000 -in results.pkl results2.pkl 2.14 Finding Taxa and their Lineage findTaxa.py can be used to find taxonomic lineages matching any species name. This uses regular expression matching (from Python’s re module) to find all taxa in the taxonomic index that match the given pattern. For each taxon in the matching lineage(s), the MySQL database is also searched and rank information is printed below a text tree representation. For instance, if a user wishes to find all taxa (and their lineages) that end with the word ‘sp.’, the following command can be used: $ findTaxa.py ‘sp.$’ 33 2.15 Plotting Tabular Data on to Trees Given the above functionality it is possible to create phyloxml trees and plot arbitrary numeric data at each taxon. plot_data.py was developed to create a phyloxml file, and corresponding IToL files, to represent any tabular data data as bar graphs on an IToL tree. For example, the number of rRNA genes present in the genomes of over 1,100 species was copied from the rRNDB website [Klappenbach et al., 2001], and pasted into Microsoft Excel. The genus, species, and strain columns were merged into one, column headers were kept, and the table was saved as a plain-text, tab-delimited file called ‘rRNAcounts.txt’. The tree and IToL-compatible files were then generated with the command: $ plot_data.py rRNAcounts.txt -out rRNAPlot 2.16 Inferring Sequence Conservation ACGTcounts.py was developed to create a position-specific scoring matrix (PSSM) from any set of sequences. The 144 archaea 16S rRNA sequences were first aligned to the domain-level archaea HMM using hmmalign, and the subsequent alignment was loaded into ACGTcounts.py. The resulting PSSM was saved as a tab-delimited text file and loaded into a spreadsheet. The sample variance across A, C, G and T residues was calculated at each nucleotide position and normalised (Equation 2.4), giving the residue conservation at each alignment position. 1 n+1 n C n+i = ∑ 4 ⋅ var(PA , PC , PG , PT ) i n=1 The tab-delimited PSSM can be created with the following command:$ hmmalign /path/to/arbDBdir/Archaea.hmm NCBIArchaea.fna | ACGTcounts.py -format stockholm -out ArchaeaPSSM.txt 34 (2.4) 2.17 Comparing Processing Times Against BLAST The ARB Silva database of reference sequences used to create the SSuMMo database of HMMs was also used to create a BLAST database with which to compare processing times. Databases were trained using 512,037 sequences present in the SSU reference database v.104, with the only difference in training data being that the SSuMMo database could use aligned sequences. Both BLAST and SSuMMo times were recorded by using the Unix time program, which is provided by most Unix shells and is invoked simply by typing ‘time’ before the preceding program call. SSuMMo, BLASTN and MEGABLAST were tested in this manner using each program’s default settings on the same datasets. To enable a fairer comparison, BLASTN and MEGABLAST settings were changed to enable use of the same number of processor cores as SSuMMo (all available CPU cores less one), and timed when completion would occur in a feasible amount of time. 35 Identifying Microbes with Small 3 Subunit ribosomal RNA A number of current research foci look to create a better understanding of the complexity of microbial communities and interactions within diverse environments [Korneel et al., 2007; Raes & Bork, 2008]. The analysis of complex microbial communities with high-throughput sequencing (HTS) technologies can generate millions of small subunit ribosomal RNA (SSU rRNA) reads [Roesch et al., 2007; Sogin et al., 2006; Turnbaugh et al., 2009]. SSU rRNA sequences are commonly used to assess community complexity and have been used in such disparate sample regimes as soils [Liu et al., 2008], the human gastrointestinal tract [Ley et al., 2006] and potential biofuel sources [DeAngelis et al., 2011]. As an alternative to primer-targeted studies, whole-genome shotgun (WGS) metagenomics has become increasingly popular over the past decade, as it provides additional insight into community function and is purported to reduce sampling bias [Manichanh et al., 2008]. Both whole-genome and primer-targeted sequencing methods use the same sequencing platforms, technologies producing ever-enlarging datasets [Shendure & Ji, 2008] and suffering similar sequence artefacts, including shorter sequence lengths and greater uncertainty in the prediction of nucleotide bases when compared with older methods [Ledergerber & Dessimoz, 2011]. 37 Regardless of method, it is always desirable to identify those species that most significantly contribute to their environment. Powerful tools to visualise and identify differences or commonalities between datasets at a number of hierarchical levels are needed to help understand and model ecosystems and their dynamics in systems biology approaches [Liu et al., 2008; Raes & Bork, 2008]. 3.1 Taxon Identification with SSuMMo We have developed the Small Subunit Markov Modeler (SSuMMo) in response to the growing computational demands of such large datasets. SSuMMo is based upon a database of profile hidden Markov models (HMMs), trained with the ARB Silva reference database of SSU rRNA sequences [Pruesse et al., 2007]. The hierarchy of HMMs [Eddy, 1998] is arranged by EMBL taxonomy and acts as a decision tree to catalogue conserved gene fragments into known species names, one taxonomic rank at a time. This design minimises the number of pairwise comparisons and bypasses the need to create operational taxonomic units (OTUs), species proxies based on percentage sequence similarity. SSuMMo only groups sequences into acknowledged species names, defined after pure-culture, phenotypic characterisations [Dewhirst et al., 2010; Schloss & Handelsman, 2005]. SSuMMo has been built and optimised for Unix multicore workstations running Python v2.6+ and is interfaced through a set of command line programs, which can read sequences in over 20 different file formats, as supported by BioPython. SSU rRNA sequences contained within any sequence dataset (genome, HTS gene fragment, etc.) are identified in the first pass of domain-level classifications and retained for further taxonomic classification. Taxonomic assignments can be visualised in real-time, and results automatically saved into a Python object file (Figure 3.1A), which is optimised for fast conversion into a number of formats, including phyloxml, html, svg, jpeg, etc. 38 A) Sequences SSUMMO.py Score against sibling HMMs* Assign sequences to highest scoring taxa Continue to next child which has been assigned sequences If no more child nodes Results (.pkl file) B) Multiple SSuMMo results SSuMMo results dict_to_phyloxml.py phyloxml comparative_results.py IToL-compatible tree (.xml) phyloxml metadata (.txt) IToL-compatible tree metadata Multiple Multiple SSuMMo results SSuMMo results rankAbundance.py Rank abundance table (.txt) rarefactionCurve.py Biodiversity Plots rarefaction Plots biodiversity indices curve indices † ‡ † Figure 3.1: High level overview of A) SSuMMo annotation pipeline; and B) select post-analysis programs. Input & output files are represented with rounded boxes, programs in straight-edged boxes. A) SSuMMo can accept any sequence file type supported by BioPython (e.g. sff, fastq, etc.); fasta formatted files expected by default. Sequence files are read from files by a single process in SSUMMO.py, which pipes sequences through threads that feed reformatted sequences into the hmmsearch sequence scoring program. As the population’s taxonomic structure is created, a plain text tree showing quantitative information is printed to screen. Verbose mode also prints all raw hmmsearch results. The main output is a “pickled file”, saved with Python’s cPickle module next to the original sequence file. This currently stores the observed taxonomy and assigned accession numbers in the form of a multi-dimensional dictionary. B) For each .pkl file, post-analysis methods can produce various figures and / or tabular data. *- Sequences are scored against multiple HMMs simultaneously, provided there are spare processor cores. † - Simpson ( D ) and Shannon (H ′ , H max , J ) indices are available to choose from. ‡ - Rarefaction curves are plotted to screen using Python’s matplotlib plotting library. Images can be saved in raster or vector-based formats. Scripts are provided to calculate abundance and biodiversity information, and fast-track visualisation of results using EMBL’s IToL web application [Letunic & Bork, 2006], which can paint quantitative and comparative information onto inferred population structures (Figure 3.1B). SSuMMo can also save annotated sequences separately for further down39 stream analyses, or plot any numeric, tabular data onto the ARB taxonomy (section 2.15 and Figure 3.6). Taxonomic accuracy of SSuMMo was tested by comparing annotated sequences obtained from the NCBI FTP repository [NCBI, 2010] and the Human Oral Microbiome Database [Dewhirst et al., 2010] against SSuMMo assignments (Figure 3.4 - 3.3, Table 3.1). Initial tests showed genus prediction accuracy to be >90% (Table 3.1), prompting development of tools to assist with visualisation and comparison of multiple datasets. Functionality is demonstrated with SSU rRNA sequence datasets sampled from lean, overweight and obese individuals in chapter 4. Further detailed analyses exploring the relative accuracy of assignment in each of nine ‘hypervariable’ regions in 16S rRNA (V1-9), excised from full and near full-length archaeal test sequences showed targeted sequences as short as 70 nucleotides could identify >70% of genera correctly (Figure 3.2 and 3.3). Simulations were designed to identify ubiquitously conserved sequence regions suitable for broad-spectrum primers. As HTS methods produce relatively short reads compared with the length of the SSU rRNA gene, we looked to identify those regions in Archaea that coincide with the highest percentage of correct genus predictions (Figure 3.5). We note that no single region in SSU rRNA is conserved to an extent as to enable a single primer to cover the entire Archaea domain Simulated studies could be used to predict those taxa that would be identified with a designed 16S rRNA primer by using the SSuMMo HMM database. To assist with modelling changes in population structure and diversity within and between datasets, programs were developed to perform rarefaction analyses, calculate biodiversity indices and export stochastic matrices representing taxon probability distributions. Each program can prune resultant taxonomies at any specified rank prior to performing analyses, an alternative to varying cluster sizes by sequence similarity. Results can be exported in tabular form or visualised using Python’s matplotlib plotting library 40 90 80 Percentage correct genus assignments 70 60 50 V1 V2 V3 40 V4 V5 30 V6 V7 V8 20 V9 10 0 0 50 100 150 200 250 300 350 400 450 Sequence slice length Figure 3.2: Accuracy of SSuMMo assignments in SSU rRNA hypervariable regions. The percentage accuracy of assigning genus information to 144 Archaeal sequences at varying lengths was recorded and tallied for all 9 hypervariable sequence regions of SSU rRNA, as detected by a modified version of Vxtractor (available on request). The modifications worked to excise sequences of fixed length leading up to or from the boundaries any specified hypervariable region. In this simulation, sequences of 500 residues in length were excised from the 5’ end of each hypervariable region, and simulate_lengths.py was written to reduce the size of the sequences by 5 residues at a time, before calling SSuMMo and recording the number of genera correctly predicted for each sequence length. Results were saved to a whitespace-delimited text file (and printed to screen / standard output) for plotting. (see section 2.13). The provided scripts can apply resampling methods to SSuMMo results, enabling visual comparisons of estimated sampling depth, taxonomic diversity, species evenness and sampling bias within and between datasets. This is performed by ‘rarefying’, or randomly sampling an equal number of sequences, from result datasets and calculating Shannon and Simpson indices from the observed population distributions. These statistical methods and metrics can be combined and compared within and between sequence datasets to distinguish high-level features of diversity and commu41 500 90 80 Percentage correct genus assignments 70 60 50 rV1 rV2 rV3 40 rV4 rV5 30 rV6 rV7 rV8 20 rV9 10 0 0 50 100 150 200 250 300 350 400 450 500 Sequence slice length Figure 3.3: SSuMMo accuracy for antisense strands of hypervariable regions. Sequences were cut at the 3’ end of each hypervariable region and re-tested for accuaracy. Percentage genus accuracies are plotted for sequence lengths between 30 and 500 residues in length in steps of 5 residues. nity structure. The ability to combine and visualise species distributions across multiple datasets is a unique feature of SSuMMo, and provides a far speedier alternative to predicting phylogenies, which is prone to human error and can be difficult to reproduce [Peplies et al., 2008]. SSuMMo was shown to provide a robust framework for characterisation and comparison of population structures, enabling fast access to an array of data-dependent metrics. For annotation and inspection, the object-based model provides extensible tools to help compare and edit taxa and sequence annotations between databases. 42 Dataset (Rank) Phylum Class Order Family Genus Species # Sequences Mean length ± SD NCBI Archaeaa Compared Matched (%) (%) 98.6 100 98.6 100 98.6 100 97.2 100 100.0 97.2 91.7 65.2 144 1441.1 ± 36.7 NCBI Bacteriaa Compared Matched (%) (%) 49.8 92.9 50.1 92.8 66.1 90.7 85.2 92.5 91.5 89.5 94.6 56.8 3,186 1468.3 ± 47.0 HOMD Extendedb Compared Matched (%) (%) 43.7 95.1 58.0 92.2 72.3 87.4 74.1 94.5 78.4 89.1 77.5 44.2 34,879 481.7 ± 106.7 HOMD RefSeqb Compared Matched (%) (%) 38.3 97.5 47.0 95.9 66.3 93.2 71.3 96.0 80.9 85.7 43.1 50.1 1,646 1176.3 ± 447.7 Table 3.1: SSuMMo annotation accuracies. Species information extracted from fasta sequence headers were compared against SSuMMo taxonomy assignments as a measure of accuracy. ‘Compared’ shows the percentage of sequence annotations that could be found in the NCBI taxonomy database and propagated back up the tree of life at each rank.‘Matched’ shows the percentage of comparable sequences whose rank assignments agreed between SSuMMo and original annotation. a - ftp://ftp.ncbi.nih.gov/genomes/TARGET/16S_rRNA/. b - http://www.homd.org/Download - 16S rRNA RefSeq and extended RefSeq databases. 3.2 Results and Discussion 3.2.1 Assignment Accuracy Initial accuracy tests were performed with 144 full and near-full length Archaeal 16S rRNA sequences (all >1257 bp) obtained from the NCBI FTP server [NCBI, 2010]. Up to 99% (142) were assigned to the correct genus and 100% of sequences are correctly assigned to higher ranks, according to their original NCBI annotation (Table 3.1). No difference in accuracy was noted between the different model training methods, when using the Archaea test dataset. However, we found that hmmbuild’s default settings made HMMs giving the best accuracy when using the NCBI Bacteria dataset of full length 16S rRNA sequences. The impact that sequence length had on SSuMMo’s assignment accuracy was investigated with the same test dataset, by trimming residues from the 3- end of aligned sequences, before analysing with SSuMMo, and tallying the scores (Figure 3.4). Interestingly, genus 43 100 90 80 Percentage agree 70 60 50 Final genus accuracy 40 if Genus accuracy comparing first word Genus accuracy against HMMs built with --wgiven option. 30 20 10 0 0 200 400 600 800 Sequence length 1000 1200 1400 Figure 3.4: Accuracy of SSuMMo compared with sequence length. We tested the accuracy of genus assignment with sequence slices ranging from full length to just 34 residues, by shortening the 5 residues at a time from the 3’ end. For each sequence length, all sequences were run through SSuMMo and the percentage of allocations agreeing with NCBI annotation recorded. The first comparison method of SSUMMO_tally.py incorrectly assumed that the first word in the annotation name was always genus, so compared the first word in the annotation to the first word of the SSuMMo allocated taxon. This is plotted against a later version which took into account species with suffix names differing from their genus. The accuracy was also tested against an HMM database built if passing the --wgiven option to hmmbuild. This showed lower accuracy than the default hmmbuild method, which uses Henikoff position-based weights [Eddy, 1998]. assignment accuracy increased to the maximum of 99% (142) only after trimming the last 85 residues from the 3’ end of the test sequences. At lengths between 1119 and 1364 residues, SSuMMo assigned sequences with a genus accuracy of 98%, below which accuracy declined in a non-linear fashion (Figure 3.4). SSuMMo genus assignment accuracy was <95, 90, 80 and 70% for sequence lengths of 1059, 959, 554 and 387 ± 2 residues, respectively. Further tests were performed on SSU rRNA hypervariable regions, as detected by V-Xtractor [Hartmann et al., 2010] (Figure 3.2 and Figure 3.3), by extracting sequences extending 500 residues to or from locations either side of each hypervariable region. 44 SSuMMo was iteratively run on sequences after shortening by five residues at a time, and percentage accuracies recorded. Our results show that the V4 region most accurately assigned genera throughout the domain, with accuracies remaining ≥ 75% for sequence lengths of just 67 ± 2 residues (Figure 3.2 and 3.3; raw data not shown). The V9 region consistently performed worst, which is likely explained by a lack of training data, as many of the Archaea sequences in the ARB database do not cover this region, which spans alignment columns 1310-1340, according to alignments against RNAMMER HMMs [Lagesen et al., 2007]. Some of the lowest accuracies for assignments within the Archaea domain occurred with regions at the 3’ end of the full-length sequences (Figure 3.5, 3.2 and 3.3). This can be explained by the increased likelihood of errors appearing at the tail of sequence reads [Flicek & Birney, 2009] and by the fact that many training sequences were not full length. Out of 511,814 training sequences housed in ARB v104 database, 9,667 sequences are < 1,200 residues in length, the majority of which are members of the Archaea domain (9,621), representing > 45% of the 20,994 Archaea sequences in the ARB v104 database. At sequence lengths of 400 nucleotides, a common read length generated by pyrosequencing technologies [Droege & Hill, 2008], SSuMMo was shown to accurately predict the genus of >70% of archaeal sequences targeted at either end of regions V1-6 (Figure 3.2 and 3.3). Methodologies producing even shorter reads would benefit from well-designed primers, as accuracies as high as 80.5% are achieved with sequences 250 bp in length, if starting from the 5’ end of the V4 region (Figure 3.2). SSuMMo accuracy was tested for consistency in the Bacteria domain using sequences obtained from NCBI and the Human Oral Microbiome Database (HOMD) [Dewhirst et al., 2010]. SSuMMo correctly assigned >86% of reference sequences to genera described in sequence annotations (Table 3.1), and >92% of bacterial family predictions matched their annotation across all datasets, up to 96% accuracy for the HOMD RefSeq database. 45 Reason for species mismatch Only annotated to Candidate Division Only annotated to family Only annotated to genus Annotated to “Oral taxon <123>” Only assigned to genus Assigned to an uncultured species Naming convention differences Assigned to wrong species Percentage 4 1 3 3 1 42 26 19 Correct* 4 1 3 2 1 24 22 0 Table 3.2: SSuMMo species mismatches. A random sample of 100 species mismatches were selected from the lowest scoring dataset (HOMD extended) and examined to understand why the wrong predictions occured. The majority of misassignments could be accounted for by original annotation not actually reaching a species level annotation, but SSuMMo had actually predicted a species. SSuMMo predicted 42 sequences with species level annotations to be from uncultured microbes. *- The number of sequences for which SSuMMo correctly predicted the taxonomic lineage to either the same level as original annotation, or up to genus specificity. The lowest accuracies were recorded for species level assignments. A random sample of 100 mis-assignments indicated ∼40% were being assigned to uncultured species, with about half of those being assigned to the correct genus (Table 3.2). The mis-assignments could be due to a number of factors, including subtle differences in naming conventions between databases (we estimate ∼25% of mis-assigned sequences), differences in the number of training sequences, and the taxonomy structure which underlies our method (ARB has multiple unclassified branches at many different nodes). Matching taxa names between databases posed a problem as database entries are often misspelled (e.g. in ARB: ‘Brumimimicrobium’ instead of ‘Brumimicrobium’ etc.), mismatched (e.g. exchangeable, non-alphanumeric characters), or non-unique (e.g. ‘Acidobacterium’ is both a phylum and a class). These issues do not affect SSuMMo’s ability to assign sequences to most probable taxa, but negatively affect the inferred number of comparable sequences in the accuracy tests (Table 3.1). 46 90 1 80 0.9 0.8 70 0.7 0.6 50 0.5 40 0.4 30 0.3 20 0.2 % Genus accuracy 10 Residue conservation V1 81 117 0 0 V2 192 200 V3 286 434 V4 526 400 622 793 600 800 V5 864 900 V6 1024 1000 1100 1174 0.1 V8 V7 1235 1200 0 1400 Residue co-ordinate Figure 3.5: Accuracy of Archaeal 16S rRNA sequences run through SSuMMo. The percentage of 144 sequences to which SSuMMo correctly assigned genus is plotted against the starting co-ordinate of sequence windows 250 nucleotides in length. Also plotted are C 10 values, the residue conservation over 10 base windows (Equation 2.4), and predicted positions of each hypervariable region for the query sequences. 3.2.2 Software comparisons SSuMMo processing times were compared with those of BLASTN and MEGABLAST (v2.2.21) using an array of datasets (see Appendix I). SSuMMo took 4 hours, 7 mins to process 291,993 V2-targeted sequence reads and 6h 32min to process 3,186 near full-length sequences (Table 3.3). When compared against the default BLAST configurations (1 CPU core), SSuMMo is fastest, but after changing BLAST settings to use 11 of 12 CPU-cores, as SSuMMo did by default, MEGABLAST was fastest with datasets up to several thousand sequences, but slower than SSuMMo with the largest tested dataset (Table 3.3). SSuMMo’s accuracy (Table 3.1) appears to outperform tools used to annotate WGS metagenomic datasets according to values quoted in the literature [Brady & Salzberg, 47 Residue conservation Percentage correct 60 Dataset NCBI Archaea* NCBI Bacteria* V2 From Lean** Dataset stats No. seqs Mean length ± S.D. 144 1441 ± 36.7 3,186 1468 ± 47.1 291,993 230 ± 10.7 SSuMMo v0.4b Processing times Blastna Megablasta Megablastb 3m52s 6hrs32m14s 4hr7m39sc 3m50s 4d6h12m - 2m38s 19h17m2s - 64s 2h55m27s >24hrs Table 3.3: SSuMMo vs. BLAST runtimes. SSuMMo processing times were compared against NCBI BLAST programs: BLASTN and MEGABLAST. The 291,993 V2-targeted sequences were started with MEGABLAST, but were not run through to completion as it became apparent that SSuMMo was far quicker at processing these larger sequence datasets. a - BLAST default settings, using a single process thread. b - Using all CPU cores less one (11 on our test system). *- NCBI “target” datasets were downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/TARGET/16S_rRNA/. **- The pooled set of V2-targeted sequences, including only those extracted from “lean” individuals was produced by Turnbaugh et al. [2009]. 2009]. This should be as expected, given that SSU rRNA is currently the most highly sequenced gene, by far. However, the RDP classifier, which is also designed specifically to annotate SSU rRNA sequences, reports comparable accuracies [Wang et al., 2007]. 3.2.3 Sequence Windows and Primer Design Prokaryotes contain nine hypervariable regions in their 16S rRNA gene, which are interspersed with relatively conserved regions that are more suitable for designing broadspectrum PCR primers. SSuMMo was tested to see if the extra variation in hypervariable regions affected genus predictions, by excising a 250 base ‘window’ within each archaeal sequence and shifting it 5 nucleotides at a time (Figure 3.5). In this scenario, the highest accuracy recorded within this set of 144 sequences was 89% and the lowest was 48%. The nucleotide conservation in Archaea sequence alignments was calculated and averaged over 16 base windows along the whole SSU rRNA gene (Equation 2.4; I = 16). This returns a value between 0 (no conservation) and 1 (perfectly conserved region) for any group of aligned sequences. The start position of the most accurately assigned 250 base window was identified in the middle of the V3 hypervariable region, where residue conservation is 48 particularly low (Figure 3.5), making this an unsuitable location for targeted primer design. A more effective primer selection might focus upon RNAMMER alignment positions 535-551, between regions V3 and V4 as it is highly conserved (C16 = 0.992) (5’-CAGC[c][AC]GCCGCGGUAA-3’). There are three 250 base long sequence windows, starting from local alignment positions 562, 567 and 572 and extending downstream, which show accuracies of 79%; the highest accuracy for any region starting from a ubiquitously conserved region of sufficient length for primer design. However, if targeting the reverse strand from this location, typical sequence lengths would extend beyond the V3 region into positions that are relatively worse at resolving taxa accurately. 3.2.4 Biological Diversity As with other SSU rRNA identifying software, SSuMMo does not account for multiple rRNA operon copy numbers per genome, which vary between 1-15 copies per organism, according to information available at the time of writing (Figure 3.6) [Klappenbach et al., 2000]. There is also variation in chromosomal copy number between organisms, which can vary with proliferation state [Pecoraro et al., 2011]. These factors mean that quantifying 16S rRNA genes in environmental samples does not indicate the number of individual cells in a sample, but only the number of rRNA gene copies sequenced. Together, these could contribute a 2 to 3 order of magnitude error in organism estimates. However, using rank abundance scores and information gained on population distributions, several biodiversity indices can still be calculated (section 2.13). Although these biodiversity metrics don’t by themselves consider gene and genome copy number, these metrics can still be used to give an approximation of relative organism abundance within a sample. When calculating biodiversity indices, a defining unit is needed to discriminate one taxon from another. In SSuMMo, these units are defined by taxonomic rank, rather than percentage sequence similarity, which is commonly used when defining OTUs (e.g. 49 Euryarchaeota N Koanoa rar rc cha ha eo eota ta P sych P sych roba roba cter Acin etob cter P acte sychroba cryo s p. P Acin Acin r towneri cter halolentiR wf−1 etob arcti acte etob cus s K r tjern acte D S M 1 r 4969 273− 5 Acin bergiae towneri 4 etob DS M D S (2 N01 ) acte M r tand 1496 6 1 4962 (7B oii Acin Ac inetobacte Acin etob etob DS M 1 02) acter 4670 r acte Acin lwoffi schind leri r s p. etob i DR Acine ac ter junii B G 8 (A TCCLU H 5832 1 ter calco tobacter B G 5 1792 5) acetic grimontii( ATCC Acine us BG1 DSM 17908 ) Acine tobacter 14968 tobac bayly (ATCC 23055 ter bayly i C5 Acine ) (DSM tobac i 14963 ter baylyB D413 ( LUH ) i B D4 3 Acinet Acine ( LUH 905) obact tobac Acinet er baylyi ter bayly 4 836) obact er baylyi A7 (DSM i ADP 14959 1 93A2 Acinet ) (LUH obact 5822) Acinet Acinetobacteer baumannii obacte r bauma SDF r nnii AY Acinet bauma nnii E AT obacte Acinet r bauma C C 1 7978 obacte Acinet r bauma nnii nnii AC IC obacte AB 307−0 U 294 P s eudomr bauma nnii AB 0057 onas stu Pseudo tzeri Pseudo monas putida A1501 monas putida W619 Pseudo KT244 monas 0 Pseudom putida GB−1 onas Pseudom onas mendocputida F 1 Pseudom P s eudomon onas entomop ina ymp as ae ruginos hila a U C B P L48 P s eudomon as ae ruginosaP−P A14 P seudo P AO1 Methyloba monas a eruginosa cterium e xtorquens P A7 C ellvibrio AM1 ja ponicus U eda107 Azotobact er vinelandii Y ers inia ps DJ eudotubercu los is Y P III Y ers inia ps eudotubercu Y ers inia ps los is P B 1+ eudotubercu los is IP 3 2953 Y ers inia ps eudotuberculo s is IP 3 1758 Y ers inia pes tis P es toides F Y ers inia pes tis N epal516 Y ers inia pes tis K IM Y ers inia pes tis C O92 Y ers inia pes tis A ntiqua Y ers inia pestis A ngola Y ersinia pes tis 9 1001 Y ersinia e nterocolitica 8 081 S odalis glossini dius m orsitans Se rratia proteam aculans 568 P seudom onas syringa e D C 3000 Pseudomonas syringae B728 a Pseudomonas syringae 1448A Photorhabdus luminescens TT01 Pectobacterium wasabiae WPP163 Pectobacterium atrosepticum SCRI1043 Klebsiella pneumoniae NTUH−K2044 Klebsiella pneumoniae MGH 78578 Klebsiella pneumoniae 342 Erwinia tasmaniensis Et199 49946 E rwinia a mylovora E a273 ATC C Entero bacte r sp. 638 Dickeya dadan tii Ech703 z 3032 C ronobacter turicens is AT C C B AA−894 C ronobac ter s aka zak ii IC C 168 C itrobacter rodentium IP 1 2.8 1) istens B G 12 ( S E Acinetobacter radiores B PEN S odalis penns ylvanicus L T 2 S almonella typhimurium P C 1 c arotovorum P ectobacterium AT C C 5 1908 S hewanella woodyi s p. W 3−18−1 S hewanella s p. M R −7 S hewanella s p. M R −4 S hewanella A NA−3 s p. S hewanella H AW−E B 3 s ediminis S hewanella putrefacie ns C N−32 S hewanella A T C C 7 00345 pealeana idensis M R −1 S hewanella a one P V −4 S hewanell a loihica S hewanell is H AW−E B 4 a h alifaxens NCIMB 400 rina S hewanell lla frigidima ans OS217 Shewane lla denitrific OS195 Shewane lla baltica Shewane baltica OS185 ella OS155 Shewan ella baltica SB2B Shewan ensis PB7 ella amazone nterica S 67 lla Shewan C −B S almone e nterica S lla C 9 150 S almonee nterica A T C a 3 4H lla ythrae S almone a ps ychrer dans 1 4642 T8 C olwelli us degra olei V aquaeecotype rophag accha bacter S Marino odii Deep amii 37 s macle as ingrah 5 mona Altero Psychromon lanktis TAC12L 2T R as halop loihie nsis a T 6c omon arina tlantic oalter Idiom ola a emec ula1 Pseud aT G laciec tidios iosa M23 a fas Xylell Xylella fastid iosa M12 4 fastid a Xylell iosa GB51 sa 9a5c a fastid 99A Xylell lla fa stidio Xyle oryzae PXO31101 8 onas e M AFF C 1033 1 hom Xant oryza ae K AC B LS 256 ae om onas oryz B 100 Xanth homonas onas oryz estris 3 596 hom Xant c amp D S M 5−10 Xant onas 3 3913 s8 hom C estri 8 004 Xant s A T C c amp estris 3 06 estri is onas pod C 73 c amp thom onas c amp P Xan thom onas −3 s a xono G PE s thom ona a Xan thom ean ilia R 551 Xan 279 Xan s a lbilin toph ilia K m70 ona s mal aP ona s maltoph thom tocid H01 65 hom Xan trop homona la mul suis S K W2 0 urel para G S teno trop Rd P aste hilus enz ae ae P ittG S teno mop influ P ittEE enz Hae hilus s influ enz ae NP hilu s influ 86−028 P mop 00H Hae ae mop 350 700 −1 Hae mophilu enz reyi 7S Hae hilus influs duc hilus NJ8 s D S −1 hilu rop itan mop mop aph com s D11 Z Hae Hae cter tem itan s 130 tiba yce com ene ae L20 rega tem L03 oni Agg actinomyce s ucc inog um ae J 76 er act a ctinom lus uropne umoni AP 6 acil ae r atib reg acte inob lus ple uropne umoni 233 nus PT Act acil ple Agg atib pne s som 129 0 6 reg uro inob O nus YJ1 Agg hilu Act bac illus ple ino illus Histop s somnificus O6−24P6 hilu Act bac top rio vul cus M C MC 3 ino His Vib nifi cus Act 063 48 221 vul nifi 16 rio IMD C C 140 Vib rio vul −11 Vib ticus R ns A T B AA O3 95 C 1 e oly riege em nat lera 696 yi ATCcho N1 SS9 aha par Vibrio ha rve rio lerae dum 98 rio Vib cho fun 91− CI rio Vib rio pro H3 eus Vib NV Vib m eriu eus illus cer14579 8 4 act cer Bac ATCC LFI123S 11 8 tob illus E Pho Bac eus nicida eri −3 41 12 s cer fish 96 illu salmo rio s WY sis U1 S 4 Bac rio ivib nsi ren C HU 8 U1 vib Ali are tula is S O S LV S Alii tul ella are ns ns is is 2−00 ella ns cis rancis tul tulare are F 00 19 8 7 F ran tul TN S C ella F cis cis ella ella TA F is F S C 1401 7 F ranF ran rancis is F are ns is F 25 L−2 ns C X C −1 F ns tul are ella tulare A TC na P fO a tul cis ella ragia noge ens Pf5 ell 3 cis F ran cis ilomi a cru ore sc scens642 F rana ph pir flu ore 49 F11 4 3 F ran 30 os as flu CC ns ell cis iomicr on onas 7 AT sce SM 2396 2 om F ran Th ud om ns P1 fluorens D C TC SK Y L1 1 P se Pseud sce as ige K nsisM W hia s ore on ex sis me flu om er sal ejuen rku cie s) elp P ari ad ud n Acine tobac as on om ud C hro ia cter oba Actin Proteobacteria uc Br Br Firmicutes 1 0 10 AK−0 B s HD ns 4 ru ra v5 PO vo ivo LS B M io la S ns 20 er en hi gh ct alk op cus i da G ou ba m hr hi ox s 0 or F rio illu yc op ar an 4 ib ac ps iditr fum ric DP lde nb ki N1 −0 za M r lfu 1 ov tib P− C ell lfa talea ac te su ris Hi iya 56 P− Bd su lfo phus obac de lga ris M P HE ce s 2C De su ro ph rio vu lga ris is o 22 an 2C De nt tro vib rio vu lga lar S 16 en ns Sy yn lfo vib rio vu llu um DK log na −5 S su lfo vib rio ra ce los s ha ge 09 De su lfo vib int llu nthu r de alo w1 80 De su lfo ia ce xa te d eh . F 23 79 De su on m s ac ter sp De ws giu cu ob ac ter . K DSM M 23 La oran oc oc m yx ob ac r sp us DS m 5 S yx ro myx ob acte lic s Be −1 M ae ro m yx ob ino icu sis GS An ae ro yx carb opion ien ns P C A idj S Z ce An ae rom ter pr f4 An ae ac ter b em leyi du ce nsns R An lob ac r lov tallire du ce ba 9 Pe lob cte r me urre du −2 ee 5144 Pe ba cte er ulf iire 55 is Sh G eoeoba ct r s uran S B1 ch ATCC 5 ny us G eoba cte r s p. G eoba cteptor acinopatic 2669 G eoba tiru ter he lori 88 IA1489 90 G tra obac ter py lori A7 44 0 i ac ter py Ni lic He licob ac cter pylorlori A7 2 49 21 2A He licob ba ter py ri A8 W0 He lico obac ter p ylo i A 8 He lic obac er pylor i B3 02 act ter pylor i G1 7 He lic 1 7 He lic ob ac ter pylor i G2 P AG 63 He lic ob ac r pylor ri H 11 8 He lic ob ba cteter lo i J 99 TC 11 63 He lico b ac r py lor i NC TC 9 He lico bacte r py lor i NC He lico ba cte r py lor i P 12 LO 34 He lico ba cte r py lor i P C He lico ba cte r py lor S 14 51 He lico ba cte r py lori S 18 He lico ba cte r py lori S 37 70 S M 12 He lico ba cte r py lori S hi4 s D 40 He lico ba cte r py lori ific an 17 He lico ba cte r py nitr 37 −1D S M He lico ba cte as de N B C ne s 8 ge 401 7299 He lico on sp. He rim um ucc ino ri RM DSM 26 S ulfu rovlla s tzle gilis 417 S ulfuline er bu rofi i UA s 138 Wo obact er nit er col cisu 525.920 74 Arc obactobact ter con vus 2−4 273 12 47071 Arc pyl bac r cur s 8 T CC C T C 124 472 Cam p ylo obacte r fetu s A s N C T C Cam pyl obacte ter fetulve ticu s N T C 12128 38 45 C am pyl obac ter he lveticu s NC C 128 Cam pyl obac r he lveticu s N CTC T C am pyl oba cte he lveticu s NC 12846 C am pyl bacter r he veticu NCTC 12847 48 C am p ylo cte r hel icus NCTC C 128 49 Cam pyloba cte r helvet icus NCT C 128 1 C am pyloba cte r helvet icus NC T B AA −38 C am pyloba cte r helvet icus T C C Cam pyloba cte helvetinis A 97 Cam pylobabac ter hom 2 69. Cam pylo bac ter jejuni 4 483 8 ni ter C am pylo jeju bac ni 734 176 ter C am pylo bac ter jeju ni 81− C am pylo 0) bac ter jeju 81116 11168 4 343 C am pylo CC bac jejuni NCTC221 C am pylo ni (AT ter C am pylobac ter jeju ni RM1G H90 11 9 80 C 2 525 Cam pylobac ter jeju ni T 18 TC Cam pylobacacter jeju ni UA5 197 sA Cam ylob acter jeju can AT C C itrifi C amp ylob den aea C 91 2 519 6 C C amp acillus s e uropopha AT C T hiobsomona s eutr mis Nitro somona multifor m E bN1 R CB Nitro sos pira romaticu atica a rom Nitroarcus a cla de) Azo acte rium (O M43 B H72 2 181 s s p. s p. B 510 F errib TCC arcu Azo ylophilus s p. H P 688 T Meth ylophilus s p. M tus K C 1 2472 Meth ylovorus s fla gella3−4 ATC Meth ylobacillu s p. S IP ceum Meth ylovorus 5 m viola FA 1090 1194 ae Meth obac teriu NC CP rrhoe C hromeria gono rrhoeae 5344 2 Neiss eria gono itidis Neiss eria m ening is 8013 8 ngitid is FAM1 Neiss eria meni ngitid MC58 is Neiss eria meni ngitid ) Z 2491 Neiss eria meni gitidis a lpha1 4 DSM 19021 ) is Neiss eria menin 1483 710 25416 ngitid (ATCC Neiss eria meni gitidis alpha DN1 (BAA− PC 25 TAM− Neiss eria menin orans 717 = Starr ter acetiv Neiss xybac cepacia Ballard a Steno ha H16 olderi Burkh vidus eutrop a J MP 134 TPS Y Cupria idus e utroph A−1 M 1 5236 ter e breus eille 1DMW C upriav 1 = DS LW−P orobac as s p. Mars Diaph arius Q A T C C B AA−62 niimon 118 = 1 Hermi leobacte r necess ce ns T U LP As P olynuc ra x fe rriredu ydans enicox 8 R hodofeiimona s a rs idans A xylosox Hermin obacter Achrom avium 197N RB50 lla Bordete bronchiseptica lla Bordete parapertussis 12822 I la Bordetel pertussis Tohama la Bordetel petrii DSM 12804 C 1 1170 la C Bordetel llum rubrum A T 9860 1 R hodospirix ave nae A TC C a Acidovor citrulli A AC0 0−1 ax Acidovor s p. J S 42 −2 x Acidovora s tes tos teroni C NB C omamona S P H−1 m P M1 Delftia a cidovorans petroleiphilu 2 Methylibium naphthalenivorans C J P olaromonas s p. J S 666 E F 01−2 P olaromonas cter e iseniae V erminephroba a mbifaria A MMD B urkholderia 40−6 a mbifaria MC B urkholderia A U 1 054 B urkholderia c enocepacia H I2424 B urkholderia c enocepacia J 2315 B urkholderia c enocepacia MC 0−3 B urkholderia c enocepacia 1 GR B urkholderia glumae B C C 2 334 4 B urkholderia mallei A T Bu rkholderia mallei NCT C 10229 Burk holderia m allei N CT C 10247 B urkholderia m allei SAVP 1 ) Burkholderia multivorans ATCC 17616 (DOE JGI Burkholderia multivorans ATCC 17616 (Tohoku) Burkholderia phymatum STM815 Burkholderia phytofirmans PsJN Burkholderia pseudomallei 1106a Burkholderia pseudomallei 1710b s bo Pse act lla ch rax o s pe P hil phila Le rby 7 ila lob ha he anivo (n phila mo ph Co 0 eu Ha as ila Alc on mo pn eumo ph 1970M 18 L1 S −1 om eu lla pn mo rin pn ne lla eu ATCC DS hila E Ma ne lla Le gio ne pn ni os um lop M LH 65 B5 49 i gio lla ea gio Le ne us oc vin ira ha he nii A4 Le 66 gio rlic ro Le cocc atiumdosp eh ve nicida 79 3 so om ho la onas mo CC 5399 0 27 AT ico tro hr Ni loc Ha lor mn rom sal hila CC 23 756 t h Al lili Ae nas op AT CC 51 Ba ns AT CC s 3A 4 ka tu Al S mo dr ro as hy oxida ns s AT ula S170 is K 84 Ae on ro oxida ldu ps VC vit er 04 m ct 23 25 m fer ro ca ca s ro s fer s cus su biu ba S M 13 48 0 7 do izo illu Ae acillu s W M illu ac oc no Rh ra dio m W S C 14 89 02 iob ac iob hyloc r ru C rum 80 41 th cte ium sa rum idi iob th et ba ob ino sa m AT sa rum 38 2 Ac idith Acidi M elo m ino 65 ch Ac R hiz gu minosa ru um sa arum AT N 42 Di ino le gu ino leg um inos li CI li CF le ium m et et ob ium legum bium leg u um um R hizhiz ob um hizo bium legizobi bi R bi R zo bium Rh izo zo Rh R hi izo hi R Rh mo Ag ro Br Br ba ad ad ct yr yr En er hi hi sif iu zo zo Oc m bi bi er Ensif hr tu um um m er ob ed m m ac jap ef tru Ag ac japon on icae eli Br m anroba ien icu icu W loti SM 10 uc ct s el thro er MAF m 11m 61 41 2 Br la F3 0s A7 9 1 pi iu Br uc ell uc ell Br suis ATm sp 01 pc 6 00 a m ell a ov uc AT C . H1 4 Br eli a m is ella C C C 49 3− 1 te icr AT uc Rh uc Bruc ell ns is ot C suis 23 18 8 3 Ni Rh od ell ell a m AT i C C 13 44 tro R od op a ca a C 25 30 5 ba Rh hodo op se m eli C M cte od ps se ud Br Bruc nis eli tens C 49 84 0 r wi op eu ud om uc ell AT tens is 23 15 no seud do omononas ella a ab C is 23 45 ab or C 16 08 7 gr mo Bra Ni adskomon na as papalus ortu tus 2336 M tro Bra dyrhi yi as s pa lus tri s 9− S1 5 Br ba Nb pa lus tri s Ha 94 9 Me dy zo cte ad s rhi −2 thy zo bium yrh r ha 55 lustri tris BisBisB5A 1 lob biu 2 ac jap izobiu mb AT s Bis B1 Me ter Me m ur C thy Me ium thy japononicu m ge C 25A5 8 3 s p. ns lob lob m thy ra icu 39 Me Me is ac thy Me thy ac ter lob diotol ter m USDAO R X 14 1 lob thy lob ium ac ter era ium USDA 12S 27 act lob act 3 8 Ba eri act eri noduium ns J sp. 16 11 rto um eri um lan po C M 89 0 ne pu lla chlor um extor s O li 28 3 Ba Barto trib o meextor qu R S B J 00 31 rto oc en ne nella oru tha quen s PA20 60 1 lla m nic B B Xa artonarton hensequint CIP um s D M41 nth ell an 10 Az ob ella a gralae a To 5476C M4 orh Ho ac izo S tar ter ba cill ha ust ulo biu ke a uto ifo mi on s e ya rm i as − P arv B Me m c no tro is 4a 1 iba eije thy au ve ph K up R hiz Me cu rincki loc linoda lla icu C 58 ob sorhi lum lav a indella ns DS s P 3 ium zo s ilve OR M 50 y2 tum biu amenica A str S 57 6 m TC is R efa lot tivo B 1 Bra R hiz hiz ob cie ns i MA ran C 90 L2 dyr ob ium C 58 F F 30s DS 39 hiz ium me Bra obium Bra ( C 30 −1 dyr dyr Ens fre dii lilo ere 99 Bra hizobi (no spe hiz ifer sp. U S ti AK on) dyr um cie obium NG DA 63 1 hiz R23 19 obi (no s) Afip um spe ATCC sp. BTA 4 1 ia (no cie R ick R icke carbox spe s) AN10319 i 1 etts ttsi cie U2 ia rick a typidovor s) 32H89 R hi ans e Ric R icke ickett ttsii S W ilm OM 1 ket sia hei ing 5 tsia ttsia pro rick la ton pro Ric w etts S mit ket h Rick tsia waz aze ii Io Rick etts pea ekii Makii Rp2 wa etts ia ma coc 2 drid kii ia R ickeRicketts felis ssiliae Rus E tic ttsia ia conURRWX MTU5 R c Cal orii R ickeicke ttsia ana den Ma 2 ttsia bel sis lish Mc 7 R icke bel lii R K iel ML lii Orie R icke ttsia OS U 369 Orie ntia ttsia aka ri 85− −C ntia 389 tsut tsutsug africaeH artford Neo sug E hrlic amu amush E S F rick etts Wolbac shi i Iked −5 E hrlic hia rum ia inan Neo sennetshia pipi Boryong a hia rum tium rick enti inan etts u Miy s W tium elge ia risti ayamaCp E hrlic W elge vonden cii Illin ( P reto ois E hrlic hia von hia ruminanden ( ria) C cha Ana ffee tium IR AD) plas nsis G Ana ma E hrlic hia Arka arde l pha plas can nsa s Ana ma mar gocytoph is plas gina ilumJ ake Ana ma mar le S t. M H Z plas arie ma gina le s Ana c entrale F lorida Ana plas ma Isra el plas s Anap ma s p. wR i lasm Anaplasm p. T a s p. G ranul R iba cterAnaplasm( no desa sp. J HBS b ethes a pipie igna tion) G Gluco nacet luconaceto G lucon dens ntis Gluco obact wP ip bac ter obacte r is C G DNIH nacet er oxyd hans obact diazotroph 1 ans en ii er diazo 6 21H icus PAI ATC troph C 2376 icus PAI 5 (RioGene) 9 Aceto bacte Acidiphiliu 5 (JGI−PGF) r pas Magn m crypt etos pirillu teuria nus um m magn I F O 3283JF−5 Aceto R hodocista eticum AMB−01 bacter xylinu cente num −1 Aceto Acetobacterm NCIB 11664S W bacter europ xylinum CL27 aeus DSM S ilicibac R os eobac ter 6160 ter denitripomeroyi R D S S −3 ficans R hodobahodobacter OC s R hodoba cter s phaerophaeroides h 114 K D131 cter s phaero ides A T C C 17 ides A 029 R hodoba TCC 1 Pelagib cter s phaero 7025 aca bermud ides 2 ensis Paracoc cus denitrifi HTCC 2601 .4 .1 Maritim ibacter cans PD1222 alkaliph Dinorose ilus HTCC26 obacter 54 shibae Ruegeria (no species) DFL 1 2 Phenylob TM1040 acterium zucineum HLK1 C aulobacte C aulobacter s p. K r Ca ulobacter s egnis A T C C 2 31 1756 c C aulobacterres centus N A1000 c res centus C B 15 Hyphomon Maricaulis maris MC S 10 as neptunium Hirschia baltica A T C C 1 5444 AT Zymomonas C C 4 9814 mobilis Z M4 Zymomonas mobilis N C IMB S phingopyxis a las kens is R B1 1163 2256 Novos phingobium S phingomonas wittichii R W1 a romaticivorans E rythrobacter litoralis D S M 1 2444 P arvularcula bermudens H T C C 2594 is HT R alstonia s olanacearum C C 2503 G MI1000 R alstonia pickettii 12J C upriavidus ta iwanens is L MG 1 9424 C upriavidus m etallidurans C H34 B urkholderia xenovorans LB 400 B urkholderia vietnam iensis G 4 Bu rkholderia thailandensis E 264 Burkholderia sp. 383 Burkholderia pseudomallei K96243 Burkholderia pseudomallei 668 Pse −1 0 8 W 46 BC 9 A− is 86 M ns 80 0 9 13 lis ge ) iti ng (2 90 35 89 C C m e A3 TC C 13 14 er ch r av ng colo KC R C C m AT33 3 10 us NB es bi tu yc es coeli ise us s AT en M 12 48 7 M 54 om yc es gr ise su rm pt m yc es gr do ofe DS DS M 20 3 9 re to St rep ptom yc yc es no la ct rnaecium S 60 2010 D 20 St re ptom yc es ve ium ca fae us S M M 3 2 St tre ptom 11 S tre ptom cter rgia rium ntari D DS 27 33 B 38 S tre iba be cte de ns ena 08 t CC P TW is S ev en ba se ca vig lei Tw is AT N CP Br ut hy us nitrifi fla cc Be ac co de as ipp lei ns sis 7 Br to sia on a wh ipp ne en e3 Ky ne lom m a wh iga an 07 Re11 1 A6 S ph Jo llu ery m mich hig B sis T C us ns Ce ph ery ter mic TC ten s lic ra Tro ph ac ter li C lai sc enhe no ivo Tro vib ac xy r ari re op thren Cla vib ia cte au lor an Cla ifson ba ter ch en 2 4 Le thro obac ter ph . FB Ar thr obac ter sp 168 5 n 42 10 Ar thr obac ter ilis n) Ar thr obac bt ilis BS J H6 IB 36 9 su bt 20 tio s C Ar thr 33 na 1 65 Ar cillus su ubtilitilis N S MY CC sig s lus Ba cil s ub tilis W 23 2 20 C 26 AT de m (no DC CT s Ba illu s ub 4 B ac cillus s s ubtili phila us N ninaruurium2 97 Ba c illu s s zo ute mo m 15 12 75 Ba llu a rhi s l sal rae 10 C C 30 01 09 B acic uri coccu rium m lep IFM s AT E C T 70 9 d) fel Ko cro cte riu ica ian C 88 C C 12 iele sa to) AT 13 Mi iba cte farcin fa sc cia ns s D1 2 (B ita um C TC 1 R en cobad ia us fas ian 03 (K My car cocc us fa sc R HA uc osae N 14 13 03 2 No do co cc us jos tii aurim eri S −3 TC C 13 hth ns Y A C C R ho do cocc s m m 5 R ho do coccu riu m dip icie icu m AT 38 R ho do ba cte riu m eff tam icu R 44 cte ho glu M ne ba riu m tam icum 41 1 R C ory ne ba cte riu m glu tam K dtii DS7109 C ory ne ba cte riu m glu keium ste M C ory ne ba cte riu m jei ppen m DS C ory ne ba cte riu m kro ticu C ory ne ba cte riu urealy 104 7 3P2 C ory ne ba cte rium um K−10 9 C ory ne cte m avi um AF2122 ur 117 2 C oryryneba eriu m avi is Paste 17 K act bov C yo −G is eriu Co cob ) 9 My cobact e rium bov is Tok PY R 221 nation My cobactcteriumm bov um llulare sig My coba eriu m gilv ace (no de My cobact eriu m intr rae 49 607 My cobact eriu m lep rae TN C 192 TC C My cobact eriu m lep ei ATCatis A 2 155 My cob act eriu m phl gm MC My cobact eriu m s me gmatis My cobact eriu sme 1 77 My cobact rium sp. JLS S 151 251 C DC My cobacte rium sp. KM S CC My cobacte rium sp. MC ulosis F 11 a AT My cobacte erium tuberc ulosis H 37RR v is My cobact erium tuberc ulos H37 My cobact erium tuberc ulos is 99 My cobact ium tuberc s Agy −1 1 ran cter My oba ium ulce lenii PYR 550 901 baa um DSM Myc oba cter ium ns Z−2 Myc oba cter ium van atic 4 rma 3 Myc obacter ium arab ii KC roge nofo3 907 4 Myc tohalob degens s hyd A T C C sis MB 5 ca rmu gen 672 5 Ace monifex 672 othe moa ceti con Am DS M 526 5 xyd teng DS M C arborella ther cter r bes cii hilum DS M Moo nae roba upto r thermop ticus B 47 8 903 eoly osir upto O SM C alda ellul osir acter prot diansis us D C aldic ellul lytic C 3 3223 obsi ATC C aldic thermob ptor sacc haro icus anol C opro ellulosiru ptor 851 9 deth SY C aldic ellulosiru r pseuX 514 s p. acte C aldic idium erob acter s p. H10 C lostr oana C 2 7405IS Dg cum erob T herm oana ellulolyti lum A T C entans ocel T hermivibrio c oferm 7 r phytorans P Acet ivibrio therm acte Acet rosporobarboxidov C 33 656 1 2 IAM 14 863 ATC Anae tridium c KIST6 ctale um um C los uria re limos therm ophil Ice 1 R oseb terium m dum E ubac obacteriu modestical WM1 n rium arolyticum inge S ymbi bacte sacch i Goett D S M 2 0731 Helio ridiummonas wolfe ntans Clost ferme M 2 0548 opho tii DS 9328 Syntr minoc occus CC 2 cus prevo Acida a AT Y5 1 rococ Anae ldia magn hafniense DSM 771 F inego tobacterium acetoxidans MI−1 SI Desulfiotomaculum reducens nicum Desulf otomaculum propio Desulf aculum thermo e C D196 7 308 P elotom difficil R 20291 = DS M e C los tridium difficil W−Y L−7 xum J 19 C los tridium parado DS M 5 Y MF klandii sQ C los tridium um stic redigen s C lostridi hilus metalli dii OhlLA 82 4 Alkalip ilus oremlan tylicum ATCC 8052 Alkaliph acetobu ium ckii NCIMB Clostrid ium beijerin m 657 Clostrid um botulinu m ATCC 19397 Clostridi um botulinu m ATCC 3502 Clostridi um botulinu A laska E 43 Clostridi 1 7B botulinum C lostridium botulinum E klund C lostridium botulinum H all C lostridium botulinum K yoto C los tridium botulinum L angeland C los tridium botulinum L och Maree kra C los tridium botulinum O 7 43B ( AT C C 3 5296) C los tridium c ellulovorans 5 55 C los tridium M kluyveri D S C los tridium N B R C 1 2016 C los tridium kluyveri DS M 1 3528 C los tridium ljungdahlii NT C los tridium novyi 13 C los tridium perfringens A T C C 1 3124 C los tridium perfringens C P N50 C los tridium perfringens S M101 C lostridium pe rfringens C los tridium tetani E 88 D S M 4 46 Alicyclobacillus a cidocaldarius Exiguobacterium sibiricu m 255−15 P aenibacillu s sp. JD R −2 Pa enibacillus sp. Y4 12M C 10 Listeria innocua Clip1126 2 Listeria monocytogenes 4b F236 5 Listeria monocytogenes EGD−e Listeria welshimeri SLCC533 4 Macrococcus caseolyticus JSCS540 2 Staphylococcus aureus RF122 Staphylococcus carnosus TM30 0 Staphylococcus epidermidis ATCC 12228 Staphylococcus epidermidis RP62 A S taphylococcus hae molyticus JCS C1435 Staphylococcus sap rophyticus A TC C 15305 Anoxybacillus fl avithermus W K 1 B acillus a myloliquefaciens D SM 7 B acillus a myloliquefac iens B acillus a nthracis A 0248 F ZB 42 B acillus a nthracis A mes B acillus a nthracis A B acillus a nthracis mes a nces tor A mes 0 581 C B acillus a nthracis DC 6 84 S terne B acillus a trophaeus 1 942 B acillus c ellulos B acillus c ereus ilyticus D S M 2 522 B acillus c ereus 0 3B B 102 B acillus c ereus AH187 A H820 B acillus c ereus A T C C B acillus c ereus B 4264 1 0987 B acillus c ereus B G SC B acillus c ereus E 33L 6 A1 B acillus c ereus G 9842 Ba cillus ce reus Q 1 Bacillus clausii KSM−K1 Bacillus 6 halodura Bacillus ns lichenifo C−125 rmis ATCC Bacillus lichenifo 14580 (DSM Bacillus 13) (Gotting megate rmis ATCC 14580 Bacillus rium QM en) (Novozy pseudo mes B acillus firmus OF4B1551 ) B acillus pumilus S AF R −032 B acillus s elenitireducen s M LS B acillus thuringiens 10 is B acillus thuringiens 4 Q2−81 (B GS C 4 B acillus thuringiens is 97−27 Q7) Bacillu thuringiens is AT C C 3 Bacillu s thuringiensi is B MB 1715646 Bacillu s thuringiensi s HD1 (BGSC 4D1) B acillus thuringiensi s HD224 (BGSC B acillu s weihe nsteps al Hakam 4H2) hane B acillu s weihe ns tepha ns is K B Bacill s weihe nstep nens AB us Bacill weihenstep hane is W S B C 4 nsis WS us 1 0204 Bacill weihenstep hane nsis WSBC B C 1 0206 G eobaus weihenstep hane nsis WSBC 10297 G eoba cillus kaust hane nsis WSBC 10311 Ocea cillus thermophil us 10315 P aeni nobacillus oden H TA42 P aeni bacillus ih eyen itrificans 6 E nterobacillus polymyxa sis H TE83 N G8 0−2 1 Leuc cocc us polymyx E 681 Oen onos toc faec alis a S C 2 C arnoococc us mes ente V 583 oen roide S taph bacterium iP S taph yloc occu sp. S U−1 s A T C C 8 293 S taph yloc occu s aure 4065 0 S taph yloc occu s a ureuus C OL S taph yloc occu s aure s J us H1 S taph yloc occu s a ureu J H9 S taph yloc occu s aure sJ S taph yloc us MRK D61 59 sa S taph yloc occ us ureus S A25 a Stap yloc occ us ureus MS S A47 2 6 Stap hylocococc us aure us MW 2 aure Mu3 S taphhylococ cus aure us S tap yloc cus aure us Mu5 0 N31 S tap hylo occus coc aure us NCT 5 Lac hylo C us coc cus Lac tobacil cus aureus Newma8325 Lac tobacil lus acida ure US n Lac tobacil lus bre oph us U S A30 0 Lac tobacil lus cas vis ilus N A30 0 T C H15 Lac tobacil lus cas ei A T C C C F M F P R 37516 Lac tobacil lus cas ei AT C C 3 67 7 Lac tobacil lus cor ei ZhaB L23 334 Lac tob acil lus cris ynif ng Lac tobacil lus pat ormis Lac tob aci lus delbru us ST1 KCTC del 316 eck La tobac llus b 7 La ctobac illus ferm rue ckii ii ATC Lac ctobac illus gallinaentum A T C 11 Lac tobaci illus ga llin rum IFO CC BA 842 Lac tobaci llus gas aru I 16 3956 A− 365 Lac tobaci llus helvetseri A m I 2 Lac tobaci llus joh icu T C C 6 333 Lac tobaci llus joh nsonii s DP 23 La tobaci llus pan nsonii FI9 C 45 NC 785 71 La ctoba llus pla is C La ctoba cill pla ntarum 17 C 533 La ctoba cill us reu ntar JD La ctoba cill us reu ter um W M1 CFS La ctoba cill us reu ter i D S La ctoba cill us reu ter i F 27 M 20 1 La ctoba cill us rha ter i F 27 5 ( DO01 6 La ctoba cill us rha mn i J C 5 (K E P ed ctoba cill us sa mn os us M 11 itas J G I) ato La ioc cill us ke os G 12 ) La cto oc cu us s sa liva i 23 us L G c 70 ali riu K Lac cto cocc s va 5 Lac t oc cocc us pe nto riu s C Str toc occu us laclac tis sa s U E C T S tre eptococcu s lac tis D L1 ce us C C 11 57 13 S tre pto oc s lac tis IL A TC 8 1 Str pto co cus tis M 14 03 C 25 S e pto co ccu ag SK1 G 13 74 S tre pto co ccu s a alacti 1 63 5 S tre pto co cc us s a gala ae Str tre pto co cc ag galac cti 260 Str ep co cc us go a lac tia ae 78 3 Str ep toc cc us rdo tia e A9 47 V R S ep toc occu us pnmutan nii e N 09 1 E M3 S tre ptotoc occu s pn eu s C S tre pt co occu s pn eu mo UA H1 16 Str tre pt oc cc s pn eu mo nia 15 9 S tre ep oc oc cu us eu mo niae e C oc pn to nia mo G Str pt co D3 s Str e pt oc cc cus pn eumo nia e G5 9 S P 14 Str ep oc oc us py eumo nia e Hu 4 Str ep toco oc cu s py ogen nia e ng ary St ep toco cc cu s p yo og es e R 19 St re to cc us p yo ge enes M1 TI 6 A− GR St rept ptoc cocc us pyog ge ne M S re ne s M GA G AS 4 oc oc us py 6 S tre ptoc oc c us py og enes s M G S (S F3 S tre ptoc oc cu p og enes MG GA AS 10 S tre ptoc oc cu s py yo enes MG AS S 10 2 70 70 ) S tre ptoc oc cu s p og ge ne MG AS 20 10 394 yo en S tre ptoc oc cu s 31 96 75 0 St tre ptoc oc cu s pyog gene es s M AS50 5 M GA 05 St re ptoc oc cu s py en St rept ptoc oc cu s s sa ngogen es s M GA S 61 re oc oc cu s ui uin es M GA S 82 80 pt oc oc cu s th s uis s 05 is S an S9 32 oc cu s th er 98 ZY S S I− fre 42 cu s th er m HA H3 K3 1 do 9 s th m ophil H3 3 6 er er m op m op hi us 3 op hi lu A0 hi lu s CN 54 lu s LM s LM RZ 1 D− G 18 9 06 6 3 11 Spirochaetes Crenarchaeota F8 Tenericut Chlamydiae s 16S idete ITS Chlorobi ria bacte ae tog rmo teria bac 23S ro Bacte The ch o Cyan o us Acid exi erm rofl s−Th Chlo ococcu etes in yc De tom ia rob nc e Pla ifica comicia i u u r Aq Verr actegloms b o e so ty tet Fu Dic rgis ne Sy Ar es Chlamydophila pneumoniae CWL029 Chlamydophila pneumoniae J138 Chlamydophila pneumoniae TW−18 3 Candidatus Phytoplasma (no species) PYL Acholeplasma asteris AYWB Acholeplasma asteris OY−M Achole plas ma australiense (no desig nation) Ac holeplas ma m ali A T Acholeplas ma axanthu m MIT 921 5 Acholeplas ma granularum MIT 9 215 Acholeplas ma la idlawii P G −8A Acholeplas ma la idlawii P G 8 A T C C 2 3206 S piroplas ma ( no s pecies ) B C −3 S piroplas ma ( no s pecies ) M Q1 ( AT CC S piroplas ma a pis B 31 (A T C C 33834) 3 3825) Mes oplas ma florum L 1 Mycoplas ma capricolum 4 2LC Mycoplas ma c apricolum C alifornia kid Mycoplas ma Mycoplas ma c apricolum E 570iii Mycoplas ma c apricolum F 38 c apricolum Mycoplas ma (n o s pecies G 5 Mycoplas ma hyopneum ) P G 50 Mycoplas ma mycoides onia J Mycoplas ma mycoides P G 3 Mycoplas ma mycoides U M30847 Mycoplas ma ag alactiaeY −goat Mycopla PG s Mycopla ma arginini G−23 2 sma gallisept 0 Mycopla sma genitaliu icum R Mycopla m G−37 Mycopla sma haemofelis Langfor Mycopla sma hyopneumonia d1 Mycop sma hyopneumonia e 168 las ma e 232 Mycop hyopne lasma Mycop m obile umonia e 7 448 las Mycop ma orale 1 63K C Mycop lasma penetr H1929 9 (A T C C 2 3714) Mycop las ma pneumans HF −2 Mycop las ma pulmo oniae M129 nis U AB Mycop lasma suis C T IP Illinoi Ureap lasma synov s lasma iae 53 B rachy parvu B rachy s pira hyodym ATCC 70097 0 B rachy spira hyody sente riae B 78 Brach s pira murd s enter iae W (T ) yspira ochii Lepto A1 pilosi DS Lepto nema illini coli 95100 M 1256 3 0 Lept spira biflex 3055 ospir a Patoc Lept 1 (Ame ospir a biflexa Lept s ) ospir a biflexa P atoc 1 Lept a bifl P atoc (Paris Lept os pira borgexa U rawa 1 ( nguc ) hi ) Lept os pira borg petersen Lept ospira inter peters ii J B 197 Lept os pira inter roga ns enii L550 5 6601 Lept ospira inter roga Lep ospira inter roga ns A kiya tosp Lep roga ns F iocru mi C ira Lep tospira interroga ns H ebdo z L 1−13 mad 0 Lep tospira interroga ns R G is B orretospira interroga ns R Z11A B orre lia afzeinte rrog ns S aline B orre lia afze lii P ko ans V m erdu B orre lia ans lii V S n B orre lia bur erina 461 Borr lia bur gdorfer( no des Borr elia bur gdorfer i 200 igna tion Borr elia bur gdorfer i 200 04 ) B orreelia bur gdorfer i 243 47 B orre lia bur gdorfer i 243 30 i 297 5 2 B orre lia bur gdo B orre lia bur gdo rferi B B orre lia bur gdo rferi G 31 B orre lia bur gdo rferi HK1 Bor lia bur gdo rferi Bor relia bur gdo rferi IP 21 Bor relia dut gdo rferi R −IP Bor relia gar ton rferi R −IP 3 90 ii Ly ZS B orrerelia inii 7 Bor lia garinii 20047 B orr relia herms Pbi B orr elia herms ii D B orr elia recurr ii S AH S piro elia turica ent erotyp Spi cha turica tae is A 1 e C Spi rochae eta tae (n o des aur 9.1 Spi r och ign ta ant E Spi rochaeaet a aurant ia J1 +136 atio n) Spi rochae ta caldar ia M S pir rochae t a coccoi ia DS 1 Tre och ta smara des M Tre pone ae ta sp. Bud gd DSM 733 4 Tre pone ma the d inae DS173 74 Tre pone ma a zot rmop y M 112 Tre pone ma de nticonutr hila 93 Tre pone ma pa = SEB ola iciu D S M llid Tre po ma pa 61 um A m R 422 Tre po ne ma pa llidum C T C C ZA S 92 Ac po ne ma ph llidum Nichic ag 35 40−9 D 8(T 5 SM R ub idimi ne ma pri aged S ho o )= 13 Ato rob cro pri mitia en S 14 ls JCM 86 2 AT Cry po ac biu mitia ZA is R 153 92 Eg pto bium ter xym fer Z S −1 eiter CC B AA Bif gerth bacte pa lan roo AS −2 B ifidido ell riu rvu op xid −8 DS 88 M Bifi ob bacte a len m lum hilus ans Bifi do act riu ta curtu DS DS D S 12 42 B ifid do bacte erium m adDSM m M 20 M 99M 10 7 AT ba B ole 22 D SM 46 9 41 33 1 CC ob cte riu B ifidob ac riu m ani ma sce 43 B AA 15 nti 64 Bifi ifidob ac ter m ani lis −8 s 1 Bifi do ac ter ium anim malis AD AT 87 Bifi do bacte ter ium a nim ali B 01 C C 15 Bi do bacte riu ium a nim ali s BI− B −1 1 70 Bi fidob bacte riu m bifidu ali s DS 0 2 3 4 fid bifi s B ob ac riu m M Bifi ifidob ac ter m brev du m P V 9 10 14 Bi do ac ter ium br e m S1 R L2 0 01 Bi fid ba ter ium de eve AT 7 0 Ca fidobobac cte ium lon nt CIP CC 15 ium riu 64 te Kin 69 lon gu ac ter Ac eo nulis ter iu m lo gu m B 69 8 Ar tin co po ium m lon ng m AT d1 S ca om ccus ra lon gu um B BMC C 15 Sa alinisnob yc rad acidi gu m DJ No lin po ac es na iot ph m JD O1 N6 8 69 7 P ca isp ra te es ole ila N M 0A Ac ropio rd ora are rium lun ran DS C C2 30 1 ioi tro nic ha dii s M Am tin 70 S yc os nib de pic ola em MG SRS3 4492 5 Th ac ol yn acte s sp a 02 8 CN oly 1 Th er char at ne riu . CN 16 tic Th er m op op ma m JS B S−2 um Ac er m ob ol sis m a 61 4 −440 05 D Fr id m ob ifid ys m iru cn es SM Fr an ot om isp a fu po ed m 20 Fr an kia herm on ora sc ra iterra DS K PA 59 an kia sp M os bi a er 5 kia sp . Eu us po sp YX yth ne 43 17 12 ra i U 82 02 sp . OR I1c cellu ra chora ea 32 7 . sy S0 lo ro ATCC N lyt m m 20 RR bi 60 i cu og 19 on L s 11 e na 99 23 t of 38 6 B AT 3 Da CC tis ca 43 gl om 19 6 er at a Chlamydophila pneumoniae AR3 9 Chlamydophila felis FeC−5 6 Chlamydophila caviae GPI C C hlamydophila abortus S 263 C hlamydia trach omatis L2tet1 C hlamydia tracho matis L2b UC H−1proctitis L 2434B u C hlamydia trachomatis D UW −3C x C hlamydia trachomatis D (s )2923 C hlamydia trachomatis B T Z1A828OT C hlamydia trachomatis B J ali20OT C hlamydia trachomatis AHAR −13 C hlamydia trachomatis 70 C hlamydia trachomatis 6 276 C hlamydia trachomatis Nigg C hlamydia muridarum U WE 25 ia a moebophila P arachlamyd biformata H T C C 2501 R obiginitalea etii K T 0803 G ramella fors m J IP 0286 ium ps ychrophilu A T C C 1 7061 F lavobacter oniae U W101 3 519−10 ium johns 271 F lavobacter iaceae bacterium a DS M 7 F lavobacter phaga ochrace uelleri C AR I C apnocyto S ulcia m M 1 3855 C andidatus ru ber DS Sa linibacterR−10 ATCC 43812 18053 ns DSM ermus marinus fermenta Rhodoth 33406 cter Dyadoba hutchinsonii ATCC 2366 ga us DSM Cytopha cter heparin DSM 2588 Pedoba 8482 haga pinensis s ATCC 1−548 2 Chitinop ides vulgatu on V P aomicr Bactero Y C H46 es t hetaiot des fra gilis C 9 343 Ba cteroid B acteroi agilis N C T W83 alis des fr 8 503 B acteroi omona s gingiv AT C C P orphyr distas onis P C C 9501 s 2 e rippka PCC950 teroide 2 P arabac C yanos pira lata PCC 700 spira capsu Cyano hococcus sp. PCC 6506 9350 Synec toria sp. P C C Oscilla s pecie s ) P C C 7905 ( no laria flos−a quae P C C 9420 Nodu inii enon 608 nizom ps is elenk es) PCC9 216 Apha aeno Anab psis (no speci es) PCC9 215 speci PCC9 9349 aeno Anab psis (no species) a e P CC aeno 933 2 Anab psis (no flos −aqu e PC C aeno 9302 Anab Anabaena flos−aqua PC C 2 71 M ae na flos−a quae arii DS 5110 Anab a a estu A T C C 3 2 65 Ana baenhloris M ssium es DS M 2 66 ecoc P rosth eton thalaovibrioidides D S 27 3 herp D S M 45 phae ctero C hloro bium eoba luteolum D S M 2 aD3 pha C hloro bium limic ola tii C LS bium C hloro bium ochromadum T S 1 C hloro C hlorom c hlor m tepi es B 265 obiu oba culu acteroid 7 DS M C hlor eob hlor 832 C rmis Q2 iofo NC IB ris pha vibr um a sp. R U−1 chlo K parv heco obium 0 ila R P rost C hlor obium hermotog T petroph a R K U−1B 8 C hlor a MS phil otog htho itima TMO T herm a nap a mar ngae 9 otog otog a letti sis BI42 T herm T hermrmotog sien TCF52B −B 1 The o melane anus m R t17 359 4 siph o afric osu S M 63 rmo siph m nod a D K B S The 166 tan rmo The acteriu poli ros eus T AA 89 sp. K B S 48 a neo us idob F erv motog erriglobglobus sp. T AA 3 ) T bus 4 T her 8 T erri glo cies T errino species) TAA 6 ( spe ) KBS 62 bus glo s (no species ) KBSBS 112 T erri obu s (no species K BD B1 rigl obu (no pecies)sp. C B AV1 5 Ter s rigl Ter globus (no id es es sp. s 19 −fl T erri lobus ococco oid nogenes J −10 −1 S u rrig hal lococc the Te tiac p. R 41 a De se Deh ide au ran s s 139 B2 7 cco s roflexu M H − 8 DS oco exu hal rofl C hloholzii ophilus HB 6 De C hlo ilus 994 1 rm ten cas s the rmophDSM ns R u us s ura 00 1 rm s the flex iod M 113 S H The mu s silvanu oro a 76 Chl T her mu cus radalis DS ltic 37 05 M her coc rm 53 13 la ba DS iot Me Deino the pirellu ilus D S M 13 S 1 ph geo do AM cus R ho lim no ensisa IF Y 04 AA68 58 5 sili rin . coc ces bra ma sp DS M s VF P 1 ino a De my cu s cto mycesire llul culumhilu oli AO 35 P lan cto P noba rop ex ae Y O3 AA −842 8 ge ex py uif s p. C B E llin 90 −1 P lan Aq ium C s dro B 35 Hy Aquif nib ila ATfla vu P 11 6 rae M 58 ge dro ciniphac ter ter s D S C 25 6724 hy 12 uri mu iob itutuscc ali TC M 6− 1 S ulfansia hthon Op bu m A m DS m H− 26 0 C hia atu u ilu M 12 6 5 rm ke tric cle gid ph DS M 21 −7 0 Ak pto nu tur mo e DS B AA 79 2 Le um mus ther ns 16 bie nis se M DS 0 eri s act tyoglo mu lom ar ao en DS nii 50 9 lca 33 sob Dic glo co s ph lsbad yi Fu lsb x vo C C 95 7 tyo m na Dic cteriu mo x car wa era AT 33 75 1 1 no ple tum lof ei ATCC MB NRC− R ba ino Natro losim dra Ha errannsii e NC p. rum HC 8 s na 9 dit bo ua Am Ha ua loq ium ali ne 04 9 x me x gib rrh ter s i clo 43 79 Ha era era s mo ac ium rtu C C 33 B3 lof lof cu lob ter AT CC ali A3−1 otg Ha Ha lococ Ha ac ris mortui AT B je ) X . GR lob ma Ha iae s Ha la ris morn cu ecies sp 2148 7 rcu ma lifo licoc sp ium M 77 G o1 o loa la ca ka ter CC MB i aro Ha rcu u la lal a (n ac um NC aze F us 2A loa arc Ha nem lob ari ium m eri s C 42 T Ha lo tri Ha salin lob ina rk an M 62 a P −1 Ha or Na m ha arc ba iv DS hil JF 1 os riu te ium an ina et nii op ei i JR Z ac ter th arc ac rto erm at igr um 8 lob ac Me anos rcina bu a th hung isn an 6A B ar re ei S 2 lob th sa es et m lab on ii s S 7 Ha Me hano oid sa irillumus bo iel di s C cc no la nn et lle M noco etha nosp cu culumgu va ipalu di s C6 5 C alu ha M etha ha no rpus nore ccus ar ip aludi dis 3 et M et co ha co us m m ar ip alu nkai− M ar ip M no et no cc ha s M etha co ccus us m mar s Na et M da tu M hano co cc us olicu no co cc di et M etha no co us ae C an M etha no cc M etha co M no ha et M Ha M et ha no ca Th ld er oc Th m opTher oc cu er m las m op s jan op m Pi las a las na Ac crop m acid ma sch idu hi a ac op vo Me P yr lip lus ido hi lca ii DS ot Me th Th oc P yr ro to ph lum nium M oc ococ fund rri ilu DS ot herm th anos er m 26 GS 61 he an du cu oc rm obac ob ph oc Py s cu um s m 12 M S1 ob te re ae cu ro furio s ho bo DS 2− 17 ac r th vib ra s co su rik on M 1B 28 te ac sta ko cc ei 97 a eo Arch Ar r th erm ter dt daka us s DS os hii T4 90 2 erm oa s m ch ae 69 an re a gl M ob o glo aeog oa utot mith ae n byss 36 OT 3 Th sis ut us er bu lob ot rophii AT DS K i G 38 Th mo Me fulgid s p us roph icu C M OD E 5 C 30 er pr Py P yro mo oteu than us rofunvene icu s De 35 91 1 V rob ba pr du fic m lta 0 61 ac cu oteu s tenopyru C −1 s us de lta H ulu lum DS s ax s k 6 (D M SN m Ca Py ars ca ne utr (no an S M 56P6 H ldi rob en lid dle 43 31 de op P yro ifo vir ac ati 04 nt hil sig ri ga ba Th ma ulum cum is us na AV19 ) De cu erm qu DS J C V tio su S taplum ofi ilin aerop M M 1124 S n) De lfuroc s ulf oc Ign hy islan lum ge hil 1351 54 ta uro cu ico lot dic pe nsis um 4 8 co s m cc he rm um nd IC− IM en Hy cc us us obilis ho us DS s H16 2 pe ka rth ma m (n spita rinM 41 rk5 7 erm o us Aeropchatk de lis K us 84 un sig bu cu IN F 1 S ulf yru en ltu Su tylicu m sis natio 41 red lfo Me olobu S ulf s pe 1 n) me Ac tallos s a olo S ulf lobus D SM rnix 22 1n tha idi no lob ph cidoc bus olobu tok 54 K 1 ge us ae ra ald solfa s od 56 nic aii a rchNitrosC enarcsa cch se duarius tar sp. B 7 C an aro la DS icu 12 did ae op atu Na on um ha eu vo D S M s P 2 s K noarc s p. ilus m ran M 63 53 9 ora ma sym s R rch ha eu C −1 riti bio 34 5− 48 mu ae m um e C ren s S sum 15 cry quita arc C M1A pto ns ha eo filu K in4 ta m OP −M Me Me than th an 5S Figure 3.6: Counts of rRNA operon genes in Human Oral Microbiome Database. A tree showing the number of genes (5S, 16S, 23S) and Intergenic Transcribed Spacers (ITS) in SSU rRNA operons, according to the rRNDB [Klappenbach et al., 2001]. This figure shows that there is no clear relationship between the number of rRNA operon copy numbers and taxa. [Schloss & Handelsman, 2005]). 3.2.5 Repository Annotation Effects SSuMMo relies upon public repository data to generate its model libraries and taxonomy information, and is therefore sensitive to inaccurate or outdated sequence annotations present in public repositories [Siezen & van Hijum, 2010]. Inaccuracies and inconsistencies between databases reduce inferred assignment accuracies, but these difficulties are faced by all software which rely on pre-existing data to classify new sequences. Through working with SSuMMo and the annotated test datasets, various inconsistencies were observed between sequence annotations and species names found within the ARB database. Often, annotated sequence names could not be found in the ARB database, with further investigations showing the most likely causes to be human error, asynchronous name-changes or taxa deliberately introduced into one database and not the other. The percentage of uncultured species described in the ARB and NCBI databases is sizeable, with 11,126 and 15,200 taxa names starting ‘uncultured’, respectively. Many taxa have numerous versions of uncultured species too. For example, the family Methanobacteriaceae contains four variations on ‘uncultured’ in the ARB database, including ‘uncultured’, ‘uncultured archaeon’, ‘uncultured Methanobacteriales archaeon’ and ‘uncultured Methanobacteriaceae archaeon’. The NCBI taxonomy contains all of these names just once, but none of them appear as children to Methanobacteriaceae. Prior to isolating a culture, formal species names cannot be accurately assigned due to an inability to fully characterise an organism’s phenotype [Dewhirst et al., 2010]. This suggests that these uncultured species have been predefined based on (dis)similarity of SSU rRNA sequences alone. As more extensive information is determined about species whose sequences are defined as uncultured, eventually leading to the definition of new species, it will be a challenge to maintain and update public databases while assigning 51 ‘uncultured’ sequences to their appropriate names. Many of these uncultured species are direct children of a family name, e.g. the family Halobacteriaceae is parent to the species ‘uncultured archaeon’, skipping the genus level assignment and therefore bypassing the rank that SSU rRNA can confidently be assigned. These curatorial discrepancies cause difficulties when trying to assess the accuracy of SSuMMo (or any similar methods) using name-based matching between taxa. 3.2.6 SSuMMo for database curation SSuMMo shows extremely high accuracies at ranks higher than genus. We suggest that current sequence and taxonomy databases may benefit from features of SSuMMo that assist with fast identification of outdated and erroneous entries. This would benefit individuals and database administrators to achieve consistency when describing sequence taxonomies and phylogenetic mappings. Consistency checks could be incorporated both pre- and postsubmission of SSU rRNA sequences into public repositories. The read sizes produced by next-generation sequencing methods enabled datasets containing hundreds of thousands of SSU rRNA sequence reads to be allocated to taxa in several hours (Table 4.4B). Running SSuMMo on a raw dataset could assign sequences to probable taxa quickly and effectively, and would also give extra assurance to annotations made with any other method. Sequences already annotated in public repositories would also benefit from the assurance of a correct SSuMMo allocation. Not only are scripts provided to download and update the latest NCBI taxonomy database and load a minimised version into MySQL, but annotations can be compared with real taxa with their corresponding rank and NCBI taxonomic ID. As the EMBL SSU rRNA database continues to be updated and enlarged, the reference collection of SSU rRNA sequences will continue to grow, and so will the ARB Silva database of aligned SSU rRNA sequences. ARB v106 currently has 1.9 million 16S rRNA sequences and the reference database over 500,000 high-quality, aligned sequences 52 allocated to 134,956 nodes across all three domains of life. As these databases continue to grow exponentially, SSuMMo’s database will not, yet it will still be updated to incorporate the latest sequence data released with EMBL, and subsequently ARB. Instead of growing (and performance decreasing) with the release of new reference sequences, SSuMMo will only continue to grow with newly defined taxa, which will only become more informative and accurate in their assignments. 53 4 Microbes Inhabiting the Human Microbiome There has been a growing interest over recent years in understanding the microbes that live both within and on the surface of the human body (e.g. [Ehrlich, 2011; Peterson et al., 2009]). The full implications for human health are yet to be realised, but the wealth of knowledge that has been bestowed upon mankind since these studies began has been simply breathtaking. To explain, as DNA sequencing gets ever more accessible, we are beginning to enter an era of “personalised medicine” [Feero et al., 2010]. The sequencing technologies are already there, but burdens still lie with cost, time and also the technical difficulties arising from both operating a sequencing machine and analysing the resulting data [Fernald et al., 2011; Hamburg & Collins, 2010]. A popular example demonstrating insight gained from human microbiome investigations is that of Hehemann’s study of the Japanese gut microbiota [Hehemann et al., 2010]. It was shown in the study that genes originating from seaweed-degrading marine bacteria had horizontally transferred into the host microbiome, causing a net benefit to the host microflora, but the bacteria from which the genes originate are not themselves inhabitants of the gut. The porphyranase-coding gene, where prevalent amongst the guts of Japanese individuals, was shown to be absent from the guts of Americans, providing a clear demonstration of the human microbiota genetically adapting according to the 55 influence of diet [Hehemann et al., 2010; Sonnenburg, 2010]. Internationally, the funding effort directed towards sequencing the human microbiome has produced an unprecedented amount of sequence data [Huse et al., 2012]. Along with this surge in funding and research into characterisation of the human gut microbiome, a huge amount of sequence data has been made freely available by research teams around the world, such as the NIH’s Human Microbiome Project [Peterson et al., 2009], the EU’s MetaHIT [Ehrlich, 2011] as well as many other independent studies (e.g. [Claesson et al., 2011; De Filippo et al., 2010; Qin et al., 2010]). This abundance of freely available data provides a great opportunity for testing novel software analysis methods on sequence data generated using a variety of sequencing platforms. SSuMMo [Leach et al., 2012] was used to analyse and visualise the species distributions and diversities of human microbiome sequence data from individuals of varying nationality, body mass index and sequencing method. High-level analyses of pooled results show similar trends to those obtained by thorough analyses performed by Turnbaugh et al. [2009], demonstrating SSuMMo’s ability to identify trends in dynamic, complex populations. 4.1 Aims Using the variety of datasets obtained over the course of the experimentation, several hypotheses can be tested: • There is a core set of bacterial species shared amongst the microbiome of healthy individuals [Turnbaugh et al., 2007]. • Imbalances in the Human Microbiome can be associated with undesirable traits such as Inflammatory Bowel Disease [Peterson et al., 2009]. 56 • A person’s Body Mass Index is affected by the microbes inhabiting his or her gut [Turnbaugh et al., 2006]. • Primer-targeted analyses will show a bias towards pre-sequenced species, whose genes were used to design the primers in use [Chakravorty et al., 2007]. • Individuals from the same geographic location share a more similar microbial gut population than individuals from other parts of the world [De Filippo et al., 2010]. 4.2 Methods Human microbiome sequence datasets produced from various studies around the world were downloaded (Table 4.1), in order to test whether the above hypotheses could be confirmed using our novel software solutions. First, some basic statistics were calculated for each dataset, including the number of sequences and the distribution of sequence lengths (Tables 4.1, 4.2 and 4.4b). Species assignments were made for all sequences that SSuMMo found to contain SSU rRNA genes, using methods described above (section 2.4). The number of sequences assigned to each taxon was tallied in order to calculate biodiversity metrics and plot discovered taxon abundances on to cladograms. Biodiversity indices were calculated and cladograms generated at a number of different taxonomic ranks between phylum and genus. Cladograms were annotated to show features such as taxon abundance distributions and ubiquity of a taxon shared amongst multiple individual. For the case where host health status information was made available (from the study by Qin et al. [2010]), sequence datasets were pooled according to whether or not an individual had inflammatory bowel disease (IBD). Microbial population distributions of individuals with IBD and those without were plotted on to a cladogram containing all genera found within the collection of all gut microbiome samples (Figure 4.2). 57 Home Country No. seqs No. people No. allocated Florence, Italy * Burkina Faso * 243,231 226,864 15 14 242,976 223,402 1,450,758 24 1,450,645 353,805 13 1,110 2,274,658 66 1,918,133 USA † Japan Totals ‡ Av. Length ± S.D. 335.69 ± 28.87 360.137 ± 46.27 559.108 ± 69.55 1,357.419 ± 1,140.27 Total Residues (Mb) 86.174 80.65 816.293 462.99 Table 4.1: Geographical human gut dataset statistics. Human microbiome sequence data for healthy human individuals were downloaded from various web servers and analysed with SSuMMo’s seqDB.py, providing initial statistics on dataset size. The number allocated shows how many of the original dataset sequences could be assigned to a clade using SSuMMo, whereas all other statistics were tallied from the raw sequence data. References:*De Filippo et al. [2010] † Peterson et al. [2009] ‡ Kurokawa et al. [2007] Where host body mass indices were disclosed, SSuMMo sequence annotations were used to try and correlate the ratio of Bacterial phyla against the host’s BMI. BMI information was released either categorically (Lean, Overweight or Obese) or as quantitative values, in the datasets released by Turnbaugh et al. [2009] and Qin et al. [2010], respectively. In both cases, sequence annotations were used to calculate the ratio between Firmicutes and Bacteroidetes phyla and plotted against the released BMI values (Figure 4.3). Rarefaction plots were also generated for the Turnbaugh et al. [2009] dataset, where for every 20th of the total number of sequences, the number of genera were counted, plotted and used to calculate biodiversity indices, including the Shannon H ′ and H max values. These were plotted individually for every BMI category and sequencing method used in the original study. Three such sequencing methods were used to generate sequences in the original study: 454 pyrosequencing reads of 16S rRNA hypervariable regions V2 and V6, as well as Sanger dideoxy full- and near full-length gene sequences. For these rarefaction plots, random sequence resampling was repeated fifty times at each subset size. 58 Gut Health No. seqs No. people No. allocated Healthy 5,409,737 99 12,032 IBD 1,179,607 25 2,158 Av. Length ± S.D. 1,565.6 ± 2,382.9 1,570.8 ± 2,371.1 Total Residues (Mb) 8,469.7 1,852.9 Table 4.2: Healthy vs. IBD gut dataset statistics. Sequences obtained from whole genome shotgun sequencing experiments were processed with SSuMMo to get an overall picture of presence and absence information between individuals suffering from Inflammatory Bowel Disease (IBD) and those without. Unfortunately, only 0.22% and 0.18% of sequences were found to contain small subunit rRNA. Four different experimental datasets were used to compare the microbial diversity in guts of individuals around the world. A rarefaction plot was generated after randomly re-sampling sequence annotations every thousand sequences and tallying the number of unique genera at each subset size. For each sample subset size, five repeat resamplings were run and the resulting taxon distributions used to calculate a number of biological diversity indices (Figure 4.7). All rarefaction curves produced are displayed as box and whisker plots, where for each subset size, median values are shown as well as 25th and 75th percentiles and “fliers”, or outliers. Outliers are defined as being beyond 1.5 times the interquartile range, which is the difference between the 25th and 75th percentiles. 4.2.1 Sample Datasets The Human Microbiome Project’s Data Analysis and Coordination Center (HMP-DACC) [Peterson et al., 2009] made available a pilot reference dataset, consisting of over 13 Gigabases of primer-targeted 16S rRNA sequence data. This data was sequenced using samples taken from 24 individuals across multiple body sites and generated by HMP sequencing centers at four different locations in the United States of America. The data generated in this Clinical Pilot Production Study of the Human Microbiome Project was 59 later deposited in the Short Read Archive (SRA) under ID SRP002012, but downloaded for this study from http://hmpdacc.org/resources/pps_data_download.php, in Fasta and Qual format. Corresponding metadata (“overview”) files describing each of 17 sequencing experiments were downloaded as well, so as to extract and organise sequence data of interest. A Python script was written (A1.4) to find sequence descriptions of interest from the compressed metadata files and to extract the corresponding sequences from the respective archives. Sequence reads generated from Stool samples were extracted with this script and separated into file names matching unique identifiers assigned to each individual. The dataset taken forward for analysis included sequences sampled from all 24 healthy individuals’ fecal specimens and is described in Table 4.1. Genus assignments were made for each microbiome sample (section 2.4) and biodiversity indices calculated for each individual (Figures 4.6 and 4.7). Primer-targeted sequence reads, sampled from 154 lean, overweight and obese twins and their mothers [Turnbaugh et al., 2009] were initially used to test SSuMMo’s applicability to analysing such datasets. The experimental results were used to compare observed trends in the data to the original publication, in an attempt to relate population distributions to BMI category and sequencing method. The data obtained had been produced using three different sequence targeting methodologies. Two hypervariable regions of 16S rRNA, V2 and V6, were targeted using region-specific primers and sequenced on the 454 GS FLX™ and GS FLX™ Titanium platforms [Turnbaugh et al., 2009]. The remaining sequence data was generated by Sanger sequencing of full- and near full-length 16S rRNA genes. All sequences were downloaded and separated according to sequencing methodology (V2, V6 and “full-length”). These were further separated by host BMI status (Lean, Overweight or Obese) according to supplemental data made available with the original paper [Turnbaugh et al., 2009]. Following organisation of the sequence data, nine fasta-formatted sequence files had been produced, comprising all of the sequence 60 information generated from each of the study’s 154 individuals. SSuMMo was used to annotate sequences in each of these nine sequence files. Sequence annotation sets from the Turnbaugh et al. [2009] dataset were later split up further, to separate microbiome information of each individual, providing further replicates and confidence to statistical analyses. Sequencing runs were almost exclusively run in duplicate, with samples taken at two different time points. These repeat sequencing runs were grouped together so long as the host’s BMI status had not changed. For five of the individuals, their BMI status had changed from Overweight to Obese, or vice-versa between samples. In this instance, sequence files were kept separate for the purposes of this experiment. Another sequence dataset, generated using whole-genome shotgun (WGS) approaches was used to compare results with those of the SSU rRNA primer-targeted experiments. The dataset produced by Kurokawa et al. [2007] contains sequences sampled from 13 Japanese individuals. The original experiment was designed to discover and explore common gene functions shared amongst the microbiomes of multiple individuals [Kurokawa et al., 2007]. The dataset was chosen as it also included sequences sampled from human Stool samples, added a new geographic location to those already obtained and provided insight into how WGS sequencing experiment results differ from those of primer-targeted experiments. Qin et al. [2010] also used WGS sequencing to produce sequence data from 124 European individuals. The released data also included information on the health status of the individual and their body mass indices. Again, sequences were analysed with SSuMMo and those containing SSU rRNA sequences were assigned down to genus specificity using methods described above (section 2.4). Further, the microbiome population distributions of individuals with inflammatory bowel diseases (IBD) were compared against those from healthy individuals (Figure 4.2). Given BMI ratios, it was also possible to plot a graph showing the abundance ratio of the two most common bacterial phyla against each host’s body mass index (Figure 4.3). 61 The study by De Filippo et al. [2010] was designed to test how diet affects the microbial population distribution of the human microbiome. Microbiomes were sampled from the stool of 15 individuals from Florence, Italy and 14 from Burkina Faso. Again, genus annotations were made from the primer-targeted SSU rRNA sequences published with the report [De Filippo et al., 2010] and biodiversity indices were calculated for each individual’s microbiome. Biodiversity indices were compared against the same statistics as calculated for other datasets described above, allowing a comparison between species distributions between four different geographic locations, when compared against sequence datasets collected from people in Japan [Kurokawa et al., 2007] and the USA [Peterson et al., 2009]. 4.3 Results 4.3.1 A core healthy microbiome SSuMMo results from Turnbaugh et al.’s data [2009] were used to find ubiquitously conserved taxa across all individuals. Conserved taxa are visualised as ‘color strips’ using the IToL web application [Letunic & Bork, 2006], so as to quickly and easily identify conserved taxa (Figure 4.1). Across all sampling methods and BMI categories only eight known genera were found in all result sets: Akkermansia, Bifidobacterium, Streptococcus, Clostridium, Pseudobutyrivibrio, Papillibacter, Subdoligranulum. There were also uncultured members of Candidate Division RF3 found in all result sets (Figure 4.1), but little is known of these bacteria as they have not yet been cultured in a laboratory for further Figure 4.1: Distribution of taxa up to genus specificity, present in the guts of 154 lean, overweight and obese individuals, pooled by sequencing method and BMI category. Graphs are shown grouped in order of increasing number of sequences generated per PCR-method, and represent the relative abundances of each taxon that were identified in that sequence pool. FL - Full Length sequences, V6 & V2 - sequences generated from V6 & V2 region specific primers. L, Ov, Ob - Lean, Overweight and Obese BMI categories. 62 Caldiserica Holdemania Turicibacter RsaHF231 uncultured bacterium uncultured Mollicutes bacterium uncultured bacterium 40 CK−1C4−19 uncultured Firmicutes bacterium uncultured rumen bacterium 4 CK−1C4−19 uncultured bacterium Clostridium innocuum Candidate division BRC1 uncultured bacterium Clostridium ramosum Candidate division BRC1 uncultured organism Clostridium sp. TM−40 Candidate division OP10 uncultured actinobacterium Clostridium spiroforme DSM 1552 Candidate division OP10 uncultured bacterium Erysipelotrichaceae bacterium 5 2 54FAA JL−ETNP−Z39 uncultured Gemmatimonadetes bacterium Eubacteriaceae bacterium DJF VR85 JL−ETNP−Z39 uncultured bacterium Eubacterium biforme DSM 3989 RFP12 gut group uncultured bacterium Eubacterium cylindroides Victivallis Eubacterium dolichum NPL−UPA2 uncultured bacterium Streptococcus pleomorphus NPL−UPA2 uncultured organism human gut metagenome 4 Marinitoga uncultured Firmicutes bacterium 5 uncultured bacterium 48 uncultured bacterium 39 Coraliomargarita C178B uncultured bacterium Akkermansia C178B uncultured compost bacterium WCHB1−60 uncultured bacterium SHBZ1548 uncultured Geobacillus sp. WCHB1−60 uncultured soil bacterium SHBZ1548 uncultured compost bacterium Elusimicrobia uncultured bacterium 4−15 uncultured Bacillaceae bacterium Lineage I (Endomicrobia) uncultured bacterium 4−15 uncultured bacterium Lineage IIc uncultured bacterium 4−15 uncultured compost bacterium Chlamydophila 16d63.751 uncultured bacterium Waddlia Ll142−1A4 uncultured bacterium Neochlamydia P5D1−392 uncultured bacterium Parachlamydia Rs−D42 uncultured bacterium Chrysiogenes arsenatis Leuconostoc Desulfurispirillum Weissella bacterium AHT 11 Photorhabdus luminescens bacterium AHT 19 MOB164 uncultured bacterium AT425−EubC11 terrestrial group uncultured bacterium Carnobacterium uncultured bacterium 41 Granulicatella Kazan−1B−37 uncultured bacterium Enterococcus Candidate division WS3 uncultured actinobacterium Enterococcus sp. enrichment culture clone S DBTGW Vagococcus uncultured candidate division WS3 bacterium Lactobacillus Candidate division WS3 uncultured delta proteobacterium Paralactobacillus Candidate division WS3 uncultured organism Pediococcus KD3−62 uncultured bacterium Lactococcus Deinococcus Streptococcus Truepera Lactovum uncultured Firmicutes bacterium Meiothermus Abiotrophia Vulcanithermus Aerococcus 4−29 uncultured bacterium Eremococcus OPB95 uncultured bacterium Globicatella uncultured bacterium 42 Ignavigranum uncultured prokaryote Dolosicoccus paucivorans uncultured bacterium 9 uncultured bacterium 21 OPB56 uncultured bacterium Thermoactinomycetaceae bacterium CNJ795 PL04 BSV26 uncultured Chlorobi bacterium Bacillales uncultured compost bacterium BSV26 uncultured bacterium Thermicanus SJA−28 uncultured Chlorobi bacterium Sporolactobacillus SJA−28 uncultured bacterium Alicyclobacillus Kazan−3B−09 uncultured bacterium Alicyclobacillaceae uncultured Alicyclobacillus sp. Brachyspira Brochothrix LH041 uncultured spirochete Listeria uncultured bacterium 45 Jeotgalicoccus Staphylococcus Spirochaeta Laceyella uncultured Spirochaetes bacterium Shimazuella uncultured spirochete Exiguobacterium Aminobacterium Bacillus decisifrondis Candidatus Tammella uncultured bacterium 17 Cloacibacillus Brevibacillus Pyramidobacter Cohnella Thermovirga Thermobacillus Synergistetes bacterium oral taxon 363 Paenibacillaceae uncultured bacterium uncultured Synergistetes bacterium Paenibacillus mucilaginosus uncultured bacterium 46 uncultured bacterium 18 BD1−5 uncultured bacterium Candidate Division BD1-5 Kurthia BD1−5 uncultured candidate division GN02 bacterium Lysinibacillus BD1−5 uncultured candidate division OP11 bacterium Paenisporosarcina BD1−5 uncultured deep−sea bacterium Rummeliibacillus BD1−5 uncultured marine bacterium Solibacillus BD1−5 uncultured organism Sporosarcina BD1−5 uncultured prokaryote uncultured bacterium 19 BD1−5 uncultured proteobacterium uncultured bacterium 20 BD1−5 uncultured soil bacterium Alkalibacillus TM7 phylum sp. oral clone DR034 Candidate Division TM7 Amphibacillus metal−contaminated soil clone K20−12 Anoxybacillus Candidate division TM7 uncultured bacterium Aquisalibacillus Candidate division TM7 uncultured bacterium AH040 Bacillus Candidate division TM7 uncultured bacterium oral clone BE109 Cerasibacillus Candidate division TM7 uncultured bacterium oral clone BS003 Halalkalibacillus uncultured candidate division TM7 bacterium Halolactibacillus Candidate division TM7 uncultured compost bacterium Deferribacteres Lentibacillus Candidate division TM7 uncultured rumen bacterium Natronobacillus Caldithrix Oceanobacillus LCP−89 uncultured bacterium Ornithinibacillus PAUC34f uncultured bacterium Paraliobacillus Calditerrivibrio Paucisalibacillus uncultured Deferribacteres bacterium Piscibacillus uncultured bacterium 14 Salirhabdus uncultured SAR406 cluster bacterium Sediminibacillus SAR406 clade(Marine group A) uncultured bacterium Virgibacillus SAR406 clade(Marine group A) uncultured marine bacterium Fibrobacteres Geomicrobium halophilum 09D2Z46 uncultured candidate division TG3 bacterium uncultured Firmicutes bacterium B122 uncultured bacterium uncultured bacterium 16 TSCOR003−O20 uncultured bacterium Thermolithobacter termite gut group uncultured Fibrobacteres bacterium 1 D8A−2 uncultured bacterium termite gut group uncultured candidate division TG3 bacterium Ammonifex Fibrobacter Gelria Coprothermobacter Fibrobacteraceae uncultured bacterium termite gut group uncultured Fibrobacteres bacterium Candidate Division RF3 Caldanaerobius uncultured bacterium 15 Caldicellulosiruptor bacterium enrichment culture clone R4−81B Tepidanaerobacter RF3 uncultured Bacillales bacterium Dethiosulfatibacter RF3 uncultured Bacillus sp. Syntrophomonas RF3 uncultured bacterium TSAC18 uncultured bacterium RF3 uncultured compost bacterium livecontrolB21 uncultured bacterium RF3 uncultured low G+C Gram−positive bacterium Acidaminobacter Fusibacter unidentified rumen bacterium RFN82 Lutispora Thermobaculum uncultured bacterium 28 AKIW781 uncultured bacterium Heliobacterium S085 uncultured bacterium uncultured Peptococcaceae bacterium Caldilinea uncultured bacterium 29 uncultured bacterium 11 OPB54 uncultured Firmicutes bacterium JG30−KF−CM66 uncultured Chloroflexi bacterium OPB54 uncultured bacterium JG30−KF−CM66 uncultured bacterium OPB54 uncultured low G+C Gram−positive bacterium TK10 uncultured Chloroflexi bacterium Alkaliphilus TK10 uncultured bacterium Clostridium Longilinea Sarcina uncultured Chloroflexi bacterium uncultured bacterium 22 uncultured bacterium 10 Peptococcus uncultured organism Acidobacteria Thermincola RB25 uncultured bacterium bacterium MB7−1 Acanthopleuribacter uncultured bacterium 33 NS72 uncultured bacterium Anaerofustis iii1−8 uncultured bacterium Eubacterium Candidatus Solibacter Pseudoramibacter 11−24 uncultured Acidobacterium sp. Eubacteriaceae uncultured bacterium 32−21 uncultured bacterium uncultured bacterium 23 BPC102 uncultured bacterium Mogibacterium DA023 uncultured bacterium uncultured bacterium 27 Elev−16S−573 uncultured bacterium uncultured bacterium 26 JH−WHS99 uncultured bacterium uncultured rumen bacterium SJA−149 uncultured bacterium Eubacterium sp. WAL 17363 DA052 uncultured bacterium Eubacterium sp. WAL 18692 DA052 uncultured forest soil bacterium uncultured bacterium 25 Mollicutes uncultured bacterium Peptostreptococcus Acholeplasma Tenericutes Tepidibacter Anaeroplasma Peptostreptococcaceae uncultured bacterium DMI uncultured bacterium uncultured bacterium 35 EMP−G18 uncultured bacterium Clostridium bartlettii DSM 16795 Haloplasma Clostridium difficile CD196 Candidatus Hepatoplasma Clostridium glycolicum Spiroplasma Clostridium irregulare Candidatus Bacilloplasma uncultured bacterium 34 Mycoplasma uncultured low G+C Gram−positive bacterium 1 uncultured bacterium 47 Anaerococcus RF9 uncultured Clostridiales bacterium Finegoldia RF9 uncultured bacterium Gallicola RF9 uncultured bacterium adhufec202 Fusobacteria Helcococcus RF9 uncultured rumen bacterium Peptoniphilus BS1−0−74 uncultured bacterium Sedimentibacter SHA−35 uncultured bacterium Soehngenia Fusobacteriales uncultured bacterium Tepidimicrobium CFT112H7 uncultured bacterium uncultured bacterium 24 boneC3G7 uncultured bacterium Parvimonas uncultured Peptostreptococcus sp. Granulicatella sp. oral clone ASC02 Parvimonas uncultured bacterium ASCC02 uncultured bacterium Acidaminococcus Fusobacterium sp. oral taxon D47 Allisonella Hados.Sed.Eubac.3 uncultured bacterium Anaerosinus Fusobacterium Anaerospora Ilyobacter Dendrosporobacter Propionigenium Dialister Sneathia Megamonas Leptotrichia−like sp. oral clone BB135 Megasphaera uncultured Fusobacteria bacterium Mitsuokella uncultured Fusobacterium sp. Phascolarctobacterium Acaryochloris Succiniclasticum Mastigocladopsis Veillonella Synechococcus sp. UH7 Veillonellaceae uncultured bacterium ML635J−21 uncultured bacterium Cyanobacteria uncultured bacterium 38 MLE1−12 uncultured bacterium Acetitomaculum Aphanizomenon gracile Anaerosporobacter 4C0d−2 uncultured bacterium Anaerostipes 4C0d−2 uncultured rumen bacterium Blautia Cyanothece Butyrivibrio uncultured bacterium 12 Caldicoprobacter uncultured bacterium 13 Catabacter uncultured cyanobacterium Catonella Arthrospira Cellulosilyticum Lyngbya Coprococcus Phormidiu m Dorea SubsectionIII uncultured cyanobacterium Epulopiscium Dinophysis mitra Hespellia Solanum bulbocastanum Howardella Solanum tuberosum (potato) Johnsonella Sorghum bicolor (sorghum) Lachnospira Chloroplast uncultured bacterium Moryella Chloroplast uncultured diatom Oribacterium Chloroplast uncultured prokaryote Pseudobutyrivibrio uncultured bacterium Robinsoniella MB−A2−108 uncultured bacterium Roseburia Rubrobacter Shuttleworthia Adlercreutzia Syntrophococcus Asaccharobacter Actinobacteria Lachnospiraceae uncultured bacterium Atopobium uncultured bacterium 31 Collinsella Marvinbryantia formatexigens Coriobacterium Marvinbryantia uncultured bacterium Denitrobacterium Marvinbryantia uncultured rumen bacterium Eggerthella Firmicutes oral clone MCE7 107 Enterorhabdus human gut metagenome 1 Gordonibacter metagenome sequence 1 Olsenella uncultured Clostridia bacterium Olsenella sp. S13−10 uncultured Clostridiaceae bacterium 1 Paraeggerthella uncultured Clostridiales bacterium 1 Slackia uncultured Clostridium sp. 1 Coriobacteriaceae uncultured Olsenella sp. uncultured Firmicutes bacterium 2 Coriobacteriaceae uncultured bacterium uncultured Thermoanaerobacteriales bacterium Coriobacteriaceae bacterium WAL 18889 uncultured anaerobic bacterium uncultured Actinobacteridae bacterium uncultured bacterium 32 uncultured actinobacterium uncultured low G+C Gram−positive bacterium uncultured bacterium 2 uncultured rumen bacterium 1 unidentified Clostridium hathewayi Bogoriellaceae bacterium YIM 93306 Clostridium scindens Bifidobacterium Clostridium sp. Corynebacterium Clostridium sp. SS2.1 Elev−16S−976 uncultured bacterium Eubacterium hallii Actinobaculum Ruminococcus gnavus Actinomyces Ruminococcus torques ATCC 27756 Mobiluncus butyrate−producing bacterium SSC.2 Varibaculum metagenome sequence Dermatophilus uncultured Clostridiaceae bacterium Microbacterium uncultured Clostridiales bacterium Arthrobacter uncultured Clostridium sp. Rothia uncultured Eubacteriaceae bacterium Aeromicrobium uncultured Firmicutes bacterium 1 Kribbella uncultured Ruminococcus sp. Marmoricola uncultured bacterium 30 Nocardioides unidentified 1 Aestuariimicrobium Acetanaerobacterium Brooklawnia Acetivibrio Propionibacterium Anaerofilum Tessaracoccus Anaerotruncus Propionibacteriaceae uncultured bacterium Butyricicoccus uncultured bacterium 1 Ethanoligenens Candidatus Thiobios Faecalibacterium ARKICE−90 uncultured bacterium Fastidiosipila Elev−16S−509 uncultured bacterium Hydrogenoanaerobacterium Campylobacter Oscillibacter MACA−EFT26 uncultured archaeon Oscillospira SC3−20 uncultured bacterium Papillibacter SK259 uncultured proteobacterium Proteobacteria Clostridia Anaerobranca RF3 uncultured Firmicutes bacterium RF3 uncultured organism Chloroflexi Firmicutes Facklamia uncultured Nitrospirae bacterium V2072−189E03 uncultured bacterium Synergistetes Bacilli Atopococcus BD2−11 terrestrial group uncultured organism Candidate division WS3 uncultured bacterium Deinococcus-Thermus Nitrospirae Chlorobi Spirochaetes U Erysipelothrix Solobacterium human gut metagenome 5 TA06 uncultured bacterium Chrysiogenetes Gemmatimonadetes Ob Coprobacillus Candidate division OP9 uncultured bacterium EM19 uncultured bacterium MVP−21 uncultured bacterium Verrucomicrobia Elusimicrobia Ov Catenibacterium BHI80−139 uncultured bacterium LF045 uncultured Caldiserica bacterium OC31 uncultured bacterium Lentisphaerae V2 Ob L Ov L V6 Ob Ob Ov L Ov FL V2 Ob L OV L V6 Ob L Ov U FL Ruminococcus TA18 uncultured proteobacterium Sporobacter SPOTSOCT00m83 uncultured bacterium Subdoligranulum SPOTSOCT00m83 uncultured sulfur−oxidizing symbiont bacterium Ruminococcaceae uncultured bacterium Magnetococcus sp. MC−1 human gut metagenome 3 magnetic coccus MP17 uncultured Clostridia bacterium 2 CF2 uncultured Magnetococcus sp. uncultured Clostridiaceae bacterium 3 DB1−14 uncultured alpha proteobacterium uncultured Clostridiales bacterium 3 Parvularcula uncultured Clostridium sp. 3 Thalassospira uncultured Firmicutes bacterium 4 Candidatus Captivus uncultured Lachnospiraceae bacterium Sphingomonas sp. MN 122.2a uncultured Ruminococcaceae bacterium 1 Methylobacterium uncultured bacterium 37 uncultured alpha proteobacterium uncultured bacterium adhufec108 Thioprofundum uncultured bacterium adhufec311 Colwellia uncultured rumen bacterium 3 PB1−aai25f07 uncultured bacterium uncultured rumen bacterium 4C28d−12 Anaerobiospirillum uncultured rumen bacterium 5C0d−12 Succinivibrio uncultured rumen bacterium 5C3d−17 Oryza sativa Indica Group unidentified 2 B38 uncultured bacterium unidentified rumen bacterium 12−110 Actinobacillus unidentified rumen bacterium 12−124 Aggregatibacter unidentified rumen bacterium RF6 Cronobacter unidentified rumen bacterium RFN16 Escherichia fergusonii unidentified rumen bacterium RFN25 Escherichia−Shigella uncultured bacterium unidentified rumen bacterium RFN33 Pseudomonas Bacteroides capillosus Acinetobacter Clostridiaceae bacterium DJF LS40 uncultured bacterium 44 Clostridium leptum DR−16 uncultured bacterium Clostridium orbiscindens Neisseria Clostridium sp. 1 SC−I−84 uncultured bacterium Clostridium sp. NML 04A032 SC−I−84 uncultured beta proteobacterium Clostridium sporosphaeroides UCT N117 uncultured bacterium Eubacterium siraeum UCT N117 uncultured beta proteobacterium Eubacterium siraeum DSM 15702 Nitrosomonas Firmicutes bacterium DJF VP44 uncultured Burkholderiales bacterium Ruminococcaceae bacterium D16 uncultured bacterium 43 Ruminococcus sp. 16442 uncultured beta proteobacterium Ruminococcus sp. YE281 Thiomonas bacterium mpn−isolate group 25 Oxalobacter human gut metagenome 2 Sutterella metagenome sequence 2 Parasutterella uncultured bacterium uncultured Clostridia bacterium 1 Cupriavidus uncultured Clostridiaceae bacterium 2 Limnobacter uncultured Clostridiales bacterium 2 Inhella uncultured Clostridium sp. 2 Roseateles uncultured Firmicutes bacterium 3 Flexibacter sp. AMV16 uncultured Ruminococcaceae bacterium SM1A07 uncultured bacterium uncultured bacterium 36 Sufflavibacter uncultured bacterium adhufec168 uncultured bacterium 6 uncultured rumen bacterium 2 VC2.1 Bac22 uncultured bacterium VC2.1 Bac22 uncultured rumen bacterium Bacteroidetes uncultured Bacteroidetes bacterium 1 Algoriphagus ST−12K33 uncultured bacterium TAA−5−07 uncultured Flavobacterium sp. env.OPS 17 uncultured bacterium vadinHA17 uncultured bacterium Reichenbachiella uncultured bacterium 8 SB−1 uncultured Bacteroidetes bacterium SB−1 uncultured bacterium SB−1 uncultured organism Arcicella Cytophaga Dyadobacter Flexibacter Hymenobacter Siphonobacter uncultured bacterium 7 Bacteroidales uncultured bacterium Bacteroides CAP−aah99b04 uncultured bacterium RF16 uncultured bacterium gir−aah93h0 uncultured bacterium ratAN060301C uncultured bacterium BS11 gut group uncultured bacterium BS11 gut group uncultured rumen bacterium Marinilabilia uncultured Bacteroidetes bacterium uncultured bacterium 3 mouse gut metagenome S24−7 uncultured bacterium S24−7 uncultured rumen bacterium Paraprevotella Prevotella Xylanibacter Prevotellaceae uncultured bacterium human gut metagenome uncultured bacterium 5 Alistipes Key: U - Ubiquity Number of datasets containing that genus. 9 8 7 6 5 4 3 2 1 Regular Roman font - phylum. Italics - class. Genera found in all sequence datasets. BCf9−17 termite group uncultured Bacteroidales bacterium SP3−e08 uncultured bacterium hoa5−07d05 gut group uncultured bacterium vadinBC27 wastewater−sludge group uncultured bacterium RC9 gut group uncultured bacterium RC9 gut group uncultured rumen bacterium Barnesiella Butyricimonas Candidatus Symbiothrix Odoribacter Paludibacter Parabacteroides Porphyromonas Proteiniphilu m uncultured bacterium 4 Leaves are coloured by phylum where ARB phylum name matches entry in NCBI taxonomy database. testing. Some of the named genera have already been reported as beneficial to health when found in human intestinal tracts (e.g. Bifidobacterium [Hao et al., 2011], Akkermansia [Derrien et al., 2007], etc.). However, functional genetic information is needed to elucidate if each provides unique metabolic capabilities that would justify their ubiquitous nature. 4.3.2 Healthy and IBD-infected gut microbiotas Data analysed from the Qin et al. [2010] dataset produced a minimal number of sequence matches (see Table 4.2) compared with the number of sequences analysed. Only 0.22% of sequences from this study could be given even a domain-level assignment by SSuMMo, as a result of the sequences being assembled from a WGS sequencing experiment and that a single gene takes up such a small proportion of an entire genome. However, that still equates to 14,190 sequences being annotated with a genus-level assignment overall (Table 4.2), from which the population distribution was visualised (Figure 4.2) and biodiversity indices calculated. Out of the 124 individuals sampled, 99 of those were described as having healthy guts, compared with 25 having inflammatory bowel disease. As can be expected from more thoroughly sampled environments, a greater number of genera were discovered amongst individuals with healthy guts. This is to be expected and is a well-established phenomenon in ecological studies [Magurran, 2009]. For instance, two samples taken from the same environment but differing in size can lead to different conclusions on their diversity [Pielou, 1975]. Simpson’s index is said to be one of the least sensitive biodiversity metrics to differences in sample size [Magurran, 2009], but for this comparative dataset, both values are extremely close to the maximum Simpson diversity of 1.0: 0.9956 ± 0.0038 for healthy guts (n=99) and 0.9954 ± 0.0022 (n=25) for those with IBD (Table 4.3). A one way analysis of variance (ANOVA) test, or one-way F-test on the Simpson index values gives a p-value of 0.854, indicating that the species diversities in the IBD dataset almost certainly 64 IBD Healthy uncultured archaeon Archaea uncultured bacterium BHI80-139 uncultured bacterium CK-1C4-19 uncultured bacterium Candidate division OD1 Chrysiogenetes Elusimicrobia Gemmatimonadetes Acidobacteria Nitrospirae Spirochaetes Deferribacteres Fusobacteria Metallosphaera Chrysiogenaceae uncultured archaeon pHGPA13 Thermoproteaceae uncultured bacterium Lineage I (Endomicrobia) Methanomicrobia uncultured delta proteobacterium GOUTA4 Methanobrevibacter uncultured Gemmatimonadetes bacterium S0134 terrestrial group uncultured rumen archaeon uncultured bacterium NPL-UPA2 MKCST-A3 uncultured bacterium OC31 uncultured archaeon Terrestrial Miscellaneous Gp(TMEG) uncultured bacterium RsaHF231 uncultured Methanobacteriales archaeon vadinCA11 gut group uncultured bacterium SM2F11 uncultured archaeon vadinCA11 gut group uncultured bacterium TA06 Dimorpha uncultured candidate division TM7 bacterium WCHB1-60 Trimastix uncultured bacterium BPC102 Amastigomonas uncultured deep-sea bacterium RB25 Acanthocystis uncultured actinobacterium Candidate division OP10 Rhodomonas uncultured bacterium SJA-176 Candidate division OP10 Rhabdomonas Chlamydia Reclinomonas Waddlia uncultured Nucleariidae environmental samples Leptospirillum Saccinobaculus uncultured bacterium 8 Brachyspira Verrucomicrobia Chloroflexi Dientamoeba Haplosporidium uncultured spirochete LH041 Florideophyceae uncultured bacterium BD1-5 Telonema uncultured candidate division GN02 bacterium BD1-5 Apicomplexa uncultured proteobacterium BD1-5 uncultured ciliate environmental samples uncultured Verrucomicrobia bacterium Candidate division WS3 Enteromonas uncultured bacterium Candidate division WS3 Spironucleus uncultured organism Candidate division WS3 Paravahlkampfia Caldithrix Sawyeria uncultured SAR406 cluster bacterium Vahlkampfiidae sp. SK1 uncultured bacterium SAR406 clade(Marine group A) uncultured eukaryote environmental samples uncultured bacterium BS1-0-74 uncultured eukaryotic picoplankton environmental samples Fusobacterium uncultured marine eukaryote environmental samples uncultured bacterium boneC3G7 Blastocystis Euryarchaeota Euglenida Haplosporidia Apicomplexa Pleurochloridella Mycetozoa uncultured bacterium Candidate division BRC1 Tubulinea uncultured bacterium SJP-2 Candidate division BRC1 Thecamoeba uncultured bacterium SJP-3 Candidate division BRC1 Entamoeba uncultured organism Candidate division BRC1 Mastigamoeba uncultured bacterium TM6 Mastigella uncultured bacterium C1-3 TM6 fungal sp. GMG PPb1 uncultured deep-sea bacterium TM6 uncultured fungus environmental samples uncultured organism TM6 Ascomycota (ascomycetes) uncultured bacterium Crenarchaeota Phaeothamnion uncultured bacterium RF3 Collinsella Synergistetes unidentified archaeon Thermoproteus uncultured Firmicutes bacterium RF3 Lentisphaerae uncultured archaeon Terrestrial Hot Spring Gp(THSCG) uncultured bacterium Candidate division OP9 uncultured Bacillales bacterium RF3 Actinobacteria Miscellaneous Crenarchaeotic Group uncultured bacterium Candidate division OP11 Saccharomyces hot spring yeast RND13 Bogoriellaceae bacterium YIM 93306 Schizosaccharomyces Gardnerella Taphrina Nitriliruptor Siboglinidae Lentisphaera Buthidae uncultured bacterium RFP12 gut group Charistephane uncultured rumen bacterium RFP12 gut group Pliciloricus Victivallis Myliobatoidei uncultured bacterium 7 Narcine Aminobacterium Dasypus Synergistes Myotis Thermanaerovibrio Sorex Thermovirga Pseudoscourfieldia uncultured Synergistes sp. Zamia uncultured bacterium DA101 soil group Zea Coraliomargarita Calycanthus uncultured bacterium vadinHA64 Saururus Akkermansia Grevillea uncultured Verrucomicrobiales bacterium Solanum Ktedonobacter Vitis uncultured bacterium JG37-AG-4 Carica uncultured Chloroflexi bacterium 2 Cucumis uncultured bacterium 4 Glycine uncultured Chloroflexi bacterium Populus Ascomycota Annelida Arthropoda Loricifera Chordata Chlorophyta Streptophyta uncultured Chloroflexus sp. uncultured bacterium 3 Catenibacterium Thermoanaerobacterium uncultured bacterium 5 Moryella uncultured bacterium 6 Firmicutes Dethiobacter Succiniclasticum Anaerotruncus uncultured rumen bacterium 2 human gut metagenome 2 uncultured Clostridiaceae bacterium Candidatus Phytoplasma uncultured bacterium Mollicutes Acholeplasma Incertae Sedis Entomoplasmataceae Tenericutes Mesoplasma uncultured bacterium RF9 uncultured rumen bacterium RF9 Mycoplasma uncultured Firmicutes bacterium Mycoplasmataceae uncultured bacterium Mycoplasmataceae uncultured bacterium 9 Petalonema sp. ANT.GENTNER2.8 Synechocystis sp. CCALA 700 uncultured Bacillales bacterium ML635J-21 Alsophila spinulosa Dolichomastix tenuilepis Cyanobacteria Glaucosphaera vacuolata Heterosigma akashiwo Lemna minor Plasmodium berghei Plasmodium chabaudi chabaudi Plasmodium falciparum (malaria parasite P. falciparum) marine metagenome uncultured gamma proteobacterium ARKICE-90 uncultured bacterium Elev-16S-509 uncultured bacterium B38 uncultured gamma proteobacterium SC3-20 uncultured bacterium pItb-vmat-80 Caulobacterales Proteobacteria uncultured organism DB1-14 uncultured alpha proteobacterium Deep 1 Caedibacter uncultured bacterium mitochondria uncultured beta proteobacterium uncultured bacterium SC-I-84 uncultured bacterium UCT N117 uncultured bacterium oca12 uncultured bacterium Parasutterella uncultured beta proteobacterium Parasutterella Flexibacter sp. AMV16 uncultured bacterium VC2.1 Bac22 uncultured organism NS7 marine group Capnocytophaga Zunongwangia Bacteroides uncultured Bacteroidetes bacterium unidentified rumen bacterium RF16 uncultured bacterium S24-7 Bacteroidetes uncultured bacterium ratAN060301C uncultured bacterium dgA-11 gut group uncultured bacterium RC9 gut group uncultured rumen bacterium RC9 gut group Barnesiella Butyricimonas Odoribacter Parabacteroides Prevotella human gut metagenome uncultured bacterium 2 uncultured rumen bacterium Figure 4.2: Gut microbiota of 99 healthy vs. 25 IBD-suffering individuals. SSU rRNA-containing sequences produced from Qin et al.’s [2010] WGS sequencing experiment were annotated to genus specificity using SSuMMo. Sequence data from healthy and IBD-inflicted individuals were tallied separately and relative abundances plotted. could have been drawn from the same species distribution as that for the healthy gut samples, assuming a normally distributed range of values. The non-parametric equivalent, the Kruskal-Wallis H-test calculated for the same set of Simpson indices gives a p-value of 0.104, which conversely indicates there to be an 89.6% chance of the samples being drawn from independent environments, assuming a Chi-squared distribution. The former test supports the null-hypothesis that there is no difference in gut microbial populations, but the latter suggests that there could be a marked population difference, provided that gut biodiversities follow a Chi-squared distribution. In macro-ecological studies, however, species populations are “often approximately normally distributed” [Magurran, 2009], further support that the two sets of samples are not markedly different, according to the analysis. The original study [Qin et al., 2010] and at least one other [Manichanh et al., 2006] has reported significant differences between the microbial populations of IBD-sufferers and those with healthy guts. In both studies, sequence reads were based on sequence similarity, and clustered into OTUs before performing Principal Components Analysis on the resulting sequence sample clusters. Here, more traditional ecological metrics are Shannon H ′ Shannon H max Simpson D IBD (n = 25) 3.8962 ± 0.1420 3.9426 ± 0.1428 0.9954 ± 0.0022 Healthy (n = 99) 3.9639 ± 0.1420 4.0112 ± 0.1325 0.9956 ± 0.0037 One-way ANOVA F-value p-value Kruskal-Wallis H-value p-value 4.4631 0.0367 4.6441 0.0312 5.0923 0.0258 4.1348 0.0420 0.0341 0.8538 2.6427 0.1040 Table 4.3: Analysis of Variance of biodiversity index calculations. Shannon and Simpson biodiversity metrics were calculated for each of 124 individuals and are shown with standard deviations. The variances in sample biodiversity were analysed for statistical significance between 24 individuals suffering from IBD and 99 others who do not, using a one-way ANOVA and Kruskal-Wallis tests. 66 used to calculate sample species diversities. The ANOVA tests show that for our results, Simpson diversity is not increased if considering the degrees of freedom in the samples. However, a comparably higher statistical confidence (p < 0.04) is demonstrated for species evenness across the healthy gut assemblage, according to Shannon indices. As can be seen in the comparative genus abundances shown in Figure 4.2, taxa evenness is one attribute that is visibly more apparent in the healthy dataset, which is confirmed as statistically significant by the ANOVA and Kruskal-Wallis tests (Table 4.3). 4.3.3 Gut microbiome differences relating to adiposity The dataset produced by Qin et al. [2010] was the only available dataset that published quantitative BMI indices associated with each individual. To test the hypothesis that a person’s BMI index is proportional to the ratio of Firmicutes / Bacteriodetes (F/B), a scatter plot was generated to determine if any correlation existed (Figure 4.3). Unfortunately, the number of sequences assigned to Bacteroidetes and Firmicutes was so low that no confident conclusion could be made with regard to this hypothesis, using the results obtained from this dataset. However, SSuMMo analyses of V2 regions and full-length 16S rRNA sequences concur with the observations made in the original study by Turnbaugh et al. [2009]: that obese subject samples have significantly fewer Bacteroidetes, more Actinobacteria and less of a difference in Firmicutes abundance relative to lean individuals (Table 4.4). Similar trends were observed across the dataset at lower taxonomic ranks, with no single genus dominating any subset of the data (Figure 4.1). Sequences sampled from the V6 hypervariable region of 16S rRNA were not as conclusive. In analysing the sequence annotations, it was noted that Bacteroidetes were only identified in a small handful of the V6 samples. This can be seen in Figure 4.1 and more clearly in Table 4.4a, where the V6 sequence reads almost completely lack Bacteroidetes 67 sequences. For V2 and Sanger sequence reads, enough Bacteroidetes and Firmicutes were present in the 154 samples to calculate ratios between those phyla. These are displayed as box and whisker plots in Figure 4.3b. Although there were far more V2 sequence reads in the dataset than there were dideoxy sequences, there appears to be no correlation between V2 reads and host BMI category. The same could be said of the V6 sequences, for which no F/B ratio could be calculated (due to the division by zero). However, the dideoxy sequences are more interesting, in that a striking correlation in the range of ratio values is visible. Clearly, obese individuals show a much larger range of F/B ratios than their lean and overweight counterparts. When the mean value is taken (Table 4.4a), the difference is not nearly as apparent, due to some extreme outliers, which are omitted from the boxplot. The median value and interquartile range increases noticeably however, with more adipose BMI categories. Shannon and Simpson biodiversity indices, biological diversity measures incorporating evenness and richness, respectively [Magurran, 2009], were calculated for each BMI category based on species-level taxa assignments (Table 4.4b). These statistics were used to investigate whether notable changes in biodiversity could be identified when sequences were grouped at species rank. No consistent changes were observed across all three BMI categories and sequence targets, as pooled samples obfuscate more subtle differences which might be observed between individuals. For example, gut populations were shown to be more similar between family members in the original publication, so characterising species assemblages from lean and obese members of the same family (rather than all families pooled together) should be a fairer method of delineating differences between BMI categories. Furthermore, variation in the number of defined species per genus across the tree of life will cause differences in primer specificity to drastically affect Shannon and Simpson index calculations, which are functions of the number of observed taxa (section 2.13). 68 a) Full length Phylum Lean Over. V6 targeted Obese Lean Over. V2 targeted Obese Lean Over. Obese - - - - - - 0.00 - 0.00 Actinobacteria 2.60 1.10 4.63 0.02 0.05 0.11 0.66 0.66 1.58 Bacteroidetes 10.45 10.39 7.14 - - 0.01 28.30 25.68 26.88 - - - 10.35 6.96 8.57 0.00 0.00 - Candidate Division RF3 0.46 - 0.36 12.15 10.45 16.83 0.32 0.09 0.19 Candidate division TM7 - - - 1.35 8.25 3.60 0.00 0.00 0.00 Candidate division WS3 - - - 0.12 0.22 0.52 0.82 0.57 0.87 Chlorobi - - - - - - 0.01 0.00 0.00 Chloroflexi - - - 0.02 - 0.01 0.08 0.10 0.09 Acidobacteria Candidate Division BD1-5 - - - 4.00 5.48 2.92 0.00 0.00 0.00 Cyanobacteria 0.03 - 0.02 0.08 0.61 0.22 0.02 0.06 0.01 Deferribacteres 0.25 0.24 0.17 - - - 0.05 0.04 0.35 - - - 9.26 2.68 8.45 - - - 82.78 86.94 83.64 58.70 47.21 37.96 67.25 70.34 67.47 Fusobacteria - - - 0.45 0.34 0.40 0.13 0.06 0.07 Gemmatimonadetes - - - 0.69 2.73 7.94 0.00 - 0.00 Proteobacteria 0.62 0.79 0.66 2.09 13.83 10.52 0.71 0.60 0.97 Tenericutes 1.14 0.31 0.76 0.02 - 0.01 0.58 0.19 0.25 Verrucomicrobia 1.64 0.24 2.60 0.07 0.24 0.06 0.13 0.15 0.12 Others 0.03 - 0.02 0.63 0.95 1.86 0.94 1.46 1.16 280,131 107,802 430,009 291,993 123,157 704,369 232.0 ± 13.8 230.3 ± 10.0 Chrysiogenetes Deinococcus-Thermus Firmicutes b) 3,234 1,271 5,268 1,208.8 ± 247.3 1,234.8 ± 235.8 1,239.5 ± 236.9 59.7 ± 1.7 59.7 ± 1.4 59.7 ± 1.6 230.8 ± 10.7 Shannon Index, H' 3.91 3.48 3.69 3.47 3.02 3.59 4.01 3.82 4.04 Shannon Max Evenness, Hmax 5.06 4.56 5.26 5.53 4.97 5.44 12.58 6.14 6.72 J' (H' / Hmax) 0.77 0.76 0.70 0.63 0.61 0.66 0.32 0.62 0.60 Simpson Index, D 0.96 0.94 0.94 0.94 0.89 0.95 0.96 0.95 0.96 N. sequences Mean Seq. length ± std. dev. Table 4.4: SSuMMo assignment statistics of Human Microbiome sequence data. SSuMMo assigned phyla and Candidate Divisions for Turnbaugh et al.’s [2009] 16S rRNA data show similar trends between BMI categories, including Obese individuals having fewer Bacteroidetes and more Actinobacteria. Sequencing method most significantly affects the proportions of detected phyla, with V6 sequences resulting in drastically different taxonomic distributions compared with V2 and Sanger-sequenced reads. a) Percentage of sequences assigned to each phylum. Dark cells indicate populous phyla, with darkest cells indicative of most abundant phyla per sample pool. b) Sequence statistics and biodiversity indices for each of the sample pools at species rank. 80 2.5 2.0 1.5 1.0 0.5 0 5 10 15 20 25 30 35 BMI (a) Qin et al. [2010] dataset. 40 Firmicutes / Bacteroidetes ratio 70 Dideoxy sequences 60 50 40 30 20 10 0 0.0 V2 reads Lean Ov. Ob. (b) Turnbaugh et al. [2009] dataset. Figure 4.3: Body Mass Index vs. Firmicutes / Bacteroidetes ratio. Body mass indices were compared against the ratio of Firmicutes / Bacteroidetes (F/B) phyla, annotated from sequence samples according to SSuMMo analyses. (a) Quantitative BMI data was only available with the Qin et al. [2010] dataset. Due to the extremely low proportion of SSU rRNA sequences found within the WGS dataset, very few of the samples were found to contain both Firmicutes and Bacteroidetes and of these, none of the samples had more than 2 sequences assigned to the Firmicutes phylum. A scatter plot is shown of Firmicutes / Bacteroidetes ratio against Body Mass Index. (b) The dataset made available by Turnbaugh et al. [2009] included many more sequences assigned to Firmicutes and Bacteroidetes. A box and whisker plot is shown of 16S rRNA sequence read annotations against the BMI category assigned to those individuals. Boxes show the F/B ratios at the 25th and 75th percentiles, with a line in the middle showing the median F/B ratio value. Whiskers extending from the boxes show the range of the data. Outliers are not shown, which are calculated as lying beyond 1.5 times the interquartile range (i.e. the difference between the 25th and 75th percentiles). White boxes show values calculated from the V2 sequence reads, and grey boxes show values calculated from reads sequenced using Sanger’s dideoxy seqeuncing method. In order to correct for differences between sequence sample sizes, rarefaction analyses were run on each member dataset, selecting random subsets of each. By plotting calculated Shannon and jackknife indices from random subsamples of Turnbaugh et al.’s data [2009], trends in the V2 and full-length sequence datasets are observed that follow the size of each set of sequences. As mentioned above, these trends are likely affected by the number of individuals sampled and pooled into a combined sequence dataset, as with more sampled individuals, more singleton taxa are introduced. The V6 dataset is unique in that there are fewer sequences in total sampled from lean individuals, yet more genera are observed 70 V2-targeted sequences Number Genera a) Lean Overweight Obese Jackknife Shannon H' Shannon Hmax Number randomly sampled sequences b) Lean Overweight Obese Shannon H' Shannon Hmax Jackknife Figure 4.4: Biodiversity analyses of Lean, Overweight and Obese individuals’ gut microflora. Rarefaction analyses were performed on ‘Lean’, ‘Overweight’ and ‘Obese’ sequence datasets, with random subsamples selected from each complete dataset and biodiversity indices calculated for randomly selected subsets. For each sequence type (V2-targeted, V6-targeted and full-length), 5% of the total sequences were selected from the largest dataset per BMI type. From these subsets, the observed number of genera was counted and following statistics calculated: Shannon index (H ′ ), Shannon value at maximum evenness (H max ) and jackknife values. This was repeated with 50 replicates, each with a sample size 5% of the largest sequence dataset in each type. a, c and e) Rarefaction box and whisker plots showing the number of genera observed by each random sequence selection for V2, V6 and full length sequences, respectively. Box lower and upper limits show 25th and 75th percentiles, respectively. Central horizontal lines show median values, and whiskers show the range of the data, with outliers drawn as ‘+’ symbols. b, d and f) Biodiversity indices were calculated for each of the 50 replicates and mean values were plotted along with 95% confidence intervals. V6-targeted sequences Number Genera c) Lean Overweight Obese Number randomly sampled sequences Lean Overweight Obese Shannon H' Shannon Hmax Jackknife Jackknife Shannon H' Shannon Hmax d) Full length sequences Number Genera e) Lean Overweight Obese Number randomly sampled sequences f) 200 100 50 Lean Overweight Obese Shannon H' Shannon Hmax Jackknife 0 Jackknife Shannon H’ Shannon Hmax 150 (Figure 4.4). This corresponds with a slightly higher species evenness, or Shannon H ′ value (Table 4.4, Figure 4.4D), and a noticeably higher H max value, suggesting that those taxa targeted by V6 primers (Figure 4.1) are more evenly distributed in Lean individuals than in their counterparts with higher BMI ratios. 4.3.4 Annotating WGS sequences Data obtained from a WGS sequence experiment shows far less sampling bias for bacteria than those of the primer-targeted sequencing experiments. Although the proportion of sequences found to contain small subunit rRNA were substantially fewer (0.3% cf. > 98.5%; Table 4.1), the WGS sequencing experiment uniquely shows a ubiquitous presence of Archaea among samples (Figure 4.5). Amongst the primer-targeted sequencing experiments however, Archaea are consistent only in their absence. Archaea are known to provide unique metabolic capabilities in a range of extreme environments [Jarrell et al., 2011]. If they are entirely missing them from primer-targeted sequence samples, surely other wide ranges of taxa are not surveyed either. Primers are known to anneal preferentially with certain taxa over others [Chakravorty et al., 2007], leading to a sampling bias dependent on the DNA primers chosen. This effect is apparent in Turnbaugh et al.’s data [2009], where presence and absence information show V2- and V6- specific primers to have more influence on observed population structure than host BMI category (see Table 4.4a and Figure 4.1). Although V6 taxon assignments appear anomalous compared with assignments based on V2 fragments and full-length sequences, V6 results show high resolution in members otherwise missed. This is demonstrated by the fact that the V6 sequence data identified so few Bacteroidetes sequences, even though it is the second most abundant phylum in all other sequence sets (Table 4.4). Similar evidence at the class level is observed, as many members of the class Bacilli are ubiquitously present in all V6 sequence sets in high proportions, yet are not 74 Nanoarchaeota Euryarchaeota Crenarchaeota Deferribacteres Cyanobacteria Tenericutes Actinobacteria Proteobacteria Bacteroidetes Firmicutes 12 Ubiquity 10 8 6 4 2 InM InR InD InE InB F2 -X F2 -Y InA F2 -V F2 -W F1 -S F1 -T F1 -U Korarchaeota Breviata Carpediemonas Cavosteliaceae Giardia Paravahlkampfia environmental samples uncultured eukaryote Candidatus Korarchaeum Marine Hydrothermal Vent Group 1(MHVG-1) uncultured euryarchaeote unidentified archaeon Nanoarchaeum Methanobacteriaceae uncultured archaeon Methanopyrus Thermococcus kodakarensis KOD1 Marine Group III uncultured archaeon pCIRA-13 uncultured archaeon Methanocaldococcus Methanococcus Deep Sea Euryarcheotic Group(DSEG) uncultured archaeon Halobacterium Marine Hydrothermal Vent Group(MHVG) uncultured archaeon Marine Hydrothermal Vent Group(MHVG) uncultured euryarchaeote Miscellaneous Euryarcheotic Group(MEG) uncultured archaeon Miscellaneous Euryarcheotic Group(MEG) uncultured euryarchaeote Candidatus Nitrosocaldus Crenarchaeota uncultured archaeon AK8 uncultured archaeon D-F10 uncultured archaeon Group C3 uncultured archaeon Miscellaneous Crenarchaeotic Group uncultured archaeon pMC2A209 uncultured archaeon Terrestrial Hot Spring Gp(THSCG) uncultured archaeon Terrestrial Hot Spring Gp(THSCG) uncultured crenarchaeote Caldisphaera uncultured Desulfurococcales archaeon Desulfurococcus Thermodiscus Thermofilum Pyrobaculum Thermoproteaceae uncultured archaeon pHGPA13 Caldithrix TA06 uncultured bacterium marine metagenome Pseudanabaena Mollicutes uncultured bacterium Haloplasma Bifidobacterium Gardnerella Parascardovia Collinsella Eggerthella Enterorhabdus uncultured bacterium ARKDMS-49 uncultured bacterium Enterobacter aerogenes Raoultella Oxalobacter uncultured beta proteobacterium UCT N117 uncultured bacterium oca12 uncultured bacterium VC2.1 Bac22 uncultured bacterium Bacteroides uncultured Bacteroidetes bacterium RC9 gut group uncultured rumen bacterium Parabacteroides uncultured bacterium 1 Paraprevotella Prevotella human gut metagenome human gut metagenome 3 uncultured bacterium 4 Enterococcus Lactobacillus Streptococcus Thermoanaerobacterium Gelria Clostridium Megasphaera Mitsuokella Phascolarctobacterium Anaerotruncus Butyricicoccus Faecalibacterium Hydrogenoanaerobacterium Oscillospira Papillibacter Ruminococcus Sporobacter Subdoligranulum Clostridium orbiscindens uncultured bacterium 2 human gut metagenome 2 uncultured bacterium 3 uncultured rumen bacterium Anaerostipes Blautia Coprococcus Dorea Hespellia Lachnospira Moryella Oribacterium Pseudobutyrivibrio Robinsoniella Roseburia Lachnospiraceae uncultured bacterium Clostridium sp. SS2.1 Marvinbryantia uncultured bacterium human gut metagenome 1 Figure 4.5: Genera identified in Japanese guts, from WGS experiment. Sequences sampled from 13 Japanese individuals were annotated with SSuMMo and displayed using IToL [Letunic & Bork, 2006]. Overall genus ubiquity is shown as a heatmap, to the right of the leaves. Relative abundances within each individual sample are displayed also. Reference IDs were allocated to each host individual by the original authors, which are shown above each dataset column. a) 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 b) 7 c) 1.5 d) 1.2 Shannon H' Shannon Hmax 6 5 4 3 2 1 USA 0 EU BF JPN 0.8 Simpson D 1.0 0.6 0.5 0.4 0.2 ie s s ec sp ge nu ily m fa r de or s sp ec ie s nu ge m fa de ily 0.0 r 0.0 or Shannon J' 1.0 Figure 4.6: Biodiversity indices calculated for geographical datasets. Sequence datasets obtained from various studies were annotated using SSuMMO and biological diversity metrics calculated from resulting taxon annotations. Standard deviations are shown for each plotted biological diversity index. Raw data is presented in Table A1.1. present in the other sequence datasets at all. Consequently, many members of the class Clostridia, in the phylum Firmicutes, are observed in high proportions with full-length and V2 sequences, but are not identified at all with V6 reads. 4.3.5 Gut microbiome diversity relating to geographic location The hypothesis that geographic location (and diet) plays a part in shaping the species diversity of an individual’s gut microbiome was tested by analysing four different sequence experiment datasets (Table 4.1). Rarefaction analyses of genus counts were performed 76 800 300 USA 700 200 Number of distinct taxa Japan Bur. Faso 600 100 500 400 0 0 10000 20000 300 Italy 200 100 0 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 130000 140000 Number of randomly chosen sequences Figure 4.7: Rarefaction curves of 66 healthy individuals’ gut microbiota. Sequences obtained from the guts of 66 healthy individuals, sampled from four distinct geographic locations, were annotated with SSuMMo. Random sub-samples were selected from the resulting genus annotations and number of genera counted for each. Genus annotations were resampled ten times for each subset size and median values plotted along with boxes showing 25th and 75th percentiles. Whiskers extend to show the range of the data. (section 2.13) on each microbiome sample and results were plotted comparatively (Figure 4.7). Again, it is hard to draw a conclusion from the results, as healthy American individuals appear to have both the highest and lowest levels of biodiversity in their gut microbiomes. The comparison is not a strictly fair one however, as the amount of taxonomically informative sequences provided by Peterson et al.’s [2009] study outnumbers the other sequence datasets by over five times. The result is that Figure 4.7 is completely overwhelmed by this dataset. As stated by Magurran [2009], sampling depth tends to increase the measured species diversity and richness of an environment. As Peterson et al.’s [2009] study generated so much more sequence data than the others, it is no surprise that members 77 of his sampling cohort have the highest calculated species richness (Figure 4.7). Bearing this in mind, what is perhaps more surprising, is that other members of his study had the lowest diversity in terms of number of genera, out of the four geographic locations. Although not disclosed along in the sequence metadata, the lower biodiversity indices might be explainable by the ease of access Westerners have to modern medicines including antibiotics. Antibiotics, as their name implies, are designed to wipe-out bacterial infections, and as a side-effect can completely alter the microbial landscape of the human gut, sometimes with lasting effect [Dethlefsen et al., 2008]. The relative biodiversities for four complete sequence datasets are shown in Figure 4.6. Strikingly, the Japanese sequence dataset shows the highest taxa diversity according to its Simpson index, when compared against the USA, Burkina Faso and Italian datasets. It is not so surprising that it also has the highest taxon evenness, as the Japanese WGS experiment contains the fewest number of taxa and the highest relative number of taxa with just single sequences assigned. The 16S rRNA primer-targeted datasets are more directly comparable due to having more similar numbers of sequence annotations (Table 4.1). Amongst these, the Burkina Faso dataset consistently has the lowest mean taxon diversity and the largest standard deviation of biodiversity values (Figure 4.6 and Table A1.1). The Shannon H max value is exempt from this comparison, as it only amounts to a theoretical maximum value, reached only if all species were present in even numbers, which will never be the case in real biological systems. Both Shannon’s and Simpson’s diversity indices are similar between American and European individuals, demonstrating similar species richness and evenness distributions in the guts of Western individuals. It is too early to conclude whether similarities arise as a result of diet, medicine, another factor, pure coincidence or a combination of several factors. However, as sequencing experiments continue to grow in size and scope, information required to realise causal relationships between the gut 78 microbiome and how it is affected will be and are being brought to light [Cho & Blaser, 2012; Gevers et al., 2012; Marchesi, 2011; Peterson et al., 2009]. 79 5 Discussion Small subunit rRNA has frequently been referred to as the “gold standard” gene for phylogenetic inference [McHardy & Rigoutsos, 2007], but it is fairly controversial to imply that a single gene can provide enough genetic information to infer taxonomic identity up to species specificity, let alone a fraction of a gene up to species specificity or higher. Higher resolution phylogenetic discrimination can be achieved with longer sequence reads and SSuMMo proves to be no exception (subsection 3.2.1). It follows that even better phylogenetic discrimination can be achieved by comparing multiple genes conserved and sequenced amongst all target species [Dunn et al., 2008; Sjölander, 2004; Wu & Eisen, 2008]. This can be used to great effect for inferring phylogenetic differences between fully sequenced organisms, but poses problems if trying to use the same methods on uncultivated organisms from environmental samples and complex communities. First, as the number of genes being targeted increases, the number of species containing those genes will be reduced. Very few genes are ubiquitous amongst living organisms, one of the reasons why SSU rRNA was such a wise choice of phylogenetic marker gene [Pace et al., 2012]. Second, with more genes being targeted, it is impossibly unlikely that for each target gene sequenced, there would be a matching number of sequence reads for other targeted genes from the same organism. This would skew the abundance of each sequenced gene an unknown amount, owing to unknown copy numbers of each gene, genome and cell cycle state. Thirdly, associating each set of genes to the correct 81 species, dissecting a set of genes for each host organism from hundreds of thousands of non-overlapping sequence reads poses a tremendous theoretical and computational challenge [Krause et al., 2008; Mande et al., 2012]. Together, these problems make any inference on a microbial community an estimate at best, especially while the majority of environmental sequences are assigned to uncultured organisms [Sharma et al., 2012]. 5.1 Sequence clustering methods There are a number of ways in which microbial environments can be analysed in order to better understand them. Conceptually, there are two: the so-called “top-down” and “bottom-up” approaches [Nisbet & Weiss, 2010]. The former treats the system as a black-box, measuring overall output while controlling the input, while the latter involves analysing the most fundamental components of the system, piecing each individual component together, like pieces of a puzzle. Historically, the top-down approach was the only method available for enquiring about microbiological systems; it was practically impossible to know what was happening inside the cell, let alone the nucleus. But in the wake of the sequencing revolution, it has become possible to obtain data on many of the most elusive components of microbial systems and populations, to the point of redundancy. However, a difficulty that remains is getting meaningful information out of comparative sequence analyses. Nearest-neighbour and alignment-based search algorithms often give meaningless results, with functional annotations of a majority of metagenomic genes in recent studies annotated only as having “putative” or “unknown” function [Gosalbes et al., 2011; Harrington et al., 2007; Qin et al., 2010]. Pairwise alignments are also notoriously slow for metagenomic sequence datasets, with search times increasing exponentially with the number of query sequences, target 82 sequences and sequence lengths [Wang & Jiang, 1994]. Unsupervised clustering methods, often based on Markov chains are a favourite amongst high-throughput annotation projects, with algorithms such as Dotur [Schloss & Handelsman, 2005], Esprit [Sun et al., 2009] OrthoMCL [Li et al., 2003] and many others proving popular and increasingly quick at sifting through redundant, overlapping and repetitive sequences. Alternatively, there are the supervised clustering techniques, which can also be based on Markov chain algorithms, but require pre-annotated data to train a database with requisite information. The more “good” training data the better, akin to education. Incorrect training data can lead to so-called “false-positives”, whilst increasing the amount of correct training data usually leads to an increase in accuracy and decrease in false-negatives. A useful metric of accuracy, is the ratio of True Positives against False Positives, which can be used to compare the accuracy of different software implementations in similar conditions [Söding, 2005]. Some popular supervised training software packages, based on similar Markov-chain based algorithms, include: HMMER [Eddy, 1998], Glimmer [Delcher et al., 1999] and HHblits [Remmert et al., 2011]. Although all are based on Markov models, the way in which comparisons are made and databases trained differ markedly. HMMER has evolved since its first release [Eddy, 2011], but still uses sequences to train profile hidden Markov models against which new sequences are compared (see section 2.1 for a further introduction). Glimmer trains “Interpolated Markov models”, which are Markov chains trained with variable length words, changing depending on the local composition of the sequence [Salzberg et al., 1998]. HHblits is more similar to HMMER, but instead of comparing raw sequences to HMMs, query sequences are grouped iteratively before being used to build more hidden Markov models. These new HMMs are then aligned to a pre-built database of HMMs [Remmert et al., 2011]. All authors claim to have made their 83 software better than other competing tools, naturally. 5.2 Comparing microbiological community diversities The concept of quantifying distance between built hidden Markov models is of particular interest, especially in terms of SSuMMo’s future development. One shortfall in SSuMMo’s generated cladograms is the lack of quantitative distance information between different taxonomic ranks. With such quantitative information, so-called “UniFrac” scores can be calculated, to compare the taxonomic diversities of two or more microbial communities [Lozupone & Knight, 2005]. This allows discrimination between communities where species richness and evenness are identical. However, where quantitative distance information is not known between clades, taxonomic diversities can still be compared, simply by considering each path length ν as equal to a constant integer value [Pienkowski et al., 1998a,b]. In the simplest of cases, ν can be set to 1. Since the taxonomic distinctness measure was first introduced, it has been used and developed extensively to assess community differences under various differing environmental situations. It was recognised early on that it is not always appropriate to treat ν as constant [Clarke & Warwick, 1999; Magurran, 2009]. For instance, some taxonomic groups will contribute little or no additional information to the diversity of the sample. In such a case, was suggested to weight each step with the proportion of taxon richness attributed to each grouping. 5.3 Summary Our novel software solution, SSuMMo, provided a novel approach to annotating taxonomic information to sequence reads from primer-targeted high-throughput sequencing 84 experiments. While its efficacy was limited to 16S rRNA gene sequences, these were and still are one of the most popular methods for describing the community and structure of microbial assemblages. However, a bias towards taxa that have already been thoroughly sequenced is an inherent problem with primer-targeted studies. Whole genome shotgun sequencing experiments were shown to be less biased in sampling entire microbial communities as a whole. The number of taxa that can be identified using SSU rRNA targeted sequencing has provided unprecedented species-level coverage of communities in recent years and may still provide the best value for time and money for identifying the majority of microbial species within a community. Highly conserved regions of 16S rRNA were identified as part of this work that are adjacent to highly divergent and taxonomically informative sequence regions. As sequencing technologies continue to provide better value and bioinformatics solutions improve, it is expected that WGS sequencing experiments will from now be the method of choice for interrogating microbial communities in previously uncharacterised habitats. 85 Appendix I A1.1 Code Listings calc_entropy.py #!/usr/bin/env python """ Plot informational entropy for values between 0 and 1 """ import numpy as np from math import log import plot_H # Compute −K(p i log4 p i + q i log4 q i ) for case where q i = (1 − p i ) def defined_sample(p_i): not_p = 1. - p_i return - 2. * ( p_i * log(p_i, 4) + not_p * log(not_p, 4) ) # Calculate Shannon’s entropy for an array of probabilities p def informational_entropy(p): return [defined_sample(p_i) for p_i in p] if __name__ == ’__main__’: p = np.arange(0.01, 1.0, 0.01) H = shannon_entropy(p) plot_H.setup_axes(max(H)) plot_H.plot_entropy(p, H) Listing A1.1: A Python script to plot informational entropy of a DNA sequence 87 scatter_plot.py """ A little module to help plot scatter graphs""" import matplotlib.pyplot as plt fig = plt.figure() # Set up graph axes and labels def setup_axes(y_max): plt.axis([0, 1, 0, y_max]) ax = fig.axes[0] ax.set_xlabel("GC ratio") ax.set_ylabel("H") # Plot informational entropy H against probabilities def plot_entropy(p, H): plt.scatter(p, H, color="k", marker=".", s=1) plt.grid(True) plt.show() fig.savefig("new_simulated.png") Listing A1.2: A shared module containing common plotting functions 88 plot_dna_probs.py #!/usr/bin/env python # Find and parse sequence files for the probability of a specific nucleotide’s # occurrence import argparse, re, os from Bio import SeqIO import matplotlib.pyplot as plt import plot_H # Opens a fasta sequence file, and calculate the probability of a specific # nucleotide within all sequences in that file. Search is case-insensitive # and only parses fasta-formatted sequences files containing "complete genome" # sequences. # # file_name - File to open # nucl - The nucleotide whose probability should be returned. def calc_nuc_probs(file_name, nucl=’g’): nucl = nucl.lower() count = 0 cum_len = 0 with file(file_name, ’r’) as seq_stream: for record in SeqIO.parse(seq_stream, ’fasta’): if ’plasmid’ in record.description \ or ’complete genome’ not in record.description: continue seq = record.seq.tostring().lower() count += seq.count(nucl) cum_len += len(seq) if cum_len == 0: return return float(count) / cum_len # Yield each file from the directory path, whose name ends in ext def find_seq_files(path, ext=’.fna’): join = os.path.join for path, dirs, files in os.walk(path): for f in files: if f.endswith(ext): yield join(path, f) 89 def parse_cmdline(): parser = argparse.ArgumentParser(description=__doc__) parser.add_argument(’path’, help="paths to search", nargs=’+’, type=str) return parser.parse_args().path def gen_data(): seq_file_folders = parse_cmdline() probs = [] dirname = os.path.dirname for path in seq_file_folders: for seq_file in find_seq_files(path): # Double probability, as G~T and A~C. p = 2. * calc_nuc_probs(seq_file) if p is None: continue probs.append(p) print("Parsed {0} genome sequences".format(len(p))) H = informational_entropy(probs) return (probs, H) def plot_data(p, H): plot_H.setup_axes(max(H)) plot_H.plot_entropy(p, H) plot_H.fig.savefig(’{0}_genomes.pgf’.format(len(p))) if __name__ == ’__main__’: import cPickle as pickle if os.path.exists(’data.pkl’): # Load pre-processed data with file(’data.pkl’, ’rb’) as data_file: (p, H) = pickle.load(data_file) else: (p, H) = gen_data() # Save processed data with file(’data.pkl’, ’wb’) as data_file: pickle.dump((p, H), data_file, -1) plot_data(p, H) Listing A1.3: A Python script that calculates informational entropy from genomic DNA sequence files 90 extract_HMP_sequences.py #!/usr/bin/env python """ Extract specific sequence files from HMP Pilot study data. """ import tarfile import os import re class OverviewInfo(): def __init__(self, directory, filters, header=None, name_file_by=None): # directory - directory to check for sequence and overview files # filters - Only output sequences where the data in column with title # header is equal to filters. Sequences will be saved in a # directory with the same name as the filter being applied. # header - Choose column filter to filter data by. Default is ‘environment’ # name_file_by - Each sequence file will be named by data in the column # with header name_file_by. i.e. Chooses how to split # the sequence data. self._dir = directory self.filters = filters self.header = header self.name_file_by = name_file_by if self.header is None: self.header = ’environment’ if self.name_file_by is None: self.name_file_by = ’subject_id’ # First group is any set of characters, excluding tabs. self.splitter = re.compile(r’([\w\d,\.\+\- ]+)’) self.line = re.compile(r’[\r\n]+’) def get_overviews(self): # Checks self._dir for any file with the word ‘overview’ in it. # compressed_files are files with the suffix ‘.tgz’, and any other # files are assumed to be uncompressed. # Returns the tuple: (compressed_files, uncompressed_files). compressed_files = [] uncompressed_files = [] 91 for file_name in os.listdir(self._dir): if ’overview’ in file_name: if file_name.endswith(’.tgz’): compressed_files.append(file_name) else: uncompressed_files.append(file_name) else: continue return (compressed_files, uncompressed_files, ) def get_headers(self, file_name): handle = tarfile.open(file_name, ’r’) headers = [] print(’Iterating through contents of {0}’.format(file_name)) for tarinfo in handle: if not tarinfo.isreg(): # If the tar’d item is not a file (e.g. a directory), skip it. continue buf = handle.extractfile(tarinfo) this_header = self.splitter.findall(buf.readline()) buf.close() if len(headers) == 0: headers = this_header handle.close() return headers def iter_contents(self, file_name): # Given a tar archive name, iterate through the archive’s contents # and yield buffer objects to each contained, compressed file. # This will close each yielded file handle, so must be used as a # generator function. handle = tarfile.open(file_name, ’r’) for tarinfo in handle: if tarinfo.isreg(): sub_handle = handle.extractfile(tarinfo) yield sub_handle sub_handle.close() elif tarinfo.isdir(): continue else: 92 print("What is {0}??".format(tarinfo.name)) handle.close() return def check_files(self): # Checks for the existance of each file in self._dir. # This looks for overview files as well as sequence files. overviews = set() data = set() names = set([’F6RMMXF’, ’F6JVTJB’, ’F6J9Z3U’, ’F6J46LU’, ’F6AVWTA’, ’F6AVU3G’, ’F6ASE4X’, ’F5MMO90’, ’F5K51YR’, ’F57CATM’, ’F5672XE’, ’F51YIRY’, ’F48MJBB’, ’F47USSH’, ’F47LS8B’, ’F475432’, ’F5GZGTO’, ’F5MNGLX’, ’F5MPOZS’, ’F5BSE3M’]) # data_id # - First group is the pilot experiment number. # - Second group is the long extension e.g. overview.tgz # or fasta_and_qual.tgz. data_id = re.compile(’hmp_pilot_([\w\d]+)\.{1}(.+)’) for thing in os.listdir(self._dir): reg = data_id.search(thing) if reg: _id, ext = reg.groups() if ext.startswith(’overview’): overviews.add(_id) else: data.add(_id) missing_data = names.difference(data) missing_overview = names.difference(overviews) if 0 == (len(missing_data) + len(missing_overview)): print("Found all files in {0}".format(self._dir)) else: self._print_missing_files(missing_data, ’fasta_and_qual’) self._print_missing_files(missing_overview, ’overview’) def _print_missing_files(self, files, sub_ext): if len(files > 0): print(’Missing {0} {1} files:-’.format(len(files), sub_ext)) 93 for file_id in files: print(’hmp_pilot_{0}.{1}.tgz’.format(file_id, sub_ext)) def get_set(self, header_name, headers=None): # Returns a set of all the column entries for a particular header. # header_name must be found in the header entries. This will # iterate through each compressed files contents and check for all # unique entries to the column of interest. # # If headers is None, then this will check the headers of only # the first file, and use them as an index for each column entry. unique_set = set() gzs, nongzs = self.get_overviews() for gzipped in gzs: iterator = self.iter_contents(gzipped) for handle in iterator: if headers == None: headers = self.splitter.findall(handle.readline()) else: handle.readline() index = headers.index(header_name) for line in handle: line_list = self.splitter.findall(line) unique_set.add(line_list[index]) # Break and do again so we don’t need to check for headers # every iteration. break for handle in iterator: handle.readline() # Skip the header line. for line in handle: line_list = self.splitter.findall() unique_set.add(line_list[index]) return unique_set def get_data(self, handle, header_line=True): if header_line: line = handle.readline() n_cols = len(self.splitter.findall(line)) all_data = (set() for dummy in xrange(n_cols)) for line in handle: 94 line_data = self.splitter.findall(line) for i in xrange(n_cols): all_data[i].add(line_data[i]) return all_data def sep_data(self, handle, split_index=None): # Give a handle to a tab-delimited data file, and this will read the # data into a list of lists (end_data), and a list of sets # (data_sets), which contain all the unique values per # column. # Optional split_index will separate the data sets into multiple lists # for each different entry in the column indexed by split_index. handle.seek(0) handle.readline() # Header line. first_line = handle.readline().rstrip().split(’\t’) if split_index == None: end_data = [first_line] n_cols = len(first_line) for line in handle: vals = line.rstrip().split(’\t’) split_value = first_line[int(split_index)] end_data = {split_value : [first_line]} # Initiate the data dictionary, with split_value as key; the # value is an array containing the data. n_cols = len(first_line) data_sets = [set() for i in xrange(n_cols)] for line in handle: vals = line.rstrip().split(’\t’) # Turn tab-delimited line to list. if vals[split_index] == split_value: # If same as previous line, append to same key’s value. end_data[split_value].append(vals) else: # Or create a new key / value pair. split_value = vals[split_index] end_data.update({split_value : [vals]}) for i in xrange(n_cols): data_sets[i].add(vals[i]) return end_data, data_sets def extract_sequences(self): 95 from multiprocessing import Process, Queue out_q = Queue() _extractor = Process(target=extractor, args=(out_q, self._dir)) _extractor.start() try: for overview_info in self.yield_info(): if len(overview_info[’sequence_library_IDs’]) > 0: # Get other process to extract the seqeunces. out_q.put(overview_info) finally: out_q.put(’END’) _extractor.join() def get_sizes(self): total = 0. for overview_info in self.yield_info(): exp_id = overview_info[’experiment_ID’] seq_lib_ids = overview_info[’sequence_library_IDs’] if len(seq_lib_ids) > 0: file_name = ’hmp_pilot_{0}.fasta_and_qual.tgz’.format(exp_id) lib_re = re.compile(’|’.join(’({0})’.format(_id) for _id in seq_lib_ids)) # lib_re - group is the library number. archive_handle = tarfile.open(file_name, ’r’) for tarinfo in archive_handle: lib = lib_re.search(tarinfo.name) if not (tarinfo.name.endswith(’.fsa’) and lib): continue size = tarinfo.size total += size kbs = size / (1024.) if kbs < 1000.: size = ’{0} kB’.format(kbs) else: size = ’{0} MB’.format(kbs / 1024.) assert len(overview_info[’subject_ID’]) == 1 print(os.path.join(file_name, tarinfo.name).ljust(60) + \ overview_info[’subject_ID’][0].ljust(20) + size) print(’\n Total: {0} MB’.format(total / (1024.**2))) return total def yield_info(self): 96 # Looks through all overview and sequence archives in the present # directory, and extracts all sequences according to the allowed # filters. gz_overviews, nongz_overviews = self.get_overviews() id_finder = re.compile(’hmp_pilot_([\w\d]+)\.{1}(.+)’) # id_finder:# - First group is the pilot experiment number. # - Second group is the long extension e.g. overview.tgz or # fasta_and_qual.tgz for filter in self.filters: if not os.path.exists(os.path.join(self._dir, filter)): os.makedirs(os.path.join(self._dir, filter)) for file_name in gz_overviews: handles = self.iter_contents(file_name) exp = id_finder.search(file_name).groups()[0] # Experiment ID taken from archive name. print(’Checking {0}’.format(file_name)) for handle in handles: # Yielding handles to compressed tabular data. info = self.extract_info(handle, exp, file_name) if info is None: continue yield info return def extract_info(self, handle, experiment, archive_name): filters = self.filters headers = self.splitter.findall(handle.readline()) # interesting_col: Index of header column ‘environment’, usually. interesting_col = headers.index(self.header) info = {’experiment_ID’ : experiment, ’sequence_library_IDs’ : [], ’subject_ID’ : [], ’save_dir’ : filters[0] # Changed if len(filters) > 1 } file_name_col = headers.index(self.name_file_by) file_data, set_data = self.sep_data(handle, file_name_col) col_data = set_data[interesting_col] lib_finder = re.compile(r’lib(\d+)’) 97 if len(filters) > 1: for filter in filters: if filter in col_data: print(’\tFilter {0} matches in {1}’.format(filter, handle.name)) lib = lib_finder.search(handle.name).group() info[’sequence_library_IDs’].append(lib) info[’subject_ID’].append(set_data[file_name_col].pop()) info[’save_dir’] = filter elif len(col_data) == 1 and set(filters) == col_data: lib = lib_finder.search(handle.name).group() info[’sequence_library_IDs’].append(lib) if len(set_data[file_name_col]) == 1: info[’subject_ID’].append(set_data[file_name_col].pop()) else: print(’More than one subject_ID for this sample: ’ \ ’{0}//{1}!!’.format(archive_name, handle.name)) else: col_names = ’, ’.join(list(col_data)) if filters[0] in col_data: print("Archive {0}, file {1} has mixed data in the column " "{2}, including: {3}"\ .format(archive_name, handle.name, self.header, col_names)) else: print("Skipping archive {0}, file {1}. " "Data in column {2}, is: {3}"\ .format(archive_name, handle.name, self.header, col_names)) return return info def extractor(in_queue, file_dir): inval = in_queue.get() contents = os.listdir(file_dir) while inval != ’END’: seq_lib_ids = inval[’sequence_library_IDs’] exp_id = inval[’experiment_ID’] lib_re = re.compile(’|’.join(’({0})’.format(id) for id in seq_lib_ids)) file_name = ’hmp_pilot_{0}.fasta_and_qual.tgz’.format(exp_id) if file_name in contents: pass else: # Just in case we made file_name wrong. for file_name in contents: 98 if inval[’experiment_ID’] in file_name and \ ’fasta_and_qual’ in file_name: # If the file_name is a fastq archive. break else: continue archive_handle = tarfile.open(file_name, ’r’) for tarinfo in archive_handle: lib = lib_re.search(tarinfo.name) if not (lib and tarinfo.name.endswith(’.fsa’)): ## Skip if not a qual file, or a library of interest continue write_seq_file(archive_handle, tarinfo, lib, inval) inval = in_queue.get() in_queue.close() return def write_seq_file(archive, tarinfo, lib, data): ## Figure out which subject ID matches that library file == subject_ind = 0 try: for group in lib.groups(): if group: break subject_ind += 1 except AttributeError: raise("Error with {0} in {1}".format(lib, tarinfo.name)) save_dir = data[’save_dir’] subject_id = data[’subject_ID’][subject_ind] file_name = os.path.join(save_dir, subject_id + ’.fas’) file_handle = archive.extractfile(tarinfo) with file(file_name, ’a’) as out_handle: out_handle.write(file_handle.read()) def main(): import argparse arg_parser = argparse.ArgumentParser(description=__doc__) arg_parser.add_argument(’-d’, ’--dir’, dest=’dir’, 99 index of lib. default=os.path.dirname(os.path.realpath(__file__))) arg_parser.add_argument(’-env’, dest=’env’, nargs=’+’, default=[’Stool’], help="Body environments for which to extract sequences. " "Default: Stool") arg_parser.add_argument(’-e’, ’--extract’, dest=’extract’, action=’store_true’, help=’Extract sequences from sequence files’) arg_parser.add_argument(’-l’, ’--list’, dest=’list’, action=’store_true’, help=’List headers in overview files’) arg_parser.add_argument(’-ls’, ’--sizes’, dest=’sizes’, action=’store_true’, help=’List file sizes’) arg_parser.add_argument(’-set’, dest=’set’, nargs=’1’, default=None, help=’Print all possible options from an overview column’) args = arg_parser.parse_args() processor = OverviewInfo(args.dir, args.env) processor.check_files() if args.list: gzs, nongzs = processor.get_overviews() headers = processor.get_headers(gzs[0]) print(’\n’.join(headers)) if args.sizes: processor.get_sizes() if args.extract: processor.extract_sequences() if args.set is not None: processor.get_set(args.set) if __name__ == ’__main__’: main() Listing A1.4: A Python script to extract sequences of interest from the Clinical Production Pilot Study (PPS) of the NIH Human Microbiome Project 100 species genus family order A1.2 Tables USA EU BF JPN USA EU BF JPN USA EU BF JPN USA EU BF JPN No. taxa Shannon H ′ 113.27 ± 38.67 1.1 ± 0.37 42.86 ± 8.82 1.09 ± 0.32 57.43 ± 16.45 0.93 ± 0.38 18.48 ± 4.51 1.87 ± 0.35 186.59 ± 62.68 1.72 ± 0.51 80.67 ± 18.53 1.78 ± 0.36 96.14 ± 25.33 1.29 ± 0.48 24.07 ± 6.36 2.35 ± 0.53 317.68 ± 99.89 2.26 ± 0.72 154.63 ± 30.76 2.69 ± 0.54 179.03 ± 43.74 1.93 ± 0.78 39.61 ± 13.43 3.18 ± 0.62 468.58 ± 143.42 3.23 ± 0.47 217.63 ± 47.44 3.1 ± 0.61 255.02 ± 64.1 2.46 ± 0.72 49.95 ± 17.13 3.56 ± 0.56 Shannon H max 4.66 ± 0.36 3.72 ± 0.21 4 ± 0.31 2.85 ± 0.28 5.17 ± 0.35 4.36 ± 0.24 4.53 ± 0.27 3.12 ± 0.3 5.71 ± 0.32 5.02 ± 0.2 5.16 ± 0.25 3.6 ± 0.39 6.1 ± 0.32 5.36 ± 0.22 5.51 ± 0.25 3.84 ± 0.39 Simpson D 0.5 ± 0.15 0.52 ± 0.15 0.46 ± 0.19 0.72 ± 0.11 0.67 ± 0.16 0.73 ± 0.11 0.53 ± 0.22 0.82 ± 0.14 0.71 ± 0.18 0.85 ± 0.1 0.61 ± 0.23 0.93 ± 0.08 0.9 ± 0.06 0.88 ± 0.09 0.76 ± 0.14 0.96 ± 0.04 Table A1.1: Biodiversity indices for geological datasets at different ranks. The number of taxa shown at each rank is estimated using the jackknife estimate. Each table value was resampled 50 times and the means are shown with standard deviations. 101 Nomenclature Roman Symbols k Boltzmann constant: 1.3806 ⋅ 10−23 J ⋅ K −1 Acronyms API Application Programming Interface BMI Body Mass Index HMM Hidden Markov Model HTS High Throughput Sequencing IBD Inflammatory Bowel Disease ITS Intergenic Transcribed Spacer MSA Multiple Sequence Alignment OED Oxford English Dictionary OTU Operational Taxonomic Unit PCR Polymerase Chain Reaction PSSM Position-Specific Scoring Matrix 103 rRNA ribosomal RiboNucleic Acid SSU Small SubUnit WGS Whole Genome Shotgun 104 Bibliography Ahmadian, Z., Mohajeri, J., Salmasizadeh, M., Hakala, R.M. & Nyberg, K. (2010). A practical distinguisher for the Shannon cipher. Journal of Systems and Software, 83, 543–547. 11 Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K.L., Tyson, G.W. & Nielsen, P.H. (2013). Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature Biotechnology, 31, 533– 538. 12 Almeida, J.S. & Vinga, S. (2006). Computing distribution of scale independent motifs in biological sequences. Algorithms for Molecular Biology, 1, 18. 19 Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.A. & Fletcher, M. (2001). Analysis of genomic sequences by Chaos Game Representation. Bioinformatics, 17, 429–437. 19 Amann, R.I., Ludwig, W. & Schleifer, K.H. (1995). Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiological Reviews, 59, 143–169. 1 Asai, K., Hayamizu, S. & Handa, K. (1993). Prediction of protein secondary structure by the hidden Markov model. Computer applications in the biosciences: CABIOS, 9, 105 141–146. 20 Barnett, J.A. (2003). Beginnings of microbiology and biochemistry: the contribution of yeast research. Microbiology, 149. 5 Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Finn, R.D. & Sonnhammer, E.L. (1999). Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Research, 27, 260–262. 20 Blair, D.F. (1995). How bacteria sense and swim. Annual Reviews in Microbiology, 49, 489–520. 1 Blaisdell, B.E. (1989). Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences. Journal of Molecular Evolution, 29, 526–537. 19 Bradshaw, D. (2001). A new look at the prime mover. Journal of the History of Philosophy, 39, 1–22. 7, 8 Brady, A. & Salzberg, S.L. (2009). Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6, 673–676. 47 Brandl, Georg (2009), accessed 2013, Sphinx: Python Documentation Generator, http: //sphinx-doc.org. xiii Brock, T. (1999). Milestones in Microbiology: 1546 To 1940. American Society Mic Series, ASM Press. 9 Bystroff, C., Thorsson, V. & Baker, D. (2000). HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. Journal of Molecular Biology, 301, 173–190. 20 106 Castresana, J. (2000). Selection of Conserved Blocks from Multiple Alignments for Their Use in Phylogenetic Analysis. Molecular Biology and Evolution, 17, 540–552. 19 Chakravorty, S., Helb, D., Burday, M., Connell, N. & Alland, D. (2007). A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. Journal of Microbiological Methods, 69, 330–339. 57, 74 Chargaff, E. (1950). Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Cellular and Molecular Life Sciences, 6, 201–209. 11 Cho, I. & Blaser, M.J. (2012). The human microbiome: at the interface of health and disease. Nature Reviews Genetics, 13, 260–270. 79 Claesson, M.J., Cusack, S., O’Sullivan, O., Greene-Diniz, R., de Weerd, H., Flannery, E., Marchesi, J.R., Falush, D., Dinan, T., Fitzgerald, G. et al. (2011). Composition, variability, and temporal stability of the intestinal microbiota of the elderly. Proceedings of the National Academy of Sciences, 108, 4586–4591. 56 Clarke, K. & Warwick, R. (1999). The taxonomic distinctness measure of biodiversity: weighting of step lengths between hierarchical levels. Marine Ecology Progress Series, 184, 21–29. 84 Cobb, M. (2013). 1953: When genes became “information”. Cell, 153, 503–506. 12 Dancoff, S.M. & Quastler, H. (1953). Information content and error rate of living things. Essays on the Use of Information Theory in Biology, 263–274. 10 Darwin, C. (1859). On the Origin of Species. John Murray. 6 Dawkins, R. (2008). The Oxford book of modern science writing. Oxford University Press. 6, 7 107 Dawkins, R. (2009). The Greatest Show on Earth: The Evidence for Evolution. Simon and Schuster. 6 De Filippo, C., Cavalieri, D., Di Paola, M., Ramazzotti, M., Poullet, J.B., Massart, S., Collini, S., Pieraccini, G. & Lionetti, P. (2010). Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa. Proceedings of the National Academy of Sciences. 56, 57, 58, 62 DeAngelis, K.M., Allgaier, M., Chavarria, Y., Fortney, J.L., Hugenholtz, P., Simmons, B., Sublette, K., Silver, W.L. & Hazen, T.C. (2011). Characterization of Trapped Lignin-Degrading Microbes in Tropical Forest Soil. PLoS One, 6, e19306. 37 Delbrück, M. (1935). Über die natur der genmutetion und der genetruktur. der Mutationsforsehung, Einige Tatsachen. 8 Delbrück, M. (1971). Aristotle-totle-totle. Of microbes and life, 50–5. 8 Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. (1999). Improved microbial gene identification with GLIMMER. Nucleic Acids Research, 27, 4636–4641. 83 Derrien, M., Collado, M.C., Ben-Amor, K., Salminen, S. & de Vos, W.M. (2007). The Mucin-Degrader Akkermansia muciniphila Is an Abundant Member of the Human Intestinal Tract. Applied and Environmental Microbiology. 64 Dethlefsen, L., Huse, S., Sogin, M.L. & Relman, D.A. (2008). The pervasive effects of an antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLoS Biology, 6, e280. 78 Dewhirst, F.E., Chen, T., Izard, J., Paster, B.J., Tanner, A.C.R., Yu, W.H., Laksh108 manan, A. & Wade, W.G. (2010). The Human Oral Microbiome. Journal of Bacteriology, 192, 5002–5017. 38, 40, 45, 51 Diamandis, P. (2013). Outpaced by Innovation: Canceling an XPRIZE. Huffington Post. 14 Dini-Andreote, F., Andreote, F.D., Araújo, W.L., Trevors, J.T. & van Elsas, J.D. (2012). Bacterial genomes: Habitat specificity and uncharted organisms. Microbial Ecology, 64, 1–7. 15 Drews, G. (2000). The roots of microbiology and the influence of Ferdinand Cohn on microbiology of the 19th century. FEMS Microbiology Reviews, 24, 225–49. 6 Droege, M. & Hill, B. (2008). The Genome Sequencer FLX™ System–Longer reads, more applications, straight forward bioinformatics and more complete data sets. Journal of Biotechnology, 136, 3–10. 45 Dunn, C.W., Hejnol, A., Matus, D.Q., Pang, K., Browne, W.E., Smith, S.A., Seaver, E., Rouse, G.W., Obst, M., Edgecombe, G.D., Sørensen, M.V., Haddock, S.H.D., Schmidt-Rhaesa, A., Okusu, A., Kristensen, R.M., Wheeler, W.C., Martindale, M.Q. & Giribet, G. (2008). Broad phylogenomic sampling improves resolution of the animal tree of life. Nature, 452, 745–749. 81 Eddy, S.R. (1996). Hidden Markov models. Current Opinion in Structural Biology, 6, 361–365. 20, 22 Eddy, S.R. (1998). Profile hidden Markov models. Bioinformatics, 14, 755–763. 22, 38, 44, 83 Eddy, S.R. (2004). What is a hidden Markov model? Nature Biotechnology, 22, 1315–1316. 22 109 Eddy, S.R. (2011). Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195. 83 Edgar, R.C. & Sjölander, K. (2004). A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, 20, 1301–1308. 20 Ehrlich, S.D. (2011). MetaHIT: The European Union Project on metagenomics of the human intestinal tract. In Metagenomics of the Human Body, 307–316, Springer. 55, 56 Elias, J.E., Gibbons, F.D., King, O.D., Roth, F.P. & Gygi, S.P. (2004). Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nature Biotechnology, 22, 214–219. 11 Falkow, S. (2004). Molecular Koch’s postulates applied to bacterial pathogenicity – a personal recollection 15 years later. Nature Reviews Microbiology, 2, 67–72. 1 Falkowski, P.G., Fenchel, T. & Delong, E.F. (2008). The microbial engines that drive Earth’s biogeochemical cycles. Science, 320, 1034–1039. 4 Feero, W.G., Guttmacher, A.E., Feero, W.G., Guttmacher, A.E. & Collins, F.S. (2010). Genomic medicine - an updated primer. New England Journal of Medicine, 362, 2001–2011. 55 Feng, D.F. & Doolittle, R. (1987). Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. Journal of Molecular Evolution, 25, 351–360. 19 Fernald, G.H., Capriotti, E., Daneshjou, R., Karczewski, K.J. & Altman, R.B. (2011). Bioinformatics challenges for personalized medicine. Bioinformatics, 27, 1741–1748. 55 Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L.L., Eddy, 110 S.R. & Bateman, A. (2010). The Pfam protein families database. Nucleic Acids Research, 38, D211–D222. 20 Flicek, P. & Birney, E. (2009). Sense from sequence reads: methods for alignment and assembly. Nature Methods, 6, S6–S12. 45 Fritz, M.H.Y., Leinonen, R., Cochrane, G. & Birney, E. (2011). Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research, 21, 734–740. 4 Fuller, C.W., Middendorf, L.R., Benner, S.A., Church, G.M., Harris, T., Huang, X., Jovanovich, S.B., Nelson, J.R., Schloss, J.A., Schwartz, D.C. et al. (2009). The challenges of sequencing by synthesis. Nature Biotechnology, 27, 1013–1023. 14 Gaarder, J. (1991). Sophie’s World: A Novel About the History of Philosophy. Farrar, Straus and Giroux. 6, 7 Galperin, M.Y. & Fernández-Suárez, X.M. (2012). The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Research, 40, D1–D8. 3 Gest, H. (2004). The discovery of microorganisms by Robert Hooke and Antonie van Leeuwenhoek, Fellows of The Royal Society. Notes and Records of the Royal Society of London, 58, 187–201. 5 Gevers, D., Knight, R., Petrosino, J.F., Huang, K., McGuire, A.L., Birren, B.W., Nelson, K.E., White, O., Methé, B.A. & Huttenhower, C. (2012). The Human Microbiome Project: A Community Resource for the Healthy Human Microbiome. PLoS Biology, 10, e1001377. 79 111 Glansdorff, N., Xu, Y. & Labedan, B. (2008). The last universal common ancestor: emergence, constitution and genetic legacy of an elusive forerunner. Biol Direct, 3, 56–125. 7 Gosalbes, M.J., Durbán, A., Pignatelli, M., Abellan, J.J., Jiménez-Hernández, N., Pérez-Cobas, A.E., Latorre, A. & Moya, A. (2011). Metatranscriptomic approach to analyze the functional human gut microbiota. PloS One, 6, e17447. 82 Gould, S.J. (1983). Hen’s teeth and horse’s toes. WW Norton & Company. 8 Gould, S.J. (2002). The Structure of Evolutionary Theory. Harvard University Press. 9 Gram, C. (1884). The differential staining of Schizomycetes in tissue sections and in dried preparations. Fortschritte der Medizin, 2, 185–9. 9 Hamburg, M.A. & Collins, F.S. (2010). The path to personalized medicine. New England Journal of Medicine, 363, 301–304. 55 Hao, Y., Huang, D., Guo, H., Xiao, M., An, H., Zhao, L., Zuo, F., Zhang, B., Hu, S., Song, S., Chen, S. & Ren, F. (2011). Complete Genome Sequence of Bifidobacterium longum subsp. longum BBMN68, a New Strain from a Healthy Chinese Centenarian. J. Bacteriol., 193, 787–788. 64 Harrington, E., Singh, A., Doerks, T., Letunic, I., Von Mering, C., Jensen, L., Raes, J. & Bork, P. (2007). Quantitative assessment of protein function prediction from metagenomics shotgun sequences. Proceedings of the National Academy of Sciences, 104, 13913–13918. 82 Hartmann, M., Howes, C.G., Abarenkov, K., Mohn, W.W. & Nilsson, R.H. (2010). V-Xtractor: An open-source, high-throughput software tool to identify and extract hy112 pervariable regions of small subunit (16S/18S) ribosomal RNA gene sequences. Journal of Microbiological Methods, 83, 250–253. 25, 44 Hehemann, J.H., Correc, G., Barbeyron, T., Helbert, W., Czjzek, M. & Michel, G. (2010). Transfer of carbohydrate-active enzymes from marine bacteria to Japanese gut microbiota. Nature, 464, 908–912. 55, 56 Hess, M., Sczyrba, A., Egan, R., Kim, T.W., Chokhawala, H., Schroth, G., Luo, S., Clark, D.S., Chen, F., Zhang, T. et al. (2011). Metagenomic discovery of biomassdegrading genes and genomes from cow rumen. Science, 331, 463–467. 3 Hirschhorn, J.N. & Daly, M.J. (2005). Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics, 6, 95–108. 14 Höhl, M. & Ragan, M.A. (2007). Is multiple-sequence alignment required for accurate inference of phylogeny? Systematic biology, 56, 206–221. 19 Holden, R., ed. (2013). The Oxford English Dictionary Online. Oxford University Press. 4, 5 Holder, J.W., Ulrich, J.C., DeBono, A.C., Godfrey, P.A., Desjardins, C.A., Zucker, J., Zeng, Q., Leach, A.L.B., Ghiviriga, I., Dancel, C., Abeel, T., Gevers, D., Kodira, C.D., Desany, B., Affourtit, J.P., Birren, B.W. & Sinskey, A.J. (2011). Comparative and functional genomics of Rhodococcus opacus PD630 for biofuels development. PLoS Genetics, 7. 4 Hooke, R. (1665). Micrographia: or, Some physiological descriptions of minute bodies made by magnifying glasses. Royal Society, London. 5 Huse, S.M., Ye, Y., Zhou, Y. & Fodor, A.A. (2012). A core human microbiome as viewed through 16S rRNA sequence clusters. PloS One, 7, e34242. 56 113 Jackson, D.A., Symons, R.H. & Berg, P. (1972). Biochemical method for inserting new genetic information into DNA of Simian Virus 40: circular SV40 DNA molecules containing lambda phage genes and the galactose operon of Escherichia coli. Proceedings of the National Academy of Sciences, 69, 2904–2909. 12 Jarrell, K.F., Walters, A.D., Bochiwal, C., Borgia, J.M., Dickinson, T. & Chong, J.P. (2011). Major players on the microbial stage: why archaea are important. Microbiology, 157, 919–936. 74 Kasting, J.F. & Siefert, J.L. (2002). Life and the evolution of Earth’s atmosphere. Science, 296, 1066–1068. 1 Kay, L.E. (2000). Who Wrote the Book of Life?: A History of the Genetic Code. Stanford University Press. 9, 10, 11, 12 Kedes, L. & Liu, E.T. (2010). The Archon Genomics X PRIZE for whole human genome sequencing. Nature Genetics, 42, 917–918. 13 Kemena, C. & Notredame, C. (2009). Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics, 25, 2455–2465. 19 Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide sequences. Proceedings of the National Academy of Sciences, 78, 454–458. 8 Klappenbach, J.A., Dunbar, J.M. & Schmidt, T.M. (2000). rRNA Operon Copy Number Reflects Ecological Strategies of Bacteria. Applied and Environmental Microbiology, 66, 1328–1333. 49 Klappenbach, J.A., Saxman, P.R., Cole, J.R. & Schmidt, T.M. (2001). rrndb: the Ribosomal RNA Operon Copy Number Database. Nucleic Acids Research, 29, 181–184. 34, 50 114 Koch, R. (1890). Aetiology of tuberculosis. William R. Jenkins. 1 Korneel, R., Jorge, R., Blackall Linda, L., Jurg, K., Pamela, G., Damien, B., Willy, V. & Nealson Kenneth, H. (2007). Microbial ecology meets electrochemistry: electricitydriven and driving communities. ISME J, 1, 9–18. 37 Krause, L., Diaz, N.N., Goesmann, A., Kelley, S., Nattkemper, T.W., Rohwer, F., Edwards, R.A. & Stoye, J. (2008). Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Research, 36, 2230–2239. 82 Krogh, A., Brown, M., Mian, I.S., Sjölander, K. & Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of molecular biology, 235, 1501–1531. 20, 22 Kurokawa, K., Itoh, T., Kuwahara, T., Oshima, K., Toh, H., Toyoda, A., Takami, H., Morita, H., Sharma, V.K., Srivastava, T.P., Taylor, T.D., Noguchi, H., Mori, H., Ogura, Y., Ehrlich, D.S., Itoh, K., Takagi, T., Sakaki, Y., Hayashi, T. & Hattori, M. (2007). Comparative Metagenomics Revealed Commonly Enriched Gene Sets in Human Gut Microbiomes. DNA Research, 14, 169–181. 58, 61, 62 Lagesen, K., Hallin, P., Rødland, E.A., Stærfeldt, H.H., Rognes, T. & Ussery, D.W. (2007). RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research, 35, 3100–3108. 45 Leach, A.L.B., Chong, J.P.J. & Redeker, K.R. (2011a). A novel method for phylotyping complex populations. Nordic Archaeal conference, Helskinki. xvii Leach, A.L.B., Chong, J.P.J. & Redeker, K.R. (2011b). Microbial community auditing with ss-RNA. Searching for a core microbial community in the guts of healthy humans. Modelling & Microbiology conference, Edinburgh. xvii 115 Leach, A.L.B., Chong, J.P.J. & Redeker, K.R. (2011c). Searching for a core microbial community in the human gut microbiome. SGM Autumn conference, York. xvii Leach, A.L.B., Chong, J.P.J. & Redeker, K.R. (2012). SSuMMo: rapid analysis, comparison and visualization of microbial communities. Bioinformatics, 28, 679–686. xvii, 56 Ledergerber, C. & Dessimoz, C. (2011). Base-calling for next-generation sequencing platforms. Briefings in Bioinformatics. 37 Leroy, F. & De Vuyst, L. (2004). Lactic acid bacteria as functional starter cultures for the food fermentation industry. Trends in Food Science & Technology, 15, 67–78. 4 Letunic, I. & Bork, P. (2006). Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics, 23, 127–128. 26, 39, 62, 75 Ley, R.E., Turnbaugh, P.J., Klein, S. & Gordon, J.I. (2006). Microbial ecology: Human gut microbes associated with obesity. Nature, 444, 1022–1023. 37 Li, H. & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11, 473–483. 19, 20 Li, L., Stoeckert, C.J. & Roos, D.S. (2003). OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome research, 13, 2178–2189. 83 Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P. & Zhang, H. (2001). An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17, 149–154. 19 Liu, Z., DeSantis, T.Z., Andersen, G.L. & Knight, R. (2008). Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research, 36, e120. 37, 38 116 Loscalzo, J. (2006). The NIH budget and the future of biomedical research. New England Journal of Medicine, 354, 1665–1667. 3 Lozupone, C. & Knight, R. (2005). UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology, 71, 8228–8235. 84 MacLean, D., Jones, J.D. & Studholme, D.J. (2009). Application of ‘next-generation’ sequencing technologies to microbial genetics. Nature Reviews Microbiology, 7, 287–296. 3 Magurran, A.E. (2009). Measuring Biological Diversity. Blackwell Publishing. 11, 31, 32, 64, 66, 68, 77, 84 Mande, S.S., Mohammed, M.H. & Ghosh, T.S. (2012). Classification of metagenomic sequences: methods and challenges. Briefings in Bioinformatics, bbs054. 82 Manichanh, C., Rigottier-Gois, L., Bonnaud, E., Gloux, K., Pelletier, E., Frangeul, L., Nalin, R., Jarrin, C., Chardon, P., Marteau, P. et al. (2006). Reduced diversity of faecal microbiota in Crohn’s disease revealed by a metagenomic approach. Gut, 55, 205–211. 66 Manichanh, C., Chapple, C.E., Frangeul, L., Gloux, K., Guigo, R. & Dore, J. (2008). A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library. Nucleic Acids Research, 36, 5180–5188. 37 Marchesi, J.R. (2011). Human distal gut microbiome. Environmental Microbiology, 13, 3088–3102. 79 Mardis, E.R. (2011). A decade’s perspective on DNA sequencing technology. Nature, 470, 198–203. 2 117 Matthaei, J.H. & Nirenberg, M.W. (1961). Characteristics and stabilization of DNAasesensitive protein synthesis in E. coli extracts. Proceedings of the National Academy of Sciences of the United States of America, 47, 1580. 12 Maxam, A.M. & Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the National Academy of Sciences, 74, 560–564. 12 Mayr, E. (1959). Darwin and the evolutionary theory in biology. Evolution and anthropology: A centennial appraisal, 1–10. 6, 7 Mayr, E. (1982). The growth of biological thought. Harvard University Press. 6 Mazzarello, P. (1999). A unifying concept: the history of cell theory. Nature Cell Biology, E13–E15. 5 McHardy, A.C. & Rigoutsos, I. (2007). What’s in the mix: phylogenetic classification of metagenome sequence samples. Current Opinion in Microbiology, 10, 499–503. 14, 15, 81 Metzker, M.L. (2009). Sequencing technologies – the next generation. Nature Reviews Genetics, 11, 31–46. 2, 13, 14 Mitsuoka, T. (1990). Bifidobacteria and their role in human health. Journal of Industrial Microbiology, 6, 263–267. 4 Nakamura, Y., Cochrane, G. & Karsch-Mizrachi, I. (2013). The International Nucleotide Sequence Database Collaboration. Nucleic Acids Research, 41, D21–D24. 3 National Human Genome Research Institute (2003), accessed 2011, The human genome project completion: Frequently asked questions, http://www.genome.gov/11006943. 13 NCBI (2010), accessed 2010, NCBI FTP server, ftp://ftp.ncbi.nlm.nih.gov/genomes/TARGET/. 40, 43 118 Nielsen, C.B., Cantor, M., Dubchak, I., Gordon, D. & Wang, T. (2010). Visualizing genomes: techniques and challenges. Nature Methods, 7, S5–S15. 3 Nirenberg, M.W. & Matthaei, J.H. (1961). The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proceedings of the National Academy of Sciences of the United States of America, 47, 1588. 12 Nisbet, E. & Weiss, R. (2010). Top-down versus bottom-up. Science, 328, 1241–1243. 82 Notredame, C. (2007). Recent evolutions of multiple sequence alignment algorithms. PLoS Computational Biology, 3, e123. 20 Orlowski, M. (1991). Mucor dimorphism. Microbiological Reviews, 55, 234–258. 5 Pace, N.R. (2009). Mapping the tree of life: progress and prospects. Microbiology and Molecular Biology Reviews, 73, 565–576. 12 Pace, N.R., Sapp, J. & Goldenfeld, N. (2012). Phylogeny and beyond: Scientific, historical, and conceptual significance of the first tree of life. Proceedings of the National Academy of Sciences, 109, 1011–1018. 81 Park, P.J. (2009). ChIP–seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 10, 669–680. 14 Pecoraro, V., Karolin, Z., Christian, L. & Jörg, S. (2011). Quantification of Ploidy in Proteobacteria Revealed the Existence of Monoploid, (Mero-)Oligoploid and Polyploid Species. PLoS One, 6, e16392. 49 Peplies, J., Kottmann, R., Ludwig, W. & Glöckner, F.O. (2008). A standard operating procedure for phylogenetic inference (SOPPI) using (rRNA) marker genes. Systematic and Applied Microbiology, 31, 251–257. 42 119 Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, J.A., Bonazzi, V., McEwen, J.E., Wetterstrand, K.A., Deal, C., Baker, C.C., Di Francesco, V., Howcroft, T.K., Karp, R.W., Lunsford, R.D., Wellington, C.R., Belachew, T., Wright, M., Giblin, C., David, H., Mills, M., Salomon, R., Mullins, C., Akolkar, B., Begg, L., Davis, C., Grandison, L., Humble, M., Khalsa, J., Little, A.R., Peavy, H., Pontzer, C., Portnoy, M., Sayre, M.H., Starke-Reed, P., Zakhari, S., Read, J., Watson, B. & Guyer, M. (2009). The NIH Human Microbiome Project. Genome Research, 19, 2317–2323. 55, 56, 58, 59, 62, 77, 79 Pham, T.D. & Zuegg, J. (2004). A probabilistic measure for alignment-free sequence comparison. Bioinformatics, 20, 3455–3461. 19 Pielou, E.C. (1975). Ecological diversity. Wiley New York. 32, 64 Pienkowski, M., Watkinson, A., Kerby, G., Clarke, K. & Warwick, R. (1998a). A taxonomic distinctness index and its statistical properties. Journal of Applied Ecology, 35, 523–531. 84 Pienkowski, M., Watkinson, A., Kerby, G., Warwick, R. & CLARKE, K.R. (1998b). Taxonomic distinctness and environmental assessment. Journal of Applied Ecology, 35, 532–543. 84 Pop, M. & Salzberg, S.L. (2008). Bioinformatics challenges of new sequencing technology. Trends in Genetics, 24, 142–149. 3 Porter, J.R. (1976). Antonie van Leeuwenhoek: tercentenary of his discovery of bacteria. Bacteriological Reviews, 40, 260–269. 5 Pruesse, E., Quast, C., Knittel, K., Fuchs, B.M., Ludwig, W., Peplies, J. & Glöckner, F.O. (2007). SILVA: a comprehensive online resource for quality checked and aligned 120 ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research, 35, 7188– 7196. 16, 23, 38 Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T. et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464, 59–65. 3, 56, 57, 58, 61, 64, 65, 66, 67, 70, 82 Raes, J. & Bork, P. (2008). Molecular eco-systems biology: towards an understanding of community function. Nature Reviews Microbiology, 6, 693–699. 37, 38 Remmert, M., Biegert, A., Hauser, A. & Söding, J. (2011). HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods, 9, 173–175. 83 Richter, B.G. & Sexton, D.P. (2009). Managing and analyzing next-generation sequence data. PLoS Computational Biology, 5, e1000369. 4 Roesch, L.F.W., Fulthorpe, R.R., Riva, A., Casella, G., Hadwin, A.K.M., Kent, A.D., Daroub, S.H., Camargo, F.A.O., Farmerie, W.G. & Triplett, E.W. (2007). Pyrosequencing enumerates and contrasts soil microbial diversity. ISME J, 1, 283–290. 37 Rothberg, J.M. & Leamon, J.H. (2008). The development and impact of 454 sequencing. Nature Biotechnology, 26, 1117–1124. 14 Rothberg, J.M., Hinz, W., Rearick, T.M., Schultz, J., Mileski, W., Davey, M., Leamon, J.H., Johnson, K., Milgrew, M.J., Edwards, M. et al. (2011). An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475, 348–352. 4 121 Salzberg, S.L., Delcher, A.L., Kasif, S. & White, O. (1998). Microbial gene identification using interpolated Markov models. Nucleic acids research, 26, 544–548. 83 Sanger, F. & Coulson, A.R. (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of molecular biology, 94, 441–448. 12 Sattler, B., Puxbaum, H. & Psenner, R. (2001). Bacterial growth in supercooled cloud droplets. Geophysical Research Letters, 28, 239–242. 1 Schloss, P.D. & Handelsman, J. (2005). Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness. Applied and Environmental Microbiology, 71, 1501–1506. 38, 51, 83 Schloss, P.D. & Handelsman, J. (2006). Toward a census of bacteria in soil. PLoS Computational Biology, 2, e92. 1 Schrödinger, E. (1944). What is life? Lectures at the Dublin Institute for Advanced Studies. Trinity College, Dublin. 9 Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O. & Huttenhower, C. (2012). Metagenomic microbial community profiling using unique cladespecific marker genes. Nature Methods, 9, 811–814. 14 Shannon, C.E. (1949). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5, 3–55. 10 Shannon, C.E. & Weaver, W. (1949). Recent contributions to the mathematical theory of communication. University of Illinois Press, 19, 1. 10 122 Sharma, V.K., Kumar, N., Prakash, T. & Taylor, T.D. (2012). Fast and accurate taxonomic assignments of metagenomic sequences using MetaBin. PLoS One, 7, e34030. 82 Shendure, J. & Aiden, E.L. (2012). The expanding scope of DNA sequencing. Nature Biotechnology, 30, 1084–1094. 2, 3 Shendure, J. & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology, 26, 1135–1145. 2, 37 Siepel, A. & Haussler, D. (2004). Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology, 11, 413–428. 20 Siezen, R.J. & van Hijum, S.A.F.T. (2010). Genome (re-)annotation and open-source annotation pipelines. Microbial Biotechnology, 3, 362–369. 51 Silk, J. (1999). II. The case for the Big Bang. Comptes Rendus de l’Académie des Sciences Series IIB - Mechanics-Physics-Astronomy, 327, 829–840. 8 Siva, N. (2008). 1000 genomes project. Nature Biotechnology, 26, 256–256. 13 Sjölander, K. (2004). Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics, 20, 170–179. 81 Smit, P. & Heniger, J. (1975). Antonie van Leeuwenhoek (1632-1723) and the discovery of bacteria. Antonie van Leeuwenhoek, 41, 217–228. 5 Sober, E. (1980). Evolution, population thinking, and essentialism. Philosophy of Science, 47, pp. 350–383. 7 Söding, J. (2005). Protein homology detection by HMM–HMM comparison. Bioinformatics, 21, 951–960. 20, 83 123 Sogin, M.L., Morrison, H.G., Huber, J.A., Welch, D.M., Huse, S.M., Neal, P.R., Arrieta, J.M. & Herndl, G.J. (2006). Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proceedings of the National Academy of Sciences, 103, 12115–12120. 37 Sonnenburg, J.L. (2010). Microbiology: genetic pot luck. Nature, 464, 837–838. 56 Stackebrandt, E. & Goebel, B. (1994). Taxonomic note: a place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. International Journal of Systematic Bacteriology, 44, 846–849. 16 Stent, G.S. (1968). That was the molecular biology that was. Science, 160, 390–395. 8, 9, 10 Sun, Y., Cai, Y., Liu, L., Yu, F., Farrell, M.L., McKendree, W. & Farmerie, W. (2009). ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Research, 37, e76. 83 Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. & Kumar, S. (2011). MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular biology and evolution, 28, 2731–2739. 8 Tsai, S.H., Liu, C.P. & Yang, S.S. (2007). Microbial conversion of food wastes for biofertilizer production with thermophilic lipolytic microbes. Renewable Energy, 32, 904–915. 4 Turnbaugh, P.J., Ley, R.E., Mahowald, M.A., Magrini, V., Mardis, E.R. & Gordon, J.I. (2006). An obesity-associated gut microbiome with increased capacity for energy harvest. Nature, 444, 1027–131. 57 124 Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R. & Gordon, J.I. (2007). The Human Microbiome Project. Nature, 449, 804–810. 56 Turnbaugh, P.J., Hamady, M., Yatsunenko, T., Cantarel, B.L., Duncan, A., Ley, R.E., Sogin, M.L., Jones, W.J., Roe, B.A., Affourtit, J.P., Egholm, M., Henrissat, B., Heath, A.C., Knight, R. & Gordon, J.I. (2009). A core gut microbiome in obese and lean twins. Nature, 457, 480–484. 37, 48, 56, 58, 60, 61, 62, 67, 69, 70, 74 Vetriani, C., Jannasch, H.W., MacGregor, B.J., Stahl, D.A. & Reysenbach, A.L. (1999). Population structure and phylogenetic characterization of marine benthic archaea in deep-sea sediments. Applied and Environmental Microbiology, 65, 4375–4384. 1 Vinga, S. & Almeida, J. (2003). Alignment-free sequence comparison–a review. Bioinformatics, 19, 513–523. 19 Wadman, M. (2008). James Watson’s genome sequenced at high speed. Nature, 452, 788. 13 Wang, L. & Jiang, T. (1994). On the complexity of multiple sequence alignment. Journal of Computational Biology, 1, 337–348. 83 Wang, Q., Garrity, G.M., Tiedje, J.M. & Cole, J.R. (2007). Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Applied and Environmental Microbiology, 73, 5261–5267. 48 Watson, J.D. & Crick, F.H. (1953). Molecular structure of nucleic acids. Nature, 171, 737–738. 12 Wheelis, M. (1998). First shots fired in biological warfare. Nature, 395, 213–213. 4 125 Whitman, W.B., Coleman, D.C. & Wiebe, W.J. (1998). Prokaryotes: The unseen majority. Proceedings of the National Academy of Sciences, 95, 6578–6583. 1 Wiener, N. (1948). Cybernetics: or control and communication in the animal and the machine. MIT Press. 10 Williams, P., Winzer, K., Chan, W.C. & Cámara, M. (2007). Look who’s talking: communication and quorum sensing in the bacterial world. Philosophical Transactions of the Royal Society B: Biological Sciences, 362, 1119–1134. 1 Woese, C. (1998). The universal ancestor. Proceedings of the National Academy of Sciences, 95, 6854–6859. 7 Woese, C.R. & Fox, G.E. (1977). Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences, 74, 5088–5090. 12 Wolinsky, H. (2007). The thousand-dollar genome. Genetic brinkmanship or personalized medicine? EMBO reports, 8, 900. 13 Wu, M. & Eisen, J. (2008). A simple, fast, and accurate method of phylogenomic inference. Genome Biology, 9, R151. 16, 81 Wu, T.J., Burke, J.P. & Davison, D.B. (1997). A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics, 1431–1439. 19 126