Computational Methods
Transcription
Computational Methods
This watermark does not appear in the registered version - http://www.clicktoconvert.com UNIT I LESSON -1 INTRODUCTION TO BIOINFORMATICS 1.0 Aims and Objectives 1.1 Introduction to Bioinformatics 1.2 Landmark Sequences Completed 1.3 Sequence Analysis: Sequence to Potential Function 1.4 The Creation of Sequence Databases 1.5 Searching for Genes 1.6 Let us sum up 1.7 Lesson end activities 1.8 Check your progress 1.9 Points for Discussion 1.10 References 1.0 Aims and Objectives: This unit describes the introduction to Bioinformatics, landmark sequences completed, computational biology and sequence databases. 1.1 Introduction to Bioinformatics: Bioinformatics and computational biology involve the use of techniques including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level. Research in computational biology often overlaps with systems biology. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution. This watermark does not appear in the registered version - http://www.clicktoconvert.com 2 The terms Bioinformatics and computational biology are often used interchangeably. However Bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems inspired from the management and analysis of biological data. Computational biology, on the other hand, refers to hypotheses-driven investigation of a specific biological problem using computers, carried out with experimental or simulated data, with the primary goal of discovery and the advancement of biological knowledge. Put more simply, Bioinformatics is concerned with the information while computational biology is concerned with the hypotheses. A similar distinction is made by National Institutes of Health in their working definitions of Bioinformatics and Computational Biology, where it is further emphasized that there is a tight coupling of developments and knowledge between the more hypotheses-driven research in computational biology and technique-driven research in Bioinformatics. Bioinformatics is also often specified as an applied subfield of the more general discipline of Biomedical informatics. A common thread in projects in Bioinformatics and computational biology is the use of mathematical tools to extract useful information from data produced by high-throughput biological techniques such as genome sequencing. A representative problem in Bioinformatics is the assembly of high-quality genome sequences from fragmentary "shotgun" DNA sequencing. Other common problems include the study of gene regulation using data from microarrays or mass spectrometry. In the last few decades, advances in molecular biology and the equipment available for research in this field have allowed the increasingly rapid sequencing of large portions of the genomes of several species. In fact, to date, several bacterial genomes, as well as those of some simple eukaryotes (e.g., Saccharomyces cerevisiae, or baker's yeast) and more complex eukaryotes (C. elegans and Drosophila) have been sequenced in full. The Human Genome Project, designed to sequence all 24 of the human chromosomes, is also progressing and a rough draft was completed in the spring of 2000. Popular sequence databases, such as GenBank and EMBL, have been growing at exponential rates. This deluge of information has necessitated the careful storage, organization and indexing This watermark does not appear in the registered version - http://www.clicktoconvert.com 3 of sequence information. Information science has been applied to biology to produce the field called Bioinformatics. 1.2 Landmark Sequences Completed · tRNA - (1964) - 75 bases (old, slow, complicated method) · First complete DNA genome: X174 DNA (1977) - 5386 bases · human mitochondrial DNA (1981) - 16,569 bases · tobacco chloroplast DNA (1986) - 155,844 bases · First complete bacterial genome (H. Influenzae)(1995) - 1.9 x 106 bases · Yeast genome (eukaryote at ~ 1.5 x 107 ) completed in 1996 · Several archaebacteria · E. coli -- 4 x 106 bases [1997 & 1998] · Several pathogenic bacterial genomes sequenced o Helicobacter pyloris (ulcers) o Treponema pallidium (Syphilis) o Borrelia burgdorferi (Lyme disease) o Chlamydia trachomatis (trachoma - blindness) o Rickettsia prowazekii (epidemic typhus) o Mycobacterium tuberculosis (tuberculosis) · Nematode C. elegans ( ~ 4 x 108 ) - December 1998 · Drosophila (fruit fly) (2000) · Human genome (rough draft completed 5/00) - 3 x 109 base Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein sequence and structural information. This includes databases of the sequences and structural information as well methods to access, search, visualize and retrieve the information. Sequence data can be used to make predictions of the functions of newly identified genes,estimate evolutionary distance in phylogeny reconstruction, determine the active sites of This watermark does not appear in the registered version - http://www.clicktoconvert.com 4 enzymes, construct novel mutations and characterize alleles of genetic diseases to name just a few uses. Sequence data facilitates: Analysis of the organization of genes and genomes and their evolution Protein sequence can be predicted from DNA sequence which further facilitates possible prediction of protein properties, structure, and function (proteins rarely sequenced in entirety today) Identification of regulatory elements in genes or RNAs Identification of mutations that lead to disease, etc. Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. There are three important sub-disciplines within Bioinformatics involving computational biology: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and i nterpretation of various t ypes of data including nucleotide and amino acid sequences, protein domains, and protein structures the development and implementation of tools that enable efficient access and management of different types of information. One of the simpler tasks used in Bioinformatics concern the creation and maintenance of databases of biological information. Nucleic acid sequences (and the protein sequences derived from them) comprise the majority of such databases. While the storage and organization of millions of nucleotides is far from trivial, designing a database and developing an interface whereby researchers can both access existing information and submit new entries is only the beginning. The most pressing tasks in Bioinformatics involve the analysis of sequence information. Computational Biology is the name given to this process, and it involves the following: · Finding the genes in the DNA sequences of various organisms · Developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences. This watermark does not appear in the registered version - http://www.clicktoconvert.com 5 · Clustering protein sequences into families of related sequences and the development of protein models. · Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships. Data- mining is the process by which the testable hypotheses are generated regarding the function or structure of a gene or protein of interest by identifying similar sequences in better characterized organisms. For example, new i nsight into the molecular basis of a disease may come from investigating the function of homologs of the disease gene in model organisms. Equally exciting is the potential for uncovering phylogenetic relationships and evolutionary patterns.The process of evolution has produced DNA sequences that encode proteins with very specific functions. It is possible to predict the three-dimensional structure of a protein using algorithms that have been derived from our knowledge of physics, chemistry and most importantly, from the analysis of other proteins with similar amino acid sequences. 1.3 Sequence Analysis: Sequence to Potential Function Sequence to Potential Function (see flow chart scheme)(see handout) ORF prediction and gene identification (see handout for eukaryotic gene organization) Search databases for potential protein function or homologue Protein structure prediction and multiple sequence alignment (conserved regions) Analysis of potential gene regulatory elements Gene knockout or inhibition (RNA interference) for phenotypic analysis Overview of Sequence Analysis (see handout)( Figure: Requires Adobe Acrobat) Sequence data facilitates: Analysis of the organization of genes and genomes Prediction of protein properties, functions, and structure from gene sequence or cDNA (proteins rarely sequenced in entirety today) -- Cystic fibrosis Identification of regulatory elements Identification of mutations that lead to disease, etc. 1.4 The Creation of Sequence Databases This watermark does not appear in the registered version - http://www.clicktoconvert.com 6 Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof), respectively. Sequences are represented in shorthand, using single letter designations. This decreases the space necessary to store information and increases processing speed for analysis. While most biological databases contain nucleotide and protein sequence information, there are also databases which include taxonomic information such as the structural and biochemical characteristics of organisms. The power and ease of using sequence information has however, made it the method of choice in modern analysis. In the last three decades, contributions from the fields of biology and chemistry have facilitated an increase in the speed of sequencing genes and proteins. The advent of cloning technology allowed foreign DNA sequences to be easily introduced into bacteria. In this way, rapid mass production of particular DNA sequences, a necessary prelude to sequence determination, became possible. Oligonucleotide synthesis provided researchers with the ability to construct short fragments of DNA with sequences of their own choosing. These oligonucleotides could then be used in probing vast libraries of DNA to extract genes containing that sequence. Alternatively, these DNA fragments could also be used in polymerase chain reactions to amplify existing DNA sequences or to modify these sequences. With these techniques in place, progress in biological research increased exponentially. For researchers to benefit from all this information, however, two additional things were required: 1) ready access to the collected pool of sequence information and 2) a way to extract from this pool only those sequences of interest to a given researcher. Simply collecting, by hand, all necessary sequence information of interest to a given project from published journal articles quickly became a formidable task. After collection, the organization and analysis of this data still remained. It would take weeks to months for a researcher to search sequences by hand in order to find related genes or proteins. Computer technology has provided the obvious solution to this problem. Not only can computers be used to store and organize sequence information into databases, but they can also be used to This watermark does not appear in the registered version - http://www.clicktoconvert.com 7 analyze sequence data rapidly. The evolution of computing power and storage capacity has, so far, been able to outpace the increase in sequence information being created. Theoretical scientists have derived new and sophisticated algorithms which allow sequences to be readily compared using probability theories. These comparisons become the basis for determining gene function, developing phylogenetic relationships and simulating protein models. The physical linking of a vast array of computers in the 1970's provided a few biologists with ready access to the expanding pool of sequence information. This web of connections, now known as the Internet, has evolved and expanded so that nearly everyone has access to this information and the tools necessary to analyze it. Check your progress: 1. List the processes involved in computational biology. Notes: a) Write your answer in the space given below. b) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… Description and Links to the NCBI Entrez Database (Example of a Complex Database) Illustration of Nucleotide Sequence Database with Entre z(Selected Davis Lab sequence entries). You should explore the NCBI and its services on your own after browsing these entries. Databases of protein and nucleic acid sequences · In the US, the repository of this information is The National Center for Biotechnology Information (NCBI) · The database at the NCBI is a collated and interlinked dataset known as the Entrez Databases This watermark does not appear in the registered version - http://www.clicktoconvert.com 8 o Description of the Entrez Databases o Examples of a selected database files § Protein § § Most protein sequence is derived from conceptual translation Chromosome with genes and predicted proteins ( Accession #D50617, Yeast Chromosome VI: Entrez) o § Genome ( C. elegans) § Protein Structure (TPI database file or Chime structure) § Expressed sequence tags (Ests)( Summary of current data) Neighboring Searching databases to identify sequences and predicting functions or properties of predicted proteins · Searching by keyword, accession, etc. · Searching for homologous sequences o see the NCBI BLAST § BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. 1.5 Searching for Genes The collecting, organizing and indexing of sequence information into a database, a challenging task in itself, provides the scientist with a wealth of information, albeit of limited use. The power of a database comes not from the collection of information, but in its analysis. A sequence of DNA does not necessarily constitute a gene. It may constitute only a fragment of a gene or alternatively, it may contain several genes. Luckily, in agreement with evolutionary principles, scientific research to date has shown that all genes share common elements. For many genetic elements, it has been possible to construct consensus sequences, those sequences best representing the norm for a given class of organisms (e.g, bacteria, eukaroytes). Common genetic elements include promoters, enhancers, polyadenylation signal sequences and protein binding sites. These elements have also been further characterized into further subelements. This watermark does not appear in the registered version - http://www.clicktoconvert.com 9 Genetic elements share common sequences, and it is this fact that allows mathematical algorithms to be applied to the analysis of sequence data. 1.6 Let us sum up: Bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems inspired from the management and analysis of biological data. This lesson gives a brief introduction on Bioinformatics, its area of specialization, the basic difference between Bioinformatics and computational biology. It also gives details of the various sequence details available and other facilties. 1.7 Lesson end activities: i. Find out the details of the various genomes sequenced so far. ii. What are the various databases for protein and DNA sequence? iii. Give details of the various options available at NCBI. 1.8 Check your progress: Model answers 1. Your answer must include these points: · Finding the genes in the DNA sequences of various organisms · Developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences. · Clustering protein sequences into families of related sequences and the development of protein models. · Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships. 1.9 Points for Discussion 1. Substantiate the significance of Bioinformatics. 2. Compare and contrast – Bioinformatics and computational Biology. This watermark does not appear in the registered version - http://www.clicktoconvert.com 10 1.10 References: 1. Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410. 2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J. 1997. 3. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. 4. Bairoch A., Bucher P., and Hofmann K. 1997. The PROSITE database, its status in 1997. Nucleic Acids Res. 25: 217–221. 5. Barker W.C. and Dayhoff M.O. 1982. Viral src gene products are related to the catalytic chain of mammalian cAMP-dependent protein kinase. Proc. Natl. Acad. Sci. 79: 2836–2839. 6. Barns S.M., Delwiche C.F., Palmer J.D., and Pace N.R. 1996. Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl. Acad. Sci. 93: 9188–9193. This watermark does not appear in the registered version - http://www.clicktoconvert.com 11 LESSON – 2 CLASSIFICATION OF BIOLOGICAL DATABASES 2.0 Aims and Objectives 2.1 Classification of Biological Database 2.2 Primary sequence databases 2.3 Protein sequence databases 2.4 Protein structure databases 2.5 Other databases 2.6 Specialized databases 2.7 Let us Sum up 2.8 Lesson end activities 2.9 Check your progress 2.10 Points for Discussion 2.11 References 2.0 Aims and Objectives: This unit describes the different biological databases, primary sequence databases, meta databases, genome browsers, protein sequence databases, and protein structure databases. 2.1 Classification of Biological Database Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interactions, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life. This watermark does not appear in the registered version - http://www.clicktoconvert.com 12 The biological knowledge of databases is usually (locally) distributed amongst many different specialized databases. This makes it difficult to ensure the consistency of information, which sometimes leads to low data quality. By far the most important resource for biological databases is a special (yearly) issue of the journal "Nucleic Acids Research" (NAR). The Database Issue is freely available, and categorizes all the publicly available online databases related to computational biology (or Bioinformatics). 2.2 Primary sequence databases The International Nucleotide Sequence Database (INSD) consists of the following databases. 1. DDBJ (DNA Data Bank of Japan) 2. EMBL Nucleotide DB (European Molecular Biology Laboratory) 3. GenBank (National Center for Biotechnology Information) These databanks represent the current knowledge about the sequences of all organisms. They interchange the stored information and are the source for many other databases. Meta-databases Strictly speaking a meta-database can be considered a database of databases, rather than any one integration project or technology. It collects information from different other sources and usually makes them available in new and more convenient form. The following are some examples for meta databases: 1. Entrez (National Center for Biotechnology Information) 2. euGenes (Indiana University) 3. GeneCards (Weizmann Inst.) 4. SOURCE (Stanford University) This watermark does not appear in the registered version - http://www.clicktoconvert.com 13 5. mGen containing four of the world biggest databases GenBank, Refseq, EMBL and DDBJ easy and simple program friendly gene extraction 6. Harvester III - Karlsruhe Institute of Technology - Integrating 26 major protein/gene resources. Genome Browsers Genome Browsers enable researchers to visualize and browse entire genomes (most have many complete genomes) with annotated data including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, etc. Annotated data is usually from multiple diverse sources. The following are some examples for genome browsers: 1. Integrated Microbial Genomes (IMG) system by the DOE-Joint Genome Institute 2. UCSC Genome Bioinformatics Genome Browser and Tools (UCSC) 3. Ensembl The Ensembl Genome Browser (Sanger Institute and EBI) 4. GBrowse The GMOD GBrowse Project 5. Pathway Tools Genome Browser 6. X:Map A genome browser that shows Affymetrix Exon Microarray hit locations alongside the gene, transcript and exon data on a Google maps api 2.3 Protein sequence databases 1. UniProt: Universal Protein Resource (UniProt Consortium: EBI, Expasy, PIR) 2. PIR: Protein Information Resource (Georgetown University Medical Center (GUMC)) 3. Swiss-Prot: Protein Knowledgebase (Swiss Institute of Bioinformatics ) 4. PEDANT: Protein Extraction, Description and ANalysis Tool (Forschungszentrum f. Umwelt & Gesundheit) 5. PROSITE: Database of Protein Families and Domains 6. DIP: Database of Interacting Proteins (Univ. of California) 7. Pfam: Protein families database of alignments and HMMs (Sanger Institute) 8. ProDom: Comprehensive set of Protein Domain Families (INRA/CNRS) 9. SignalP: Server for signal peptide prediction This watermark does not appear in the registered version - http://www.clicktoconvert.com 14 2.4 Protein structure databases Protein structure databases: 1. Protein Data Bank (PDB): (Research Collaboratory for Structural Bioinformatics (RCSB)) 2. CATH: Protein Structure Classification 3. SCOP: Structural Classification of Proteins 4. SWISS-MODEL: Server and Repository for Protein Structure Models 5. ModBase: Database of Comparative Protein Structure Models (Sali Lab, UCSF) Protein-protein interactions: 1. BioGRID: A General Repository for Interaction Datasets (Samuel Lunenfeld Research Institute) 2. STRING: STRING is a database of known and predicted protein-protein interactions. (EMBL) 3. Database of Interacting Proteins 2.5 Other Databases Pathway Databases 1. BioCyc Database Collection including EcoCyc and MetaCyc 2. KEGG PATHWAY Database (Univ. of Kyoto) 3. Reactome (Cold Spring Harbor Laboratory, EBI, Gene Ontology Consortium) Microarray-databases 1. ArrayExpress (European Bioinformatics Institute) 2. Gene Expression Omnibus (National Center for Biotechnology Information) 3. maxd (Univ. of Manchester) 4. SMD (Stanford University) 5. GPX(Scottish Centre for Genomic Technology and Informatics) This watermark does not appear in the registered version - http://www.clicktoconvert.com 15 Check your progress: 1. List the protein structure databases available. Notes: c) Write your answer in the space given below. d) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… ……………………………………………………………………………………………………… ……………………………………………………… 2.6 Specialized databases 1. CGAP Cancer Genes (National Cancer Institute) 2. Clone Registry Clone Collections (National Center for Biotechnology Information) 3. DBGET H.sapiens (Univ. of Kyoto) 4. GDB Hum. Genome Db (Human Genome Organisation) 5. MGI Mouse Genome (Jackson Lab.) 6. SHMPD The Singapore Human Mutation and Polymorphism Database 7. NCBI-UniGene (National Center for Biotechnology Information) 8. OMIM Inherited Diseases (Online Mendelian Inheritance in Man) 9. Off. Hum. Genome Db (HUGO Gene Nomenclature Committee) 10. HGMD disease-causing mutations (HGMD Human Gene Mutation Database) 11. List with SNP-Databases 12. p53 The p53 Knowledgebase 13. Edinburgh Mouse Atlas 14. Corn (Maize Genetics and Genomics Database) This watermark does not appear in the registered version - http://www.clicktoconvert.com 16 2.7 Let us sum up: Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. The biological knowledge of databases is usually (locally) distributed amongst many different specialized databases. This lesson briefs about the various biological databases available and their web location. It gives us the list of available databases under various categories. 2.8 Lesson end activities: i. Give details of the various primary sequence databases. ii. Give details of the various protein structure databases. iii. Give details of the NCBI and KEGG. 2.9 Check your progress: Model answers 1. Your answer must include any of these points: 1. Protein Data Bank (PDB) 2. CATH 3. SCOP 4. SWISS-MODEL 5. ModBase 2.10 Points for Discussion 1. Make a critical analysis of the primary sequence databases. 2. Make a comparative study on the protein structure database. 2.11 References 1. Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410. This watermark does not appear in the registered version - http://www.clicktoconvert.com 17 2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. 3. Bairoch A., Bucher P., and Hofmann K. 1997. The PROSITE database, its status in 1997. Nucleic Acids Res. 25: 217–221. 4. Barker W.C. and Dayhoff M.O. 1982. Viral src gene products are related to the catalytic chain of mammalian cAMP-dependent protein kinase. Proc. Natl. Acad. Sci. 79: 2836–2839. 5. Barns S.M., Delwiche C.F., Palmer J.D., and Pace N.R. 1996. Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl. Acad. Sci. 93: 9188–9193. This watermark does not appear in the registered version - http://www.clicktoconvert.com 18 LESSON – 3 BIOLOGICAL DATA FORMAT 3.0 Aims and Objectives 3.1 Biological Data Format 3.2 GenBank DNA Sequence Entry 3.3 European Molecular Biology Laboratory Data Library Format 3.4 SwissProt Sequence Format 3.5 FASTA Sequence Format 3.6 National Biomedical Research Foundation/Protein Information Resource Sequence Format 3.7 Stanford University/Intelligenetics Sequence Format 3.8 Genetics Computer Group Sequence Format 3.9 Plain/ASCII.Staden Sequence Format 3.10 Abstract Syntax Notation Sequence Format 3.11 Genetic Data Environment Sequence Format 3.12 Multiple Sequence Formats 3.13 Let us Sum up 3.14 Lesson end activities 3.15 Check your progress 3.16 Points for Discussion 3.16 References 3.0 Aims and Objectives: This unit describes the different biological data formats, primary sequence databases, meta databases, genome browsers, protein sequence databases, and protein structure databases. 3.1 Biological Data Format One major difficulty encountered in running sequence analysis software is the use of differing sequence formats by different programs. These formats all are standard ASCII files, but they may differ in the presence of certain characters and words that indicate where different types of This watermark does not appear in the registered version - http://www.clicktoconvert.com 19 information and the sequence itself are to be found. The more commonly used sequence formats are discussed below. 3.2 GenBank DNA Sequence Entry The format of a database entry in GenBank, the NCBI nucleic acid and protein sequence database, is as follows: Information describing each sequence entry is given, including literature references, information about the function of the sequence, locations of mRNAs and coding regions, and positions of important mutations. This information is organized into fields, each with an identifier, shown as the first text on each line. In some entries, these identifiers may be abbreviated to two letters, e.g., RF for reference, and some identifiers may have additional subfields. The information provided in these fields is described in and the database organization is described. The CDS subfield in the field FEATURES gives the amino acid sequence, obtained by translation of known and potential open reading frames, i.e., a consecutive set of three- letter words that could be codons specifying the amino acid sequence of a protein. The sequence entry is assumed by computer programs to lie between the identifiers “ORIGIN” and “//”. The sequence includes numbers on each line so that sequence positions can be located by eye. Because the sequence count or a sequence checksum value may be used by the computer program to verify the sequence composition, the sequence count should not be modified except by programs that also modify the count. The GenBank sequence format often has to be changed for use with sequence analysis software. 3.3 European Molecular Biology Laboratory Data Library Format The European Molecular Biology Laboratory (EMBL) maintains DNA and protein sequence databases. As with GenBank entries, a large amount of information describing each sequence entry is given, including literature references, information about the function of the sequence, locations of mRNAs and coding regions, and positions of important mutations. This information is organized into fields, each with an identifier, shown as the first text on each line. These identifiers are abbreviated to two letters, e.g., RF for reference, and some identifiers may have additional subfields. The sequence entry is assumed by computer programs to lie between the identifiers “SEQUENCE” and “//” and includes numbers on each line to locate parts of the This watermark does not appear in the registered version - http://www.clicktoconvert.com 20 sequence visually. The sequence count or a checksum value for the sequence may be used by computer programs to make sure that the sequence is complete and accurate. For this reason, the sequence part of the entry should usually not be modified except with programs that also modify this count. This EMBL sequence format is very similar to the GenBank format. The main differences are in the use of the term ORIGIN in the GenBank format to indicate the start of sequence; also, the EMBL entry does not include the sequence of any translation products, which are shown instead as a different entry in the database. This sequence format often has to be changed for use with sequence analysis software. 3.4 SwissProt Sequence Format The format of an entry in the SwissProt protein sequence database is very similar to the EMBL format, except that considerably more information about the physical and biochemical properties of the protein is provided. 3.5 FASTA Sequence Format The FASTA sequence format includes three parts: (1) a comment line identified by a “_” character in the first column followed by the name and origin of the sequence; (2) the sequence in standard one- letter symbols; and (3) an optional “*” which indicates end of sequence and which may or may not be present. The presence of “*” may be essential for reading the sequence correctly by some sequence analysis programs. The FASTA format is the one most often used by sequence analysis software. This format provides a very convenient way to copy just the sequence part from one window to another because there are no numbers or other nonsequence characters within the sequence. The FASTA sequence format is similar to the protein information resource (PIR) format except that the PIR format includes a first line with a “_” character in the first column followed by information about the sequence, a second line containing an identification name for the sequence, and the third to last lines containing the sequence, as described below. This watermark does not appear in the registered version - http://www.clicktoconvert.com 21 3.6 National Biomedical Research Foundation/Protein Information Resource Sequence Format This sequence format, which is sometimes also called the PIR format, has been used by the National Biomedical Research Foundation/Protein Information Resource (NBRF/PIR) and also by other sequence analysis programs. Note that sequences retrieved from the PIR database on their Web site (http://www-nbrf.georgetown.edu) are not in this compact format, but in an expanded format with much more information about the sequence, as shown below. The NBRF/PIR format is similar to the FASTA sequence format but with significant differences. The first line includes an initial “_” character followed by a two-letter code such as P for complete sequence or F for fragment, followed by a 1 or 2 to indicate type of sequence, then a semicolon, then a four- to six-character unique name for the entry. There is also an essential second line with the full name of the sequence, a hyphen, then the species of origin. In FASTA format, the second line is the start of the sequence and the first line gives the sequence identifier after a “_” sign. The sequence terminates with an “*”. 3.7 Stanford University/Intelligenetics Sequence Format Started by a molecular genetics group at Stanford University, and subsequently continued by a company, Intelligenetics, the IG format is similar to the PIR format, except that a ‘semicolon’ is usually placed before the comment line. The identifier on the second line is also present. At the end of the sequence, a 1 is placed if the sequence is linear, and a 2, if the sequence is circular. 3.8 Genetics Computer Group Sequence Format Earlier versions of the Genetics Computer Group (GCG) programs required a unique sequence format and include programs that convert other sequence formats into GCG format. Later versions of GCG accept several sequence formats. Information about the sequence in the GenBank entry is first included, followed by a line of information about the sequence and a checksum value. This value is provided as a check on the accuracy of the sequence by the This watermark does not appear in the registered version - http://www.clicktoconvert.com 22 addition of the ASCII values of the sequence. If the sequence has not been changed, this value should stay the same. If one or more sequence characters changed through error, a program reading the sequence will be able to determine that the change had occurred because, the checksum value in the sequence entry will no longer be correct. Lines of information are terminated by two periods, which mark the end of information and the start of the sequence on the next line. The rest of the text in the entry is treated as the sequence. Since there is no symbol to indicate end of sequence, no text other than sequence should be added beyond this point. The sequence should not be altered except by programs that will also adjust the checksum score for the sequence. The GCG sequence format may have to be changed for use with other sequence analysis software. GCG also includes programs for reformatting sequence files. 3.9 Plain/ASCII.Staden Sequence Format This sequence format is a computer file that includes only the sequence with no accessory information. This particular format is used by the Staden Sequence Analysis programs (http://www/.mrc-lmb.com.ac.uk/pubseq) produced by Roger Staden at Cambridge University (Staden et al. 2000). The sequence must be further formatted to be used for most sequence analysis programs. 3.10 Abstract Syntax Notation Sequence Format Abstract Syntax Notation (ASN.1) is a formal data description language that has been developed by the computer industry. ASN.1 (http://www-sop.inria.fr/rodeo/personnel/ hoschka/asn1.html; NCBI 1993) has been adopted by the National Center for Biotechnology Information (NCBI) to encode data such as sequences, maps, taxonomic information, molecular structures, and bibliographic information. These data sets may then be easily connected and accessed by computers. The ASN.1 sequence format is a highly structured and detailed format especially designed for computer access to the data. All the information found in other forms of sequence storage, e.g., the GenBank format, is present. For example, sequences can be retrieved in this format by ENTREZ. However, the information is much more difficult to read by eye than a This watermark does not appear in the registered version - http://www.clicktoconvert.com 23 GenBank formatted sequence. One would normally not required to use the ASN.1 format except when running a computer program that uses this format as input. 3.11 Genetic Data Environment Sequence Format Genetic Data Environment (GDE) format is used by a sequence analysis system called the Genetic Data Environment, which was designed by Steven Smith and collaborators (Smith et al. 1994) around a multiple sequence alignment editor that runs on UNIX machines. The GDE features are incorporated into the SEQLAB interface of the GCG software, version 9. GDE format is a tagged- field format similar to ASN.1 that is used for storing all available information about a sequence, including residue colour. The file consists of various fields, each enclosed by brackets, and each field has specific lines, each with a given name tag. The information following each tag are placed in double quotes or follows the tag name by one or more spaces. Check your progress: 1. How does an NBRF sequence format look like? Notes: e) Write your answer in the space given below. f) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… ……………………………………………………………………………………………………… ……………………………………………………… READSEQ to Switch between Sequence Formats READSEQ is an extremely useful sequence formatting program developed by D. G. Gilbert at Indiana University, Bloomington (gilbertd_bio.indiana.edu). READSEQ can recognize a DNA This watermark does not appear in the registered version - http://www.clicktoconvert.com 24 or protein sequence file in any of the formats discussed above, identify the format, and write a new file with an alternative format. Some of these formats are used for special types of analysis such as multiple sequence alignment and phylogenetic analysis. READSEQ may be reached at the Baylor College of Medicine site at http://dot.imgen.bcm.tmc.edu:9331/seq-util/readseq.html and also by anonymous FTP from ftp.bio.indiana.edu/molbio/readseq or ftp.bioindiana.edu/molbio/mac to obtain the appropriate files. Data files that have multiple sequences, such as those required for multiple sequence alignment and phylogenetic analysis using parsimony (PAUP), are also converted. Options to reverse-complement and to remove gaps from sequences are included. SEQIO, another sequence conversion program for a UNIX machine, is described at http://bioweb.pasteur.fr/docs/seqio/seqio. html and is available for download at http://www.cs.ucdavis.edu/_gusfield/seqio.html. Sequence formats recognized by format conversion program READSEQ 1. Abstract Syntax Notation (ASN.1) 2. DNA Strider 3. European Molecular Biology Laboratory (EMBL) 4. FASTA/Pearson 5. Fitch (for phylogenetic analysis) 6. GenBank 7. Genetics Computer Group (GCG) 8. Intelligenetics/Stanford 9. Multiple sequence format (MSF) 10. National Biomedical Research Foundation (NBRF) 11. Olsen (input only) 12. Phylogenetic Analysis Using Parsimony (PAUP) NEXUS format 13. Phylogenetic Inference package (Phylip v3.3, v3.4) 14. Phylogenetic Inference package (Phylip v3.2) 15. Plain text/Stadena 16. Pretty format for publication (output only) 17. Protein Information Resource (PIR or CODATA) 18. Zuker for RNA analysis (input only) This watermark does not appear in the registered version - http://www.clicktoconvert.com 25 GCG Programs for Conversion of Sequence Formats The “FROM” programs convert sequence files from GCG format into the named format, and the “TO” programs convert the alternative format into GCG format. Shown below are the actual program names, no spaces included. There are no programs that convert to GenBank and EMBL formats. FROMEMBL FROMFASTA FROMGENBANK FROMIG FROMPIR FROMSTADEN TOFASTA TOIG TOPIR TOSTADEN In addition, the GCG programs include the following sequence formatting programs: (1) ETSEQ, which converts a simple ASCII file being received from a remote PC to GCG format, (2) REFORMAT, which will format a GCG file that has been edited, and will also perform other functions, and (3) SPEW, which sends a GCG sequence file as an ASCII file to a remote PC. 3.12 Multiple Sequence Formats Most of the sequence formats listed above can be used to store multiple sequences in tandem in the same computer file. Exceptions are the GCG and raw sequence formats, which are designed only for single sequences. GCG has an alternative multiple sequence format, which is described below. In addition, there are formats especially designed for multiple sequences that can also be used to show their alignments or to perform types of multiple sequence analysis such as This watermark does not appear in the registered version - http://www.clicktoconvert.com 26 phylogenetic analysis. In the case of PAUP, the program will accept MSA format and convert to the NEXUS format. These formats are illustrated below using the same two short sequences. The sequences are in FASTA format. The aligned sequence characters occupy the same line and column, and gaps are indicated by a dash. >gi|730305| MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFL RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ >gi|404390| ----------------------APEAQVSVQPNFQPDKFL RTQTPRAELKEKFTAFCKAQGFTEDSIVFLPQTDKCMTEQ >gi|895868 MAALRMLWMGLVLLGLLGFPQTPAQGHDTVQPNFQQDKFL RTQTLKDELKEKFTTFSKAQGLTEEDIVFLPQPDKCIQE Represents the same alignment as: MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFL ----------------------APEAQVSVQPNFQPDKFL RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ RTQTLKDELKEKFTTFSKAQGLTEEDIVFLPQPDKCIQE Storage of Information in a Sequence Database As shown by the above examples, each DNA or protein sequence database entry has much information, including an assigned accession number(s); source organism; name of locus; reference(s); keywords that apply to sequence; features in the sequence such as coding regions, intron splice sites, and mutations; and finally the sequence itself. The above information is organized into a tabular form very much like that found in a relational database. If one imagines a large table with each sequence entry occupying one row, then each column will include one of the above types of information for each sequence, and each column is called a FIELD. The last column contains the sequences themselves. It is very easy to make an index of the information in each of these fields so that a search query can locate all the occurrences through the index. Even This watermark does not appear in the registered version - http://www.clicktoconvert.com 27 related sequences are cross-referenced. In addition, the information in one database can be crossreferenced to that in another database. The DNA, protein, and reference databases have all been cross-referenced so that moving between them is readily accomplished. Database Types There are several types of databases; the two principal types are the relational and objectoriented databases. The relational database orders data in tables made up of rows giving specific items in the database, and columns giving the features as attributes of those items. These tables are carefully indexed and cross-referenced with each other, sometimes using additional tables, so that each item in the database has a unique set of identifying features. A relational model for the GenBank sequence database has been devised at the National Center for Genome Resources (http://www.ncgr.org/research/sequence/schema.html). The object-oriented database structure has been useful in the development of biological databases. The objects, such as genetic maps, genes, or proteins, each have an associated set of utilities for analysis and display of the object and a set of attributes such as identifying name or references. In developing the database, relationships among these objects are identified. To standardize some commonly arising objects in biological databases, e.g., maps, the Object Management Group (http://www. omg.org) has formed a Life Science Research Group. The Life Science Research Group is a consortium of commercial companies, academic institutions, and software vendors that is trying to establish standards for displaying biological information from Bioinformatics and genomics analysis (http://www.omg.org/home pages/lsr). The Common Object Request Broker Architecture (CORBA) is the Object Management Group’s interface for objects that allows different computer applications to communicate with each other through a common language called Interface Definition Language (IDL). To plan an object-oriented database by defining the classes of objects and the relationships among these objects, a specific set of procedures called the Unified Modeling Language (UML) has been devised by the OMG. DNA sequence analysis software packages often include sequence databases that are updated regularly. The organizations that manage sequence databases also provide public access through the Internet. Using a browser such as the Netscape navigator or the Internet Exploreron a This watermark does not appear in the registered version - http://www.clicktoconvert.com 28 personal computer, these sites may be visited through the internet and a form can be filled in with the sequence name. Once the correct sequence has been identified, the sequence is delivered to the browser and may be saved as a local computer file, cut-and-pasted from the browser window to another window of an analysis program or editor, or even pasted into another browser page for analysis in another website. A useful feature of browser programs for sequence analysis is the capability of having more than one browser window running at a time. Hence, one browser window may retrieve sequences from a database and another may analyze these sequences. At the time of retrieving the sequence, several sequence formats may be available. The FASTA format, which is readily converted into other formats and also is smaller and simpler, containing just a line of sequence identifiers followed by the sequence without numbers, is very useful for this purpose. Using the Database Access Program Entrez One straightforward way to access the sequence databases is through ENTREZ, a resource prepared by the staff of the National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland, and available through their web site at http://ncbi.nlm.nih.gov/Entrez. ENTREZ provides a series of forms that can be filled into retrieve a DNA or protein sequence, or a Medline reference related to the molecular biology sequence databases. After a search for either a protein or a DNA sequence is chosen at the above address, another web page is provided with a form to be filled in for the search. On the ENTREZ form, make a selection in the data entry window after the term “Search,” then enter search terms in the longer data entry window after “for.” The database will be searched for sequence database entries that contain all of these terms or related ones. Using boolean logic, the search looks for database entries that include the first term AND the second, and subsequent terms repeated until the last term. The “Limits” link on the ENTREZ form page is used to limit the GenBank field to be searched, and various logical combinations of search terms may be designed by this method. These fields refer to the GenBank fields. When searching for terms in a particular field, some knowledge of the terms that are in the database can be helpful. To assist in finding suitable terms, for each field, ENTREZ provides a list of index entries. For a protein search, for example, current choices for fields include accession (number), all fields, author This watermark does not appear in the registered version - http://www.clicktoconvert.com 29 name, E. C. number, issue, journal name, keyword, modification date, organism, page number, primary accession (number), properties, protein name, publication date (of reference), seqID string, sequence length, substance name, text word, title word, volume, and sequence ID. Similar fields are shown for the DNA database search. Later, the results of searches in separate fields may be combined to narrow down the choices. The number of terms to be searched for and the field to be searched are the main decisions to be made. In doing so, keep in mind that it is important to be as specific as possible, or else there may be a great many possibilities. Thus, knowing accession number, protein name, or name of gene should be enough to find the required entry quickly. If the same protein has been sequenced in several organisms, providing an organism name is also helpful. When the chosen search terms and fields have been decided and submitted, a database comprising all of the currently available sequences (called the nonredundant or NR database) will be searched. Other database selections may also be made. The program returns the number of matches found and provides an opportunity to narrow this list by including more terms. When the number of matching sequences has been narrowed to a reasonable number, the sequence may be retrieved in a chosen format in several straightforward steps. It is important to look through the sequences to locate the one intended. There may be several different copies of the sequence because it may have been sequenced from more than one organism, or the sequence may be a mutant sequence, a particular clone, or a fragment. There is no simple way to find the correct sequence without manually checking the information provided in each sequence, but this usually takes only a short time. Before leaving ENTREZ, it is often useful to check for sequence database entries that are similar to the one of interest, called “neighbors” by ENTREZ. The expanded query searches other database entries of interest, such as the same protein in another organism, a large chromosomal sequence that includes the gene, or members of the same gene family. While visiting the site, note that ENTREZ has been adapted to search through a number of other biological databases, and also through Medline, and these searches are available from the initial ENTREZ Web page. Retrieving a Specific Sequence Even strictly following the above instructions, it may be difficult to retrieve the sequence of a specific gene or protein simply because of the sheer number of sequences in the Gen Bank This watermark does not appear in the registered version - http://www.clicktoconvert.com 30 database and the complex problem of indexing them. For projects that require the most currently available sequences, the NR databases should be searched. Other projects may benefit from the availability of better curated and annotated protein sequence databases, including PIR and SwissProt. The genomic databases can also provide the sequence of a particular gene or protein. Protein sequences in the Gen P ro database are generated by automatic translation of DNA sequences. When read from cDNA copies of mRNA sequences, they provide a reliable sequence, given a certain amount of uncertainty as to the translational start site. Many protein sequences are now predicted by translation of genomic sequences, requiring a prediction of exons, a somewhat error-prone step escribed later int his material. The origin of protein sequence entries thus needs to be determined, and if they are not from a cDNA sequence, it may be necessary to obtain and sequence a cDNA copy of the gene. 3.13 Let us sum up: One major difficulty encountered in running sequence analysis software is the use of differing sequence formats by different programs. These formats all are standard ASCII files, but they may differ in the presence of certain characters and words that indicate where different types of information and the sequence itself are to be found. The various biological data formats were discussed in this lesson. 3.14 Lesson end activities: (i) Visit the NCBI website and write details of the data format available there. (ii) Give details of the data formats available at Swiss Prot. (iii) Mention a few data formats available for multiple sequence. 3.15 Check your progress: Model answers 1. Your answer must include these points: The first line includes an initial “_” character followed by a two- letter code such as P for complete sequence or F for fragment, followed by a 1 or 2 to indicate type of sequence, then a semicolon, then a four- to six-character unique name for the entry. This watermark does not appear in the registered version - http://www.clicktoconvert.com 31 3.15 Points for Discussion 1. Eluciate the liightiglting features of FASTA format. 2. How do you rate the GCG format when compared to the other formats? Elobrate your discussion. 3.16 References: 1. Blattner F.R., Plunkett III, G., Bloch C.A., Perna N.T., Burland V., Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F., Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J., 2. Mau B., and Shao Y. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 1453–1474. 3. Bowie J.U., Luthy R., and Eisenberg D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253: 164–170. 4. C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282: 2012–2018. 5. Cherry J.M. and Cartinhour S.W. 1993. ACEDB, a tool for biological information. In Automated DNA sequencing and analysis (ed. M. Adams et al.). Academic Press, New York. 6. Cherry J.M., Ball C., Weng S., Juvik G., Schmidt R., Adler C., Dunn B., Dwight S., Riles L., Mortimer R. K., and Botstein D. 1997. Genetic and physical maps of Saccharomyces cerevisiae. Nature (suppl. 6632) 387: 67–73. 7. Chothia C. 1992. Proteins. One thousand families for the molecular biologist. Nature 357: 543–544. 8. Chou P.Y. and Fasman G.D. 1978. Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol. 47: 45–147. This watermark does not appear in the registered version - http://www.clicktoconvert.com 32 LESSON – 4 APPLICATIONS OF BIOINFORMATICS 4.0 Aims and Objectives 4.1 Application Of Bioinformatics in Various Field 4.1.1 Sequence analysis 4.1.2 Genome annotation 4.1.3 Computational evolutionary biology 4.1.4 Measuring biodiversity 4.1.5 Analysis of gene expression 4.1.6 Analysis of regulation 4.1.7 Analysis of protein expression 4.1.8 Analysis of mutations in cancer 4.1.9 Prediction of protein structure 4.2 Let us Sum up 4.3 Lesson end activities 4.4 Check your progress 4.5 Points for Discussion 4.6 References 4.0 Aims and Objectives: This unit describes the Application of Bioinformatics in Various Fields, genome annotation, measuring biodiversity, expression of genes, prediction of protein structure. 4.1 Application of Bioinformatics in Various Fields 4.1.1 Sequence analysis Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of hundreds of organisms have been decoded and stored in databases. The information is analyzed to determine genes that encode polypeptides, as well as regulatory sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between This watermark does not appear in the registered version - http://www.clicktoconvert.com 33 species (the use of molecular systematics to construct phylogenetic trees). With the growing amount of data, it is quite impractical to analyze DNA sequences manually. Today, computer programs are used to search the genome of thousands of organisms, containing billions of nucleotides. These programs would compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, in order to identify sequences that are related, but not identical. A variant of this sequence alignment is used in the sequencing process itself. The so-called shotgun sequencing technique (which was used, for example, by The Institute for Genomic Research to sequence the first bacterial genome, Haemophilus influenzae) does not give a sequential list of nucleotides, but instead the sequences of thousands of small DNA fragments (each about 600800 nucleotides long). The ends of these fragments overlap and, when aligned in the right way, make up the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. In the case of the Human Genome Project, it took several months of CPU time (on a circa-2000 vintage DEC Alpha computer) to assemble the fragments. Shotgun sequencing is the method of choice for virtually all genomes sequenced today, and genome assembly algorithms are a critical area of Bioinformatics research. Another aspect of Bioinformatics in sequence analysis is the automatic search for genes and regulatory sequences within a genome. Not all of the nucleotides within a genome are genes. Within the genome of higher organisms, large parts of the DNA do not serve any obvious purpose. This so-called junk DNA may, however, contain unrecognized functional elements. Bioinformatics helps to bridge the gap between genome and proteome projects--for example, in the use of DNA sequences for protein identification. 4.1.2 Genome annotation In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence. The first genome annotation software system was designed in 1995 by Dr. Owen White, who was part of the team that sequenced and analyzed the first genome of a free- living organism to be decoded, the bacterium Haemophilus influenzae. Dr. White built a software system to find the genes (places in the DNA sequence that encode a protein), the This watermark does not appear in the registered version - http://www.clicktoconvert.com 34 transfer RNA, and other features, and to make initial assignments of function to those genes. Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA are constantly changing and improving. 4.1.3 Computational Evolutionary Biology Evolutionary biology is the study of the origin and descent of species, as well as their change over time. Informatics has assisted evolutionary biologists in several key ways. It has enabled researchers to: * trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone, * more recently, compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplication, lateral gene transfer, and the prediction of factors important in bacterial speciation, * build complex computational models of populations to predict the outcome of the system over time * track and share information on an increasingly large number of species and organisms Future work endeavours to reconstruct the now more complex tree of life. The area of research within computer science that uses genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are unrelated. 4.1.4 Measuring Biodiversity Biodiversity of an ecosystem might be defined as the total genomic complement of a particular environment, from all of the species present, whether it is a biofilm in an abandoned mine, a drop of sea water, a scoop of soil, or the entire biosphere of the planet Earth. Databases are used to collect the specie’s names, descriptions, distributions, genetic information, status and size of This watermark does not appear in the registered version - http://www.clicktoconvert.com 35 populations, habitat needs, and how each organism interacts with other species. Specialized software programs are used to find, visualize, and analyze the information, and most importantly, communicate it to other people. Computer simulations model such things as population dynamics, or calculate the cumulative genetic health of a breeding pool (in agriculture) or endangered population (in conservation). One very exciting potential of this field is that entire DNA sequences, or genomes of endangered species can be preserved, allowing the results of Nature's genetic experiment to be remembered insilico, and possibly reused in the future, even if that species is eventually lost. 4.1.5 Analysis of gene expression The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies. Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and downregulated in a particular population of cancer cells. 4.1.6 Analysis of regulation Regulation is the complex orchestration of events starting with an extracellular signal such as a hormone and leading to an increase or decrease in the activity of one or more proteins. Bioinformatics techniques have been applied to explore various steps in this process. For example, promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about This watermark does not appear in the registered version - http://www.clicktoconvert.com 36 the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for overrepresented regulatory elements. 4.1.7 Analysis of protein expression Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected. 4.1.8 Analysis of mutations in cancer In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways. Massive sequencing efforts are used to identify previously unknown point mutations in a variety of genes in cancer. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome sequences and germline polymorphisms. New physical detection technologies are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and single nucleotide polymorphism arrays to detect known point mutations. These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabytes of data per experiment. Again the massive amounts and new types of data generate new opportunities for Bioinformaticians. The data is often found to contain This watermark does not appear in the registered version - http://www.clicktoconvert.com 37 considerable variability, or noise, and thus Hidden Markov Model and change-point analysis methods are being developed to infer real copy number changes. Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent across many tumors . 4.1.9 Prediction of protein structure Protein structure prediction is another important application of Bioinformatics. The amino acid sequence of a protein, the so-called primary structure, can be easily determined from the sequence on the gene that codes for it. In the vast majority of cases, this primary structure uniquely determines a structure in its native environment. (Of course, there are exceptions, such as the bovine spongiform encephalopathy - aka Mad Cow Disease - prion.) Knowledge of this structure is vital in understanding the function of the protein. For lack of better terms, structural information is usually classified as one of secondary, tertiary and quaternary structure. A viable general solution to such predictions remains an open problem. As of now, most efforts have been directed towards heuristics that work most of the time. One of the key ideas in Bioinformatics is the notion of homology. In the genomics branch of Bioinformatics , homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of Bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably. One example of this is the similar protein homology between haemoglobin in the humans and the haemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in the organism. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes. This watermark does not appear in the registered version - http://www.clicktoconvert.com 38 Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling. Check your progress: 1. Explain how Bioinformatics has assisted evolutionary biologists in their research. Notes: g) Write your answer in the space given below. h) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… Comparative Genomics The core of comparative genome analysis is the establishment of the correspondence between genes (orthology analysis) or other genomic features in different organisms. It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectra of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation This watermark does not appear in the registered version - http://www.clicktoconvert.com 39 algorithms for problems based on parsimony models to Markov Chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models. Many of these studies are based on the homology detection and protein families computation. Modeling biological systems Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms. High-throughput image analysis Computational technologies are used to accelerate or fully automate the processing, quantification and analysis of large amounts of high- information-content biomedical imagery. Modern image analysis systems augment an observer's ability to make measurements from a large or complex set of images, by improving accuracy, objectivity, or speed. A fully developed analysis system may completely replace the observer. Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research. Some examples are: * high-throughput and high- fidelity quantification and sub-cellular localization (highcontent screening, cytohistopathology) * morphometrics * clinical image analysis and visualization * determining the real- time air-flow patterns in breathing lungs of living animals * quantifying occlusion size in real-time imagery from the development of and recovery during arterial injury * making behavioral observations from extended video recordings of laboratory animals * infra-red measurements for metabolic activity determination This watermark does not appear in the registered version - http://www.clicktoconvert.com 40 Protein-Protein docking In the last two decades, tens of thousands of protein three-dimensional structures are determined by X-ray crystallography and protein nuclear magnetic resonance spectroscopy (protein NMR). One central question for the biological scientist is whether it is practical to predict possible protein-protein interactions only based on these 3D shapes, without doing protein-protein interaction experiments. A variety of methods have been developed to tackle the Protein-protein docking problem, though it seems that there is still much place to work on in this field. 4.2 Let us sum up Bioinformatics is the use of IT in biotechnology for the data storage, data warehousing and analyzing the DNA sequences. In Bioinfomatics knowledge of many branches are required like biology, mathematics, computer science, laws of physics & chemistry, and of course sound knowledge of IT to analyze biotech data. Bioinformatics is not limited to the computing data, but in reality it can be used to solve many biological problems and find out how living things works. 4.3 Lesson end activities 1. Mention the other fields where Bioinformatics is used, other than those mentioned in this lesson. 4.4 Check your progress: Model answers 1. Your answer may include these points: * trace the evolution of a large number of organisms * compare entire genomes * build complex computational models of populations to predict the outcome of the system over time * track and share information on an increasingly large number of species and organisms. This watermark does not appear in the registered version - http://www.clicktoconvert.com 41 4.5 Points for Discussion 1. Bioinformatics will be ruling all the sciences in the near future – Comment on this statement. 2. Do you support the fact that, Bioinformatics above will be able to provide landful of information to all the researchers in different areas mentioned in this lesson? 4.6 References 1. Dayhoff M.O., Ed. 1972. Atlas of protein sequence and structure, vol. 5. National Biomedical Research Foundation, Georgetown University, Washington, D.C. ———. 1978. Survey of new data and computer methods of analysis. In Atlas of protein sequence and structure, vol. 5, suppl. 2. National Biomedical Research Foundation, Georgetown University, Washington, D.C. Doolittle R.F., Hunkapiller M.W., Hood L.E., Devare S.G., Robbins K.C., Aaronson S.A., and Antoniades H.N. 1983. Simian sarcoma onc gene v-sis is derived from the gene (or genes) encoding a platelet-derived growth factor. Science 221: 275–277. 3. Eddy S.R., Mitchison G., and Durbin R. 1995. Maximum discrimination hidden Markov models of sequence consensus. J. Comput. Biol. 2: 9–23. 4. Ewing B. and Green P. 1998. Base-calling of automated sequence traces using phred. II. Error probabilities. Genome Res. 8: 186–194. 5. Ewing B., Hillier L., Wendl, M.C., and Green P. 1998. Base-calling of automated sequence traces using phred. I. Accuracy assessment. Genome Res. 8: 175–185. 6. Felsenstein J. 1988. Phylogenies from molecular sequences: Inferences and reliability. Annu. Rev. Genet. 22: 521–565. 7. Fitch W.M. and Margoliash E. 1987. Construction of phylogenetic trees. Science 155: 279– 284. 8. Fleischmann R.D., Adams M.D., White O., Clayton R.A., Kirkness E.F., Kerlavage A.R., Bult C.J., Tomb J.F., Dougherty B.A., Merrick J.M., et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512. 9. Garnier J., Osguthorpe D.J., and Robson B. 1978. Analysis of the accuracy and implications of simple This watermark does not appear in the registered version - http://www.clicktoconvert.com 42 LESSON – 5 STRUCTURE OF PROTEINS 5.0 Aims and Objectives 5.1 Single letter code of amino acid: 5.2 Functions in proteins 5.3 Non-protein functions 5.4 General structure 5.5 Peptide bond formation 5.6 Hydrophilic and hydrophobic amino acids 5.7 Entrez and SRS 5.8 Let us Sum up 5.9 Lesson end activities 5.10 Check your progress 5.11 Points for Discussion 5.12 References 5.0 Aims and Objectives: This unit describes the single letter code of amino acid, function of proteins, general structure of amino acid, isomerism, peptide bond formation. 5.1 Single letter Code of Amino acid: Alpha-amino acids are the building blocks of proteins. A protein forms via the condensation of amino acids to form a chain of amino acid "residues" linked by peptide bonds. Proteins are defined by their unique sequence of amino acid residues; this sequence is the primary structure of the protein. Just as the letters of the alphabet can be combined to form an almost endless variety of words, amino acids can be linked in varying sequences to form a huge variety of proteins. This watermark does not appear in the registered version - http://www.clicktoconvert.com 43 Twenty standard amino acids are used by cells in protein biosynthesis, and these are specified by the general genetic code. These twenty amino acids are biosynthesized from other molecules, but organisms differ in which ones they can synthesize and which ones must be provided in their diet. The ones that cannot be synthesized by an organism are called essential amino acids. 5.2 Functions in proteins A polypeptide is a chain of amino acids. Amino acids are the basic structural building units of proteins. They form short polymer chains called peptides or longer chains either called polypeptides or proteins. The process of such formation from an mRNA template is known as translation which is part of protein biosynthesis. Twenty amino acids are encoded by the standard genetic code and are called proteinogenic or standard amino acids. Other amino acids contained in proteins are usually formed by posttranslational modification, which is modification after translation in protein synthesis. These modifications are often essential for the function or regulation of a protein; for example, the carboxylation of glutamate allows for better binding of calcium cations, and the hydroxylation of proline is critical for maintaining connective tissues and responding to oxygen starvation. Such modifications can also determine the localization of the protein, e.g., the addition of long hydrophobic groups can cause a protein to bind to a phospholipid membrane. 5.3 Non-protein functions The twenty standard amino acids are either used to synthesize proteins and other biomolecules, or oxidized to urea and carbon dioxide as a source of energy. The oxidation pathway starts with the removal of the amino group by a transaminase, the amino group is then fed into the urea cycle. The other product of transamidation is a keto acid that enters the citric acid cycle. Glucogenic amino acids can also be converted into glucose, through gluconeogenesis. This watermark does not appear in the registered version - http://www.clicktoconvert.com 44 Hundreds of types of non-protein amino acids have been found in nature and they have multiple functions in living organisms. Microorganisms and plants can produce uncommon amino acids. In microbes, examples include 2-aminoisobutyric acid and lanthionine, which is a sulfidebridged alanine dimer. Both these amino acids are both found in peptidic lantibiotics such as alamethicin. While in plants, 1-Aminocyclopropane-1-carboxylic acid is a small disubstituted cyclic amino acid that is a key intermediate in the production of the plant hormone ethylene. In humans, non-protein amino acids also have biologically- important roles. Glycine, gammaaminobutyric acid and glutamate are neurotransmitters and many amino acids are used t o synthesize other molecules. For example: * Tryptophan is a precursor of the neurotransmitter serotonin * Glycine is a precursor of porphyrins such as heme * Arginine is a precursor of nitric oxide * Carnitine is used in lipid transport within a cell, * Ornithine and S-adenosylmethionine are precursors of polyamines, * Homocysteine is an intermediate in S-adenosylmethionine recycling Also present are hydroxyproline, hydroxylysine, and sarcosine. The thyroid hormones are also alpha-amino acids. Some amino acids have even been detected in meteorites, especially in a type known as carbonaceous chondrites. This observation has prompted the suggestion that life may have arrived on earth from an extraterrestrial source. This watermark does not appear in the registered version - http://www.clicktoconvert.com 45 5.4 General structure Fig 5. The general structure of an α-amino acid, with the amino group on the left and the carboxyl group on the right. In the structure shown to the right, the R represents a side chain specific to each amino acid. The central carbon atom called Cα is a chiral central carbon atom (with the exception of glycine) to which the two termini and the R-group are attached. Amino acids are usually classified by the properties of the side chain into four groups. The side chain can make them behave like a weak acid, a weak base, a hydrophile if they are polar, and hydrophobe if they are nonpolar. The chemical structures of the 20 standard amino acids, along with their chemical properties, are catalogued in the list of standard amino acids. The phrase "branched-chain amino acids" or BCAA is sometimes used to refer to the amino acids having aliphatic side-chains that are non- linear, these are leucine, isoleucine and valine. Proline is the only proteinogenic amino acid whose side group links to the α-amino group, and thus is also the only proteinogenic amino acid containing a secondary amine at this position. Proline has sometimes been termed an imino acid, but this is not correct in the current nomenclature. This watermark does not appear in the registered version - http://www.clicktoconvert.com 46 Isomerism Most amino acids can exist in either of two optical isomers, called D and L. The L-amino acids represent the vast majority of amino acids found in proteins. D-amino acids are found in some proteins produced by exotic sea-dwelling organisms, such as cone snails. They are also abundant components of the peptidoglycan cell walls of bacteria. The L and D conventions for amino acid configuration do not refer to the optical activity, but rather to the optical activity of the isomer of glyceraldehyde having the same stereochemistry as the amino acid. S-Glyceraldehyde is levorotary, and R- glyceraldehyde is dexterorotary, and so Samino acids are called L- even if they are not levorotary, and R-amino acids are likewise called D- even if they are not dexterorotary. There are two exceptions to these general rules of amino acid isomerism. Firstly, glycine, where R = H, no isomerism is possible because the alpha-carbon bears two identical groups (hydrogen). Secondly, in cysteine, the L = S and D = R assignment is reversed to L = R and D = S. Cysteine is structured similarly (with respect to glyceraldehyde) to the other amino acids but the sulfur atom alters the interpretation of the Cahn-Ingold-Prelog priority rule. Reactions As amino acids have both a primary amine group and a primary carboxyl group, these chemicals can undergo most of the reactions associated with these functional groups. These include nucleophilic addition, amide bond formation and imine formation for the amine group and esterification, amide bond formation and decarboxylation for the carboxylic acid group. The multiple side chains of amino acids can also undergo chemical reactions. The types of these reactions are determined by the groups on these side chains and are discussed in the articles dealing with each specific type of amino acid. 5.5 Peptide bond formation This watermark does not appear in the registered version - http://www.clicktoconvert.com 47 As both the amine and carboxylic acid groups of amino acids can react to form amide bonds, one amino acid molecule can react with another and become joined through an amide linkage. This polymerization of amino acids is what creates proteins. This condensation reaction yields the newly formed peptide bond and a molecule of water. In cells, this reaction does not occur directly, instead the amino acid is activated by attachment to a transfer RNA molecule through an ester bond. This aminoacyl- tRNA is produced in an ATP-dependent reaction carried out by an aminoacyl tRNA synthetase. This aminoacyl-tRNA is then a substrate for the ribosome, which catalyzes the attack of the amino group of the elongating protein chain on the ester bond. As a result of this mechanism, all proteins are synthesized starting at their N-terminus and moving towards their C-terminus. However, not all peptide bonds are formed in this way. In a few cases peptides are synthesized by specific enzymes. For example, the tripeptide glutathione is an essential part of the defenses of cells against oxidative stress. This peptide is synthesized in two steps from free amino acids. In the first step gamma-glutamylcysteine synthetase condenses cysteine and glutamic acid through a peptide bond formed between the side-chain carboxyl of the glutamate (the gamma carbon of this side chain) and the amino group of the cysteine. This dipeptide is then condensed with glycine by glutathione synthetase to form glutathione. In chemistry, peptides are synthesized by a variety of reactions. One of the most used in solidphase peptide synthesis, which uses the aromatic oxime derivatives of amino acids as activated units. These are added in sequence onto the growing peptide chain, which is attached to a solid resin support. As amino acids have both the active groups of an amine and a carboxylic acid they can be considered both acid and base (though their natural pH is usually influenced by the R group). At a certain pH known as the isoelectric point, the amine group gains a positive charge (is protonated) and the acid group a negative charge (is deprotonated). The exact value is specific to each different amino acid. This ion is known as a zwitterion, which comes from the German word Zwitter meaning "hybrid". A zwitterion can be extracted from the solution as a white This watermark does not appear in the registered version - http://www.clicktoconvert.com 48 crystalline structure with a very high melting point, due to its dipolar nature. Near-neutral physiological pH allows most free amino acids to exist as zwitterions. 5.6 Hydrophilic and hydrophobic amino acids Depending on the polarity of the side chain, amino acids vary in their hydrophilic or hydrophobic character. These properties are important in protein structure and protein-protein interactions. The importance of the physical properties of the side chains comes from the influence this has on the amino acid residues' interactions with other structures, both within a single protein and between proteins. The distribution of hydrophilic and hydrophobic amino acids determines the tertiary structure of the protein, and their physical location on the outside structure of the proteins influences their quaternary structure. For example, soluble proteins have surfaces rich with polar amino acids like serine and threonine, while integral membrane proteins tend to have outer ring of hydrophobic amino acids that anchors them into the lipid bilayer, and proteins anchored to the membrane have a hydrophobic end that locks into the membrane. Similarly, proteins that have to bind to positively-charged molecules have surfaces rich with negatively charged amino acids like glutamate and aspartate, while proteins binding to negatively-charged molecules have surfaces rich with positively charged chains like lysine and arginine. Recently a new scale of hydrophobicity based on the free energy of hydrophobic association has been proposed Hydrophilic and hydrophobic interactions of the proteins do not have to rely only on the sidechains of amino acids themselves. By various posttranslational modifications other chains can be attached to the proteins, forming hydrophobic lipoproteins or hydrophilic glycoproteins. Alanine A Ala 89.09404 6.01 2.35 9.87 Cysteine C Cys 121.15404 5.05 1.92 10.70 Aspartic acid D Asp 133.10384 2.85 1.99 9.90 Glutamic acid E Glu 147.13074 3.15 2.10 9.47 Phenylalanine F Phe 165.19184 5.49 2.20 9.31 Glycine G Gly 75.06714 6.06 2.35 9.78 Histidine H His 155.15634 7.60 1.80 9.33 This watermark does not appear in the registered version - http://www.clicktoconvert.com 49 Isoleucine I Ile 131.17464 6.05 2.32 9.76 Lysine K Lys 146.18934 9.60 2.16 9.06 Leucine L Leu 131.17464 6.01 2.33 9.74 Methionine M Met 149.20784 5.74 2.13 9.28 Asparagine N Asn 132.11904 5.41 2.14 8.72 Proline P Pro 115.13194 6.30 1.95 10.64 Glutamine Q Gln 146.14594 5.65 2.17 9.13 Arginine R Arg 174.20274 10.76 1.82 8.99 Serine S Ser 105.09344 5.68 2.19 9.21 Threonine T Thr 119.12034 5.60 2.09 9.10 Selenocysteine U Sec 169.06 Valine V Val 117.14784 6.00 2.39 9.74 Tryptophan W Trp 204.22844 5.89 2.46 9.41 Tyrosine Y Tyr 181.19124 5.64 2.20 9.21 This watermark does not appear in the registered version - http://www.clicktoconvert.com 50 5.7 Entrez and SRS The The Entrez Global Query Cross-Database Search System is a powerful federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. NCBI is part of the National Library of Medicine (NLM), itself a department of the National Institutes of Health (NIH) of the United States government. Entrez also happens to be the French word for the second person plural form of the verb "to enter", meaning literally "come in". Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system. This watermark does not appear in the registered version - http://www.clicktoconvert.com 51 The Entrez front page provides, by default, access to the global query. All databases indexed by Entrez can be searched via a single query string, supporting boolean operators and search term tags to limit parts of the search statement to particular fields. This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database. Entrez also provides a similar interface for searching each particular database and for refining search results. The Limits feature allows the user to narrow a search a web forms interface. The History feature gives a numbered list of recently performed queries. Results of previous queries can be referred to by number and combined via boolean operators. Search results can be saved temporarily in a Clipboard. Users with a MyNCBI account can save queries indefinitely and also choose to have updates with new search results e- mailed for saved queries of most databases. It is widely used in the field of Biotechnology to enhance the knowledge of students worldwide. Entrez searches the following databases: * PubMed: biomedical literature citations and abstracts, including Medline - articles from (mainly medical) journals, often including abstracts. Links to PubMed Central and other full-text resources are provided to articles from the 1990s. * PubMed Central: free, full text journal articles * Site Search: NCBI web and FTP web sites * Books: online books * OMIM: Online Mendelian Inheritance in Man * OMIA: Online Mendelian Inheritance in Animals * Nucleotide: sequence database (GenBank) * Protein: sequence database * Genome: whole genome sequences and Mapping * Structure: three-dimensional macromolecular structures * Taxonomy: organisms in GenBank Taxonomy * SNP: Single Nucleotide Polymorphism * Gene: gene-centered information * HomoloGene: eukaryotic homology groups This watermark does not appear in the registered version - http://www.clicktoconvert.com 52 * PubChem Compound: unique small molecule chemical structures * PubChem Substance: deposited chemical substance records * Genome Project: genome project information * UniGene: gene-oriented clusters of transcript sequences * CDD: conserved protein domain database * 3D Domains: domains from Entrez Structure * UniSTS: markers and mapping data * PopSet: population study data sets (epidemiology) * GEO Profiles: expression and molecular abundance profiles * GEO DataSets: experimental sets of GEO data * Cancer Chromosomes: cytogenetic databases * PubChem BioAssay: bioactivity screens of chemical substances * GENSAT: gene expression atlas of mouse central nervous system * Probe: sequence-specific reagents * NLM Catalog: NLM bibliographic data for over 1.2 million journals, books, audiovisuals, computer software, electronic resources, and other materials resident in LocatorPlus (updated every weekday). Check your progress: 1. What are the single letter codes for the following amino acids. Tryptophan, leucine, tyrosine, glutamine, asparagine Notes: i) Write your answer in the space given below. j) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… This watermark does not appear in the registered version - http://www.clicktoconvert.com 53 Accessing Entrez and SRS In addition to using the search engine forms to query the data in Entrez, NCBI provides the Entrez Programming Utilities (eUtils) for more direct access to query results. The eUtils are accessed by posting specially formed URLs to the NCBI server, and parsing the XML response. There is also an eUtils SOAP interface. 5.8 Let us Sum up Proteins are an important class of biological macromolecules present in all biological organisms, made up of such elements as carbon, hydrogen, nitrogen, oxygen, and sulfur. All proteins are polymers of amino acids. The polymers, also known as polypeptides consist of a sequence of 20 different L-α-amino acids, also referred to as residues. For chains under 40 residues the term peptide is frequently used instead of protein. To be able to perform their biological function, proteins fold into one, or more, specific spatial conformations, driven by a number of non covalent interactions such as hydrogen bonding, ionic interactions, Van der Waals' forces and hydrophobic packing. In order to understand the functions of proteins at a molecular level, it is often necessary to determine the three dimensional structure of proteins. This is the topic of the scientific field of structural biology, that employs techniques such as Xray crystallography or NMR spectroscopy, to determine the structure of proteins. 5.9 Lesson end activities 1. Biochemistry refers to four distinct aspects of a protein's structure. Find out those different structures. 2. Find out the nutritional importance of various amino acids. 5.10 Check your progress: Model answers 1. Your answer must include these: Tryptophan - W Leucine - L Tyrosine - Y This watermark does not appear in the registered version - http://www.clicktoconvert.com 54 Glutamine - Q Asparagine - N 5.11 Points for Discussion 1. Elaborate on the features of NCBI.. 2. Make a comparative study on hydrophilic and hydropholic amino acids. 5.12 References 1. Gibbs A.J. and McIntyre G.A. 1970. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16: 1–11. 2. Gibrat J.F., Madej T., and Bryant S.H. 1996. Surprising similarity in structure comparison. Curr. Opin.Struct. Biol. 6: 377–385. 3. Gribskov M., McLachlan A.D., and Eisenberg D. 1987. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. 84: 4355–4358. 4. Henikoff S. and Henikoff J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915–10919. This watermark does not appear in the registered version - http://www.clicktoconvert.com 55 UNIT II LESSON - 6 SEQUENCE ALIGNMENT 6.0 Aims and Objective 6.1 Introduction 6.2 Definition of sequence alignment 6.2.1 Global alignment 6.2.2 Local alignment 6.3 Significance of sequence alignment 6.4 Let us Sum up 6.5 Lesson end activities 6.6 Check your progress 6.7 Points for Discussion 6.8 References 6.0 Aims and Objectives This unit discuss about Definition of sequence alignment , Global alignment, Local alignment , Significance of sequence alignment , Overview of methods of sequence alignment , Alignment of pairs of sequences , Multiple sequence alignment. 6.1 Introduction to Sequence Alignment In 1970, A.J. Gibbs and G.A. McIntyre (1970) described a new method for comparing two amino acid and nucleotide sequences in which a graph was drawn with one sequence written across the page and the other down the left- hand side. Whenever the same letter appeared in both This watermark does not appear in the registered version - http://www.clicktoconvert.com 56 sequences, a dot was placed at the intersection of the corresponding sequence positions on the graph. The resulting graph was then scanned for a series of dots that formed a diagonal, which revealed similarity, or a string of the same characters, between the sequences. Long sequences can also be compared in this manner on a single page by using smaller dots. The dot matrix method quite readily reveals the presence of insertions or deletions between sequences because they shift the diagonal horizontally or vertically by the amount of change. Comparing a single sequence to itself can reveal the presence of a repeat of the same sequence in the same (direct repeat) or reverse (inverted repeat or palindrome) ori- entation. This method of self-comparison can reveal several features, such as similarity between chromosomes, tandem genes, repeated domains in a protein sequence, regions of low sequence complexity where the same characters are often repeated, or self-comple- mentary sequences in RNA that can potentially base-pair to give a double-stranded struc- ture. Because diagonals may not always be apparent on the graph due to weak similarity, Gibbs and McIntyre counted all possible diagonals and these counts were compared to those of random sequences to identify the most significant alignments. Maizel and Lenk (1981) later developed various filtering and color display schemes that greatly increased the usefulness of the dot matrix method. This dot matrix representation of sequence comparisons continues to play an important role in analysis of DNA and pro- tein sequence similarity, as well as repeats in genes and very long chromosomal sequences. 6.2 Definition of Sequence Alignment Sequence alignment is the procedure of comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. Two sequences are aligned by writing them across a page in two rows. Identical or similar characters are placed in the same column, and nonidentical characters can either be placed in the same column as a mismatch or opposite a gap in the other sequence. In an optimal alignment, nonidentical characters and gaps are placed to bring as many identical or similar characters as possible into vertical register. Sequences that can be readily aligned in this manner are said to be similar. There are two types of sequence alignment namely global and local. In global alignment, an attempt is made to align the entire This watermark does not appear in the registered version - http://www.clicktoconvert.com 57 sequence, using as many characters as possible, up to both ends of each sequence. Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. In local alignment, stretches of sequence with the highest density of matches are aligned, thus generating one or more islands of matches or subalignments in the aligned sequences. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain. 6.2.1 Global Alignment For the two hypothetical protein sequence fragments, the global alignment is stretched over the entire sequence length to include as many matching amino acids as possible up to and including the sequence ends. Vertical bars between the sequences indicate the presence of identical amino acids. Although there is an obvious region of identity in this example (the sequence GKG preceded by a commonly observed substitution of T for A), a global alignment may not align such regions so that more amino acids along the entire sequence lengths can be matched. 6.2.2 Local Alignment In a local alignment, the alignment stops at the ends of regions of identity or strong similarity, and a much higher priority is given to finding these local regions than to extending the alignment to include more neighbouring amino acid pairs. Dashes indicate sequence not included in the alignment. This type of alignment favours finding conserved nucleotide patterns, DNA sequences, or amino acid patterns in protein sequences. This watermark does not appear in the registered version - http://www.clicktoconvert.com 58 Check your progress: 1. Mention the algorithm used for global alignment and local alignment. Notes: k) Write your answer in the space given below. l) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… 6.3 SIGNIFICANCE OF SEQUENCE ALIGNMENT Sequence alignment is useful for discovering functional, structural, and evolutionary information in biological sequences. It is important to obtain the best possible or so-called “optimal” alignment to discover this information. Sequences that are very much alike, or “similar” in the parlance of sequence analysis, probably have the same function, be it a regulatory role in the case of similar DNA molecules, or a similar biochemical function and three-dimensional structure in the case of proteins. Additionally, if two sequences from different organisms are similar, there may have been a common ancestor sequence, and the sequences are then defined as being homologous. The alignment indicates the changes that could have occurred between the two homologous sequences and a common ancestor sequence during evolution. With the advent of genome analysis and large-scale sequence comparisons, it becomes important to recognize that sequence similarity may be an indicator of several possible types of ancestor relationships, or there may be no ancestor relationship at all. For example, new gene evolution is often thought to occur by gene duplication, creating two tandem copies of the gene, followed by mutations in these copies. In rare cases, new mutations in one of the copies provide an advantageous change in function. The two copies may then evolve along separate pathways. Although the resulting separation of function will generate two related sequence families, sequences among both families will still be similar due to the single gene ancestor. In addition, genetic rearrange ments This watermark does not appear in the registered version - http://www.clicktoconvert.com 59 can reassort domains in proteins, leading to more complex proteins with an evolutionary history that is difficult to reconstruct (Henikoff et al. 1997). Evolutionary theory provides terms that may be used to describe sequence relationships. Homologous genes that share a common ancestry and function in the absence of any evidence of gene duplication are called orthologs. When there is evidence for gene duplication, the genes in an evolutionary lineage derived from one of the copies and with the same function are also referred to as orthologs. The two copies of the duplicated gene and their progeny in the evolutionary lineage are referred to as paralogs. In other cases, similar regions in sequences may not have a common ancestor but may have arisen independently by two evolutionary pathways converging on the same function, called convergent evolution. There are some remarkable examples in protein structures. For instance, although the enzymes chymotrypsin and subtilisin have totally different three-dimensional structures and folds, the active sites show similar structural features, including histidine (H), serine (S), and aspartic acid (D) in the catalytic sites of the enzymes (for discussion, see Branden and Tooze 1991). Additional examples are given. In such cases, the similarity will be highly localized. Such sequences are referred to as analogous (Fitch 1970). A closer examination of alignments can help to sort out possible evolutionary origins among similar sequences (Tatusov et al. 1997). As pointed out by Fitch and Smith (1983), sequences can be either homologous or nonhomologous, but not in between. The genetic rearrangements referred to above can give rise to chimeric genes, in which some regions are homologous and others are not. Referring to the entire sequences as homologous in such situations leads to an inaccurate and incomplete description of the sequence lineage. Another complication in tracing the origins of similar sequences is that individual genes may not share the same evolutionary origin as the rest of the genome in which they presently reside. Genetic events such as symbioses and viral- induced transduction can cause horizontal transfer of genetic material between unrelated organisms. In such cases, the evolutionary history of the transferred sequences and that of the organisms will be different. Again, with the capability of detecting such events in the genomes of organisms comes the responsibility to describe these changes with the correct evolutionary terminology. In this case, the sequences are xenologous (Gray and Fitch 1983). Recently, Lawrence and Ochman (1997) have shown that horizontal transfer of genes between species is as common in enteric bacteria, if not more common, than mutation. Describing such changes requires a careful description of sequence origins. This watermark does not appear in the registered version - http://www.clicktoconvert.com 60 6.4 Let us sum up: Many Bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a traces. When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be aligned.In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. 6.5 Lesson end activities 1. Find out the methodology for (i) Global Alignment (ii) Local Alignment. 6.6 Check your progress: Model answers 1. Your answer must include these points: Global Alignment – Needlemann-Wunsch algorithm Local Alignment – Smith-Waterman algorithm 6.7 Points for Discussion 1. “Sequence alignment has made the task of biological scientist easy” - Comment. 2. How do you rate the local and global alignments. 6.8 References 1. Altschul,S.F. (1989) Gap costs for multiple sequence alignmen J. Theor. Biol., 138, 297– 309. 2. Altschul,S.F., Madden,T.L., Sch¨affer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search program Nucleic Acids Res., 25, 3389–3402. 3. Barton,G.J. and Sternberg,M.J. (1987) A strategy for the rapi multiple alignment of protein sequences. Confidence levels fro tertiary structure comparisons. J. Mol. Biol., 198, 327–337. Boguski,M.S. and Schuler,G. (1995) ESTablishing a human transcrip map. Nature Genet., 10, 369–371. This watermark does not appear in the registered version - http://www.clicktoconvert.com 61 LESSON – 7 METHODS OF SEQUENCE ALIGNMENT 7.0 Aims and Objective 7.1 Overview of methods of sequence alignment 7.1.1 Alignment of pairs of sequences 7.1.2 Multiple sequence alignment 7.2 Let us Sum up 7.3 Lesson end activities 7.4 Check your progress 7.5 Points for Discussion 7.6 References 7.0 Aims and Objectives This unit discuss the methods of sequence alignment, alignment of pairs of sequences and multiple sequence alignment. 7.1 Overview Of Methods Of Sequence Alignment 7.1.1 Alignment of Pairs of Sequences Alignment of two sequences is performed using the following methods: 1. Dot matrix analysis 2. The dynamic programming (or DP) algorithm 3. Word or k-tuple methods, such as used by the programs FASTA and BLAST, described This watermark does not appear in the registered version - http://www.clicktoconvert.com 62 Unless the sequences are known to be very much alike, the dot matrix method should be used first, as this method displays any possible sequence alignments as diagonals on the matrix. Dot matrix analysis can readily reveal the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the other, more automated methods. The major limitation of the method is that most dot matrix computer programs do not show an actual alignment. The dynamic programming method, first used for global alignment of sequences by Needleman and Wunsch (1970) and for local alignment by Smith and Waterman (1981a), provides one or more alignments of the sequences. An alignment is generated by starting at the ends of the two sequences and attempting to match all possible pairs of characters between the sequences and by following a scoring scheme for matches, mismatches, and gaps. This procedure generates a matrix of numbers that represents all possible alignments between the sequences. The highest set of sequential scores in the matrix defines an optimal alignment. For proteins, an amino acid substitution matrix, such as the Dayhoff percent accepted mutation matrix 250 (PAM250) or blosum substitution matrix 62 (BLOSUM62) is used to score matches and mismatches. Similar matrices are available for aligning DNA sequences. The dynamic programming method is guaranteed in a mathematical sense to provide the optimal (very best or highest-scoring) alignment for a given set of user-defined variables, including choice of scoring matrix and gap penalties. Fortunately, experience with the dynamic programming method has provided much help for making the best choices, and dynamic programming has become widely used. The dynamic programming method can also be slow due to the very large number of computational steps, which increase approximately as the square or cube of the sequence lengths. The computer memory requirement also increases as the square of the sequence lengths. Thus, it is difficult to use the method for very long sequences. Fortunately, the computer scientists have greatly reduced the time and space requirements to near-linear relationships without compromising the reliability of the dynamic programming method, and these methods are widely used in the available dynamic programming applications to sequence alignment. Other shortcuts have been developed to speed up the early phases of finding an alignment. The word or k-tuple methods are used by the FASTA and BLAST algorithms . They align two sequences very quickly, by first searching for identical short stretches of sequences (called words or k-tuples) and by then joining these words into an alignment by the dynamic programming method. These methods are fast enough to be suitable for searching an entire database for the sequences that This watermark does not appear in the registered version - http://www.clicktoconvert.com 63 align best with an input test sequence. The FASTA and BLAST methods are heuristic; i.e., an empirical method of computer programming in which rules of thumb are used to find solutions and feedback is used to improve performance. However, these methods are reliable in a statistical sense, and usually provide a reliable alignment. Dynamic Programming The following is an example of global sequence alignment using Needleman Wunsch techniques. For this example, the two sequences to be globally aligned are GAATTCAGTTA(sequence#1) G G A T C G A (sequence #2) So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively) A simple scoring scheme is assumed where · Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position j of sequence #2 (match score); otherwise · Si,j = 0 (mismatch score) · w = 0 (gap penalty) Three steps in dynamic programming 1. Initialization 2. Matrix fill (scoring) 3. Traceback (alignment) Initialization Step The first step in the global alignment dynamic programming approach is to create a matrix with M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned. Since this example assumes there is no gap opening or gap extension penalty, the first row and first column of the matrix can be initially filled with 0. This watermark does not appear in the registered version - http://www.clicktoconvert.com 64 Matrix Fill Step One possible (inefficient) solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j for each position in the matrix. In order to find Mi,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1 . For each position, Mi,j is defined to be the maximum score at position i,j; i.e. Mi,j = MAXIMUM[ Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)] Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be green and Mi-1,j will be blue. Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, S1,1 = 1, and by the assumptions stated at the beginning, w = 0. Thus, M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1. A value of 1 is then placed in position 1,1 of the scoring matrix. This watermark does not appear in the registered version - http://www.clicktoconvert.com 65 Since the gap penalty (w) is 0, the rest of row 1 and column 1 can be filled in with the value 1. Take the example of row 1. At column 2, the value is the max of 0 (for a mismatch), 0 (for a vertical gap) or 1 (horizontal gap). The rest of row 1 can be filled out similarly until we get to column 8. At this point, there is a G in both sequences (light blue). Thus, the value for the cell at row 1 column 8 is the maximum of 1 (for a match), 0 (for a vertical gap) or 1 (horizontal gap). The value will again be 1. The rest of row 1 and column 1 can be filled with 1 using the above reasoning. Now look at column 2. The location at row 2 will be assigned the value of the maximum of 1(mismatch), 1(horizontal gap) or 1 (vertical gap). So its value is 1. At the position column 2 row 3, there is an A in both sequences. Thus, its value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so its value is 2. This watermark does not appear in the registered version - http://www.clicktoconvert.com 66 Moving along to position colum 2 row 4, its value will be the maximum of 1 (mismatch), 1 (horizontal gap), 2 (vertical gap) so its value is 2. Note that for all of the remaining positions except the last one in column 2, the choices for the value will be the exact same as in row 4 since there are no matches. The final row will contain the value 2 since it is the maximum of 2 (match), 1 (horizontal gap) and 2(vertical gap). Using the same techniques as described for column 2, we can fill in column 3. After filling in all of the values the score matrix is as follows: This watermark does not appear in the registered version - http://www.clicktoconvert.com 67 Check your progress: 1. List the 3 steps involved in dynamic programming. Notes: m) Write your answer in the space given below. n) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… Traceback Step After the matrix fill step, the maximum alignment score for the two test sequences is 6. The traceback step determines the actual alignment(s) that result in the maximum score. Note that with a simple scoring algorithm such as one that is used here, there are likely to be multiple maximal alignments. The traceback step begins in the M,J position in the matrix, i.e. the position that leads to the maximal score. In this case, there is a 6 in that location. Traceback takes the current cell and looks to the neighbour cells that could be direct predecessors. This means it looks to the neighbour to the left (gap in sequence #2), the diagonal neighbour (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for traceback chooses as the next cell in the sequence one of the possible predecessors. In this case, the neighbours are marked in red. They are all also equal to 5. This watermark does not appear in the registered version - http://www.clicktoconvert.com 68 Since the current cell has a value of 6 and the scores are 1 for a match and 0 for anything else, the only possible predecessor i s the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of (Seq #1) A | (Seq #2) A So now we look at the current cell and determine which cell is its direct predecessor. In this case, it is the cell with the red 5. The alignment as described in the above step adds a gap to sequence #2, so the current alignment is (Seq #1) TA | (Seq #2) _A Once again, the direct predacessor produces a gap in sequence #2. This watermark does not appear in the registered version - http://www.clicktoconvert.com 69 After this step, the current alignment is (Seq #1) TTA | __A Continuing on with the traceback step, we eventually get to a position in column 0 row 0 which tells us that traceback is completed. One possible maximum alignment is : Giving an alignment of : GAATTCAGTTA | | || | | GGA_TC_G__A An alternate solution is: This watermark does not appear in the registered version - http://www.clicktoconvert.com 70 Giving an alignment of : G_AATTCAGTTA | | || | | GG_A_TC_G__A There are more alternative solutions each resulting in a maximal global alignment score of 6. Since this is an exponential problem, most dynamic programming algorithms will only print out a single solution. 7.1.2 Multiple Sequence Alignment From a multiple alignment of three or more protein sequences, the highly conserved residues that define structural and functional domains in protein families can be identified. New members of such families can then be found by searching sequence databases for other sequences with these same domains. Alignment of DNA sequences can assist in finding conserved regulatory patterns in DNA sequences. Despite the great value of multiple sequence alignments, obtaining one presents a very difficult algorithmic problem. Introduction For many genes a database search will reveal a whole number of homologous sequences. One then wishes to learn about the evolution and the sequence conservation in such a group. This question surpasses what can reasonably be achieved by the sequence comparison methods described in the previous sections. Pairwise comparisons do not readily show positions that are conserved among a whole set of sequences and tend to miss subtle similarities that become This watermark does not appear in the registered version - http://www.clicktoconvert.com 71 visible when observed simultaneously among many sequences. Thus one wants to simultaneously compare several sequences. A multiple alignment arranges a set of sequences in a scheme where positions believed to be homologous are written in a common column. Like in a pairwise alignment, when a sequence does not possess an amino acid in a particular position this is again denoted by a dash. Like for pairwise alignments there are conventions regarding the scoring of a multiple alignment. In one approach, one simply adds the scores of all the induced pairwise aligments contained in a multiple alignment. For a linear gap penalty this amounts to scoring each column of the alignment by the sum of the amino acid pair scores in this column. The corresponding score is called the sum of pairs (SP) score.. Although it would be biologically meaningful, the distinctions between global, local and other forms of alignment are rarely made in a multiple alignment. The reason for this will become apparent below when we describe the computational difficulties in computing multiple alignments. Note that the full set of optimal pairwise alignments among a given set of sequences will generally overdetermine the multiple alignment. If one wishes to assemble a multiple alignment from pairwise alignments one has to avoid "closing loops", i.e. one can put together pairwise alignments as long as no new pairwise alignment is included to a sequence which is already part the multiple alignment. In particular, pairwise alignments can be merged when they align one sequence to all others, when a linear order of the given sequence is maintained, or when sequences pairs with pairwise alignments form a tree. While all these schemes allow for the ready definition of algorithms that output multiply aligned sequences, they do not inlcude any information stemming from the simultaneous analysis of several sequences. The alternative approach is to generalize the dynamic programming optimization procedure applied for pairwise alignment to the delineation of a multiple alignment that maximizes a score. The algorithm used is a straight- forward generalization of the global alignment algorithm presented in the section Algorithms for the comparison of two sequences. This is easy to see, in particular, for the case of column-oriented scoring function avoiding affine gap penalty in favor of the simpler linear one. With this scoring, the arrangment of gaps and letters in a column can This watermark does not appear in the registered version - http://www.clicktoconvert.com 72 be represented by a boolean vector indicating which sequences contain a gap in a particular column. Given the letters that are being compared, one needs to evaluate the scores for all these arrangements. However, the computational complexity of this algorithm is rather forbidding. For n sequences it is proportional to 2n times the product of the lengths of all sequences. In practice this algorithm can only be run for a modest number of sequences being compared. There exists software to compare three sequences with this algorithm that additionally implements a space-saving technique. For more than three sequences algorithms have been developed that aim at reducing the search space while still optimizing the given scoring function. The most prominent program of this kind is MSA2 [1]. An alternative approach is used by DCA [2] which implements a divide-and-conquer philosophy. The search space is repeatedly subdivided by identifying strongholds for the alignment. For a more detailed description of these concepts, also look at the section on Algorithms for SP-optimal multiple alignments. None of these approaches, however, would work independent of the number of sequences to be aligned. The most common remedy is reducing the multiple alignment problem to an iterated application of the pairwise alignment algorithm. However, in doing so, one also aims at drawing on the increased amount of information contained in a set of sequences. Instead of simply merging pairwise alignments of sequences, the notion of a profile has been introduced in order to grasp the conservation patterns within subgroups of sequences. A profile is essentially a representation of an already computed multiple alignment of a subgroup. This alignment is "frozen" for the remaining computation. Other sequences or other profiles can be compared to a given profile based on a generalized scoring scheme defined for this purpose. Two such schemes are in use, one based on average scores and one based on information theoretic score [3]. Note that profile scoring schemes respect conservation patterns. Given a profile and a single sequence, the two can be aligned using the basic dynamic programming algorithm together with the accompanying scoring scheme. The result will be an alignment between the two that can readily be converted into a multiple alignment now comprising the sequences underlying the profile plus the new one. Likewise, two profiles can be aligned with each other resulting in a multiple alignment containing all sequences from both profiles. With these tools various multiple alignment strategies can be implemented. Most This watermark does not appear in the registered version - http://www.clicktoconvert.com 73 commonly, a hierachical tree is generated for the given sequences which is then used as a guide for iterative profile construction and alignment. The construction of such a tree is described in the section Phylogenetic Trees and Multiple Alignments. The above alignment strategy was introduced in papers by Taylor, Barton, Corpet, and Higgins. Higgins' program Clustal has meanwhile become the de facto standard for multiple sequence alignment. Another program in use is Dialign. Dialign is different in that it aims at the delineation of regions of similarity among the given sequences. Since the iterative profile alignment tends to be guided by a hierachical tree, this step of the computation is also influencing the final result. Usually this tree is computed based on pairwise comparisons and the resulting scores. Subsequently this score matrix is used as input to a clustering procedure like single linkage clustering or UPGMA. However, it is well understood that in an evolutionary sense such a hierarchic clustering does not necessarily result in a biologically valid tree. Thus, when allowing this tree to determine the multiple alignment there is the danger of directing further evolutionary analysis of this alignment into the wrong direction. Consequently, the question has arisen of a common formulation of evolutionary reconstruction and multiple sequence alignment. The cleanest although biologically somewhat simplistic model attempts to reconstruct ancestral sequences to attribute to the inner nodes of a tree. Such reconstructed sequences at the same time determine the multiple alignment among the sequences. In this generalized tree alignment one aims at minimizing the sum of the edges of this tree, where each edge is annotated with the alignment distance between the sequences at its incident nodes. As to be expected, the computational complexity of this problem again makes its solution unpractical but approximation algorithms are known. The practical effort in this direction go back to the work of Sankoff Jotun Hein Schwikowski and Vingron, and Tao Jiang produced programs relying on these ideas. 7.2 Let us sum up This watermark does not appear in the registered version - http://www.clicktoconvert.com 74 This lesson gives a detailed account of Dot matrix analysis, dynamic programming (or DP) algorithm, word or k-tuple methods, such as used by the programs FASTA and BLAST and multiple sequence alignment 7.3 Lesson end activities 1. Find out the various software and tools available for pair wise and multiple sequence alignment. 7.4 Check your progress: Model answers 1. Your answer must include these points: 1. Initialization 2. Matrix fill (scoring) 3. Traceback (alignment) 7.5 Points for Discussion 1. Dynamic programming method is the best one for the alignment of pairs of sequences – Give your views on this statement. 7.6 References 1. Wang L, Jiang T. (1994) On the complexity of multiple sequence alignment. J Comput Biol 1:337-348. 2. Just W. (2001). Computational complexity of multiple sequence alignment with SPscore. J Comput Biol 8(6):615-23. 3. Higgins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237-44. This watermark does not appear in the registered version - http://www.clicktoconvert.com 75 4. Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionsspecific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680. 5. Notredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205-17. LESSON – 8 DYNAMIC PROGRAMMING 8.0 Aims and Objective 8.1 Dot matrix sequence comparison 8.1.1 Pair-wise sequence comparison 8.2 Dynamic programming algorithm for sequence alignment 8.2.1 Description of the algorithm 8.2.2 Formal description of the dynamic programming algorithm 8.3 Let us Sum up 8.4 Lesson end activities 8.5 Check your progress This watermark does not appear in the registered version - http://www.clicktoconvert.com 76 8.6 Points for Discussion 8.7 References 8.0 Aims and Objectives This unit discuss about Dot matrix sequence comparison, Pair-wise sequence comparison, Dynamic programming algorithm for sequence alignment , Description of the algorithm, Formal description of the dynamic programming algorithm. 8.1 Dot Matrix Sequence Comparison A dot matrix analysis is primarily a method for comparing two sequences to look for possible alignment of characters between the sequences, first described by Gibbs and McIntyre (1970). The method is also used for finding direct or inverted repeats in protein and DNA sequences, and for predicting regions in RNA that are self-complementary and that, therefore, have the potential of forming secondary structure. Every laboratory that does sequence analysis should have at least one dot matrix program available. In choosing a program, look for as many of the features described below as possible. The dot matrix should be visible on the computer terminal, thus providing an interactive environment so that different types of analyses may be tried. Use of colored dots can enhance the detection of regions of similarity (Maizel and Lenk 1981). Additional descriptions of the dot matrix method have appeared elsewhere (Doolittle 1986; States and Boguski 1991). The examples given below use the dot matrix module of DNA Strider (version 1.3) on a Macintosh computer. The program DOTTER has interactive features for the U N I X X -Windows environment (Sonnhammer and Durbin 1995; http://www.cgr.ki.se/cgr/groups/sonnhammer/ Dotter.html). The Genetics Computer Group programs COMPARE and DOTPLOT also perform a dot matrix analysis. Although not a dot matrix method, the program PLALIGN in the FASTA suite may be used to display the alignments found by the dynamic programming method between two sequences on a graph (http://fasta.bioch. virginia.edu/fasta/fasta_list.html; Pearson 1990). A dot matrix program that may be used with a Web browser is described in Junier and Pagni (2000) (http://www.isrec.isbsib. ch/java/ dotlet/Dotlet.html). This watermark does not appear in the registered version - http://www.clicktoconvert.com 77 8.1.1 Pair-wise Sequence Comparison The major advantage of the dot matrix method for finding sequence alignments is that all possible matches of residues between two sequences are found, leaving the investigator the choice of identifying the most significant ones. Then, sequences of the actual regions that align can be detected by using one of two other methods for performing sequence alignments, e.g., dynamic programming. These methods are automatic and usually show one best or optimal alignment, even though there may be several different, nearly alike alignments. Alignments generated by these programs can be compared to the dot matrix alignment to determine whether the longest regions are being matched and whether insertions and deletions are located in the most reasonable places. In the dot matrix method of sequence comparison, one sequence (A) is listed across the top of a page and the other sequence (B) is listed down the left side. Starting with the first character in B, one then moves across the page keeping in the first row and placing a dot in any column where the character in A is the same. The second character in B is then compared to the entire A sequence, and a dot is placed in row 2 wherever a match occurs. This process is continued until the page is filled with dots representing all the possible matches of A characters with B characters. Any region of similar sequence is revealed by a diagonal row of dots. Isolated dots not on the diagonal represent random matches that are probably not related to any significant alignment. Detection of matching regions may be improved by filtering out random matches in a dot matrix. Filtering is achieved by using a sliding window to compare the two sequences. Instead of comparing single sequence positions, a window of adjacent positions in the two sequences is compared at the same time, and a dot is printed on the page only if a certain minimal number of matches occur. The window starts at the positions in A and B to be compared and includes characters in a diagonal line going down and to the right, comparing each pair in turn, as in making an alignment. A larger window size is generally used for DNA sequences than for protein sequences because the number of random matches is much greater due to the use of only four DNA symbols as compared to 20 amino acid symbols. A typical window size for DNA sequences is 15 and a suitable match requirement in this window is 10. For protein sequences, the matrix is often not filtered, but a window size of 2 or 3 and a match requirement of 2 will highlight matching regions. If two proteins are expected to be related but to have long regions of dissimilar sequence with only a small proportion of identities, such as This watermark does not appear in the registered version - http://www.clicktoconvert.com 78 similar active sites, a large window, e.g., 20, and small stringency, e.g., 5, should be useful for seeing any similarity. Identification of sequence alignments by the dot matrix method can be aided by performing a count of dots in all possible diagonal lines through the matrix to determine statistically which diagonals have the most matches, and by comparing these match scores with the results of random sequence comparisons (Gibbs and McIntyre 1970; Argos 1987). An example of a dot matrix analysis between the DNA sequences that encode the Escherichia coli phage _ cI and phage P22 c2 repressor proteins. With a window of 1 and stringency of 1, there is so much noise that no diagonals can be seen, but with a window of 11 and a stringency of 7, diagonals appear in the lower right. The analysis reveals that there are regions of similarity in the 3_ ends of the coding regions, which, in turn, suggests similarity in the carboxy-terminal domains of the encoded repressors. Note that sequential diagonals in matrix C do not line up exactly, indicating the presence of extra nucleotides in one sequence (the lambda cI gene on the vertical scale). The diagonals shown in the lower part of the matrix reveal a region of sequence similarity in the carboxy-terminal domains of the proteins. A small insertion in the cI protein that is approximately in the middle of this region and shifts the diagonal slightly downward accounts for this pattern. An example of a dot matrix analysis between the amino acid sequences of the same two E. coli phage lambda cI and phage P22 c2 repressor proteins. This matrix was filtered by a window of 1 and a stringency of 1. As found with the DNA sequence alignment of the corresponding genes, diagonals shown in the lower part of the matrix reveal a region of sequence similarity in the carboxy-terminal domains of the proteins. The small insertion in the cI protein approximately in the middle of this region which shifts the diagonal slightly downward and which is also observed in the DNA alignment of these corresponding genes is also visible. Note that these windows are much smaller than required for DNA sequence comparisons due to the greater number of possible symbols (20 amino acids) and therefore fewer random matches. In conclusion, for DNA sequence dot matrix comparisons, use long windows and high stringencies, e.g., 7 and 11, 11 and 15. For protein sequences, use short windows, e.g., 1 and 1, for window and stringency, respectively, except when looking for a short domain of partial similarity in otherwise not-similar sequences. In this case, use a longer window and a small stringency, e.g., 15 and 5, for window and stringency, respectively. There are three types of variations in the analysis of two protein sequences by the dot matrix method. First, chemical similarity of the amino acid R group or some other feature for distinguishing amino acids may be used to score This watermark does not appear in the registered version - http://www.clicktoconvert.com 79 similarity. Second, a symbol comparison table such as the PAM250 or BLOSUM62 tables may be used (States and Boguski 1991). These tables provide scores for matches based on their occurrence in aligned protein families. These tables are discussed later. When these tables are used, a dot is placed in the matrix only if a minimum similarity score is found. These table values may also be used in a sliding window option, which averages the score within the window and prints a dot only above a certain average score. Finally, several different matrices can be made, each with a different scoring system, and the scores can be averaged. This method should be useful for aligning more distantly related proteins. The scores of each possible diagonal through the matrix are then calculated, and the most significant ones are identified and shown on a computer screen (Argos 1987). This watermark does not appear in the registered version - http://www.clicktoconvert.com 80 8.2 Dynamic Programming Algorithm For Sequence Alignment Dynamic programming is a computational method that is used to align two protein or nucleic acid sequences. The method is very important for sequence analysis because it provides the very best or optimal alignment between sequences. Programs that perform this analysis on sequences are readily available, and there are Web sites that will perform the analysis. However, the method requires the intelligent use of several variables in the program. Thus, it is important to understand how the program works in order to make informed choices of these variables. The method compares every pair of characters in the two sequences and generates an alignment. This alignment will include matched and mismatched characters and gaps in the two sequences that are positioned so that the number of matches between identical or related characters is the maximum possible. The dynamic programming algorithm provides a reliable computational method for aligning DNA and protein sequences. The method has been proven mathematically to produce the best or optimal alignment between two sequences under a given set of match conditions. Optimal alignments provide useful information to biologists concerning sequence relationships by giving the best possible information as to which characters in a sequence should be in the same column in an alignment, and which are insertions in one of the sequences (or deletions on the other). This information is important for making functional, structural, and evolutionary predictions on the basis of sequence alignments. Both global and local types of alignments may be made by simple changes in the basic dynamic programming algorithm. A global alignment program is based on the Needleman- Wunsch algorithm, and a local alignment program on the Smith-Waterman algorithm, described (p. 72). The predicted alignment will be given a score that gives the odds of obtaining the score between sequences known to be related to that obtained by chance alignment of unrelated sequences. There is a method to calculate whether or not an alignment obtained this way is statistically significant. One of the sequences may be scrambled many times and each randomly generated sequence may be realigned with the second sequence to demonstrate that the original alignment is unique. The statistical significance of alignment scores is discussed in detail (p. 96). Another feature of the dynamic programming algorithm is that the alignments obtained depend on the choice of a scoring system for comparing character pairs and penalty scores for gaps. For protein sequences, the simplest This watermark does not appear in the registered version - http://www.clicktoconvert.com 81 system of comparison is one based on identity. A match in an alignment is only scored if the two aligned amino acids are identical. However, one can also examine related protein sequences that can be aligned easily and find which amino acids are commonly substituted for each other. The probability of a substitution between any pair of the 20 amino acids may then be used to produce alignments. Recent improvements and experience with the dynamic programming programs and the scoring systems have greatly simplified their use. These enhancements are discussed below and at http://www.Bioinformatics online.org. It is important to recognize that several different alignments may provide approximately the same alignment score; i.e., there are alignments almost as good as the highest-scoring one reported by the alignment program. Some programs, e.g., LALIGN, provide several entirely different alignments with different sequence positions matched that can be compared to improve confidence in the best-scoring one. Alignment programs have also been greatly improved in algorithmic design and performance. With the advent of faster machines, it is possible to do a dynamic programming alignment between a query sequence and an entire sequence database and to find the similar sequences in several minutes. Dynamic programming has also been used to perform multiple sequence alignment, but only for a small number of sequences because the complexity of the calculations increases substantially for more than two sequences. Sequence alignment programs are available as a part of most sequence analysis packages, such as the widely used Genetics Computer Group GAP (global alignment) and BESTFIT (local alignment) programs. Sequences can also be pasted into a text area on a guest Web page on a remote host machine that will perform a dynamic programming alignment, and there are also versions of alignment programs that will run on a microcomputer. In deciding to perform a sequence alignment, it is important to keep the goal of the analysis in mind. Is the investigator interested in trying to find out whether two proteins have similar domains or structural features, whether they are in the same family with a related biological function, or whether they share a common ancestor relationship? The desired objective will influence the way the analysis is done. There are several decisions to be made along the way, including the type of program, whether to produce a global or local alignment, the type of scoring matrix, and the value of the gap penalties to be used. There are a very large number of amino acid scoring matrices in use, some much more popular than others, and these scoring matrices are designed for different purposes. Some, such as the Dayhoff PAM matrices, are based on an evolutionary model of protein change, whereas others, such as the BLOSUM This watermark does not appear in the registered version - http://www.clicktoconvert.com 82 matrices, are designed to identify members of the same family. Alignments between DNA sequences require similar kinds of considerations. It is often worth the effort to try several approaches to find out which choice of scoring system and gap penalty give the most reasonable result. Fortunately, most alignment programs come with a recommended scoring matrix and gap penalties that are useful for most situations. A more recent development is the simultaneous use of a set of scoring matrices and gap penalties by a method that generates the most probable alignments. The final choice as to the most believable alignment is up to the investigator, subject to the condition that reasonable decisions have been made regarding the methods used. For sequences that are very similar, e.g., _95%, the sequence alignment is usually quite obvious, and a computer program may not even be needed to produce the alignment. As the sequences become less and less similar, the alignment becomes more difficult to produce and one is less confident of the result. For protein sequences, similarity can still be recognized down to a level of approximately 25% amino acid identity. At this level of identity, the relative numbers of mismatched amino acids and gaps in the alignment have to be decided empirically and a decision made as to which gap penalties work the best for a given scoring matrix. Alignment of sequences at this level of identity is called the “twilight zone” of sequence alignment by Doolittle (1981). The alignment program may provide a quite convincing alignment, which suggests that the two sequences are homologous. Check your progress: 1. What are three types of variations in the analysis of two protein sequences by the dot matrix method? Notes: o) Write your answer in the space given below. p) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… This watermark does not appear in the registered version - http://www.clicktoconvert.com 83 ……………………………………………………………………………………………………… ……………………………………………………… 8.2.1 Description of the Algorithm Alignment of two sequences without allowing gaps requires an algorithm that performs a number of comparisons roughly proportional to the square of the average sequence length, as in a dot matrix comparison. If the alignment is to include gaps of any length at any position in either sequence, the number of comparisons that must be made becomes astronomical and is not achievable by direct comparison methods. Dynamic programming is a method of sequence alignment that can take gaps into account but that requires a manageable number of comparisons. The method of sequence alignment by dynamic programming and the proof that the method provides an optimal (highest scoring) alignment. To understand how the method works, we must first recall what is meant by an alignment, using the two protein sequences as an example. The two sequences will be written across the page, one under the other, the object being to bring as many amino acids as possible into register. In some regions, amino acids in one sequence will be placed directly below identical amino acids in the second. In other regions, this process may not be possible and nonidentical amino acids may have to be placed next to each other, or else gaps must be introduced into one of the sequences. Gaps are added to the alignment in a manner that increases the matching of identical or similar amino acids at subsequent portions in the alignment. Ideally, when two similar protein sequences are aligned, the alignment should have long regions of identical or related amino acid pairs and very few gaps. As the sequences become more distant, more mismatched amino acid pairs and gaps should appear. The quality of the alignment between two sequences is calculated using a scoring system that favors the matching of related or identical amino acids and penalizes for poorly matched amino acids and gaps. To decide how to score these regions, information on the types of changes found in related protein sequences is needed. These changes may be expressed by the following probabilities: (1) that a particular amino acid pair is found in alignments of related proteins; (2) that the same amino acid pair is aligned by chance in the sequences, given that some amino acids are abundant in proteins and others rare; and (3) that the insertion of a gap of one or more residues in one of the This watermark does not appear in the registered version - http://www.clicktoconvert.com 84 sequences (the same as an insertion of the same length in the other sequence), thus forcing the alignment of each partner of the amino acid pair with another amino acid, would be a better choice. The ratio of the first two probabilities is usually provided in an amino acid substitution matrix. Each table entry gives the ratio of the observed frequency of substitution between each possible amino acid pair in related proteins to that expected by chance, given the frequencies of the amino acids in proteins. These ratios are called odds scores. The ratios are transformed to logarithms of odds scores, called log odds scores, so that scores of sequential pairs may be added to reflect the overall odds of a real to chance alignment of an alignment. Examples are the Dayhoff PAM250 and BLOSUM62 substitution matrices described (p. 76). These matrices contain positive and negative values, reflecting the likelihood of each amino acid substitution in related proteins. Using these tables, an alignment of a sequential set of amino acid pairs with no gaps receives an overall score that is the sum of the positive and negative log odds scores for each individual amino acid pair in the alignment. The higher this score, the more significant is the alignment, or the more it resembles alignments in related proteins. The score given for gaps in aligned sequences is negative, because such misaligned regions should be uncommon in sequences of related proteins. Such a score will reduce the score obtained from an adjacent, matching region upstream in the sequences. The score of the alignment, using values from the BLOSUM62 amino acid substitution matrix and a gap penalty score of _11 for a gap of length 1, is 26 (the sum of amino acid pair scores) _11 _15. The value of _11 as a penalty for a gap of length 1 is used because this value is already known from experience to favor the alignment of similar regions when the BLOSUM62 comparison matrix is used. Choice of the gap penalty is discussed further below where a table giving suitable choices is presented. As shown in the example, the presence of the gap decreases significantly the overall score of the alignment. Although one may be able to align the two short sequences by eye and to place the gap where shown, the dynamic programming algorithm will automatically place gaps in much longer sequence alignments so as to achieve the best possible alignment. The derivation of the dynamic programming algorithm, using the above alignment as an example. Consider building this alignment in steps, starting with an initial matching aligned pair of characters from the sequences (V/V) and then sequentially adding a new pair until the alignment is complete, at each stage choosing a pair from all the possible matches that provides the highest score for the alignment up to that point. If the full alignment finally reached on the left side (I) has the highest possible or This watermark does not appear in the registered version - http://www.clicktoconvert.com 85 optimal score, then the old alignment from which it was derived (A) by addition of the aligned Y/Y pair must also have been optimal up to that point in the alignment. If this were incorrect, and a different preceding alignment other than A was the highest scoring one, then the alignment on the left would also not be the highest scoring alignment, and we started with that as a known condition. Similarly, (II), alignment A must also have been derived from an optimal alignment (B) by addition of a C/C pair. In this manner, the alignment can be traced back sequentially to the first aligned pair that was also an optimal alignment. One concludes that the building of an optimal alignment in this stepwise fashion can provide an optimal alignment of the entire sequences. The example also illustrates two of the three choices that can be made in adding to an alignment between two sequences: Match the next two characters in the next positions in each sequence, or match the next character to a gap in the upper sequence. The last possibility, not illustrated, is to add a gap to the lower sequence. This situation is analogous to performing a dot matrix analysis of the sequences, and of either continuing a diagonal or of shifting the diagonal sideway or downward to produce a gap in one of the sequences. 8.2.2 Formal Description of the Dynamic Programming Algorithm The algorithm may be written in mathematical form. The diagram indicates the moves that are possible to reach a certain matrix position (i,j) starting from the previous row and column at position (i _ 1, j _ 1) or from any position in the same row and column. There are three paths in the scoring matrix for reaching a particular position, a diagonal move from position i _ 1, j _ 1 to position i, j with no gap penalties, or a move from any other position from column j or row i, with a gap penalty that depends on the size of the gap. For two sequences a _ a1a2 . . . an and b _ b1 b2 . . . bn, where Sij beta h(a1a2 . . . ai, b1b2..bj) then (Smith and Waterman 1981a,b) in sequence a, and wy is the penalty for a gap of length y in sequence b. Note that Sij is a type of running best score as the algorithm moves through every position in the matrix. Eventually, when all of the matrix positions (all Sij) have been filled, the best score of the alignment will be found as the highest scoring position in the last row and column (for a global alignment), after correcting for any remaining gap penalties to align the sequence ends, if applicable. To determine an optimal alignment of the sequences from the scoring matrix, a second matrix called the trace-back matrix is used. The trace-back matrix keeps track of the positions in the scoring This watermark does not appear in the registered version - http://www.clicktoconvert.com 86 matrix that contributed to the highest overall score found. The sequence characters corresponding to these high scoring positions may align or may be next to a gap, depending on the information in the trace-back matrix. An example of this procedure can be found on the book Web site. Use of the dynamic programming method requires a scoring system for the comparison of symbol pairs (nucleotides for DNA sequences and amino acids for protein sequences), and a scheme for insertion/deletion (GAP) penalties. Once those parameters have been set, the resulting alignment for two sequences should always be the same. 8.3 Let us Sum up A dot matrix analysis is primarily a method for comparing two sequences to look for possible alignment of characters between the sequences. Dynamic programming is a computational method that is used to align two protein or nucleic acid sequences. The method is very important for sequence analysis because it provides the very best or optimal alignment between sequences. Programs that perform this analysis on sequences are readily available, and there are Web sites that will perform the analysis. Alignment of two sequences without allowing gaps requires an algorithm that performs a number of comparisons roughly proportional to the square of the average sequence length, as in a dot matrix comparison. 8.4 Lesson end activities 1. Find out some more algorithms that use dynamic programming. 8.5 Check your progress: Model answers 1. Your answer must include these points: 1. Chemical similarity of the amino acid R group or some other feature for distinguishing amino acids may be used to score similarity. 2. A symbol comparison table such as the PAM250 or BLOSUM62 tables may be used 3. Several different matrices can be made, each with a different scoring system, and the scores can be averaged. This watermark does not appear in the registered version - http://www.clicktoconvert.com 87 8.6 Points for Discussion 1. Comment on the significant features of the dynamic programming algorithm with a suitable examples. 8.7 References 1. Wagner, David B., 1995, " Dynamic Programming." An introductory article on dynamic programming in Mathematica. 2. King, Ian, 2002 (1987), " A Simple Introduction to Dynamic Programming in Macroeconomic Models." An introduction to dynamic programming as an important tool in economic theory. 3. Dreyfus, Stuart, " Richard Bellman on the birth of Dynamic Programming This watermark does not appear in the registered version - http://www.clicktoconvert.com 88 LESSON – 9 SCORING MATRICES AND GAP PENALTY 9.0 Aims and Objective 9.1 Use of scoring matrices and gap penalties in sequence alignments 9.1.1 Amino acid substitution matrices 9.1.2 Nucleic acid PAM scoring matrices 9.1.3 Gap penalties 9.1.4 Optimal combinations of scoring matrices and gap penalties for finding related proteins 9.2 Let us Sum up 9.3 Lesson end activities 9.4 Check your progress 9.5 Points for Discussion 9.6 References 9.0 Aims and Objectives This unit discuss the use of scoring matrices and gap penalties in sequence alignments, Amino acid substitution matrices, N u cleic acid PAM scoring matrices, Gap penalties, Optimal combinations of scoring matrices and gap penalties , for finding related proteins. This watermark does not appear in the registered version - http://www.clicktoconvert.com 89 9.1 Use of Scoring Matrices And Gap Penalties In Sequence Alignments Amino Acid Substitution Matrices Protein chemists discovered early on that certain amino acid substitutions commonly occur in related proteins from different species. Because the protein still functions with these substitutions, the substituted amino acids are compatible with protein structure and function. Often, these substitutions are to a chemically similar amino acid, but other changes also occur. Yet other substitutions are relatively rare. Knowing the types of changes that are most and least common in a large number of proteins can assist with predicting alignments for any set of protein sequences. If related protein sequences are quite similar, they are easy to align, and one can readily determine the single-step amino acid changes. If ancestor relationships among a group of proteins are assessed, the most likely amino acid changes that occurred during evolution can be predicted. This type of analysis was pioneered by Margaret Dayhoff (1978). Amino acid substitution matrices or symbol comparison tables, as they are sometimes called, are used for such purposes. Although the most common use of such tables is for the comparison of protein sequences, other tables of nucleic acid symbols are also used for comparison of nucleic acid sequences in order to accommodate ambiguous nucleotide characters or models of expected sequence changes during different periods of evolutionary time that vary scoring of transitions and transversions. In the amino acid substitution matrices, amino acids are listed both across the top of a matrix and down the side, and each matrix position is filled with a score that reflects how often one amino acid would have been paired with the other in an alignment of related protein sequences. The probability of changing amino acid A into B is always assumed to be identical to the reverse probability of changing B into A. This assumption is made because, for any two sequences, the ancestor amino acid in the phylogenetic tree is usually not known. Additionally, the likelihood of replacement should depend on the product of the frequency of occurrence of the two amino acids and on their chemical and physical similarities. A prediction of this model is that amino acid frequencies will not change over evolutionary time (Dayhoff 1978). This watermark does not appear in the registered version - http://www.clicktoconvert.com 90 9.1.1 Dayhoff Amino Acid Substitution Matrices (Percent Accepted Mutation or PAM Matrices) This family of matrices lists the likelihood of change from one amino acid to another in homologous protein sequences during evolution. There is presently no other type of scoring matrix that is based on such sound evolutionary principles as are these matrices. Even though they were originally based on a relatively small data set, the PAM matrices remain a useful tool for sequence alignment. Each matrix gives the changes expected for a given period of evolutionary time, evidenced by decreased sequence similarity as genes encoding the same protein diverge with increased evolutionary time. Thus, one matrix gives the changes expected in homologous proteins that have diverged only a small amount from each other in a relatively short period of time, so that they are still 50% or more similar. Another gives the changes expected of proteins that have diverged over a much longer period, leaving only 20% similarity. These predicted changes are used to produce optimal alignments between two protein sequences and to score the alignment. The assumption in this evolutionary model is that the amino acid substitutions observed over short periods of evolutionary history can be extrapolated to longer distances. The BLOSUM matrices are based on scoring substitutions found over a range of evolutionary periods and reveal that substitutions are not always as predicted by the PAM model. In deriving the PAM matrices, each change in the current amino acid at a particular site is assumed to be independent of previous mutational events at that site (Dayhoff 1978). Thus, the probability of change of any amino acid a to amino acid b is the same, regardless of the previous changes at that site and also regardless of the position of amino acid a in a protein sequence. Amino acid substitutions in a protein sequence are thus viewed as a Markov model, characterized by a series of changes of state in a system such that a change from one state to another does not depend on the previous history of the state. Use of this model makes possible the extrapolation of amino acid substitutions observed over a relatively short period of evolutionary time to longer periods of evolutionary time. To prepare the Dayhoff PAM matrices, amino acid substitutions that occur in a group of evolving proteins were estimated using 1572 changes in 71 groups of protein sequences that were at least 85% similar. Because these changes are observed in closely related proteins, they represent amino acid substitutions that do not significantly change the function of the protein. Hence they are called “accepted mutations,” This watermark does not appear in the registered version - http://www.clicktoconvert.com 91 defined as amino acid changes “accepted” by natural selection. Similar sequences were first organized into a phylogenetic tree. The number of changes of each amino acid into every other amino acid was then counted. To make these numbers useful for sequence analysis, information on the relative amount of change for each amino acid was needed. Relative mutabilities were evaluated by counting, in each group of related sequences, the number of changes of each amino acid and by dividing this number by a factor, called the exposure to mutation of the amino acid. This factor is the product of the frequency of occurrence of the amino acid in that group of sequences being analyzed and the total number of all amino acid changes that occurred in that group per 100 sites. This factor normalizes the data for variations in amino acid composition, mutation rate, and sequence length. The normalized frequencies were then summed for all sequence groups. By these scores, Asn, Ser, Asp, and Glu were the most mutable amino acids, and Cys and Trp were the least mutable. The above amino acid exchange counts and mutability values were then used to generate a 20:20 mutation probability matrix representing all possible amino acid changes. Because amino acid change was modeled by a Markov model, the mutation at each site being independent of the previous mutations, the changes predicted for more distantly related proteins that have undergone N mutations could be calculated. By this model, the PAM1 matrix could be multiplied by itself N times, to give transition matrices for comparing sequences with lower and lower levels of similarity due to separation of longer periods of evolutionary history. Thus, the commonly used PAM250 matrix represents a level of 250% of change expected in 2500 my. Although this amount of change seems very large, sequences at this level of divergence still have about 20% similarity. For example, alanine will be matched with alanine 13% of the time and with another amino acid 87% of the time. The percentage of remaining similarity for any PAM matrix can be calculated by summing the percentages for amino acids not changing (Ala versus Ala, etc.) after multiplying each by the frequency of that amino acid pair in the database (e.g., 0.089 for Ala) (Dayhoff 1978). The PAM120, PAM80, and PAM60 matrices should be used for aligning sequences that are 40%, 50%, and 60% similar, respectively. Simulations by George et al. (1990) have shown that, as predicted, the PAM250 matrix provides a better-scoring alignment than lower-numbered PAM matrices for distantly related proteins of 14–27% similarity. This watermark does not appear in the registered version - http://www.clicktoconvert.com 92 At one time, the PAM250 scoring matrix was modified in an attempt to improve the alignment obtained. All scores for matching a particular amino acid were normalized to the same mean and standard deviation, and all amino acid identities were given the same score to provide an equal contribution for each amino acid in a sequence alignment (Gribskov and Burgess 1986). These modifications were included as the default matrices for the GCG sequence alignment programs in versions 8 and earlier and are optional in later versions. They are not recommended because they will not give an optimal alignment that is in accord with the evolutionary model. Choosing the Best PAM Scoring Matrices for Detecting Sequence Similarity. The ability of PAM scoring matrices to distinguish statistically between chance and biologically meaningful alignments has been analyzed using a recently developed statistical theory for sequences (Altschul 1991) that is discussed later in this material. As discussed above, each PAM matrix is designed to score alignments between sequences that have diverged by a particular degree of evolutionary distance. Altschul (1991) has examined how well the PAM matrices actually can distinguish proteins that have diverged to a greater or lesser extent, when these proteins are subjected to a local alignment. Initially, when using a scoring matrix to produce an alignment, the amount of similarity between sequences may not be known. However, the ungapped alignment scores obtained are maximal when the correct PAM matrix, i.e., the one corresponding to the degree of similarity in the target sequences, is used (Altschul 1991). Altschul (1991) has also examined the ability of PAM matrices to provide a reliable enough indication of an ungapped local alignment score between sequences on an initial attempt of alignment. For sequence alignments, the PAM200 matrix is able to detect a significant ungapped alignment of 16–62 amino acids whose score is within 87% of the optimal one. Alternatively, several combinations, such as PAM80 and PAM250 or PAM120 and PAM350, can also be used. Altschul (1993) has also proposed using a single matrix and adjusting a statistical parameter in the scoring system to reach more distantly related sequences, but this change would primarily be for database searches. Scoring matrices are also used in database searches for similar sequences. The optimal matrices for these searches have also been determined. It is important to remember that these predictions assume that the amino acid distributions in the set of protein families used to make the scoring matrix are representative of all families that are likely to be encountered. This watermark does not appear in the registered version - http://www.clicktoconvert.com 93 The original PAM matrices represent only a small number of families. Scoring matrices obtained more recently, such as the BLOSUM matrices, are based on a much larger number of protein families. BLOSUM matrices are not based on a PAM evolutionary model in which changes at large evolutionary distance are predicted by extrapolation of changes found at small distances. Matrix values are based on the observed frequency of change in a large set of diverse proteins. As is discussed on the book Web site, the BLOSUM scoring matrices (especially BLOSUM62) appear to capture more of the distant types of variations found in protein families. In addition to the aforementioned differences among PAM scoring matrices for scoring alignments of more- or less-related proteins, the ability of each PAM matrix to discriminate real local alignments from chance alignments also varies. To calculate the ability of the entire matrix to discriminate related from unrelated sequences (H, the relative entropy), the score for each amino acid pair sij (in units of log2, called bits) is multiplied by the probability of occurrence of that pair in the original dataset, qij (Altschul 1991). This weighted score is then summed over all of the amino acid pairs to produce a score that represents the ability of the average amino acid pair in the matrix to discriminate actual from chance alignments. In information theory, this score is called the average mutual information content per pair, and the sum over all pairs is the relative entropy of the matrix (termed H). The relative entropy will be a small positive number. For the PAM250 matrix the number is _0.36, for PAM120, _0.98, and for PAM160, _0.70. In general, all other factors being equal, the higher the value of H for a scoring matrix, the more likely it is to be able to distinguish real from chance alignments. Blocks Amino Acid Substitution Matrices (BLOSUM) The BLOSUM62 substitution matrix (Henikoff and Henikoff 1992) is widely used for scoring protein sequence alignments. The matrix values are based on the observed amino acid substitutions in a large set of _2000 conserved amino acid patterns, called blocks. These blocks have been found in a database of protein sequences representing more than 500 families of related proteins (Henikoff and Henikoff 1992) and act as signatures of these protein families. The BLOSUM matrices are thus based on an entirely different type of sequence analysis and a much larger data set than the Dayhoff PAM matrices. These protein families were originally identified by Bairoch in the Prosite catalog. This catalog provides lists of proteins that are in the This watermark does not appear in the registered version - http://www.clicktoconvert.com 94 same family because they have a similar biochemical function. For each family, a pattern of amino acids that are characteristic of that function is provided. Henikoff and Henikoff (1991) examined each Prosite family for the presence of ungapped amino acid patterns (blocks) that were present in each family and that could be used to identify members of that family. To locate these patterns, the sequences of each protein family were searched for similar amino acid patterns by the MOTIF program of H. Smith (Smith et al. 1990), which can find patterns of the type aa1 d1 aa2 d2 aa3, where aa1 and aa2 are conserved amino acids and d1 and d2 are stretches of intervening sequence up to 24 amino acids long located in all sequences. These initial patterns were organized into larger ungapped patterns (blocks) between 3 and 60 amino acids long by the Henikoffs’ PROTOMAT program (http://www.blocks.fhcrc.org). Because these blocks were present in all of the sequences in each family, they could be used to identify other members of the same family. Thus, the family collections were enlarged by searching the sequence databases for more proteins with these same conserved blocks. The blocks that characterized each family provided a type of multiple sequence alignment for that family. The amino acid changes that were observed in each column of the alignment could then be counted. The types of substitutions were then scored for all aligned patterns in the database and used to prepare a scoring matrix, the BLOSUM matrix, indicating the frequency of each type of substitution. As previously described for the PAM matrices, BLOSUM matrix values were given as logarithms of odds scores of the ratio of the observed frequency of amino acid substitutions divided by the frequency expected by chance. This procedure of counting all of the amino acid changes in the blocks, however, can lead to an overrepresentation of amino acid substitutions that occur in the most closely related members of each family. To reduce this dominant contribution from the most alike sequences, these sequences were grouped together into one sequence before scoring the amino acid substitutions in the aligned blocks. The amino acid changes within these clustered sequences were then averaged. Patterns that were 60% identical were grouped together to make one substitution matrix called BLOSUM60, and those 80% alike to make another matrix called BLOSUM80, and so on. As with the PAM matrices, these matrices differ in the degree to which the more common amino acid pairs are scored relative to the less common pairs. Thus, when used for aligning protein sequences, they provide a greater or lesser distinction between the more common and less common amino acid pairs. The ability of these different BLOSUM matrices to distinguish real from chance alignments and to identify as many members as possible of a This watermark does not appear in the registered version - http://www.clicktoconvert.com 95 protein family has been determined (Henikoff and Henikoff 1992). Two types of analysis were performed: (1) an information content analysis of each matrix, as was described above for the PAM matrices, and (2) an actual comparison of the ability of each matrix to find members of the same families in a database search, discussed below. As the clustering percentage was increased, the ability of the resulting matrix to distinguish actual from chance alignments, defined as the relative entropy of the matrix or the average information content per residue pair, also increased. As clustering increased from 45% to 62%, the information content per residue increased from _0.4 to 0.7 bits per residue, and was _1.0 bits at 80% clustering. However, at the same time, the number of blocks that contributed information decreased by 25% between no clustering and 62% clustering. BLOSUM62 represents a balance between information content and data size. Henikoff and Henikoff (1993) have prepared a set of interval BLOSUM matrices that represent the changes observed between more closely related or more distantly related representatives of each block. Rather than representing the changes observed in very alike sequences up to sequences that were n% alike to give a BLOSUM-n matrix, the new BLOSUM-nm matrix represented the changes observed in sequences that were between n% alike and m% alike. The idea behind these matrices was to have a set of matrices corresponding to amino acid changes in sequence blocks that are separated by different evolutionary distances. Comparison of the PAM and BLOSUM Amino Acid Substitution Matrices There are several important differences in the ways that the PAM and BLOSUM scoring matrices were derived, and these differences should be appreciated in order to interpret the results of protein sequence alignments obtained with these matrices. First, the PAM matrices are based on a mutational model of evolution that assumes amino acid changes occur as a Markov process, each amino acid change at a site being independent of previous changes at that site. Changes are scored in sequences that are 85% similar after predicting a phylogenetic history of the changes in each family. Thus, the PAM matrices are based on prediction of the first changes that occur as proteins diverge from a common ancestor during evolution of a protein family. Matrices that may be used to compare more distantly related proteins are then derived by extrapolation from these short-term changes, assuming that these more distant changes are a reflection of the short-term changes occurring over and over again. For each longer evolutionary interval, each amino acid can change to any other with the same frequency as observed in the This watermark does not appear in the registered version - http://www.clicktoconvert.com 96 short term. In contrast, the BLOSUM matrices are not based on an explicit evolutionary model. They are derived from considering all amino acid changes observed in an aligned region from a related family of proteins, regardless of the overall degree of similarity between the protein sequences. However, these proteins are known to be related biochemically and, hence, should share common ancestry. The evolutionary model implied in such a scheme is that the proteins in each family share a common origin, but closer versus distal relationships are ignored, as if they all were derived equally from the same ancestor, called a starburst model of protein evolution . Second, the PAM matrices are based on scoring all amino acid positions in related sequences, whereas the BLOSUM matrices are based on substitutions and conserved positions in blocks, which represent the most alike common regions in related sequences. Thus, the PAM model is designed to track the evolutionary origins of proteins, whereas the BLOSUM model is designed to find their conserved domains. 9.1.2 Nucleic acid PAM scoring matrices Just as amino acid scoring matrices have been used to score protein sequence alignments, nucleotide scoring matrices for scoring DNA sequence alignments have also been developed. The DNA matrix can incorporate ambiguous DNA symbols and information from mutational analysis, which reveals that transitions (substitutions between the purines A and G or between the pyrimidines C and T) are more probable than transversions (substitutions between purine to pyrimidine or pyrimidine to purine) (Li and Graur 1991). These substitution matrices may be used to produce global or local alignments of DNA sequences. States et al. (1991) have developed a series of nucleic acid PAM matrices based on a Markov transition model similar to that used to generate the Dayhoff PAM scoring matrices. Although designed to improve the sensitivity of similarity searches of sequence databases, these matrices also may be used to score nucleic acid alignments. The advantage of using these matrices is that they are based on a defined evolutionary model and that the statistical significance of alignment scores obtained by local alignment programs may be evaluated, as described later in this . To prepare these DNA PAM matrices, a PAM1 mutation matrix representing 99% sequence conservation and one PAM of evolutionary distance (1% mutations) was first calculated. For a model in which all mutations from any nucleotide to any other are equally likely, and in which the four nucleotides are present This watermark does not appear in the registered version - http://www.clicktoconvert.com 97 at equal frequencies, the four diagonal elements of the PAM1 matrix representing no change are 0.99 whereas the six other elements representing change are 0.00333 ( .4). The values are chosen so that the sum of all possible changes for a given nucleotide in the PAM1 matrix is 1% (3 _ 0.00333 _ 0.00999). For a biased mutation model in which a given transition is threefold more likely than a transversion ( .4), the off-diagonal matrix elements corresponding to the one possible transition for each nucleotide are 0.006 and those for the two possible transversions are 0.002, and the sum for each nucleotide is again 1% (0.006 _ 0.002 _ 0.002 _ 0.01). As with the amino acid matrices, the above matrix values are then used to produce log odds scoring matrices that represent the frequency of substitutions expected at increasing evolutionary distances. In terms of an alignment, the probability (sij) of obtaining a match between nucleotides i and j, divided by the random probability of aligning i and j, is given by where Mij is the value in the mutation matrix, and pi and pj are the fractional composition of each nucleotide, assumed to be 0.25. The base of the logarithm can be any value, corresponding to multiplying every value in the matrix by the same constant. With such scaling variations, the ability of the matrix to distinguish among significant and chance alignments will not be altered. The resulting tables with sij expressed in units of bits (logarithm to the base 2) and rounded off to the nearest whole integer are shown in .5. From these PAM1 matrices, additional log odds matrices at an evolutionary distance of n PAMs may be obtained by multiplying the PAM1 matrix by itself n times. The ability of each matrix to distinguish real from random nucleotide matches in an alignment, designated H, measured in bit units (log2) can be calculated using the equation where the sij scores are also expressed in bit units. In .6 are shown the log odds values of the match and mismatch scores for PAM matrices at increasing evolutionary distances, assuming a uniform rate of mutation among all nucleotides. Also shown is the percentage of nucleotides that will be changed at that distance. The identity score will be 100 minus this value. This percentage is not as great as the PAM score due to expected backmutation over longer time periods. Also shown are the H scores of the matrices at each PAM value. Check your progress: 1. List out the important differences between PAM and BLOSUM. This watermark does not appear in the registered version - http://www.clicktoconvert.com 98 Notes: q) Write your answer in the space given below. r) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… 9.1.3 Gap penalties The inclusion of gaps and gap penalties is necessary in order to obtain the best possible alignment between two sequences. A gap opening penalty for any gap (g) and a gap extension penalty for each element in the gap (r) is most often used, to give a total gap score wx , according to the equation where x is the length of the gap. Note that in some formulations of the gap penalty, the equation wx _ g _ r (x _ 1) is used. Thus, the gap extension penalty is not added to the gap opening penalty until the gap size is 2. Although this difference does not affect the alignment obtained, one needs to distinguish which method is being used by a particular computer program if the correct results are to be obtained. In the former case, the penalty for a gap of size 1 is g _ x, whereas in the latter case this value is g. The values for these penalties have to be chosen to balance the scores in the scoring matrix that is used. Thus, the Dayhoff log odds matrix at PAM250 is expressed in units of log10, which is approximately 1/3 bits, but if this matrix were converted to 1/2 bits, the same gap penalties would no longer be appropriate. If too high a gap penalty is used relative to the range of scores in the substitution matrix, gaps will never appear in the alignment. Conversely, if the gap penalty is too low compared to the matrix scores, gaps will appear everywhere in the alignment in order to align as many of the same characters as possible. Fortunately, most alignment programs will suggest gap penalties that are appropriate for a given scoring matrix in most situations. In the GCG and FASTA program suites, the scoring matrix itself is formatted in a way that includes default gap penalties. Examples of the values of g and r used by various alignment programs are shown on the book This watermark does not appear in the registered version - http://www.clicktoconvert.com 99 Web site. When deciding gap penalties for local alignment programs, another consideration is that the penalties should be large enough to provide a local alignment of the sequences. Examples of suitable values are given in Altschul and Gish (1996) and Pearson (1996, 1998) have found that use of appropriate gap penalties will provide an improved local alignment based on statistical analysis. These studies are described in detail in the following section. Mathematician Peter Sellers (1974) showed that if sequence alignment was formulated in terms of distances instead of similarity between sequences, a biologically more appealing interpretation of gaps is possible. The distance is the number of changes that must be made to convert one sequence into the other and represents the number of mutations that will have occurred following separation of the genes during evolution; the greater the distance, the more distantly related are the sequences in evolution. In this case, substitution produces a positive score of 1. Notice that the distance score plus the similarity score for an alignment is equal to 1. Sellers proved that this distance formulation of sequence alignment has a desirable mathematical property that also makes evolutionary sense. If three sequences, a, b, and c, are compared using the above scoring scheme, the distance score as defined above is described as a metric that satisfies the triangle inequality relationship where d(a,b) is the distance between sequences a and b, and likewise for the other two d values. Expressed another way, if the three possible distances between three sequences are obtained, then the distance between any first pair plus that for any second pair cannot underscore the third pair. Violating this rule would not be consistent with the expected evolutionary origin of the sequences. To satisfy the metric requirement, the scoring of individual matches, mismatches, and gaps must be such that in an alignment of two idend tical sequences a and a_, d(a,a_) must equal 0 and for two totally different sequences b and b_, d(b,b_) must equal 1. For any other two sequences a and b, d(a,b) _ d(b,a). Hence, it is important that the distance score for changing one sequence character into a second is the same as the converse score for changing the second into the first, if the distance score of the alignment is to remain a metric and to make evolutionary sense. The above relationships were shown by Sellers to be true for gaps of length 1 in a sequence alignment. He also showed that the smallest number of steps required to change one sequence into the other could be calculated by the dynamic programming algorithm. The method was similar to that discussed above for the Needleman-Wunsch global and Smith-Waterman local alignments, except that these former methods found the maximum similarity between two sequences, as opposed to the minimum distance found by the Sellers This watermark does not appear in the registered version - http://www.clicktoconvert.com 100 analysis. Subsequently, Smith et al. (1981) and Smith and Waterman (1981a,b) showed that gaps of any length could also be included in an alignment and still provide a distance metric for the alignment score. In this formulation, the gap penalty was required to increase as a function of the gap length. The argument was made that a single mutational event involving a single gap of n residues should be more likely to have occurred than n single gaps. Thus, to increase the likelihood of such gaps of length _1 being found, the penalty for a gap of length n was made smaller than the score for n individual gaps. The simplest way of implementing this feature of the gap penalty was to have the gap score wx be a linear function of gap length by consisting of two parts, a larger gap opening penalty (g) and a smaller gap extension penalty (r) for each extra position in the gap, or wx _ g _ rx, where x is the length of the gap, as described above. This type of gap penalty is referred to as an affine gap penalty in the literature. Any other formula for scoring gap penalties should also work, provided that the score increases with length of the gap but that the score is less than x individual gaps. Scoring of gaps by the above linear function of gap length has now become widely used in sequence alignment. However, more complex gap penalty functions have been used (Miller and Myers 1988). 9.1.4 Optimal c o m binations of scoring matrices and gap penalties for finding related proteins The usefulness of combinations of scoring matrices and gap penalties for identifying related proteins, including distantly related ones, has been compared (Feng et al. 1985; Doolittle 1986; Henikoff and Henikoff 1993; Pearson 1995, 1996, 1998; Agarwal and States 1998; Brenner et al. 1998). The method generally used is to start with a database of protein sequences organized into families, either based on sequence similarity or structural similarity (described in s 7 and 9, respectively). A member of a family is then selected and used as a query sequence in a search of the entire database from which the sequence came, using a database similarity search methods (FASTA, BLAST, SSEARCH),. These methods basically use the dynamic programming algorithm and a choice of scoring matrix and gap penalties to produce alignment scores. Details of these studies are described on the book Web site. In summary, the following general observations have been made: (1) Some scoring matrices are superior to others at finding related proteins based on either sequence or structure. For example, matrices prepared by examining the This watermark does not appear in the registered version - http://www.clicktoconvert.com 101 full range of amino acid substitutions in families of related proteins, such as the BLOSUM62 matrix, perform better than matrices based on variations in closely related proteins that are extrapolated to produce matrices for more distantly related sequences, such as the Dayhoff PAM250 matrix. (2) Gap penalties that for a given scoring matrix are adjusted to produce a local alignment are the most suitable. (3) To identify related sequences, the significance of the alignment scores should be estimated, as described in the following section. These methods provide the means to demonstrate sequence similarity in even the most distantly related proteins. For closely related proteins, a PAM-type scoring matrix that matches the evolutionary separation of the sequences may provide a higher-scoring alignment. Another set of studies has suggested that a global alignment algorithm in combination with scoring matrices that have all positive values and suitable gap penalties can be used to align proteins that have limited sequence similarity (i.e., 25% identity) but that have similar structure (Vogt et al. 1995; Abagyan and Batalov 1997). 9.2 Let us Sum up In this unit, we tried to discuss about (i) Use of scoring matrices and gap penalties in sequence alignments (ii) Amino acid substitution matrices (iii) Nucleic acid PAM scoring matrices (iv) Gap penalties (v) Optimal combinations of scoring matrices and gap penalties 9.3 Lesson end activities 1. Where did the BLOSUM62 alignment score matrix come from? 9.4 Check your progress: Model answers 1. Your answer must include these points: PAM matrices are based on a mutational model of evolution. This watermark does not appear in the registered version - http://www.clicktoconvert.com 102 BLOSUM matrices are not based on an explicit evolutionary model PAM matrices are based on scoring all amino acid positions in related sequences. BLOSUM matrices are based on substitutions and conserved positions in blocks PAM model is designed to track the evolutionary origins of proteins BLOSUM model is designed to find their conserved domains. 9.6 Points for Discussion 1. “Scoring matrices play a vital role in sequence alignment” – Make a critical analysis of this statement. 2. “The inclusion of gap and gap penalties in necessary to get the best possible alignment” Justify. 9.7 References 1. Altschul, S.F. Amino acid substitutions matrices from an information theoretic perspective. J. Mol. Biol. 219, 555-665 (1991). 2. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. A model of evolutionary change in proteins. In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 - 352 (1978). 3. Henikoff, S. and Henikoff, J. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919 (1992). 4. Eddy S.R. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol. 22(8), 1035-6. Review. (2004). This watermark does not appear in the registered version - http://www.clicktoconvert.com 103 LESSON – 10 ASSESSING THE SIGNIFICANCE OF SEQUENCE ALIGNMENTS 10.0 Aims and Objective 10.1 Assessing the significance of sequence alignments 10.1.1 Significance of global alignments 10.1.2 Modeling a random DNA sequence alignment 10.1.3 Alignments with gaps 10.1.4 The Gumbel extreme value distribution 10.1.5 Methods for Calculating the Parameters of the Extreme Value Distribution 10.1.6 The Statistical Significance of Individual Alignment Scores between Sequences and the Significance of Scores Found in a Database Search Are Calculated Differently 10.1.7 FASTA and BLAST 10.2 Let us Sum up 10.3 Lesson end activities 10.4 Check your progress 10.5 Points for Discussion 10.6 References 10.0 Aims and Objectives: This unit discuss about Assessing the significance of sequence alignments, Significance of global alignments, Modeling a random DNA sequence alignment, Alignments with gaps, The Gumbel extreme value distribution, A quick determination of the significance of an alignment, score, The importance of the type of scoring matrix for statistical, analyses, Significance of gapped local alignments, Methods for calculating the parameters of the extreme value, distribution, FASTA and BLAST. This watermark does not appear in the registered version - http://www.clicktoconvert.com 104 10.1 Assessing The Significance Of Sequence Alignments One of the most important recent advances in sequence analysis is the development of methods to assess the significance of an alignment between DNA or protein sequences. For sequences that are quite similar, such as two proteins that are clearly in the same family, such an analysis is not necessary. A significance question arises when comparing two sequences that are not so clearly similar but are shown to align in a promising way. In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected between related sequences or would just as likely be found if the sequences were not related. The significance test is also needed to evaluate the results of a database search for sequences that are similar to a sequence by the BLAST and FASTA programs. The test will be applied to every sequence matched so that the most significant matches are reported. Finally, a significance test can also help to identify regions in a single sequence that have an unusual composition suggestive of an interesting function. Our present purpose is to examine the significance of sequence alignment scores obtained by the dynamic programming method. Originally, the significance of sequence alignment scores was evaluated on the basis of the assumption that alignment scores followed a normal statistical distribution. If sequences are randomly generated in a computer by a Monte Carlo or sequence shuffling method, as in generating a sequence by picking marbles representing four bases or 2 amino acids out of a bag (the number of each type is proportional to the frequency found in sequences), the distribution may look normal at first glance. However, further analysis of the alignment scores of random sequences will reveal that the scores follow a different distribution than the normal distribution called the Gumbel extreme value distribution. In this section, we review some of the earlier methods used for assessing the significance of alignments, then describe the extreme value distribution, and finally discuss some useful programs for this type of analysis with some illustrative examples. The statistical analysis of alignment scores is much better understood for local alignments than for global alignments. Recall that the Smith-Waterman alignment algorithm and the scoring system used to produce a local alignment are designed to reveal regions of closely matching sequence with a positive This watermark does not appear in the registered version - http://www.clicktoconvert.com 105 alignment score. In random or unrelated sequence alignments, these regions are rarely found. Hence, their presence in real sequence alignments is significant, and the probability of their occurring by chance alignment of unrelated sequences can be readily calculated. The significance of the scores of global alignments, on the other hand, is more difficult to determine. Using the Needleman-Wunsch algorithm and a suitable scoring system, there are many ways to produce a global alignment between any pair of sequences, and the scores of many different alignments may be quite similar. When random or unrelated sequences are compared using a global alignment method, they can have very high scores, reflecting the tendency of the global algorithm to match as many characters as possible. Thus, assessment of the statistical significance of a global alignment is a much more difficult task. Rather than being used as a strict test for sequence homology, a global alignment is more appropriately used to align sequences that are of approximately the same length and already known to be related. The method will conveniently show which sequence characters align. One can then use this information to perform other types of analyses, such as structural modeling or an evolutionary analysis. 10.1.1 Significance of Global Alignments In general, global alignment programs use the Needleman-Wunsch alignment algorithm and a scoring system that scores the average match of an aligned nucleotide or amino acid pair as a positive number. Hence, the score of the alignment of random or unrelated sequences grows proportionally to the length of the sequences. In addition, there are many possible different global alignments depending on the scoring system chosen, and small changes in the scoring system can produce a different alignment. Thus, finding the best global alignment and knowing how to assess its significance is not a simple task, as reflected by the absence of studies in the literature. Waterman (1989) provided a set of means and standard deviations of global alignment scores between random DNA sequences, using mismatch and gap penalties that produce a linear increase in score with sequence length, a distinguishing feature of global alignments. This watermark does not appear in the registered version - http://www.clicktoconvert.com 106 However, these values are of limited use because they are based on a simple gap scoring system. Abagyan and Batalov (1997) suggested that global alignment scores between unrelated protein sequences followed the extreme value distribution, similar to local alignment scores. However, since the scoring system that they used favored local alignments, these alignments they produced may not be global but local. Unfortunately, there is no equivalent theory on which to base an analysis of global alignment scores as there is for local alignment scores. For zero mismatch and gap penalties, which is the most extreme condition for a global alignment giving the longest subsequence common to two sequences, the score between two random or unrelated sequences P is proportional to sequence length n, such that P _ cn (Chvátal and Sankoff 1975), but it has not proven possible to calculate the proportionality constant c (Waterman and Vingron 1994a). To evaluate the significance of a Needleman-Wunsch global alignment score, Dayhoff (1978) and Dayhoff et al. (1983) evaluated Needleman-Wunsch alignment scores for a large number of randomized and unrelated but real protein sequences, using their log odds scoring matrix at 250 PAMs and a constant gap penalty. The distribution of the resulting random scores matched a normal distribution. On the basis of this analysis, the significance of an alignment score between two apparently related sequences A and B was determined by obtaining a mean and standard deviation of the alignment scores of 100 random permutations or shufflings of A with 100 of B, conserving the length and amino acid composition of each. If the score between A and B is significant, the authors specify that the real score should be at least 3–5 standard deviations greater than the mean of the random scores. This level of significance means that the probability that two unrelated sequences would give such a high score is 1.35 _ 10_3 (3 S.D.s) and 2.87 _ 10_6 (5 S.D.s). In evaluating an alignment, two parameters were varied to maximize the alignment score: First, a constant called the matrix bias was added to each value in the scoring matrix and, second, the gap penalty was varied. The statistical analysis was then performed after the score between A and B had been maximized. Recall that the log odds PAM250 matrix values vary from _7 to 17 in units of 1/3 bits. The bias varied from 2 to 20 and had the effect of increasing the score by the bias times the number of alignment positions where one amino acid is matched to another. As a result, the alignment frequently decreases in length because there are fewer gaps, assuming the gap penalty is not also changed. It was these optimized alignments on which the significance test This watermark does not appear in the registered version - http://www.clicktoconvert.com 107 was performed. Feng et al. (1985) used the same method to compare the significance of alignment scores obtained by using different scoring matrices. They used 25–100 pairs of randomized sequences for each test of an alignment. There are several potential problems with this approach, some of which apply to other methods as well. First, the method is expensive in terms of the number of computational steps, which increase at least as much as the square of sequence length because many Needleman-Wunsch alignments must be done. However, this problem is much reduced with the faster computers and more efficient algorithms of today. Second, if the amino acid composition is unusual, and if there is a region of low complexity (for example, many occurrences of one or two amino acids), the analysis will be oversimplified. Third, when natural sequences were compared more closely, the patterns found did not conform to a random set of the basic building blocks of sequences but rather to a random set of sequence segments that were varying. Consider use of the 26- letter alphabet in English sentences. Alphabet letters do not appear in any random order in these sentences but rather in a vocabulary of meaningful words. What happens if sentences, which are made up of words, are compared? On the one hand, if just the alphabet composition of many sentences is compared, not much variation is seen. On the other hand, if words are compared, much greater variation is found because there are many more words than alphabet characters. If random sequences are produced from segments of sequences, rather than from individual residues, more variation is observed, more like that observed when unrelated natural sequences are compared. The increased variation found among natural sequences is not surprising when one thinks of DNA and proteins as sources of information. For example, protein-encoding regions of DNA sequences are constrained by the genetic code and by amino acid patterns that produce functional domains in proteins. Lipman et al. (1984) analyzed the distribution of scores among 100 vertebrate nucleic acid sequences and compared these scores with randomized sequences prepared in different ways. When the randomized sequences were prepared by shuffling the sequence to conserve base composition, as was done by Dayhoff and others, the standard deviation was approximately one-third less than the distribution of scores of the natural sequences. Thus, natural sequences are more variable than randomized ones, and using such randomized sequences for a significance test may lead to an overestimation of the significance. If, instead, the random sequences were prepared in a way that maintained the local This watermark does not appear in the registered version - http://www.clicktoconvert.com 108 base composition by producing them from overlapping fragments of sequence, the distribution of scores has a higher standard deviation that is closer to the distribution of the natural sequences. The conclusion is that the presence of conserved local patterns can influence the score in statistical tests such that an alignment can appear to be more significant than it actually is. Although this study was done using the Smith-Waterman algorithm with nucleic acids, the same cautionary note applies for other types of alignments. The final problem with the above methods is that the correct statistical model for alignment scores was not used. However, these earlier types of statistical analysis methods set the stage for later ones. The GCG alignment programs have a RANDOMIZATION option, which shuffles the second sequence and calculates similarity scores between the unshuffled sequence and each of the shuffled copies. If the new similarity scores are significantly smaller than the real alignment score, the alignment is considered significant. This analysis is only useful for providing a rough approximation of the significance of an alignment score and can easily be misleading. Dayhoff (1978) and Dayhoff et al. (1983) devised a second method for testing the relatedness of two protein sequences that can accommodate some local variation. This method is useful for finding repeated regions within a sequence, similar regions that are in a different order in two sequences, or a small conserved region such as an active site. As used in a computer program called RELATE (Dayhoff 1978), all possible segments of a given length of one sequence are compared with all segments of the same length from another. An alignment score using a scoring matrix is obtained for each comparison to give a score distribution among all of the segments. A segment comparison score in standard deviation units is calculated as the difference between the value for real sequences minus the average value for random sequences divided by the standard deviation of the scores from the random sequences. A version of the program RELATE that runs on many computer platforms is included with the FASTA distribution package by W. Pearson. This program also calculates a distribution based on the normal distribution, thus it provides only an approximate indication of the significance of an alignment. 10.1.2 Modeling a Random DNA Sequence Alignment The above types of analyses assume that alignment scores between random sequences follow a normal distribution that can be used to test the significance of a score between two test This watermark does not appear in the registered version - http://www.clicktoconvert.com 109 sequences. For a number of reasons, mathematicians were concerned that this statistical model might not be correct. Let us start by creating two aligned random DNA sequences by drawing pairs of marbles from a large bag filled with four kinds of labeled marbles. The marbles are in equal proportions and labeled A, T, G, and C to represent an assumed equal representation of the four nucleotides in DNA. Now consider the probability of removing 10 identical pairs representing 10 columns in an alignment between two random sequences. The probability of removing an identical pair (an A and another A) is 1/4 _1/4, but there are 4 possible identical pairs (A/A, C/C, G/G, and T/T), so that the probability of removing any identical pair is 4 _ 1/4 _ 1/4 _ 1/4 and that for removing 6 identical pairs is (1/4)6_2.4 _10_4. The probability of drawing a mismatched pair is 1 _1/4 _3/4, and that of drawing 6/6 mismatched pairs (3/4)6 _ 0.178. Most random alignments produced in this manner will have a mixture of a few matches and many mismatches. The calculations are a little more complex if the four nucleotides are not equally represented, but the results will be approximately the same. The probability of drawing the same pair is p, where p _ pA 2 _ pC 2 _ pG 2 _ pT 2, where pX is the proportion of nucleotide X. p is an important parameter to remember for the discussion below. An even more complicated situation is when the two random sequences to align have different nucleotide distributions. One way would be to use an average p for the two sequences. This example illustrates the difficulty of modeling sequence alignments between two different organisms that have a different base composition. The above model is not suitable for predicting the number of sequentially matched positions between random sequences of a given length. To estimate this number, a DNA sequence alignment may also be modeled by coin-tossing experiments (Arratia and Waterman 1989; Arratia et al. 1986, 1990). Random alignments will normally comprise mixtures of matches and mismatches, just as a series of coin tosses will produce a mixture of heads and tails. The chance of producing a series of matches in a sequence alignment with no mismatches is similar to the chance of tossing a coin and coming up with a series of only heads. The numbers of interest are the highest possible score that can be obtained and the probability of obtaining such a score in a certain number of trials. In such models, coins are usually considered to be “fair” in that the probability of a head is equal to that of a tail. The coin in this example has a certain probability p of scoring a head (H) and q _ 1 _ p of scoring a tail (T). The longest run of heads R has been shown by Erdös and Rényi to be This watermark does not appear in the registered version - http://www.clicktoconvert.com 110 given by log1/p(n). If p _ 0.5 as for a normal coin, then the base of the logarithm is 1/p _ 2. For the example of n _ 100 tosses, then R _ log2100 _ loge100/loge2 _ 4.605/0.693 _ 6.65. To use the coin model, an alignment of two random sequences a _ a1, a2, a3---an and b _ b1, b2, b3--bn, each of the same length n is converted to a series of heads and tails. If ai _ bi then the equivalent toss result is an H, otherwise the result is a T. The following example illustrates the conversion of an alignment to a series of H and T tosses. The longest run of matches in the alignment is now equivalent to the longest run of heads in the coin- tossing sequence, and it should be possible to use the Erdös and Rényi law to predict the longest run of matches. This score, however, only applies to one particular alignment of random sequences, such as generated above by the marble draw. In performing a sequence alignment, two sequences are in effect shifted back and forth with respect to each other to find regions that can be aligned. In addition, the sequences may be of different lengths. If two random sequences of length m and n are aligned in this same manner, the same law still applies but the length of the predicted match is log1/p(mn) (Arratia et al. 1986). If m _ n, the longest run of matches is doubled. Thus, for DNA sequences of length 100 and p_0.25 (equal representation of each nucleotide), the longest expected run of matches is 2 _ log1/p(n) _ 2 _ log4100 _ 2 _ loge100 / loge4 _ 2 _ 4.605 / 1.386 _ 6.65, the same number as in the coin-tossing experiment. This number corresponds to the longest subalignment that can be expected between two random sequences of this length and composition. A more precise formula for the expectation value or mean of the longest match M and its variance has been derived (Arratia et al. 1986; Waterman et al. 1987; Waterman 1989). It also applies when there are k mismatches in the alignment, except that another term _ k log1/p log1/p(qmn) appears in the equation (Arratia et al. 1986). K, the constant in it, depends on k. The log log term is small and can be replaced by a constant (Mott 1992), and simulations also suggest that it is not important (Altschul and Gish 1996). Altschul and Gish (1996) have found a better match to it when the length of each sequence is reduced by the expected length of a match. In the example given above with two sequences of length 100, the expected length of a match was 6.65. As the sequences slide align each other, it is not possible to have overlaps on the ends that are shorter than 7 because there is not enough sequence remaining. Hence, the effective length of the sequences is 100 _ 7 _ 93 (Altschul and Gish 1996). This correction is also used for the This watermark does not appear in the registered version - http://www.clicktoconvert.com 111 calculation of statistical significance by the BLAST algorithm. It is fundamentally important for calculating the statistical significance of alignment scores. Basically, it states that as the lengths of random or unrelated sequences increase, the mean of the highest possible local alignment scores will be proportional to the logarithm of the product of the sequence lengths, or twice the logarithm of the sequence length if the lengths are equal (since log (nn) _ 2 log n). It also predicts a constant variance among scores of random or unrelated sequences, and this prediction is also borne out by experiment. It is important to emphasize once again that this relationship depends on the use of scoring parameters appropriate for a local alignment algorithm, such as 1 for a match and _0.9 for a mismatch, or a scoring matrix that scores the average aligned position as negative, and also upon the use of sufficiently large gap penalties. This type of scoring system gives rise to positive scoring regions only rarely. The significance of these scores can then be estimated as described herein. Another way of describing the result in it uses a different parameter, _, where _ _ loge(1/p) (Karlin and Altschul 1990) Recall that p is the probability of a match between the same two characters, given above as 1/4 for matching a random pair of DNA bases, assuming equal representation of each base in the sequences. p may also be calculated as the probability of a match averaged over scoring matrix and sequence composition values. Instead, it is _ that is more commonly used with scoring matrix values. The calculation of _ and also of K is described below and in more detail on the book Web site. It is more useful in sequence analysis to use alignment scores instead of lengths for comparing alignments. The expected or mean alignment length between two random sequences given by Equations 11 and 12 can be easily converted to an alignment score just by using match and mismatch or scoring matrix values along with some simple normalization procedures. Thus, in addition to predicting length, these equations can also predict the mean. 10.1.3 Alignments with Gaps It was predicted on mathematical grounds and shown experimentally that a similar type of analysis holds for sequence alignments that include gaps (Smith et al. 1985). Thus, when Smith This watermark does not appear in the registered version - http://www.clicktoconvert.com 112 et al. (1985) optimally aligned a large number of unrelated vertebrate and viral DNA sequences of different lengths (n and m) and their complements to each other, using a dynamic programming local alignment method that allowed for a score of _1 for matches, _0.9 for mismatches, and _2 for a single gap penalty (longer gaps were not considered in order to simplify the analysis), a plot of the similarity score (S) versus the log1/p(nm) produced a straight line with approximately constant variance. This result is as expected in the above model except that with the inclusion of gaps, the slope was increased and was of the form with constant standard deviation _ _ 1.78. This result was then used to calculate how many standard deviations were between the predicted mean and variance of the local alignment scores for unrelated sequences and the scores for test pairs of sequences. If the actual alignment score exceeded the predicted Smean by several standard deviations, then the alignment score should be significant. For example, the expected score between two unrelated sequences of lengths 2948 and 431, average p _ 0.279, was Smean _ 2.55 _ log1/0.279(2948 _ 431) _ 8.99 _ 2.55 _ (loge(2948 _ 431)/loge(1/0.279)) _ 8.99 _ 2.55 _ 14.1 / 1.28 _ 8.99 _ 28.1 _ 8.99 _ 19.1. The actual optimal alignment score between the two real sequences of these lengths was 37.20, which exceeds the alignment score expected for random sequences by (37.20 _ 19.1) / 1.78 _ 10.2_. Is this number of standard deviations significant? Smith et al. (1985) and Waterman (1989) suggested the use of a conservative statistic known as Chebyshev’s inequality, which is valid for many probability distributions: The probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean. In this example where the actual score is 10 standard deviations above the mean, the probability is (1/10)2 _ 0.01. Waterman (1989) has noted that for low mismatch and gap penalties, e.g., _1 for matches, _0.5 for mismatches, and _0.5 for a single gap penalty, the predicted alignment scores between random sequences as estimated above are not accurate because the score will increase linearly with sequence length instead of with the logarithm of the length. The linear relationship arises when the alignment is more global in nature, and the logarithmic relationship when it is local. Waterman (1989) has fitted alignment scores from a large number of randomly generated DNA sequences of varying lengths to either the predicted log(n) or n linear relationships expected for low- and high-valued mismatch and gap penalties. The results provide the mean and standard deviation of an alignment score for several scoring schemes, assuming a constant gap penalty. This watermark does not appear in the registered version - http://www.clicktoconvert.com 113 With further mathematical analysis, it became apparent that the expected scores between alignment of random and unrelated sequences follow a distribution called the Gumbel extreme value distribution (Arratia et al. 1986; Karlin and Altschul 1990). This type of distribution is typical of values that are the highest or best score of a variable, such as the number of heads only expected in a coin toss discussed previously. Subsequently, S. Karlin and S. Altschul (1990, 1993) further developed the use of this distribution for evaluating the significance of ungapped segments in comparisons between a test sequence and a sequence database using the BLAST program (for review, see Altschul et al. 1994). The method is also used for evaluating the statistical features of repeats and amino acid patterns and clusters in the same sequence (Karlin and Altschul 1990; Karlin et al. 1991). The program SAPS developed by S. Karlin and colleagues at Stanford University and available at http://ulrec3.unil.ch/software/software.html provides this type of analysis. The extreme value distribution is now widely used for evaluating the significance of the score of local alignments of DNA and protein sequence alignments, especially in the context of database similarity searches. 10.1.4 The Gumbel Extreme Value Distribution When two sequences have been aligned optimally, the significance of a local alignment score can be tested on the basis of the distribution of scores expected by aligning two random sequences of the same length and composition as the two test sequences (Karlin and Altschul 1990; Altschul et al. 1994; Altschul and Gish 1996). These random sequence alignment scores follow a distribution called the extreme value distribution, which is somewhat like a normal distribution with a positively skewed tail in the higher score range. When a set of values of a variable are obtained in an experiment, biologists are used to calculating the mean and standard deviation of the entire set assuming that the distribution of values will follow the normal distribution. For sequence alignments, this procedure would be like obtaining many different alignments, both good and bad, and averaging all of the scores. However, biologically interesting alignments are those that give the highest possible scores, and lower scores are not of interest. The experiment, This watermark does not appear in the registered version - http://www.clicktoconvert.com 114 then, is one of obtaining a set of values, and then of using only the highest value and discarding the rest. The focus changes from the statistical approach of wanting to know the average of scores of random sequences, to one of knowing how high a value will be obtained next time another set of alignment scores of random sequences is obtained. The distribution of alignment scores between random sequences follows the extreme value distribution, not the normal distribution. After many alignments, a probability distribution of highest values will be obtained. The goal is to evaluate the probability that a score between random or unrelated sequences will reach the score found between two real sequences of interest. If that probability is very low, the alignment score between the real sequences is significant and the sequence similarity score is significant. The probability distribution of highest values in an experiment, the extreme value distribution, is compared to the normal probability distribution. The equations giving the respective y coordinate values in these distributions, Yev and Yn. 10.1.5 Methods for Calculating the Parameters of the Extreme Value Distribution In the analysis by Altschul and Gish (1996), 10,000 random amino acid sequences of variable lengths were aligned using the Smith-Waterman method and a combination of the scoring matrix and a reasonable set of gap penalties for the matrix. The scores found by this method followed the same extreme value distribution predicted by the underlying statistical theory. Values of K and _ were then estimated for each combination by fitting the data to the predicted extreme value distribution. Some representative results are shown in .10. Readers should consult Tables V–VII in Altschul and Gish (1996) for a more detailed list of the gap penalties tested. Altschul and Gish (1996) have cautioned users of these statistical parameters. First, the parameters were generated by alignment of random sequences that were produced assuming a particular amino acid distribution, which may be a poor model for some proteins. Second, the accuracy of _ and K cannot be estimated easily. Finally, for gap costs that give values of H _ 0.15, the optimal alignment length is a significant fraction of the sequence lengths and produces a source of error called the edge effect. The effect occurs when the expected length of an alignment is a significant fraction of the sequence length, and, as discussed earlier, alignments between sequences that overlap at their ends cannot be completed. The expected length is then subtracted from the sequence length before _ is estimated. This watermark does not appear in the registered version - http://www.clicktoconvert.com 115 If no such correction is done, _ may be overestimated. These values for gap penalties should also not be construed to represent the best choice for a given pair of sequences or the only choices, simply because the statistical parameters are available. The process of choosing a gap penalty remains a matter of reasoned choice. In trying the effects of varying the gap penalty, it is important to recognize that as the gap penalty is lowered, the alignments produced will have more gaps and will eventually change from a local to a global type of alignment, even though a local alignment program is being used. In contrast, higher H values are generated by a very large gap penalty and produce alignments with no gaps ( .10), thus suggesting an increased ability to discriminate between related and unrelated sequences. In this respect, Altschul and Gish (1996) note that beyond a certain point increasing the gap extension penalty does not change the parameters, indicating that most gaps in their simulations are probably of length 1. However, reducing the gap penalty can also allow an alignment to be extended and create a higher scoring alignment. Eventually, however, the optimal local alignment score between unrelated sequences will lose the log length relationship with sequence length and become a linear function. At this point, gap penalties are no longer useful for obtaining local alignments and the above statistical relationships are no longer valid. The higher the H value, the better the matrix can distinguish related from unrelated sequences. The lower the value of H, the longer the expected alignment. These conditions may be better if a longer alignment region is required, such as testing a structural or functional model of a sequence by producing an alignment. Conversely, scoring parameters giving higher values of H should produce shorter, more compact alignments. If H _ 0.15, the alignments may be very long. In this case, the sequences have a shorter effective length since alignments starting near the ends of the sequences may not be completed. This edge effect can lead to an overestimation but was corrected (Altschul and Gish 1996). Unfortunately, the above method for calculating the significance of an alignment score may not be used to test the significance of a global alignment score. The theory does not apply when these same substitution matrices are used for global alignments. Transformation of these matrices by adding a fixed constant value to each entry or by multiplying each value by a constant has no effect on the relative scores of a series of global alignments. Hence, there is no theoretical basis for a statistical analysis of such scores as there is for local alignments (Altschul 1991). As discussed, This watermark does not appear in the registered version - http://www.clicktoconvert.com 116 two programs are commonly used for database similarity searches: FASTA and BLAST. These programs both calculate the statistical significance of the higher scores found with similar sequences, but the types of analyses used to determine the statistical significance of these scores are somewhat different. BLAST uses the value of K and _ found by aligning random sequences, where n and m are shortened to compensate for inability of ends to align. FASTA calculates the statistical significance using the distribution of scores with unrelated sequences found during the database search. In effect, the mean and standard deviation of the low scores found in a given length range are calculated. These scores represent the expected range of scores of unrelated sequences for that sequence length (recall that the local alignment scores increase as the logarithm of the sequence length). The number of standard deviations to the high scores of related sequences in the same length range (z score) is then determined. The significance of this z score is then calculated according to the extreme value distribution expected of the z scores, given in it. This method is discussed in greater detail in . Pearson (1996) showed that these two methods are equally useful in database similarity searches for detecting sequences more distantly related to the input query sequence. Pearson (1996) has also determined the influence of scoring matrices and gap penalties on alignment scores of moderately related and distantly related protein sequences in the same family. For two examples of moderately related sequences, the choice of scoring matrix and gap penalties (gap opening penalty followed by penalty for each additional gap position) did not matter, i.e., BLOSUM50 _12/_2, BLOSUM62 _8/_2, Gonnet93 _10/_2, and PAM250 _12, _2 all produced statistically significant scores. The scores of distantly related proteins in the same family depended more on the choice of scoring matrix and gap penalty, and some scores were significant and others were not. Pearson recommends using caution in evaluating alignment scores using only one particular combination of scoring matrix and gap penalties. He also suggests that using a larger gap penalty, e.g., _14, _2 with BLOSUM50, can increase the selectivity of a database search for similarity (fewer sequences known to be unrelated will receive a significant alignment score). A difficulty encountered by FASTA in calculating statistical parameters during a database search is that of distinguishing unrelated from related sequences, because only scores of unrelated sequences must be used. As score and sequence length information is accumulated during the search, the scores will include high, intermediate, This watermark does not appear in the registered version - http://www.clicktoconvert.com 117 and sometimes low scores of sequences that are related to the query sequence, as well as low scores and sometimes intermediate and even high scores of unrelated sequences. As an example, a high score with an unrelated database sequence can occur because the database sequence has a region of low complexity, such as a high proportion of one amino acid. Regardless of the reason, these high scores must be pruned from the search if accurate statistical estimates are to be made. Pearson (1998) has devised several such pruning schemes, and then determined the influence of the scheme on the success of a database search at demonstrating statistically significant alignment scores among members of the same protein family or superfamily. However, no particular scheme proved to be better than another. The above method does not necessarily ensure that the choice of scoring matrix and gap penalties provides a realistic set of local alignment scores. In the comparable situation of matching a test sequence to a database of sequences, the scores also follow the extreme value distribution. For this situation, Mott (1992) has explained that for local alignments the end point of the alignment should on the average be half- way along the query sequence, and for global alignments, the end point should be beyond that half- way point. Pearson (1996) has pointed out that the presence of known, unrelated sequences in the upper part of the curve where E _ 1 can be an indication of an inappropriate scoring system. 10.1.6 The Statistical Significance of Individual Alignment Scores between Sequences and the Significance of Scores Found in a Database Search Are Calculated Differently In performing a database search between a query sequence and a sequence database, a new comparison is made for each sequence in the database. Alignment scores between unrelated sequences are employed by FASTA to calculate the parameters of the extreme value distribution. The probability that scores between unrelated sequences could reach as high as those found for matched sequences can then be calculated (Pearson 1998). Similarly, in the database similarity search program BLAST, estimates of the statistical parameters are calculated based on the scoring matrix and sequence composition. The parameters are then used to calculate the probability of finding conserved patterns by chance alignment of unrelated sequences (Altschul et al. 1994). When performing such database searches, many trials are made in order to find the This watermark does not appear in the registered version - http://www.clicktoconvert.com 118 most strongly matching sequences. As more and more comparisons between unrelated sequences are made, the chance that one of the alignment scores will be the highest one yet found increases. The probability of finding a match therefore has to be higher than the value calculated for a score of one sequence pair. The length of the query sequence is about the same as it would be in a normal sequence alignment, but the effective database sequence is very large and represents many different sequences, each one a different test alignment. Theory shows that the Poisson distribution should apply (Karlin and Altschul 1990, 1993; Altschul et al. 1994), as it did above for estimating the parameters of the extreme value distribution from many alignments between random sequences. The probability of observing, in a database of D sequences, no alignments with scores higher than the mean of the highest possible local alignment scores s is given by e_Ds, and that of observing at least one score s is P _ 1 _ e_Ds. For the range of values of P that are of interest, i.e., P _ 0.1, P _ Ds. If two sequences are aligned by PRSS as given in the above example, and the significance of the alignment is calculated, two scores must be considered. The probability of the score may first be calculated using the estimates of _ and K. Thus, in the phage repressor alignment, P(s _ 401) _ 3.7. _ 10_27. However, to estimate the EV parameters, 1000 shuffled sequences were compared, and the probability that one of those sequences would score as high as 401 is given by Ds, or 1000 _ 3.7 _ 10_27 _ 3.7 _ 10_24. These numbers are also shown in the statistical estimates computed by PRSS. Finally, if the score had arisen from a database search of 50,000 sequences, the probability of a score of 401 among this many sequence alignments is 5 _ 10_19, still a small number, but 50,000 larger than that for a single comparison. These probability calculations are used for reporting the significance of scores with database sequences by FASTA and BLAST, as described. 10.1.7 FASTA and BLAST FASTA heuristic belongs to the family of programs for sequence database search called FAST. All the heuristics in the FAST family follow three basic steps: I. Use a heuristic to find a good local alignment of U with V1; : : : ; VN. Return only those alignments, whose score is greater than T, where T is some threshold parameter chosen in advance. Let _i be the initial score of the alginment (returned by the heuristic) between U and Vi. This watermark does not appear in the registered version - http://www.clicktoconvert.com 119 II. Sort _is in non- increasing order: _1 _ _2 _ : : : _ _M > T. III. For the highest ranking sequences, compute the optimized score using the dynamic programming algorithm. Return the top few alignments ranked according to their optimized score. “In practice, when the sequences are truly related, the optimized score is usually significantly higher than the initial score. This observation often helps distinguish between good alignments occuring by chance and true relationships.” [3] The general idea behind the heuristic is very simple: first, we quickly compute a coarse estimation for the scores of the sequences, and then we refine the scores for the top ranking few (using dynamic programming). It is obvious that the success of this technique depends largely on the effectiveness of the heuristic chosen for step I. Intuition for heuristic As the dynamic programming algorithm for sequence alignment fills out the table, the backreferences along the diagonals correspond to matches in the alignment, while the vertical and horizontal back-references correspond to gaps. Inuitively, we want to find long diagonals in the dynamic programming table without actually going through the trouble of filling it out. When good matches (long, possibly broken diagonals) are found by the heuristic, they are merged together. This behavior categorizes the whole FAST family; FASTA itself takes an extra step: “after the best regions have been selected [FASTA] tries to join nearby regions, even if they do not belong to the same diagonal”. Therefore, we arrive to the following layout for an algorithm that computes _i (aligning U with Vi): I. Identify long diagonals in the dynamic programming table. If two diagonals are close by (next one starts after the previous one ends), merge them. II. Choose k0 largest diagonals. III. For each of the diagonals, find its score using some scoring matrix chosen in advance. IV. Return the top score. The most time-consuming part of the above procedure again lies in step I, so let us examine how this step could be performed efficiently. The key observation that we need to make is that the alphabet over which the sequences are given is much smaller than the sequences themselves. Let us assume that we are comparing This watermark does not appear in the registered version - http://www.clicktoconvert.com 120 BLAST BLAST is an acronym for Basic Local Alignment Search Tool. “The BLAST programs are among the most frequently used to search sequence databases worldwide”. BLAST works by finding seeds, “which are short segment pairs between the query and a database sequence”. The seeds are then extended in both directions until the maximum possible score for an extension of the particular seed is reached (mechanisms are in place that determine when an extension should terminate without getting lost in local maxima). The algorithm follows three major steps: I. Preprocessing query. II. Find all words x of length w (called w- mers, where w is a parameter for the algorithm that varies based on whether DNA or protein sequences are being compared) such that there is a substring y of length w in U, such that score(x; y) _ T, where T is another (threshold) parameter for the algorithm. In other words a list of high-scoring w- mers is computed; note that this list may not contain all query w- mers, e.g., if a w- mer consists of common amino acids, its score with itself may fall below T, and it may be left out. [3] All such words are stored in a hash table X = fx1; : : : ; xng, so that membership queries y 2 X take O(1) time. III. Finding hits. Each Vi 2 V is scanned, and all substring of length w in Vi that are in X (seeds) are recorded in a list fy1; : : : ; ykg. Extending seeds. Each yi is extended in both directions: the seeds grow as long as a match can be found, eventually, the score function starts decreasing. However, the cut-off does not happen immediately since one wants to avoid local maxima, instead the extension stops when the value of the score function goes below cSmax for some constant c < 1, where Smax is the maximum score seen so far. The sketch of the statistical analysis of how the cutoff point should be determined is given below. Clearly, this step is the most expensive one since the number of the hits and the sizes of the extensions can be large. An improvement for step II can be achieved by This watermark does not appear in the registered version - http://www.clicktoconvert.com 121 replacing a hash table with a deterministic finite automaton that has transitions for each character in the alphabet, and that recognizes with its states words from the list of high-scoring words. In effect, the automaton moves a window over the sequences in the database in a very computationally cheap manner: one transition is made per character. However, since the most expensive step is step III, the improvements obtained from using a finite automaton are less significant than one may expect. [3] gives a sketch of the main points of the statistical theory behind the extensions performed by BLAST that allows one to derive various parameters needed by BLAST. Given two random sequences s and t of lengths m and n, the following approximations can be obtained: Given a matrix of replacement costs sij for the pairs of characters in the alphabet and the probability pi of occurrence of each individual character in the sequences, we first compute a value _, solving the equation X i;j pipje_sij = 1 The parameter _ is the unique positive solution to this equation and can be obtained by Newton’s method. Once _ is known, the expected number of distinct segment pairs between s and t with score above S is Kmne _S where K is a calculable constant. Actually, the distribution of the number of segment pairs scoring above S is a Poisson distribution with mean given by the previous formula. From this, it is easy to derive expressions for useful quantitites like the average score, intervals where the score will fall 90% of the time, and so on. BLAST Programs The BLAST program can either be downloaded and run as a command- line utility "blastall" or accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone with a web browser to perform similarity searches against constantly updated databases of proteins and DNA that include most of the newly sequenced organisms. BLAST is actually a family of programs (all included in the blastall executable). The following are some of the programs, ranked mostly in order of importance: · Nucleotide-nucleotide BLAST (blastn): This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies. · Protein-protein BLAST (blastp): This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies. · Position-Specific Iterative BLAST (PSI-BLAST): One of the more recent BLAST programs, this program is used for finding distant relatives of a protein. First, a list of all closely related proteins is created. Then these proteins are combined into a "profile" that is a sort of This watermark does not appear in the registered version - http://www.clicktoconvert.com 122 average sequence. A query against the protein database is then run using this profile, and a larger group of proteins found. This larger group is used to construct another profile, and the process is repeated. By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than the standard protein-protein BLAST. Nucleotide 6-frame translation-protein (blastx): This program compares the six- frame · conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx): T h i s · program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six possible frames and compares it against the six- frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences. Protein-nucleotide 6-frame translation (tblastn): This program compares a protein · query against the six- frame translations of a nucleotide sequence database. Large numbers of query sequences (megablast): When comparing large numbers of · input sequences via the command-line BLAST, "megablast" is much faster than running BLAST multiple times. It basically concatenates many input sequences together to form a large sequence before searching the BLAST database, then post-analyze the search results to glean individual alignments and statistical values. Check your progress: 1. List the various types of BLAST programs available. Notes: s) Write your answer in the space given below. t) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… This watermark does not appear in the registered version - http://www.clicktoconvert.com 123 10.2 Let us Sum up One of the most important recent advances in sequence analysis is the development of methods to assess the significance of an alignment between DNA or protein sequences. A significance question arises when comparing two sequences that are not so clearly similar but are shown to align in a promising way. In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected between related sequences or would just as likely be found if the sequences were not related. The significance test is also needed to evaluate the results of a database search for sequences that are similar to a sequence. In this lesson, we have tried to describe the various programs and methods for assessment of significance. 10.3 Lesson end activities 1. What are the significance of Global Alignment and local alignment? 10.4 Check your progress: Model answers 1. Your answer must include these points: BLASTP compares an amino acid query sequence against a protein sequence database; BLASTN compares a nucleotide query sequence against a nucleotide sequence database; BLASTX compares the six- frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database; TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). TBLASTX compares the six- frame translations of a nucleotide query sequence against the sixframe translations of a nucleotide sequence database. 10.5 Points for Discussion 1. Closely analysis and comment on the problems with global alignment of the sequences. 2. Explain the statistical significance of the individual alignment scores between sequence. This watermark does not appear in the registered version - http://www.clicktoconvert.com 124 10.6 References 1. Burke,J., Wang,H., Hide,W. and Davison,D.B. (1998) Alternativ gene form discovery and candidate gene selection from gen indexing projects. Genome Res., 8, 276–290. 2. Eddy,S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol., 6, 361–365. Feng,D.F. and Doolittle,R.F. (1987) Progressive sequence alignmen as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. 3. Gordon,D., Desmarais,C. and Green,P. (2001) Automated finishin with autofinish. Genome Res., 11, 614–625. 4. Gotoh,O. (1993) Optimal alignment between groups of sequence and its application to multiple sequence alignment. CABIOS, 9, 361–370. 5. Gotoh,O. (1996) Significant improvement in accuracy of multipl protein sequence alignments by iterative refinement as assesse by reference to structural alignments. J. Mol. Biol., 264, 823– 838. 6. Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package fo performing multiple sequence alignment on a microcompute Gene, 73, 237–244. 7. Irizarry,K., Kustanovich,V., Li,C., Brown,N., Nelson,S., Wong, and Lee,C. (2000) Genome-wide analysis of single- nucleotid polymorphisms in human expressed sequences. Nature Genet., 26, 233–236. 8. Irizarry,K., Hu,G., Wong,M.L., Licinio,J. and Lee,C. (2001) Singl nucleotide polymorphism identification in candidate gene system of obesity. Pharmacogenomics J., 1, 193–203. 9. Lipman,D.J., Altschul,S.F. and Kececioglu,J.D. (1989) A tool fo multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412–4415. 463 This watermark does not appear in the registered version - http://www.clicktoconvert.com 125 UNIT III LESSON – 11 PHYLOGENETIC ANALYSIS 11.0 Aims and Objective 11.1 Cluster Analysis 11.1.1 Statistical Significance Testing 11.1.2 Area of Application 11.1.3 Joining (Tree Clustering) 11.1.4 Two-way Joining 11.1.5 k-Means Clustering 11.1.6 EM (Expectation Maximization) Clustering 11.1.7 Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation 11.2 Let us Sum up 11.3 Lesson end activities 11.4 Check your progress 11.5 Points for Discussion 11.6 References 11.0 Aims and Objectives This chapter discusses about Evolutionary Analysis, Cluster Analysis, General Purpose, Statistical Significance Testing, Area of Application, Joining (Tree Clustering), Two-way Joining, k-Means Clustering, EM (Expectation Maximization) Clustering, Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation. This watermark does not appear in the registered version - http://www.clicktoconvert.com 126 11.1 Cluster analysis The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist. We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. There is a countless number of examples in which clustering playes an important role. For instance, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc. For a review of the general categories of cluster analysis methods, see Joining (Tree Clustering), Two-way Joining (Block Clustering), and k-Means Clustering. In short, whatever the nature of your business is, sooner or later you will run into a clustering problem of one form or another. 11.1.1 Statistical Significance Testing Note that the above discussions refer to clustering algorithms and do not mention anything about statistical significance testing. In fact, cluster analysis is not as much a typical statistical test as it This watermark does not appear in the registered version - http://www.clicktoconvert.com 127 is a "collection" of different algorithms that "put objects into clusters according to well defined similarity rules." The point here is that, unlike many other statistical procedures, cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of our research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here, even in cases when p-levels are reported (as in k-means clustering). 11.1.2 Area of Application Clustering techniques have been applied to a wide variety of research problems. Hartigan (1975) provides an excellent summary of the many published studies reporting the results of cluster analyses. For example, in the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility. 11.1.3 Joining (Tree Clustering) · Hierarchical Tree · Distance Measures · Amalgamation or Linkage Rules General Logic The example in the General Purpose Introduction illustrates the goal of the joining or tree clustering algorithm. The purpose of this algorithm is to join together objects (e.g., animals) into successively larger clusters, using some measure of similarity or distance. A typical result of this type of clustering is the hierarchical tree. Hierarchical Tree Consider a Horizontal Hierarchical Tree Plot, on the left of the plot, we begin with each object in a class by itself. Now imagine that, in very small steps, we "relax" our criterion as to what is This watermark does not appear in the registered version - http://www.clicktoconvert.com 128 and is not unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster. As a result we link more and more objects together and aggregate (amalgamate) larger and larger clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined together. In these plots, the horizontal axis denotes the linkage distance (in Vertical Icicle Plots, the vertical axis denotes the linkage distance). Thus, for each node in the graph (where a new cluster is formed) we can read off the criterion distance at which the respective elements were linked together into a new single cluster. When the data contain a clear "structure" in terms of clusters of objects that are similar to each other, then this structure will often be reflected in the hierarchical tree as distinct branches. As the result of a successful analysis with the joining method, one is able to detect clusters (branches) and interpret those branches. Distance Measures The joining or tree clustering method uses the dissimilarities (similarities) or distances between objects when forming the clusters. Similarities are a set of rules that serve as criteria for grouping or separating items. In the previous example the rule for grouping a number of dinners was whether they shared the same table or not. These distances (similarities) can be based on a single dimension or multiple dimensions, with each dimension representing a rule or condition for grouping objects. For example, if we were to cluster fast foods, we could take into account the number of calories they contain, their price, subjective ratings of taste, etc. The most straightforward way of computing distances between objects in a multi-dimensional space is to compute Euclidean distances. If we had a two- or three-dimensional space this measure is the actual geometric distance between objects in the space (i.e., as if measured with a ruler). However, the joining algorithm does not "care" whether the distances that are "fed" to it are actual real distances, or some other derived measure of distance that is more meaningful to the researcher; and it is up to the researcher to select the right method for his/her specific application. This watermark does not appear in the registered version - http://www.clicktoconvert.com 129 Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as: distance(x,y) = { i (xi - yi )2 }½ Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers). However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected (i.e., biased by those dimensions which have a larger scale), and consequently, the results of cluster analyses may be very different. Generally, it is good practice to transform the dimensions so they have similar scales. Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as distance(x,y) = i (xi - yi)2 . City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as: distance(x,y) = i |xi - yi| Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as: distance(x,y) = Maximum|xi - yi | Power distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as: This watermark does not appear in the registered version - http://www.clicktoconvert.com 130 distance(x,y) = ( i |xi - yi|p )1/r where r and p are user-defined parameters. A few example calculations may demonstrate how this measure "behaves." Parameter p controls the progressive weight that is placed on differences on individual dimensions, parameter r controls the progressive weight that is placed on larger differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean distance. Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as: distance(x,y) = (Number of xi yi )/ i Amalgamation or Linkage Rules At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, how do we determine the distances between those new clusters? In other words, we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be linked together. There are various possibilities: for example, we could link two clusters together when any two objects in the two clusters are closer together than the respective linkage distance. Put another way, we use the "nearest neighbors" across clusters to determine the distances between clusters; this method is called single linkage. This rule produces "stringy" types of clusters, that is, clusters "chained together" by only single objects that happen to be close together. Alternatively, we may use the neighbors across clusters that are furthest away from each other; this method is called complete linkage. There are numerous other linkage rules such as these that have been proposed. Single linkage (nearest neighbour). As described above, in this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbours) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains." Complete linkage (farthest neighbour). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually This watermark does not appear in the registered version - http://www.clicktoconvert.com 131 form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate. Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method using arithmetic averages. Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method as weighted pair-group method using arithmetic averages. Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average. Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as weighted pair-group method using the centroid average. Ward's method. This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each This watermark does not appear in the registered version - http://www.clicktoconvert.com 132 step. Refer to Ward (1963) for details concerning this method. In general, this method is regarded as very efficient, however, it tends to create clusters of small size. For an overview of the other two methods of clustering, 11.1.4 Two-way Joining · Introduction · Two-way Joining Introduction Previously, we have discussed this method in terms of "objects" that are to be clustered. In all other types of analyses the research question of interest is usually expressed in terms of cases (observations) or variables. It turns out that the clustering of both may yield useful results. For example, imagine a study where a medical researcher has gathered data on different measures of physical fitness (variables) for a sample of heart patients (cases). The researcher may want to cluster cases (patients) to detect clusters of patients with similar syndromes. At the same time, the researcher may want to cluster variables (fitness measures) to detect clusters of measures that appear to tap similar physical abilities. Two-way Joining Given the discussion in the paragraph above concerning whether to cluster cases or variables, one may wonder why not cluster both simultaneously? Two-way joining is useful in (the relatively rare) circumstances when one expects that both cases and variables will simultaneously contribute to the uncovering of meaningful patterns of clusters. For example, returning to the example above, the medical researcher may want to identify clusters of patients that are similar with regard to particular clusters of similar measures of physical fitness. The difficulty with interpreting these results may arise from the fact that the similarities between different clusters may pertain to (or be caused by) somewhat different subsets of variables. Thus, the resulting structure (clusters) is by nature not homogeneous. This may seem a bit confusing at first, and, indeed, compared to the other clustering methods described and k-Means Clustering), two-way joining is probably the one least commonly used. However, some researchers believe that this method offers a powerful exploratory data analysis This watermark does not appear in the registered version - http://www.clicktoconvert.com 133 tool (for more information you may want to refer to the detailed description of this method in Hartigan, 1975). 11.1.5 k-Means Clustering · Example · Computations · Interpretation of results General logic This method of clustering is very different from the Joining (Tree Clustering) and Two-way Joining. Suppose that you already have hypotheses concerning the number of clusters in your cases or variables. You may want to "tell" the computer to form exactly 3 clusters that are to be as distinct as possible. This is the type of research question that can be addressed by the k- means clustering algorithm. In general, the k-means method will produce exactly k different clusters of greatest possible distinction. It should be mentioned that the best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. Example In the physical fitness example, the medical researcher may have a "hunch" from clinical experience that her heart patients fall basically into three different categories with regard to physical fitness. She might wonder whether this intuition can be quantified, that is, whether a kmeans cluster analysis of the physical fitness measures would indeed produce the three clusters of patients as expected. If so, the means on the different measures of physical fitness for each cluster would represent a quantitative way of expressing the researcher's hypothesis or intuition (i.e., patients in cluster 1 are high on measure 1, low on measure 2, etc.). Computations Computationally, you may think of this method as analysis of variance ( ANOVA) "in reverse." The program will start with k random clusters, and then move objects between those clusters with the goal to 1) minimize variability within clusters and 2) maximize variability between This watermark does not appear in the registered version - http://www.clicktoconvert.com 134 clusters. In other words, the similarity rules will apply maximally to the members of one cluster and minimally to members belonging to the rest of the clusters. This is analogous to "ANOVA in reverse" in the sense that the significance test in ANOVA evaluates the between group variability against the within- group variability when computing the significance test for the hypothesis that the means in the groups are different from each other. In k-means clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results. Interpretation of results Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very different means for most, if not all dimensions, used in the analysis. The magnitude of the F values from the analysis of variance performed on each dimension is another indication of how well the respective dimension discriminates between clusters. 11.1.6 EM (Expectation Maximization) Clustering · Introductory Overview · The EM Algorithm Introductory Overview The methods described here are similar to the k-Means algorithm described above, and you may want to review that section for a general overview of these techniques and their applications. The general purpose of these techniques is to detect clusters in observations (or variables) and to assign those observations to the clusters. A typical example application for this type of analysis is a marketing research study in which a number of consumer behavior related variables are measured for a large sample of respondents. The purpose of the study is to detect "market segments," i.e., groups of respondents that are somehow more similar to each other (to all other members of the same cluster) when compared to respondents that "belong to" other clusters. In addition to identifying such clusters, it is usually equally of interest to determine how the clusters This watermark does not appear in the registered version - http://www.clicktoconvert.com 135 are different, i.e., determine the specific variables or dimensions that vary and how they vary in regard to members in different clusters. k-means clustering. To reiterate, the classic k-Means algorithm was popularized and refined by Hartigan (1975; see also Hartigan and Wong, 1978). The basic operation of that algorithm is relatively simple: Given a fixed number of (desired or hypothesized) k clusters, assign observations to those clusters so that the means across clusters (for all variables) are as different from each other as possible. Extensions and generalizations. The EM (expectation maximization) algorithm extends this basic approach to clustering in two important ways: 1. Instead of assigning cases or observations to clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters. 2. Unlike the classic implementation of k-means clustering, the general EM algorithm can be applied to both continuous and categorical variables (note that the classic k-means algorithm can also be modified to accommodate categorical variables). The EM Algorithm The EM algorithm for clustering is described in detail in Witten and Frank (2001). The basic approach and logic of this clustering method is as follows. Suppose you measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two clusters of observations with different means (and perhaps different standard deviations); within each sample, the distribution of values for the continuous variable follows the normal distribution. The resulting distribution of values (in the population) may look like this: Mixtures of distributions. The illustration shows two normal distributions with different means and different standard deviations, and the sum of the two distributions. Only the mixture (sum) of the two normal distributions (with different means and standard deviations) would be This watermark does not appear in the registered version - http://www.clicktoconvert.com 136 observed. The goal of EM clustering is to estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data (distribution). Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters. With the implementation of the EM algorithm in some computer programs, you may be able to select (for continuous variables) different distributions such as the normal, log- normal, and Poisson distributions. You can select different distributions for different variables and, thus, derive clusters for mixtures of different types of distributions. Categorical variables. The EM algorithm can also accommodate categorical variables. The method will at first randomly assign different probabilities (weights, to be precise) to each class or category, for each cluster. In successive iterations, these probabilities are refined (adjusted) to maximize the likelihood of the data given the specified number of clusters. Classification probabilities instead of classifications. The results of EM clustering are different from those computed by k-means clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an actual assignment of observations to clusters, based on the (largest) classification probability. 11.1.7 Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation An important question that needs to be answered before applying the k-means or EM clustering algorithms is how many clusters there are in the data. This is not known a priori and, in fact, there might be no definite or unique answer as to what value k should take. In other words, k is a nuisance parameter of the clustering model. Luckily, an estimate of k can be obtained from the data using the method of cross- validation. Remember that the k-means and EM methods will determine cluster solutions for a particular user-defined number of clusters. The k-means and EM clustering techniques (described above) can be optimized and enhanced for typical applications in data mining. The general metaphor of data mining implies the situation in which an analyst searches for useful structures and "nuggets" in the data, usually without any strong a priori This watermark does not appear in the registered version - http://www.clicktoconvert.com 137 expectations of what the analysist might find (in contrast to the hypothesis-testing approach of scientific research). In practice, the analyst usually does not know ahead of time how many clusters there might be in the sample. For that reason, some programs include an implementation of a v-fold cross-validation algorithm for automatically determining the number of clusters in the data. This unique algorithm is immensely useful in all general "pattern-recognition" tasks - t o determine the number of market segments in a marketing research study, the number of distinct spending patterns in studies of consumer behavior, the number of clusters of different medical symptoms, the number of different types (clusters) of documents in text mining, the number of weather patterns in meteorological research, the number of defect patterns on silicon wafers, and so on. The v-fold cross-validation algorithm applied to clustering. The v- fold cross-validation algorithm is described in some detail in Classification Trees and General Classification and Regression Trees (GC&RT). The general idea of this method is to divide the overall sample into a number of v folds. The same type of analysis is then successively applied to the observations belonging to the v-1 folds (training sample), and the results of the analyses are applied to sample v (the sample or fold that was not used to estimate the parameters, build the tree, determine the clusters, etc.; this is the testing sample) to compute some index of predictive validity. The results for the v replications are aggregated (averaged) to yield a single measure of the stability of the respective model, i.e., the validity of the model for predicting new observations. Cluster analysis is an unsupervised learning technique, and we cannot observe the (real) number of clusters in the data. However, it is reasonable to replace the usual notion (applicable to supervised learning) of "accuracy" with that of "distance." In general, we can apply the v-fold cross-validation method to a range of numbers of clusters in k-means or EM clustering, and observe the resulting average distance of the observations (in the cross- validation or testing samples) from their cluster centers (for k-means clustering); for EM clustering, an appropriate equivalent measure would be the average negative (log-) likelihood computed for the observations in the testing samples. This watermark does not appear in the registered version - http://www.clicktoconvert.com 138 Check your progress: 1. List the different distance matrix methods. Notes: u) Write your answer in the space given below. v) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… ……………………………………………………………………………………………………… ……………………………………………………… Reviewing the results of v-fold cross-validation. The results of v-fold cross-validation are best reviewed in a simple line graph. Shown here is the result of analyzing a data set widely known to contain three clusters of observations (specifically, the well-known Iris data file reported by Fisher, 1936, and widely referenced in the literature on discriminant function analysis). Also shown (in the graph to the right) are the results for analyzing simple normal random numbers. The "real" data (shown to the left) exhibit the characteristic scree-plot pattern, where the cost function (in this case, 2 times the log- likelihood of the cross-validation data, given the estimated parameters) quickly decreases as the number of clusters increases, but then (past 3 clusters) levels off, and even increases as the data are overfitted. Alternatively, the random numbers show no such pattern, in fact, there is basically no decrease in the cost function at all, and it quickly begins to increase as the number of clusters increases and overfitting occurs. It is easy to see from this simple illustration how useful the v-fold cross-validation technique, applied to k-means and EM clustering can be for determining the "right" number of clusters in the data. This watermark does not appear in the registered version - http://www.clicktoconvert.com 139 11.2 Let us Sum up Evolution is regarded as a branching process, whereby populations are altered over time and may speciate into separate branches, hybridize together again, or terminate by extinction. This may be visualized as a multidimensional character-space that a population moves through over time. The problem posed by phylogenetics is that genetic data is only available for the present, and fossil records ( osteometric data) are sporadic and less reliable. Our knowledge of how evolution operates is used to reconstruct the full tree. This lesson discusses about Evolutionary Analysis, Cluster Analysis, General Purpose, Statistical Significance Testing, Area of Application. 11.3 Lesson end activities Students will work in groups in an attempt to identify the evolutionary history of a particular group of organisms based on the protein sequences of various molecular analyses. Students will try to show how similar organisms are related using rooted and un-rooted phylogenic trees. Procedure: 1. Form groups of 3 or 4 students to do the investigation. 2. Students will select a group of organisms and then get approval from the instructor. Organisms should be similar appearing or taxonomically related organisms. Its first come first served. You must have a minimum of 4 organisms that you will study. Possible animal and plant groups for study (some may be similar in appearance though not necessarily related by analysis): Canines (Dog, Wolves, Coyote, Dingo, Hyena, African Wild dog, Foxes, etc.) Felines (House cat, Lynx, Tiger, African lion, Mt. Lion, Jaguar, Leopard, Panther, Cheetah, etc.) Bears (Polar, Black, Kodiak, Grizzly, Panda, etc.) Trees (Red oak, White oak, Sugar maple Norway maple, Am. elm, Ginkgo, Green spruce, etc. This watermark does not appear in the registered version - http://www.clicktoconvert.com 140 Mollusks (Slugs, Snails, Squid, Octopus, Cuttlefish, Clams, Oysters, etc.) Arthropods (Insects, Crabs, Spiders, Centipedes, etc) Flowers (Roses, Tulip, Tiger lily, Daylily, Carnation, etc.) Any other collections or subdivisions of plants or animals (Specific insects, specific members of a particular taxonomic genera, families or orders, types of ferns, etc) 3. Identify each of the organisms by their scientific name. 4. Using at least 3 different protein compounds appropriate for their organisms (ex. Hemoglobin for higher animals, myoglobin in muscle, enolase is good for most organism comparisons, cytochrome c in all organisms, etc.), create and print (or save to disk) the BW evolutionary trees for those organisms and the respective proteins. 5. Based on these trees create a composite tree on a web page with pictures obtained from websites. 11.4 Check your progress: Model answers 1. Your answer must include these points: i. Neighbor-joining ii. Fitch-Margoliash method iii. Using outgroups 11.5 Points for Discussion 1. Role of statistics in clustering in vital - Discuss. 2. “Polylogenetic analysis help us understand the relation between us and the ancasters in a lucid manner” – Substantiate. This watermark does not appear in the registered version - http://www.clicktoconvert.com 141 11.6 References 1. Ackerly, D. D. 1999. Comparative plant ecology and the role of phylogenetic information. Pages 391-413 in M. C. Press, J. D. Scholes, and M. G. Braker, eds. Physiological plant ecology. The 39th symposium of the British Ecological Society held at the University of York 7-9 September 1998. Blackwell Science, Oxford, U.K. 2. Berenbrink, M., P. Koldkjær, O. Kepp, and A. R. Cossins. 2005. Evolution of oxygen secretion in fishes and the emergence of a complex physiological system. Science 307:1752-1757. 3. Blomberg, S. P., T. Garland, Jr., and A. R. Ives. 2003. Testing for phylogenetic signal in comparative data: behavioral traits are more labile. Evolution 57:717-745. 4. Brooks, D. R., and D. A. McLennan. 1991. Phylogeny, ecology, and behavior: a research program in comparative biology. Univ. Chicago Press, Chicago. 434 pp. 5. Cheverud, J. M., M. M. Dow, and W. Leutenegger. 1985. The quantitative assessment of phylogenetic constraints in comparative analyses: sexual dimorphism in body weight among primates. Evolution 39:1335-1351. 6. Eggleton, P., and R. I. Vane-Wright, eds. 1994. Phylogenetics and ecology. Linnean Society Symposium Series Number 17. Academic Press, London. 7. Felsenstein, J. 1985. Phylogenies and the comparative method. American Naturalist 125:115. This watermark does not appear in the registered version - http://www.clicktoconvert.com 142 LESSON – 12 ROOTED AND UNROOTED TREE 12.0 Aims and Objective 12.1 Rooted And Unrooted Tree 12.1.1 Definition of a phylogenetic tree 12.1.2 Features of a phylogenetic tree 12.1.3 Unrooted trees 12.1.4 Rooted trees 12.2 Let us Sum up 12.3 Lesson end activities 12.4 Check your progress 12.5 Points for Discussion 12.6 References 12.0 Aims and Objectives This chapter discusses the Rooted And Unrooted Tree, Definition of a phylogenetic tree, Features of a phylogenetic tree, Branches, Nodes (External & Internal), Unrooted trees, Rooted trees. 12.1 Rooted and Unrooted tree 12.1.1 Definition of a phylogenetic tree A tree is an acyclic connected graph that consists of a collection of nodes (internal and external) and branches connecting them so that every node can be reached by a unique path from every other branch. This watermark does not appear in the registered version - http://www.clicktoconvert.com 143 A C Branches External nodes B Internal nodes D Figure 12.1: An unrooted phylogenetic tree joining taxonomic units. 12.1.2 Features of a phylogenetic tree In the area of phylogenetic inference, trees are used as visual displays that represent hypothetical, reconstructed evolutionary events. The tree in this case consists of: v internal nodes which represent taxonomic units such as species or genes; the external nodes, those at the ends of the branches, represent living organisms. v The lengths of the branches usually represent an elapsed time, measured in years, or the length of the branches may represent number of molecular changes (e.g. mutations) that have taken place between the two nodes. This is calculated is from the degree of differences when sequences are compared (refer to “alignments” later) v Sometimes, the lengths are irrelevant and the tree represents only the order of evolution. [In a dendrogram, only the lengths of horizontal (or vertical, as the case may be) branches count]. v Finally the tree may be rooted or unrooted. Check your progress: 1. List the features of a phylogenetic tree. Notes: w) Write your answer in the space given below. x) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… This watermark does not appear in the registered version - http://www.clicktoconvert.com 144 …………………………………………………………………………………………………… ………………………………………………………… ………………………………………………………………………………………………………… …………………………………………………… 12.1.3 Unrooted trees An unrooted tree simply represents phylogenetic but doesnot provide an evolutionary path. In an unrooted tree, an external node represents a contemporary organism. Internal nodes represent common ancestors of some of the external nodes. In this case, the tree shows the relationship between organisms A, B, C & D and does not tell us anything about the series of evolutionary events that led to these genes. There is also no way to tell whether or not a given internal node is a common ancestor of any external nodes. 12.1.4 Rooted trees Gene trees are not the same as species trees. In case of a rooted tree, one of the internal nodes is used as an outgroup, and, in essence, becomes the common ancestor of all the other external nodes. The outgroup therefore enables the root of a tree to be located and the correct evolutionary pathway to be identified. In the above case, five different evolutionary pathways are possible using an outgroup, each depicted by a different rooted tree. C D C C A D A B B A A B D B C A B D D A B C D C Unrooted tree Figure 12.2. The five rooted trees that can be drawn from the unrooted tree (box). The positions of the roots are indicated by the number on the outline of the unrooted tree (box) This watermark does not appear in the registered version - http://www.clicktoconvert.com 145 12.2 Let us Sum up This chapter gives a brief introduction about Rooted And Unrooted Tree, Definition of a phylogenetic tree, Features of a phylogenetic tree, Branches, Nodes (External & Internal), Unrooted trees, Rooted trees with diagrams. 12.3 Lesson end activities 1.Why use several molecules to show the phylogeny of organisms? 2. Did you encounter any unusual patterns or branches? What might be a possible explanation for these? 12.4 Check your progress: Model answers 1. Your answer must include these points: i. Internal nodes ii. External nodes iii. Lengths of the branches iv. May be rooted or unrooted 12.5 Points for Discussion 1. “Polygenetic trees are very much useful in understanding our ancasters” - Justify. 12.6 References 1. Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunderland, Mass. xx + 664 pp. 2. Freckleton, R. P., P. H. Harvey, and M. Pagel. 2002. Phylogenetic analysis and comparative data: a test and review of evidence. American Naturalist 160:712-726. This watermark does not appear in the registered version - http://www.clicktoconvert.com 146 3. Garland, T., Jr., and A. R. Ives. 2000. Using the past to predict the present: Confidence intervals for regression equations in phylogenetic comparative methods. American Naturalist 155:346-364. 4. Garland, T., Jr., A. F. Bennett, and E. L. Rezende. 2005. Phylogenetic approaches in comparative physiology. Journal of Experimental Biology 208:3015-3035. 5. Garland, T., Jr., A. W. Dickerman, C. M. Janis, and J. A. Jones. 1993. Phylogenetic analysis of covariance by computer simulation. Systematic Biology 42:265-292. 6. Garland, T., Jr., P. H. Harvey, and A. R. Ives. 1992. Procedures for the analysis of comparative data using phylogenetically independent contrasts. Systematic Biology 41:18-32. 7. Gittleman, J. L., and M. Kot. 1990. Adaptation: statistics and a null model for estimating phylogenetic effects. Systematic Zoology 39:227-241. 8. Grafen, A. 1989. The phylogenetic regression. Philosophical Transactions of the Royal Society of London B 326:119-157. 9. Harvey, P. H., and M. D. Pagel. 1991. The comparative method in evolutionary biology. Oxford University Press, Oxford. 239 pp. This watermark does not appear in the registered version - http://www.clicktoconvert.com 147 LESSON – 13 BOOTSTRAPPING 13.0 Aims and Objective 13.1 Inferred and true trees 13.2 Gene trees are not the same as species trees 13.3 Bootstrapping 13.4 Molecular sequences 13.5 Sequence alignment is the essential preliminary to tree construction. 13.6 Let us Sum up 13.7 Lesson end activities 13.8 Check your progress 13.9 Points for Discussion 13.10 References 13.0 Aims and Objectives This lesson discusses about Inferred and true trees, Gene trees, Bootstrapping, Molecular sequences, tree construction. 13.1 Inferred and true trees The criteria used to choose an outgroup depends very much on the type of analysis that is carried out. Suppose that homologous (orthologous) genes in a tree come from human, chimpanzee, gorilla and orangutan. A useful homologous primate outgroup sequence is that from baboon as palaeontological evidence suggests that baboons branched away from the lineage leading to human, chimpanzee, gorilla and orangutan before the time of the common ancestor of the four species (Fig 13.1). Human Chimpanze e Gorilla Orangutan Baboo n This watermark does not appear in the registered version - http://www.clicktoconvert.com 148 Fig13.1: The use of an outgroup to root a phylogenetic tree. We refer to the rooted tree given above, as an inferred tree. This is to emphasise that it depicts the series of evolutionary events that are inferred from the data that were analysed, and may not necessarily be the same as the true tree, the one that depicts the actual series of events that occurred. Sometimes we can be fairly confident that the inferred tree is the true tree but most phylogenetic data analysis are prone to uncertainties. Degrees of confidence can be assigned to the branching patterns in an inferred tree using bootstrap analysis (discussed in a later section). Due to the imprecise nature of phylogenetic analysis controversies have arisen. 13.2 Gene trees are not the same as species trees The above tree is a gene tree i.e. a tree derived by comparing orthologous sequences (those derived from the same ancestral sequence). The assumption is that this gene tree is a more accurate reflection of a species tree than the one that can be inferred from morphological data. This assumption is generally correct but it does not mean that the gene tree is the same as a species tree. Mutation and speciation are not expected to occur at the same time. For example, the mutation event could precede the speciation event. This would mean that, to begin with, both alleles will still be present in the same unsplit population of the ancestral species. When the population split occurs, it is likely that both alleles will be present in each of the resulting groups. After the split, the new population evolve independently. One possibility is that as a result of random genetic drift loss of one allele from one population and the loss of the other allele from the second population occurs. This establishes the two separate genetic lineages that were inferred from phylogenetic analysis of the gene. How do these considerations affect the coincidence between a gene and a species tree? (a) If a molecular clock is used to date the time at which gene divergence took place, than it cannot be assumed that this is also the time of the speciation event. A significant difference between a gene and a species event can exist though the species tree & gene tree look the same (Fig 13.2). (b) If the first speciation event is followed closely by a second speciation event in one of the two populations, then the branching order of the gene tree might be different to that of the species tree. This can occur if the genes in the modern This watermark does not appear in the registered version - http://www.clicktoconvert.com 149 (c) species are derived from alleles that had already appeared before the first of the two speciation (Fig 13.3) 13.3 Bootstrapping Tree reconstruction In any molecular phylogenetic reconstruction the following points need to be addressed. i. Molecular sequences ii. Sequence alignment is the essential preliminary to tree reconstruction iii. Converting the l Mutation l Mutation l Mutation alignment Speciation Speciation Speciation data Allele loss B A A B C into a A B A B Bb Fig 13.2 Gene tree & species tree look the same. However, mutation might precede speciation giving an incorrect time for the latter if a molecular clock is used A B C A B Fig 13.3 A gene tree can have a different branching order from a species tree phylogenetic tree iv. Assessing accuracy of a reconstructed tree v. Molecular clocks enable the time of divergence of ancestral sequences to be estimated 13.4 Molecular sequences C This watermark does not appear in the registered version - http://www.clicktoconvert.com 150 Nucleic acids (rRNA, DNA) and protein sequences are used in molecular phylogenetic tree construction. DNA yields more phylogenetic information than DNA and has become by the far predominant molecule for phylogeny: § More statistical information from DNA data: The nucleotide sequences of a pair of homologous genes has a higher information content than the amino acid of the corresponding proteins, because mutation that result in nonsynchrononymous changes affect only the DNA sequence. Hence coding as well as non-coding regions of the genome can be examined. Write out the DNA sequences or the following two amino acids as an example of this. You can see that at the protein level there is only difference but at the nucleic acid level there are differences. Protein -gly-ala-ile-leu-asp-arg- DNA -gga-gcc-ata-tta-gat-aga DNA -gga-gca-att-ttt-gat-aga- Protein -gly-ala-ile-phe-asp-arg§ Ease of sequencing DNA: Samples for DNA sequencing can be prepared by PCR which is an extremely easy technique. Protein electrograms, Restriction fragment length polymorphism (RFLP), Simple sequence length polymorphism (SSLP), Single nucleotide polymorphism (SNP) and DNA-DNA hybridazation data have also been used for molecular phylogenetic reconstruction. Immunological data from crossreactivity studies were used in for such work as well. 13.5 Sequence alignment is the essential preliminary to tree construction. This is the most important step in molecular phylogeny and a number of issues have to be considered: · Sequence Homologs: Sequences that are to be aligned should be homologs. An example of this are the b-globin genes of different vertebrates. This is to satisfy the phylogeny criteria which states that the sequence should be derived from an common ancestral sequence. This watermark does not appear in the registered version - http://www.clicktoconvert.com 151 · Non-homologous sequences: If the sequences are not homologous and hence do not share a common ancestor phylogenetic construction methods will always produce a tree but the tree will not be of any biological relevance. This type of error commonly occurs when undertaking homology analysis to assign functions to newly generated gene sequences. Blast is used extensively as on of the homology analysis methods and hence interpretation of the data arising from the analysis should be undertaken with care. · Easy alignments: Correctly aligning the homologous sequence is the next task. In some cases it is an easy task. A simple sequence alignment is shown below: Sequence AGCAATGGCCAGACAATAATG Sequence AGCTATGGACAGACATTAATG *** **** ****** ***** · Difficult alignments: If sequences have evolved and diverged by accumulating insertions and deletions as well as point mutations, then these sequence are not always easy to align. Insertions and deletions cannot be distinguished when pairs of sequences are aligned so we refer to them as indels Below is a pair of difficult sequences for alignment where placing the indel at the correct location can become a problem. Sequence GACGACCATAGACCAGCATAG Sequence GACTACCATAGA-CTGCAAAG *** ******** * *** ** Sequence GACGACCATAGACCAGCATAG Sequence GACTACCATAGACT-GCAAAG Two possible positions for the indel *** ********* *** ** · The dot matrix technique for alignment: Some alignments can be easily done by "eye balling" the sequences yet others may require a pen and paper. The simplest is known as the dot matrix method. The two sequences are written out on the x- and y- axes of the graph paper at the positions corresponding to the identical nucleotides of the two sequences. The alignment is indicated by a This watermark does not appear in the registered version - http://www.clicktoconvert.com 152 diagonal series of dots broken by empty squares where the sequences have nucleotide differences, and shifting from one column to another where indels occur. · Similarity approach is a mathematical based alignment technique: The similarity approach (Needleman and Wunesh) aims to maximise the number of identical matched nucleotides in the two sequences. The distance method, (Waterman) on the other hand, minimises the number of mismatches. Often the two approaches will identify the same alignment as being the best one. · Multiple alignments are generated for more then two sequences: Rarely can one do multiple alignments with a pen and paper and all the steps required for phylogenetic analysis is undertaken on a computer. For automatically generating multiple alignments several computer programs are available. · rRNA genes (aka rDNA) and rRNA have been used as molecular chronometers and phylogentetic studies undertaken. Refer to the section on rRNA for detailed notes on the methods of aligning these types of nucleic acids. Check your progress: 1. List the steps involved in bootstrapping tree construction. Notes: y) Write your answer in the space given below. z) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… This watermark does not appear in the registered version - http://www.clicktoconvert.com 153 Methods of Phylogenetic Analysis Two major groups of analyses exist to examine phylogenetic relationships: phenetic methods and cladistic methods. It is important to note that phenetics and cladistics have had an uneasy relationship over the last 40 years or so. Most of today's evolutionary biologists favor cladistics, although a strictly cladistic approach may result in counterintuitive results. Phenetic Method of Analysis Phenetics, also known as numerical taxonomy, involves the use of various measures of overall similarity for the ranking of species. There is no restriction on the number or type of characters (data) that can be used, although all data must be first converted to a numerical value, without any character "weighting". Each organism is then compared with every other for all characters measured, and the number of similarities (or differences) is calculated. The organisms are then clustered in such a way that the most similar are grouped close together and the more different ones are linked more distantly. The taxonomic clusters, called phenograms, that result from such an analysis do not necessarily reflect genetic similarity or evolutionary relatedness. The lack of evolutionary significance in phenetics has meant that this system has had little impact on animal classification, and as a consequence, interest in and use of phenetics has been declining in recent years. Cladistic Method of Analysis An alternative approach to diagramming relationships between taxa is called cladistics. The basic assumption behind cladistics is that members of a group share a common evolutionary history. Thus, they are more closely related to one another than they are to other groups of organisms. Related groups of organisms are recognized because they share a set of unique features (apomorphies) that were not present in distant ancestors but which are shared by most or all of the organisms within the group. These shared derived characteristics are called synapomorphies. Therefore, in contrast to phenetics, cladistics groupings do not depend on whether organisms share physical traits but depend on their evolutionary relationships. Indeed, in cladistic analyses two organisms may share numerous characteristics but still be considered members of different groups. Cladistic analysis entails a number of assumptions. For example, species are assumed to arise primarily by bifurcation, or separation, of the ancestral lineage; species are often considered to become extinct upon hybridization (crossbreeding); and hybridization is assumed to be rare or This watermark does not appear in the registered version - http://www.clicktoconvert.com 154 absent. In addition, cladistic groupings must possess the following characteristics: all species in a grouping must share a common ancestor and all species derived from a common ancestor must be included in the taxon. The application of these requirements results in the following terms being used to describe the different ways in which groupings can be made: · A monophyletic grouping is one in which all species share a common ancestor, and all species derived from that common ancestor are included. This is the only form of grouping accepted as valid by cladists. · A paraphyletic grouping is one in which all species · share a common ancestor, but not all species derived · from that common ancestor are included. · A polyphyletic grouping is one in which species that do not share an immediate common ancestor are lumped together, while excluding other members that would link them 13.6 Let us Sum up Macromolecular data, meaning gene (DNA) and protein sequences, are accumulating at an increasing rate because of recent advances in molecular biology. For the evolutionary biologist, the rapid accumulation of sequence data from whole genomes has been a major advance, because the very nature of DNA allows it to be used as a "document" of evolutionary history. Comparisons of the DNA sequences of various genes between different organisms can tell a scientist a lot about the relationships of organisms that cannot otherwise be inferred from morphology, or an organism's outer form and inner structure. Because genomes evolve by the gradual accumulation of mutations, the amount of nucleotide sequence difference between a pair of genomes from different organisms should indicate how recently those two genomes shared a common ancestor. Two genomes that diverged in the recent past should have fewer differences than two genomes whose common ancestor is more ancient. Therefore, by comparing different genomes with each other, it should be possible to derive evolutionary relationships between them, the major objective of molecular phylogenetics. Molecular phylogenetics attempts to determine the rates and patterns of change occurring in DNA and proteins and to reconstruct the evolutionary history of genes and organisms. Two general approaches may be taken to obtain this information. In the first approach, scientists use DNA to study the evolution of an organism. In the second approach, different organisms are used to study the evolution of DNA. Whatever the approach, the general goal is to infer process from This watermark does not appear in the registered version - http://www.clicktoconvert.com 155 pattern: the processes of organismal evolution deduced from patterns of DNA variation and processes of molecular evolution inferred from the patterns of variations in the DNA itself. 13.7 Lesson end activities 1. Are there any ideas you have that might simplify the process of making the combined or composite tree to help others in the future? 2. Other ways of displaying the information obtained such that you or others more easily understand it? 13.8 Check your progress: Model answers 1. Your answer must include these points: 1. Molecular sequences 2. Sequence alignment is the essential preliminary to tree reconstruction 3. Converting the alignment data into a phylogenetic tree 4. Assessing accuracy of a reconstructed tree 5. Molecular clocks enable the time of divergence of ancestral sequences to be estimated 13.9 Points for Discussion 1. “Sequence alignment in essential process in tree construction” - Substantiate. 13.10 References 1. Housworth, E. A., E. P. Martins, and M. Lynch. 2004. The phylogenetic mixed model. American Naturalist 163:84-96. 2. Ives, A. R., P. E. Midford, and T. Garland, Jr. 2007. Within-species variation and measurement error in phylogenetic comparative methods. Systematic Biology 56:252270. 3. Maddison, D. R. 1994. Phylogenetic methods for inferring the evolutionary history and process of change in discretely valued characters. Annual Review of Entomology 39:267292. This watermark does not appear in the registered version - http://www.clicktoconvert.com 156 4. Maddison, W. P. 1990. A method for testing the correlated evolution of two binary characters: Are gains or losses concentrated on certain branches of a phylogenetic tree? Evolution 44:539-557. 5. Maddison, W. P., and D. R. Maddison. 1992. MacClade. Analysis of phylogeny and character evolution. Version 3. Sinauer Associates, Sunderland, Mass. 398 pp. 6. Martins, E. P., ed. 1996. Phylogenies and the comparative method in animal behavior. Oxford University Press, Oxford. 415 pp. 7. Martins, E. P., and T. Garland, Jr. 1991. Phylogenetic analyses of the correlated evolution of continuous characters: a simulation study. Evolution 45:534-557. This watermark does not appear in the registered version - http://www.clicktoconvert.com 157 LESSON – 14 USE OF CLUSTAL AND PHYLIP 14.0 Aims and Objective 14.1 CLUSTAL Introduction 14.2 New Features 14.3 Use Of CLUSTAL 14.4 PHYLIP 14.4.1 Setup 14.1.2 Usage 14.5 Let us Sum up 14.6 Lesson end activities 14.7 Check your progress 14.8 Points for Discussion 14.9 References 14.0 Aims and Objectives This chapter discusses about Clustal programs, its features, use of Clustal, Phylip, its features, set up and use. 14.1 CLUSTAL Introduction One of the cornerstones of modern Bioinformatics is the comparison or alignment of protein sequences. With the aid of multiple sequence alignments, biologists are able to study the sequence patterns conserved through evolution and the ancestral relationships between different organisms. Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). The most widely used programs for global multiple sequence alignment are from the Clustal series of programs. The first Clustal program was written by Des Higgins in 1988 and was designed specifically to work efficiently on personal computers, which at that time, had feeble computing power by today's standards. It combined a memory-efficient dynamic programming algorithm with the progressive alignment strategy developed by Feng and This watermark does not appear in the registered version - http://www.clicktoconvert.com 158 Doolittle and Willie Taylor. The multiple alignment is built up progressively by a series of pairwise alignments, following the branching order in a guide tree. The initial pre-comparison used a rapid word-based alignment algorithm and the guide tree was constructed using the UPGMA method. In 1992, a new release was made, called ClustalV which incorporated profile alignments (alignments of existing alignments) and the facility to generate trees from the multiple alignment using the Neighbour-Joining (NJ) method . The third generation of the series, ClustalW, released in 1994, incorporated a number of improvements to the alignment algorithm, including sequence weighting, position-specific gap penalties and the automatic choice of a suitable residue comparison matrix at each stage in the multiple alignment. In addition, the approximate word search used for the pre-comparison step was replaced by a more sensitive dynamic programming algorithm, and the dendogram construction by UPGMA was replaced by NJ. The ClustalW program looked very similar to ClustalV, with simple text menus for interactive use and the possibility of running the program in batch mode by specifying the input file and the parameter options on the command line. The rationale behind the development of the Clustal series has been to provide robust, portable programs that are capable of providing good, biologically accurate alignments within a reasonable time limit. A close collaboration between biologists and computer scientists is probably one of the main reasons for the success and continued widespread use of the Clustal programs. ClustalW has given rise to a number of developments, including the latest member of the family, ClustalX. Although the alignments produced are the same as those produced by the current release of ClustalW, the user can better evaluate alignments in ClustalX. The program displays the multiple alignment in a scrollable window and all parameters are available using pull-down menus. Within alignments, conserved columns are highlighted using a customizable colour scheme and quality analysis tools are available to highlight potentially misaligned regions. ClustalX is easy to install, is user- friendly and maintains the portability of the previous generations through the NCBI Vibrant toolkit ( ftp://ncbi.nlm.nih.gov/toolbox/ncbitools/). Numerous options are provided, such as the realignment of selected sequences or selected blocks of the alignment and the possibility of building up difficult alignments piecemeal, making ClustalX an ideal tool for working interactively on alignments. This watermark does not appear in the registered version - http://www.clicktoconvert.com 159 Parallel versions of ClustalW and ClustalX have been developed by SGI ( http://www.sgi.com/industries/sciences/chembio/resources/clustalw/parallel_clustalw.html), which show increased speeds of up to 10× when running ClustalW/X on 16 CPUs and significantly reduce the time required for data analysis. A number of other significant developments have been based on the ClustalW program. For example, ClustalNet is a Clustal alignment CORBA server and DbClustal) is a program for aligning sequences detected by database searches, which uses local alignment information to anchor the global multiple alignment. DbClustal is available on the Web at http://www-igbmc.u- strasbg.fr/BioInfo/DbClustal and forms part of the WU-Blast2 (Washington University BLAST version 2.0) server at the EBI ( http://www.ebi.ac.uk/blast2/). Numerous Web servers have exploited the command line interface of ClustalW, notably the EBI's ClustalWWW Web server, which currently runs between 2000–10 000 jobs/day and the SRS server at the same site ( http://srs.ebi.ac.uk/), which has ClustalW built in. The EBI ClustalWWW interface provides extensive help, ranging from an introduction to multiple alignments for new users to detailed descriptions of each alignment option. An important factor in obtaining a high-quality alignment is the ability to change the numerous alignment parameters available in ClustalW. While the default values of the parameters have been optimised to work in the majority of cases, they are not necessarily optimal for any given alignment problem. In the ClustalWWW interface, all the options are easily accessible on the top page. This watermark does not appear in the registered version - http://www.clicktoconvert.com 160 Sequences can be entered by either pasting them or by uploading a file from the user's local computer. In both cases, the sequences should be in one of seven different formats (GCG, FASTA, EMBL, GenBank, PIR, NBRF, Phylip or SWISS-PROT). Although users are encouraged to submit large numbers of sequences, there is no guarantee that the alignment will be completed within the job run limits. Therefore, users who experience problems when attempting to make very large alignments are advised to download the software and run it locally. In addition to the input format, the user can also specify the preferred output format for the multiple sequence alignment. The options are currently ALN, GCG, PHYLIP, PIR and GDE. It is also possible to configure the browser to automatically load the results files from ClustalW into a suitable external application. Many commercial packages, e.g. the GCG package (Wisconsin Package, Genetics Computer Group, Madison, WI) and its X Window graphical user interface, SeqLab, can also accept ClustalW alignments. A recent enhancement to the ClustalW WWW interface has been the addition of an option that allows the user to upload the results of ClustalW into an alignment editor, using a Java Applet called JalView ( http://www.compbio.dundee.ac.uk/). JalView is a fully featured multiple sequence alignment editor which allows the user to perform further alignment analysis. Special features include the definition of sequence sub-groups, links to the SRS server at the EBI and an option to output the alignment as a colour postscript file for printing purposes. ClustalW W WW can also calculate trees from a multiple alignment using the NJ method, a widely used and relatively fast algorithm that clusters sequences by minimising the sum of branch lengths. The resulting evolutionary relationships can be viewed either as cladograms or phylograms, with the option to display branch lengths (or ‘tree graph distances’). 14.2 New Features Both ClustalW and ClustalX are being actively maintained and updated. Recent enhancements have included the possibility of saving both alignments and phylogenetic trees in the NEXUS format for compatibility with a number of phylogeny programs. Some work has also been done to optimise the alignment parameters, for example the Gonnet series of residue comparison This watermark does not appear in the registered version - http://www.clicktoconvert.com 161 matrices is now used by default for protein sequence alignments. The latest version of the programs (version 1.83), contained four main enhancements. The first modification is the facility to save the multiple alignment result as a FASTA format file, for compatibility with a number of other software packages. Another is to provide a percent identity matrix, which some users have asked for. A third new option is the possibility of saving the residue range in the output file when saving a user-specified range of the alignment. This is particularly useful when extracting a single domain from the alignment of multi-domain proteins. The increased speeds obtained mean that it is now possible to construct phylogenetic trees for very large sets of sequences, which were previously only feasible on very large computer systems. 14.3 Use of CLUSTAL CLUSTAL X is a new windows interface for the widely- used progressive multiple sequence alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing multiple sequence and profile alignments and analysing the results. CLUSTAL X displays the sequence alignment in a window on the screen. A versatile sequence colouring scheme allows the user to highlight conserved features in the alignment. Pull-down menus provide all the options required for traditional multiple sequence and profile alignment. New features include: § the ability to cut-and-paste sequences to change the order of the alignment, § selection of a subset of the sequences to be realigned, § and selection of a sub-range of the alignment to be realigned and inserted back into the original alignment. § Alignment quality analysis can be performed and low-scoring segments or exceptional residues can be highlighted. § Quality analysis and realignment of selected residue ranges provide the user with a powerful tool to improve and refine difficult alignments and to trap errors in input sequences. CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh PowerMac. This watermark does not appear in the registered version - http://www.clicktoconvert.com 162 Multiple sequence alignment with the Clustal series of programs. The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user- friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) ( http://www.ebi.ac.uk/clustalw/). 14.4 PHYLIP PHYLIP is a free package of programs for inferring phylogenies. PHYLIP is a free computational phylogenetics package of programs for inferring evolutionary trees ( phylogenies). The name is an acronym for PHYLogeny Inference Package. It consists of 35 portable programs, i.e. the source code is written in C and precompiled executables are available for Windows (95/98/NT/2000/me/XP), MacOS 8 and 9, MacOS X, and Linux systems. A complete documentation is written for all the programs in the package and is part of the package. The author of this package is Joseph Felsenstein, Professor in the Department of Genome Sciences and the Department of Biology at the University of Washington, Seattle. Methods (implemented by each program) that are available in the package include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include molecular sequences, gene frequencies, restriction sites and fragments, distance matrices, and discrete characters. Each program is controlled through a menu, which asks the users which options they want to set, and allows them to start the computation. The data is read into the program from a text file, which the user can prepare using any word processor or text editor (but it is important that this text file not be in the special format of that word processor -- it should instead be in flat ASCII or Text Only format). Some sequence analysis programs such as the ClustalW alignment program This watermark does not appear in the registered version - http://www.clicktoconvert.com 163 can write data files in the PHYLIP format. Most of the programs look for the data in a file called infile -- if they do not find this file they then ask the user to type in the file name of the data file. Output is written onto files with names like outfile and outtree. Trees written onto outtree are in the Newick format, an informal standard agreed to in 1986 by authors of a number of major phylogeny packages. 14.4.1 Setup To use PHYLIP, it is necessary to set the PHYLIP environment by running a special command sequence once per login session. You may optionally place these commands in your .cshrc (C Shell users) or .bash_profile (Bourne Shell users) to avoid having to manually run these commands on login. For csh and tcsh: source /usr/local/setup/phylip.setup.csh For sh and bash: . /usr/local/setup/phylip.setup.sh 14.1.2 Usage PHYLIP contains the following commands: clique o finds the largest clique of mutually compatible characters, and the phylogeny which they recommend, for discrete character data with two states. consense o computes consensus trees by the majority-rule consensus tree method, which also allows one to easily find the strict consensus tree. contml o estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations. This watermark does not appear in the registered version - http://www.clicktoconvert.com 164 dnacomp o estimates phylogenies from nucleic acid sequence data using the compatibility criterion, which searches for the largest number of sites which could have all states (nucleotides) uniquely evolved on the same tree. dnadist o computes four different distances between species from nucleic acid sequences. The distances can then be used in the distance matrix programs. dnainvar o for nucleic acid sequence data on four species, computes Lake's and Cavender's phylogenetic invariants, which test alternative tree topologies. The program also tabulates the frequencies of occurrence of the different nucleotide patterns. dnaml o estimates phylogenies from nucleotide sequences by maximum likelihood. The model employed allows for unequal expected frequencies of the four nucleotides, for unequal rates of transitions and transversions, and for different (prespecified) rates of change in different categories of sites, with the program inferring which sites have which rates. dnamlk o same as dnaml but assumes a molecular clock. The use of the two programs together permits a likelihood ratio test of the molecular clock hypothesis to be made. dnamove o interactive construction of phylogenies from nucleic acid sequences, with their evaluation by parsimony and compatibility and the display of reconstructed ancestral bases. This can be used to find parsimony or compatibility estimates by hand. dnapars o estimates phylogenies by the parsimony method using nucleic acid sequences. Allows use the full IUB ambiguity codes, and estimates ancestral nucleotide states. dnapenny This watermark does not appear in the registered version - http://www.clicktoconvert.com 165 o finds all most parsimonious phylogenies for nucleic acid sequences by branchand-bound search. dollop o estimates phylogenies by the Dollo or polymorphism parsimony criteria for discrete character data with two states (0 and 1). Also reconstructs ancestral states and allows weighting of characters. dolmove o interactive construction of phylogenies from discrete character data with two states (0 and 1) using the Dollo or polymorphism parsimony criteria. Evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree. This can be used to find parsimony or compatibility estimates by hand. dolpenny o finds all most parsimonious phylogenies for discrete-character data with two states, for the Dollo or polymorphism parsimony criteria using the branch-andbound method of exact search. factor o takes discrete multistate data with character state trees and produces the corresponding data set with two states (0 and 1). fitch o estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. This program will be useful with distances computed from DNA sequences, with DNA hybridization measurements, and with genetic distances computed from gene frequencies. gendist o computes one of three different genetic distance formulas from gene frequency data. The formulas are Nei's genetic distance, the Cavalli- Sforza chord measure, and the genetic distance of Reynolds etal. The former is appropriate for data in which new mutations occur in an infinite isoalleles neutral mutation model, the latter two for a model without mutation and with pure genetic drift. The distances This watermark does not appear in the registered version - http://www.clicktoconvert.com 166 are written to a file in a format appropriate for input to the distance matrix programs. kitsch o estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. This program will be useful with distances computes from DNA sequences, with DNA hybridization measurements, and with genetic distances computed from gene frequencies. mix o estimates phylogenies by some parsimony methods for discrete character data with two states (0 and 1). Also reconstructs ancestral states and allows weighting of characters. move o interactive construction of phylogenies from discrete character data with two states (0 and 1). Evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree. This can be used to find parsimony or compatibility estimates by hand. neighbor o an implementation by Mary Kuhner and John Yamato of Saitou and Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage clustering) method. penny o finds all most parsimonious phylogenies for discrete-character data with two states, for the Wagner, Camin-Sokal, and mixed parsimony criteria using the branch-and-bound method of exact search. protpars o estimates phylogenies from protein sequences (input using the standard one- letter code for amino acids) using the parsimony method, in a variant which counts only those nucleotide changes that change the amino acid, on the assumption that silent changes are more easily accomplished. protdist This watermark does not appear in the registered version - http://www.clicktoconvert.com 167 o computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. The distances can then be used in the distance matrix programs. restml o estimation of phylogenies by maximum likelihood using restriction sites data (not restriction fragments but presence/absence of individual sites). seqboot o reads in a data set, and produces multiple data sets from it by bootstrap resampling. Since most programs in the current version of the package allow processing of multiple data sets, this can be used together with the consensus tree program consense to do bootstrap (or delete- half-jackknife) analyses with most of the methods in this package. This program also allows the Archie/Faith technique of permutation of species within characters. Check your progress: 1. List any five Phylip commands. Notes: aa) Write your answer in the space given below. bb) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… 14.5 Let us Sum up Clustal is a widely used multiple sequence alignment computer program. The latest version is 1.83. PHYLIP is a free Computational phylogenetics package of programs for inferring evolutionary trees ( phylogenies). The name is an acronym for PHYLogeny Inference Package. It This watermark does not appear in the registered version - http://www.clicktoconvert.com 168 consists of 35 portable programs, i.e. the source code is written in C and precompiled executables are available for Windows (95/98/NT/2000/me/XP), MacOS 8 and 9, MacOS X, and Linux systems. A complete documentation is written for all the programs in the package and is part of the package. The author of this package is Joseph Felsenstein, Professor in the Department of Genome Sciences and the Department of Biology at the University of Washington, Seattle. Methods (implemented by each program) that are available in the package include parsimony, distance matrix, a n d likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include molecular sequences, gene frequencies, restriction sites and fragments, distance matrices, and discrete characters. 14.6 Lesson end activities 1. Collect any 10 related sequences and perform multiple sequence alignment using CLUSTAL. 2. Build a phylogenetic tree using PHYLIP. 14.7 Check your progress: Model answers 1. Your answer must include any five commands given in this lesson above. 14.8 Points for Discussion 1. Do a comparative analysis of Cluster X and Cluster W and elaborate which are is better. 14.9 References 1. Martins, E. P., and T. F. Hansen. 1997. Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data. American Naturalist 149:646-667. Erratum Am. Nat. 153:448. 2. Nunn, C. L., and R. A. Barton. 2001. Comparative methods for studying primate adaptation and allometry. Evolutionary Anthropology 10:81-98. 3. Oakley, T. H., Z. Gu, E. Abouheif, N. H. Patel, and W.-H. Li. 2005. Comparative methods for the analysis of gene-expression evolution: an example using yeast functional genomic data. Molecular Biology and Evolution 22:40-50. This watermark does not appear in the registered version - http://www.clicktoconvert.com 169 4. O’Meara, B. C., C. M. Ané, M. J. Sanderson, and P. C. Wainwright. 2006. Testing for different rates of continuous trait evolution in different groups using likelihood. Evolution 60:922-933. 5. Organ, C. L., A. M. Shedlock, A. Meade, M.. Pagel, and S. V. Edwards. 2007. Origin of avian genome size and structure in non-avian dinosaurs. Nature 446:180-184. 6. Page, R. D. M., ed. 2003. Tangled trees : phylogeny, cospeciation, and coevolution. University of Chicago Press, Chicago. 7. Pagel, M. D. 1993. Seeking the evolutionary regression coefficient: an analysis of what comparative methods measure. Journal of Theoretical Biology 164:191-205. 8. Pagel, M. 1999. Inferring the historical patterns of biological evolution. Nature 401:877884. 9. Paradis, E. 2005. Statistical analysis of diversification with species traits. Evolution 59:112. 10. Paradis, E., and J. Claude. 2002. Analysis of comparative data using generalized estimating equations. Journal of Theoretical Biology 218:175-185. 11. Purvis, A., and T. Garland, Jr. 1993. Polytomies in comparative analyses of continuous characters. Systematic Biology 42:569-575. 12. Rezende, E. L., and T. Garland, Jr. 2003. Comparaciones interespecíficas y métodos estadísticos filogenéticos. Pages 79-98 in F. Bozinovic, ed. Fisiología Ecológica & Evolutiva. Teoría y casos de estudios en animales. Ediciones Universidad Católica de Chile, Santiago. 13. Ridley, M. 1983. The explanation of organic diversity: The comparative method and adaptations for mating. Clarendon, Oxford, U.K. 14. Rohlf, F. J. 2001. Comparative methods for the analysis of continuous variables: geometric interpretations. Evolution 55:2143-2160. This watermark does not appear in the registered version - http://www.clicktoconvert.com 170 UNIT IV LESSON – 15 GENE PREDICTION 15.0 Aims and objectives 15.1 Gene prediction 15.2 Extrinsic Approaches 15.3 ab initio Approaches 15.4 Other Signals 15.4.1 Signal Sensors 15.4.2 Content Sensors 15.4.3 Integrated Gene Finding Methods 15.5 Comparative Genomics Approaches 15.6 Let us sum up 15.7 Check your progress 15.8 Points for Discussion This watermark does not appear in the registered version - http://www.clicktoconvert.com 171 15.9 References 15.0 Aim and objectives: This chapter discusses the various gene prediction methods, signal sensors, content sensors, extrinsic and ab initio approaches, and integrated gene finding methods. 15.1 Gene prediction Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. In its earliest days, "gene finding" was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of homologous recombination of several different genes could determine their order on a certain chromosome, and information from many such experiments could be combined to create a genetic map specifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, gene finding has been redefined as a largely computational problem. Determining that a sequence is functional should be distinguished from determining the function of the gene or its product. The latter still demands in vivo experimentation through gene knockout and other assays, although frontiers of Bioinformatics research are making it increasingly possible to predict the function of a gene based on its sequence alone. 15.2 Extrinsic Approaches In extrinsic gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known sequence of a messenger RNA (mRNA) or protein product. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to This watermark does not appear in the registered version - http://www.clicktoconvert.com 172 efficiently search a target genome for matches, complete or partial, and exact or inexact. BLAST is a widely used system designed for this purpose. A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, in order to collect extrinsic evidence for most or all of the genes in a complex organism, many hundreds or thousands of different cell types must be studied, which itself presents further difficulties. For example, some human genes may be expressed only during development as an embryo or foetus, which might be difficult to study for ethical reasons. Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the RefSeq database contains transcript and protein sequence from many different species, and the Ensembl system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erraneous data. 15.3 Ab Initio Approaches Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to ab initio gene finding, in which genomic DNA sequence alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content, statistical properties of protein-coding sequence itself. Ab initio gene finding might be more accurately characterized as gene prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional. In the genomes of prokaryotes, genes have specific and relatively well- understood promoter sequences (signals), such as the Pribnow box and transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame (ORF), which is typically many hundred or thousands of base pairs long. The statistics of stop codons are such that even finding an open reading frame of this length is a This watermark does not appear in the registered version - http://www.clicktoconvert.com 173 fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20-25 codons, or 60-75 base pairs, in a random sequence.) Furthermore, protein-coding DNA has certain periodicities and other statistical properties that are easy to detect in sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy. Ab initio gene finding in eukaryotes, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well- understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are CpG islands and binding sites for a polyA tail. Second, splicing mechanisms employed by eukaryotic cells mean that a particular proteincoding sequence in the genome is divided into several parts ( exons), separated by non-coding sequences ( introns). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes. Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex probabilistic models, such as Hidden Markov Models, in order to combine information from a variety of different signal and content measurements. The GLIMMER system is a widely used and highly accurate gene finder for prokaryotes. GeneMark is another popular approach. Eukaryotic ab initio gene finders, by comparison, have achieved only limited success; notable examples are the GENSCAN and geneid programs. 15.4 Other Signals Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like k-mer statistics, Fourier transform of a pseudo-number-coded DNA, Z-curve parameters and certain run features. It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of secondary structure in the identification of regulatory This watermark does not appear in the registered version - http://www.clicktoconvert.com 174 motifs has been reported. In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction. 15.4.1 Signal Sensors The most basic signal sensor is a simple consensus sequence or an expression that describes a consensus sequence along with allowable variations, such as a PROSITE expression. More sensitive sensors can be designed using weight matrices in place of the consensus, in which each position in the pattern allows a match to any residue, but different costs are associated with matching each residue in each position. The score returned by a weight matrix sensor for a candidate site is the sum of the costs of the individual residue matches over that site. If this score exceeds a given threshold, the candidate site is predicted to be a true site. Such sensors have a natural probabilistic interpretation in which the score returned is a log likelihood ratio under a simple statistical model in which each position in the site is characterized by an independent and distinct distribution over possible residues. A mathematically equivalent interpretation of the score is that it is the discrimination energy for site recognition. Weight matrices can also be viewed as a simple type of neural network, sometimes called a perceptron. Many investigators have also applied more complex neural networks, such as multilayer feed- forward networks and time delay networks, to various DNA signal recognition problems. Multi- layer nets have the ability to capture statistical dependency between the residues at different positions in a site, an ability that perceptrons (and hence weight matrices) lack. Time delay neural networks also allow insertions and deletions while evaluating a match to a prospective site, whereas weight matrices and feed- forward neural networks do not. Other statistical/pattern models besides neural networks, such as nonhomogeneous Markov models (a weight matrix where the distribution at position i depends on the residue at position i-1, sometimes called ``WAM" models), decision trees, quadratic discriminant functions, and graphical models, have also been used as biosequence signal sensors. In general, the penalty for these more sophisticated models is that much more training data is needed to estimate the many parameters that they contain, so they are unsuitable in cases where relatively few verified examples are known of the site to be modeled. 15.4.2 Content Sensors The most important and most studied content sensor is the sensor that predicts coding regions. An extensive review of computational methods to detect coding regions is given by Fickett and This watermark does not appear in the registered version - http://www.clicktoconvert.com 175 Tung [ 23] (see also [ 20, 21]). In prokaryotes, it is still common to locate genes by simply looking for long open reading frames (ORFs); this is certainly not adequate for higher eukaryotes. To discriminate coding from non-coding regions in eukaryotes, exon content sensors often use in- frame hexamer counts or, what is nearly equivalent, a set of 3 fifth-order Markov models, one for each of the three nucleotide positions within a codon, as pioneered in the genefinder GeneMark [ 7]. It is also important to consider local compositional biases, as the codon preferences are quite different between genes in G+C rich regions and genes in A+T rich regions. While many other measures of coding potential have been investigated (Fickett tested 19 different measures, which he took from the literature, few others have been proven to be as effective. However, combinations of several measures can be effective, as in the popular GRAIL exon detector, in which several coding measures are combined along with base composition and signal sensor output for flanking splice sites, and fed into a neural net to predict exons. Other content sensors include sensors for CpG islands, which are regions that often occur near the beginnings of genes where the frequency of the dinucleotide CG is not as low as it typically is in the rest of the genome and sensors for repetitive DNA, such as ALU sequences. The latter sensors are often used as masks or filters that completely remove the repetitive DNA, leaving the remaining DNA to be analyzed. Check your progress: 1. Describe the ab initio method. Notes: cc) Write your answer in the space given below. dd) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… ……………………………………………………………………………………………………… ……………………………………………………… 15.4.3 Integrated Gene Finding Methods This watermark does not appear in the registered version - http://www.clicktoconvert.com 176 Signal and content sensors alone cannot solve the genefinding problem. The statistical signals they are trying to recognize are too weak, and there are dependencies between signals and contents that they cannot capture, such as the possible correlation between splice site strength and exon size. During the last five years, a number of systems have been developed that combine signal and content sensors to try to identify complete gene structure. Such systems are capable, in principle, of handling more complex interdependencies between gene features. A linguistic metaphor is sometimes applied here, likening the process of breaking down a sequence of DNA into genes, each of which is a series of exons and introns, to the process of parsing a sentence by breaking it down into its constituent grammatical parts. Indeed this parsing metaphor can be pushed deeper. Searls was the first major proponent of describing gene structure in linguistic terms using a formal grammar. His genefinding program, GenLang, was one of the earliest integrated genefinders, following on the pioneering work of Gelfand, Gelfand and Roytberg Fields and Soderlund, and Phil Green's GeneFinder, and was one of the inspirations for significant later work and the HMM methods described below.) Nearly all integrated genefinders use dynamic programming to combine candidate exons and other scored regions and sites into an complete gene prediction with maximal total score. A brief and lucid tutorial on this topic can be found in] and a more detailed exposition in. Gelfand, et al, proposed a dynamic programming scheme, embodied in the genefinder GREAT, that calculates the set of all so-called Pareto-optimal gene structure predictions, which include the optimal predictions for a wide variety of different scoring functions. Dynamic programming methods are also used in Grail II, GeneParser, FGENEH, and recent versions of GeneID. Dynamic programming methods find the candidate gene structure with the best overall score. The key to success in these methods is developing the right score function. A fruitful approach here has been to define a statistical model of genes that includes parameters describing codon dependencies in exons, characteristics of splice sites (e.g. the parameters of a weight matrix for splice sites), as well as ``linguistic" information on what functional features are likely to follow other features (see Figure 1). In this approach the observed DNA sequences are actually modeled as if they were manifestations of a stochastic process that generates gene-containing DNA. This process includes a latent (or ``hidden") variable associated with each nucleotide that represents the This watermark does not appear in the registered version - http://www.clicktoconvert.com 177 functional role or position of that nucleotide, e.g. a G residue might be part of a GT consensus donor splice site or it might be in the third position of a start codon. Taken together, the states of these hidden variables define a candidate gene structure. The linguistic rules for what functional features follow what other features are expressed by the parameters of a Markov process on the hidden variables. For this reason, these models are called hidden Markov models, or HMMs. Because a Markov process is just a finite state machine with probabilities on the state transitions, genefinding HMMs are merely a stochastic version of the genefinding finite state machines (regular grammars) introduced by Searls. Fig 15.1: A simplified diagram representing the liguistic rules for what might follow what when parsing a sequence consisting of a multiple exon gene. The arcs represent contents and the nodes represent signals. The contents are J5' : 5' UTR, EI : Initial Exon, E : Exon, I : Intron, E : Internal Exon, EF: Final Exon, ES : Single Exon, and J3' : 3' UTR. The signals are B : Begin sequence, S : Start Translation, D : Donor splice site, A : Acceptor splice site, T : Stop Translation, F : End sequence. A candidate gene structure is created by tracing a path in this figure from B to F. An HMM (GHMM) is defined by attaching stochastic models to each of the arcs and nodes. Figure taken from [ 44]. The advantage of HMMs is that, being probabilistic models, they define a natural score function. Let X denote the DNA sequence, Q denote a possible sequence of hidden states, one for each nucleotide in X, and denote the parameters of the HMM. Since Q represents a candidate gene structure for X, to find the genes in X, we want to find the Q that is most likely given the This watermark does not appear in the registered version - http://www.clicktoconvert.com 178 , the probability of the gene sequence X, i.e., we want to find the Q that maximizes structure Q given the DNA sequence X and the parameters . Equivalently, we can maximize . This is the score function that is optimized in a genefinding HMM. It can be optimized using standard dynamic programming methods. Early genfinding HMMs were EcoParse (for E. coli, also recently used in the annotation of the M. Tuberculosis genome) and Xpound (for human). More recent programs are GeneMark-HMM (for bacterial genomes) Veil and HMMgene (for human). A somewhat more general class of probabilistic models, called generalized HMMs (GHMMs) or (hidden) semi- Markov models, have their roots in GeneParser, and were more fully developed in Genie and then GenScan. The probabilistic approach has further advantages. For example, for any given feature, such as a 5' splice site, and any position in the DNA sequence X, we can calculate the probability that that feature occurs at that position. If we do this for separately for each feature of our overall predicted gene structure, then this gives us a kind of individual ``confidence" value for each part of our prediction. GeneParser pioneered this methodology, and it is used to give highly accurate confidence values for predicted exons in Genscan. In addition, the probabilistic formulation provides various new ways to estimate the parameters of the gene- finding model. Given a large ``training" DNA contig (or set of contigs) X and its correct state sequence annotation Q, we can find to maximize maximum a posteriori (the approach), maximum or likelihood approach), (the (the conditional maximum likelihood approach). It is even possible to estimate the parameters from partially annotated training sequences using the expectation- maximization method. So far we have focused on genefinders that predict gene structure based only on general features of genes, rather than using explicit comparisons to other, previously known genes, or auxiliary information such as expressed sequence tag (EST) matches. One way to include information about previously known genes is to use the database of known proteins as a basis for gene prediction. Current state-of-the-art genefinding systems combine multiple statistical measures with database homology searches, obtained by translating the DNA to protein in all possible reading frames, and then searching the protein databases for similar protein sequences. Examples This watermark does not appear in the registered version - http://www.clicktoconvert.com 179 are Genie, GeneID+, GeneParser3, and recent versions of Grail. The program AAT and new versions of Grail also take into account EST information. Database homology has long been used as a post hoc method to validate gene predictions, but these systems were among the first to integrate database homology directly into the genefinding algorithm itself. This approach has been taken to its extreme limit in a genefinding program developed by Gelfand, Mironov, and Pevzner. This system, called Procrustes, requires the user to provide a close protein homolog of the gene to be predicted. Then a ``spliced alignment'' algorithm, similar to a Smith-Waterman alignment, is used to derive a putative gene structure by aligning the DNA to the homolog. The major disadvantage to this method is the requirement of a close homolog. It is often the case that homologs are unknown or are remote, in which case this system would be inappropriate. Nevertheless, in the presence of a very close homolog, Procrustes is an extremely effective gene finding method. Recent related methods, based on HMM models, have been developed by Birney and Durbin and are currently being developed by Kulp. In 1995, a number of different integrated genefinders were tested on a benchmark set of 570 vertebrate genes by Burset and Guigó. They looked at not only how many bases were predicted correctly as either coding or non-coding, but how many exons were predicted exactly, with both splice sites located correctly. In the former case, accuracy was about 75-80%. In the latter it was about 40-60%. These numbers are for systems that do not employ protein database homology searches. When database homology is employed, the upper limit for the accuracy increases about 10% in both categories. Integrated eukaryotic genefinding systems based on HMM and GHMM models, starting with Genie, and followed by Veil, Genscan and HMMgene have pushed beyond these early performance numbers, with the latter two programs now obtaining upwards of 90% accuracy at the level of individual nucleotides and 80% for exact exon prediction, without the use of database homologies. A new category of completely correct gene prediction has been added to the list of performance measurements, and Genscan achieves an accuracy of about 40% on the Burset and Guigó dataset in this category. Tests have also been conducted on the identification of promoters, showing that the accuracy of currently available methods is much lower on this task. The currently available genefinding performance results must be approached with extreme caution. The primary reason is that they depend very strongly on the difficulty of the genes in the This watermark does not appear in the registered version - http://www.clicktoconvert.com 180 test set, and for some genefinders, on the homology overlap between the genes in the test set and those in the training set that is used to optimize the parameters of the models. The latter is a factor even when no homology is explicitly used by the genefinding method. To avoid this problem, it is best to compare genefinders by training and testing on the same genes, and to avoid homologies between genes used for training and testing. Reese has constructed benchmark sets for human and for Drosophila genes of this type that are randomly partitioned into specified parts for use in cross-validated train-test experiments. These have been used by Genie, Genscan and HMMgene. Reese's human dataset is a bit harder than the original Burset and Guigó dataset as well, so genefinding programs get overall lower scores on it. Furthermore, the variance in performance from one train-test partition to another is quite high, since some parts by chance ended up with more ``hard-to-predict" genes (usually genes with many exons and or long introns) than others. This graphically demonstrates the unreliability of the currently available genefinding performance figures: if by chance a different set of human genes had been included in Genbank, the numbers would have been quite different, and probably lower, since Genbank is biased towards genes with fewer exons and shorter introns. We need a much larger sample of human genes before we can get stable performance numbers. Reese's datasets, like those of Burset and Guigó, contain exactly one gene per sequence. Little is known about the accuracy of genefinders on large genomic sequences containing multiple genes. Some harder and more realistic human genomic data, consisting of large annotated contigs, is available. The Sanger center also proposes a standardized format, Gene Finding Format or GFF, for both gene annotation and comparing the results of various genefinders. It would greatly aid the maturation of this field if we could agree on a simple standard data interchange format like this. Once this is established, we could then share a set of tools for the display, comparison, analysis and combination of different gene predictions, along with auxiliary sequence annotation. 15.5 Comparative Genomics Approaches As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a comparative genomics approach. This is based on the principle that the forces of natural selection cause genes and other functional elements undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be detected by This watermark does not appear in the registered version - http://www.clicktoconvert.com 181 comparing the genomes of related species to detect this evolutionary pressure for conservation. This approach was first applied to the mouse and human genomes, using programs such as SLAM, SGP and Twinscan. Comparative gene finding can also be used to project high quality annotations from one genome to another. Notable examples include Projector, GeneWise and GeneMapper. Such techniques now play a central role in the annotation of all genomes. 15.6 Let us sum up We briefly review computational methods for finding genes in genomic DNA sequences. Specific programs are now available to find genes in the genomic DNA of many organisms. We discuss the approaches used by these programs, their performance, and future directions for this field. It is important to distinguish two different goals in genefinding research. The first goal is to provide computational methods to aid in the annotation of the large volume of genomic data that is produced by genome sequencing efforts. The second goal is to provide a computational model to help elucidate the mechanisms involved in transcription, splicing, polyadenylation and other critical processes in the pathway from genome to proteome. While there is some overlap in these goals, there is also some conflict. No one computational genefinding approach will be optimal for both goals. A ``purist" system that mimics the cellular processes cannot take advantage of homologies with other proteins and matches to EST sequences when deciding where to splice. It presumably should not use codon statistics, frame consistency between exons, or lack of in- frame stop codons to predict overall gene structure, although there is some evidence that absence of early inframe stop codons may be involved in biological start site selection. One would think that these restrictions would completely cripple computational genefinding methods, however Guigó has shown that just using simple weight matrices to find the best combination of splice site signals, translation start and stop signals, along with the standard syntactic constraints on gene structure (frame consistency, no in- frame stop codons, minimum intron size), gives results on his This watermark does not appear in the registered version - http://www.clicktoconvert.com 182 benchmark data set that are comparable to those obtained by most of the genefinders he and Burset tested in 1995. These results are not competitive with the older genefinders that use protein homology, nor with the newer methods that use exon coding potential but not homology, but they nevertheless indicate a surprising potential for purist genefinding models. More detailed models of the splicing process, the selection of translation start and the process of polyadenylation may significantly improve such purist models. These models may prove useful in human genome annotation for finding rapidly evolving and rarely expressed genes, especially those with unusual codon usage. However, if we simply want to produce genefinders that give the most reliable annotation in ``everyday" genome center annotation efforts, it is clear that more work needs to be done to incorporate EST information along with protein homology and powerful statistical models. There are other key issues that will effect future research in both of the above computational genefinding paradigms. One is the issue of alternative splicing. No currently available genefinders handle alternative splicing in an effective manner. Intimately tied with this issue is that of gene regulation. The abundant regulatory signals flanking genes, and appearing in introns (and sometimes in exons, combined with regulatory proteins specific to the cell type and cell state, determine the expression of the gene. Gene annotation is not complete until these signals are identified, and the cellular conditions that give rise to differing expression levels for different transcripts are worked out. This implies, among other things, that future genefinders will need to explicitly take into account experimental data relating to differential expression, along with the other types of data we have discussed. It may be anticipated that this task will occupy genefinding researchers for some years to come. 15.7 Check your progress: Model answers 1. Your answer must include these points: Ab initio gene finding in eukaryotes, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well- understood than in prokaryotes, making them more difficult to reliably recognize. This watermark does not appear in the registered version - http://www.clicktoconvert.com 183 Two classic examples of signals identified by eukaryotic gene finders are CpG islands and binding sites for a poly(A) tail. 15.8 Points for Discussion 1. “Gene prediction is closely related to computational Biology” - Substantiate. 15.9 References 1. Saeys Y, Rouzé P, Van de Peer Y (2007). "In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists". Bioinformatics 23 (4): 414-420. doi:10.1093/Bioinformatics /btl639. 2. Hiller M, Pudimat R, Busch A, Backofen R (2006). "Using RNA secondary structures to guide sequence motif finding towards single-stranded regions". Nucleic Acids Res 34 (17): e117. Entrez PubMed 16987907. 3. Patterson DJ, Yasuhara K, Ruzzo WL (2002). "Pre- mRNA secondary structure prediction aids splice site prediction". Pac Symp Biocomput: 223-234. Entrez PubMed 11928478. 4. Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H (2006). "Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks". Comput Biol Chem 30 (1): 50-57. Entrez PubMed 16386465. This watermark does not appear in the registered version - http://www.clicktoconvert.com 184 LESSON – 16 FRAGMENT ASSEMBLY 16.0 Aims and objectives 16.1 Fragment assembly 16.1.1 Introduction 16.1.2 New ideas 16.1.3 Error correction 16.1.4 Error correction or data corruption? 16.1.5 Eulerian superpath problem 16.2 Let us Sum up 16.3 Lesson end activities 16.4 Check your progress 16.5 Points for Discussion 16.6 References 16.0 Aims and Objectives: This watermark does not appear in the registered version - http://www.clicktoconvert.com 185 To study about the fragment assembly, introduction, new ideas, error correction, error correction or data corruption?, eulerian superpath problem. 16.1 Fragment Assembly 16.1.1 Introduction For the last twenty years fragment assembly in DNA sequencing mainly followed the “overlap layout - consensus” paradigm, that is used in all currently available software tools for fragment assembly. Although this approach proved to be useful in assembling contigs of moderate sizes, it faces difficulties while assembling prokaryotic genomes a few million bases long. These difficulties led to introduction of the double-barreled DNA sequencing that uses additional experimental information for assembling large genomes in the framework of the same “overlap layout - consensus” paradigm. Although the classical approach culminated in some excellent fragment assembly tools (Phrap, CAP3, TIGR, and Celera assemblers are among them), critical analysis of the “overlap - layout - consensus” paradigm reveals some weak points. First, the overlap stage finds pairwise similarities that do not always provide true information on whether the fragments (sequencing reads) overlap. A better approach would be to reveal multiple similarities between fragments since sequencing errors tend to occur at random positions while the differences between repeats are always at the same positions. However, this approach is infeasible due to high computational complexity of the multiple alignment problem. Another problem with the conventional approach to fragment assembly is that finding the correct path in the overlap graph with many false edges (layout problem) becomes very difficult. Unfortunately, these problems are difficult to overcome in the framework of the “overlap - layout - consensus” approach and the existing fragment assembly algorithms are often unable to resolve the repeats even in prokaryotic genomes. Inability to resolve repeats and to figure out the order of contigs leads to additional experimental work to complete the assembly. Moreover all the programs we tested made errors while assembling shotgun reads from the bacterial sequencing projects Campylobacter jejuni, Neisseria meningitidis, a n d Lactococcus lactis. Biologists at large sequencing centers are well-aware of potential assembly errors and are This watermark does not appear in the registered version - http://www.clicktoconvert.com 186 forced to carry additional experimental tests to verify the assembled contigs. Bioinformaticians are also aware of assembly errors as evidenced by finishing software that supports experiments correcting these errors. How can one resolve these problems? Surprisingly enough, an unrelated area of DNA arrays provides a hint. Sequencing by Hybridization (SBH) is a 10- years old idea that never became practical but (indirectly) created the DNA arrays industry. Conceptually, SBH is similar to fragment assembly, the only difference is that the “reads” in SBH are much shorter l-tuples. In fact, the very first attempts to solve the SBH fragment assembly problem [3, 9] followed the “overlap- layout-consensus” paradigm. However, even in a simple case of error- free SBH data, the corresponding lay out problem leads to the NP-complete Hamiltonian Path Problem. Pevzner, 1989 proposed a different approach that reduces SBH to an easy-to-solve Eulerian PathPr oblem in the de Bruijn graph by abandoning the “overlap- layoutconsensus” paradigm. Since the Eulerian path approach transforms a once difficult layout problem into a simple one, a natural question is: “Could the Eulerian path approach be applied to fragment assembly?”. Idury and Waterman, 1995 answered this question by mimicking the fragment assembly problem as an SBH problem. They represented every read of length n as a collection of n - l + 1 l- mers and applied an Eulerian path algorithm to a set of l-tuples formed by the union of such collections for all reads. At the first glance this transformation of every read into a collection of l-tuples is a very short-sighted procedure since information about the sequencing reads is lost. However, the loss of information is minimal for large l and is well paid for by the computational advantages of the Eulerian path approach in the resulting easy-to-analyze graph. Not to mention that the lost information can be easily restored at the later stages. Unfortunately, the Idury-Waterman approach, while very promising, did not scale up well. The problem is that the sequencing errors transform a simple de Bruijn graph (corresponding to an error- free SBH) into a tangle of erroneous edges. For a typical sequencing project, the number of erroneous edges is a few times larger than the number of real edges and finding the correct path in this graph is extremely difficult, if not impossible task. Moreover, repeats in prokaryotic genomes pose serious challenges even in the case of error- free data since the de Bruijn graph gets very tangled and difficult to analyze. This paper abandons the classical “overlap-layout- This watermark does not appear in the registered version - http://www.clicktoconvert.com 187 consensus” approach in favor of a new Eulerian superpath approach. Our main result is the reduction of the fragment assembly problem to a variation of the classical Eulerian path problem. This reduction opens new possibilities for repeat resolution and leads to the EULER software that generated optimal solutions for the large-scale assembly projects that were studied. 16.1.2 New Ideas Given two similar reads, how can we decide whether they correspond to the same region (i.e. the differences between them are due to sequencing errors) or to two copies of a repeat located in different parts of the genome? This problem is crucial for all fragment assembly algorithms and pairwise comparison used in the conventional algorithms does not adequately resolve this problem. Our error-correction procedure implicitly uses multiple comparison of reads and successfully distinguishes these two situations. Both Idury and Waterman, 1995 and Myers, 1995 tried to deal with errors and repeats via graph reductions. However, both these methods do not explore multiple alignment of reads to fix sequencing errors at the pre-processing stage. Of course, multiple alignment of reads is costly and pairwise alignment is the only realistic option at the overlap stage of the conventional fragment assembly algorithms. However, the multiple alignment becomes feasible when we deal with perfect or nearly perfect matches of short l-tuples, exactly the case in the SBH approach to fragment assembly. Our error correction idea utilizes the multiple alignment of short substrings to modify the original reads and to create a new instance of the fragment assembly problem with the greatly reduced number of errors. The error correction makes our reads almost error-free and transforms the original very large graph into a graph with very few erroneous edges. In some sense, the error correction is a variation of the consensus step taken at the very first step of fragment assembly (rather than at the last one as in the conventional approach). Imagine an ideal situation when the errorcorrection procedure eliminated all errors and we deal with a collection of error- free reads. Is there an algorithm to reliably assemble such error- free reads in a large-scale sequencing project? At the first glance, the problem looks simple, but surprisingly enough, the answer is no: we are unaware of any algorithm that solves this problem. This watermark does not appear in the registered version - http://www.clicktoconvert.com 188 For example, Phrap, CAP3 and TIGR assemblers make 17, 14, and 9 assembly errors correspondingly while assembling real reads from the N. meningitidis genome. All these algorithms still make errors while assembling the error-free reads from the N. meningitidis genome (although the number of errors reduces to 5, 4, and 2 correspondingly). Although the TIGR assembler makes less errors than other programs, this accuracy does not come for free, since this program produces twice as many contigs as do the other programs. EULER made no assembly errors and produced less contigs with real data than other programs produced with error-free data! EULER can be also used to immediately improve the accuracy of Phrap, CAP3 and TIGR assemblers: these programs produce better assemblies if they use error-corrected reads from EULER. To achieve such accuracy, EULER has to overcome the bottleneck of the IduryWaterman approach and to restore information about sequencing reads that was lost in the construction of the de Bruijn graph. Our second Eulerian Superpath idea addresses this problem. Every sequencing read corresponds to a path in the de Bruijn graph called a readpath. An attempt to take into account the information about the sequencing reads leads to the problem of finding an Eulerian path that is consistent with all read-paths, an Eulerian Superpath Problem. Below we show how to solve this problem. This simple description hides some algorithmic challenges. 16.1.3 Error Correction Sequencing errors make implementation of the SBH-style approach to fragment assembly difficult. To bypass this problem we reduce the error rate by a factor of 35-50 at the preprocessing stage and make the data almost errorfree by solving the Error Correction Problem. We use the N. meningitidis (NM) sequencing project completed at the Sanger Center as an example. NM is one of the most “difficult-to-assemble” bacterial genome completed so far. It has 126 long perfect repeats up to 3832 bp in length (not to mention many imperfect repeats). The length of the genome is 2,184,406 bp. The sequencing project resulted in 53263 reads of average length 400 (average coverage is 9.7). There were 255,631 errors overall distributed over these reads. It results in 4.8 errors per read (error rate of 1.2%). Let s be a sequencing read (with errors) derived from a genome G. If the sequence of G is known then the error correction in s can This watermark does not appear in the registered version - http://www.clicktoconvert.com 189 be done by aligning the read s against the genome G. In real life, the sequence of G is not known until the very last ”consensus” stage of the fragment assembly. It is a catch-22: to assemble a genome it is highly desirable to correct errors in reads first, but to correct errors in reads one has to assemble the genome first. To bypass this catch-22, let’s assume that, although the sequence of G is unknown, the set Gl of all continuous strings of fixed length l (l-tuples) present in G is known. Of course, Gl is unknown either, but Gl can be reliably approximated without knowing the sequence of G. An l-tuple is called solid if it belongs to more than M reads (where M is a threshold) and weak otherwise. A natural approximation for Gl is the set of all solid l-tuples from a sequencing project. Let T be a collection of l-tuples called a spectrum. A string s is called a Tstring if all its l-tuples belong to T. Our approach to error correction leads to the following Spectral Alignment Problem. Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string. A similar problem was considered by Peer and Shamir, 2000, in a different context of resequencing by hybridization. In the context of error corrections, the solution of the Spectral Alignment Problem makes sense only if the number of mutations is small. In this case the Spectral Alignment Problem can be efficiently solved by dynamic programming even for large l (compare with ). Spectral alignment of a read against the set of all solid ltuples from a sequencing project, suggests the error corrections that may change the sets of weak and solid l-tuples. Iterative spectral alignments with the set of all reads and all solid l-tuples gradually reduce the number of weak l-tuples, increase the number of solid l-tuples, and reads and all solid l-tuples gradually reduce the number of weak l-tuples, increase the number of solid l-tuples, and lead to elimination of many errors in bacterial sequencing projects. Although the Spectral Alignment Problem helps to eliminate errors (and we use it as one of the steps in EULER) it does not adequately capture the specifics of the fragment assembly. The Error Correction Problem described below is somewhat less natural than the Spectrum Alignment Problem but it is probably a better model for fragment assembly (although it is not a perfect model either). The greedy heuristics for the Error Correction Problem eliminates up to 97% of errors in a typical bacterial project. Given a collection of reads (strings) S = {s1, . . . , sn} from a sequencing project and an integer l, the spectrum of S is a set Sl of all l-tuples from the reads s1, . . . , sn and This watermark does not appear in the registered version - http://www.clicktoconvert.com 190 s1, . . . , sn, where s denotes a reverse complement of read s. Let ∆ be an upper bound on the number of errors in each DNA read. A more adequate approach to error correction motivates the error erroneous l-tuples in the sequencing read erroneous l-tuples in the complementary sequencing read sequencing read Given S, ∆, and l, introduce up to ∆ corrections in each read in S in such a way that |Sl| is minimized. An error in a read s affects at most l l-tuples in s and l ltuples in s and usually creates 2l erroneous ltuples that point out to the same sequencing error (2d for positions within a distance d < l from the endpoint of the reads). Therefore a greedy approach for the Error Correction Problem is to look for an error correction in the read s that reduces the size of Sl by 2l (or 2d for positions close to the endpoints of the reads). This simple procedure already eliminates 86.5% of errors in sequencing reads. Below we describe a more involved approach that eliminates 97.7% of sequencing errors. This approach transforms the original fragment assembly problem with 4.8 errors per read on average into an almost error- free problem with 0.11 errors per read on average. Two l-tuples are called neighbors if they are one mutation apart. For an l-tuple a define its multiplicity m(a) as the number of reads in S containing this l-tuple. An l-tuple is called an orphan if (i) it has small multiplicity, i.e., m(a) M, where M is a threshold, (ii) it has the only neighbor b, and (iii) m(b) m(a). The position where an orphan and its neighbor differ is called an orphan position. A sequencing read is orphan-free if it contains no orphans. An important observation is that each erroneous l-tuple created by a sequencing error usually does not appear in other reads and is usually one mutation apart from a real l-tuple (for an appropriately chosen l). Therefore, a mutation in a read usually creates 2l orphans. This observation leads to an approach that corrects errors in orphan positions within the sequencing reads, if the overall number of error corrections in a given read to make it orphan- free is at most ∆. The greedy orphan elimination approach to the Error Correction Problem starts error corrections from the orphan positions that reduce the size of Sl by 2l (or 2d for positions at distance d < l from the endpoints of the reads). After correcting all such errors the “2l condition” gradually transforms into a weaker 2l - δ condition. 16.1.4 Error Correction Or Data Corruption? A word of caution is in place. Our error-correction procedure is not perfect while deciding which nucleotide, among, let’s say, A or T is correct in a given l-tuple within a read. If the correct This watermark does not appear in the registered version - http://www.clicktoconvert.com 191 nucleotide is A, but T is also present in some reads covering the same region, the error-correction procedure may assign T instead of A to all reads, i.e., to introduce an error, rather than to correct it (particularly, in the low-coverage regions). Since our algorithm sometimes introduces errors, data corruption is probably a more appropriate name for this approach! Introducing an error in a read is not such a bad thing as long as the errors from overlapping reads covering the same position are consistent (i.e., they corresponds to a single mutation in a genome). An important insight is that, at this stage of the algorithm, we don’t care much whether we correct or introduce errors in the sequencing reads. From algorithmic perspective, introducing an error, which simply corresponds to changing a nucleotide in a final assembly, is not a big deal. It is much more important to make sure that we eliminate a competition between A and T at this stage, thus reducing the complexity of the de Bruijn graph. In this way we eliminate false edges in our graph and deal with this problem later: the correct nucleotide can be easily reconstructed at the final consensus stage of the algorithm. For N. meningitidis sequencing project, orphan elimination corrects 234410 errors, and introduces 1452 errors. It leads to a tenfold reduction in the number of sequencing errors (0.44 errors per read). The orphan elimination procedure is usually run withM = 2 since orphan elimination with M = 1 leaves some errors uncorrected. For a sequencing project with coverage 10 and error rate 1%, every solid 20-tuple has on average 2 orphans o1 and o2, each with multiplicity 1 (i.e., an expected multiplicity of this 20-tuple is 8 rather than 10 as in the case of error- free reads). With some probability, the same errors in (different) reads correspond to the same position in the genome thus “merging” o1 and o2 into a single l-tuple o with m(o) = 2. Although the probability of such event is relatively small, the overall number of such cases is large for large genomes. In our studies of bacterial genomes setting M = 2 and simultaneous correction of up to M multiple errors worked well in practice. With M = 2, we eliminated additional 705 errors and created 131 errors (21837 errors, or 0.41 errors per read are left). Orphan elimination is a more conservative procedure than spectral alignment. Orphans were defined as l-tuples of low multiplicity that have only one neighbor. The latter condition (that is not captured by the spectral alignment) is important since in the case of multiple neighbors it is not clear how to correct an error in an orphan. For the N. meningitidis genome there were 1862 weak 20- mers (M 2) that had multiple This watermark does not appear in the registered version - http://www.clicktoconvert.com 192 neighbors. Our approach to this problem is to increase l in a hope that there is only one “competing” neighbor for longer l. After increasing l from 20 to 100, the number of orphans with multiple neighbors have been reduced from 1862 to 17. Orphan elimination should be done with caution since errors in reads are sometimes hard to distinguish from differences in repeats. If we treated the differences between repeats (particularly repeats with low coverage) as errors, then orphan elimination would correct the differences between repeats instead of correcting errors. This may lead to inability to resolve repeats at the later stages. It is important to realize that error corrections in orphan positions often create new orphans. Imagine a read containing an imperfect low-coverage ( M) copy of a repeat that differs from a high-coverage (>M) copy of this repeat by a substitution of a block of t consecutive nucleotides. Without knowing that we deal with a repeat, the orphan elimination procedure would first detect two orphans, one of them ending in the first position of the block, and the other one starting in the last position of the block. If the orphans are eliminated without checking the “at most ∆ corrections per read” condition, these two error corrections will shrink the block to the size t - 2 and will create two new orphans in the beginning and the end of this shrunk block. At the next step, this procedure would correct the first and the last nucleotides in the shrunk block, and, in just t/2 steps, erase the differences between two copies of the repeat. Of course, for long bacterial genomes many “bad” events that may look improbable happen and there are two types of errors that are prone to orphan elimination. They require a few coordinated error corrections since single error corrections do not lead to a significant reduction in the size of Sl and thus may be missed by the greedy orphan elimination. These errors include: (i) consecutive or closely-spaced errors in the same read and (ii) the same error with high multiplicity (>M) at the same genome position in different reads. The first type of error is best addressed by solving the Spectral Alignment Problem to identify reads that require less than ∆ error corrections. We found that some reads from the N. meningitidis project have very poor spectral alignment. These reads are likely to represent contamination, vector, isolated reads, or an error in the sequencing pipeline. All these reads are of limited interest and should be discarded. In fact, it is a common practice in sequencing centers to discard such “poorquality” reads and we adopt this approach. Although deleting poor-quality reads may slightly This watermark does not appear in the registered version - http://www.clicktoconvert.com 193 reduce the amount of available sequencing information, it greatly simplifies the assembly process. Another important advantage of spectral alignment is an ability to identify the chimeric reads. Such reads are characterized by good spectral alignments of the prefix and suffix parts, that, however, cannot be extended to a good spectral alignment of the entire read. EULER breaks the chimeric reads into two or more pieces and preserves the type of error reflects the situation with M identical errors in different reads corresponding to the same genome position and generating an erroneous l-tuple with high multiplicity. For example, if both the correct and erroneous ltuples have multiplicity 3 (with default threshold M = 2), it is hard to decide whether we deal with a unique region (with coverage 6) or with two copies of an imperfect repeat (each with coverage 3). In the N. meningitidis project there were 1610 errors with multiplicity 3 and larger. Due to page limitation, the algorithm to correct high- multiplicity errors will be described elsewhere. Check your progress: 1. What is orphan position? Notes: ee) Write your answer in the space given below. ff) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… 16.1.5 Eulerian Superpath Problem As we have discussed, the idea of the Eulerian path approach to SBH is to construct a graph whose edges correspond to l-tuples and to find a path visiting every edge of this graph exactly This watermark does not appear in the registered version - http://www.clicktoconvert.com 194 once. Given a set of reads S = {s1, . . . , sn}, define the de Bruijn graph G(Sl) with vertex set Sl- 1 (the set of all (l- 1)-tuples from S) as follows. An (l - 1)-tuple v Sl- 1 is joined by a directed edge with an (l - 1)-tuple w Sl- 1, if Sl contains an l-tuple for which the first l - 1 nucleotides coincide with v and the last l- 1 nucleotides coincide with w. Each l-tuple from Sl corresponds to an edge in G. If S contains the only sequence s1, then this sequence corresponds to a path visiting each edge of G exactly once, an Eulerian path. Finding Eulerian paths is a well-known problem that can be efficiently solved. The reduction from SBH to the Eulerian path problem described above assumes unit multiplicities of edges (no repeating l-tuples) in the de Bruijn graph. We usually assume that S contains a direct complement of every read. In this case, G(Sl) includes reverse complement for every l-tuple and the de Bruijn graph can be partitioned into 2 subgraphs, one corresponding to a “canonical” sequence, and another one to its reverse complement. With real data, the errors hide the correct path among many erroneous edges. The overall number of vertices in the graph corresponding to the error- free data from the NM project is 4,039,248 (roughly twice the length of the genome), while the overall number of vertices in the graph corresponding to real sequencing reads is 9,474,411 (for 20- mers). After the errorcorrection procedure this number is reduced to 4,081,857. A vertex v is called a source if indegree(v) = 0, a sink if outdegree(v) = 0 and a branching vertex if indegree(v).outdegree(v) > 1. For the N. meningitidis genome, the de Bruijn graph has 502,843 branching vertices for original reads (for l-tuple size 20). Error corrections simplifies this graph and leads to a graph with 382 sources and sinks and 12,175 branching vertices. The error- free reads lead to a graph with 11173 branching vertices. Since the de Bruijn graph gets very complicated even in the error- free case, taking into account the information about what l-tuples belong to the same reads (that was lost after the construction of the de Bruijn graph) helps us to untangle this graph. A path v1 . . . vn in the de Bruijn graph is called a repeat if indegree(v1) > 1, outdegree(vn) > 1, and outdegree(vi) = 1 for 1 i n- 1. Edges entering the vertex v1 are called entrances into a repeat while edges leaving the vertex vn are called exits from a repeat. This watermark does not appear in the registered version - http://www.clicktoconvert.com 195 An Eulerian path visits a repeat a few times and every such visit defines a pairing between an entrance and an exit. Repeats may create problems in fragment assembly since there are a few entrances in a repeat and a few exits from a repeat but it is not clear which exit is visited after which entrance in the Eulerian path. However, most repeats can be resolved by read-paths (i.e., v1 v2 vn-1 vn paths in the de Bruijn graph that correspond to sequencing reads) covering these repeats. A read-path covers a repeat if it contains an entrance into this repeat and an exit from this repeat. Every covering read-path reveals some information about the correct pairings between entrances and exits. However, some parts of the de Bruijn graph are impossible to untangle due to long perfect repeats that are not covered by any read-paths. A repeat is called a tangle if there is no read-path containing this repeat. Tangles create problems in fragment assembly since pairings of entrances and exits in a tangle cannot be resolved via the analysis of read-paths. To address this issue we formulate the following generalization of the Eulerian Path Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this graph that contains all these paths as subpaths. The classical Eulerian Path Problem is a particular case of the Eulerian Superpath Problem with every path being a single edge. To solve the Eulerian Superpath Problem we transform both the graph G and the system of paths P in this graph into a new graph G1 with a new system of paths P1. Such transformation is called equivalent if there exists a one-to-one correspondence between Eulerian superpaths in (G,P) and (G1,P1). Our goal is to make a series of equivalent transformations (G,P) › (G1,P1) › . . . › (Gk,Pk) that lead to a system of paths Pk with every path being a single edge. Since all transformations on the way from (G,P) to (Gk,Pk) are equivalent, every solutions of the Eulerian Path Problem in (Gk,Pk) provides a solution of the Eulerian Superpath Problem in (G,P) Below we describe a simple equivalent transformation that solves the Eulerian Superpath Problem in the case when the graph G has no multiple edges. Let x = (vin, vmid) and y = (vmid, vout) be two consecutive edges in graph G and let Px,y be a collection of all paths from P that include both these edges as a subpath. Define P› x as a collection of paths from P that end with x and Py› as a collection of paths from P that start with y. The x,y-detachment is a transformation that adds a new edge z = (vin, vout) and deletes the edges x and y from G. This detachment Vin Vmid Vout x y Px,y P x Py P x Px,y Py z Vin Vout Vmid alters the system of paths P as follows: (i) substitute z instead of x, y in all paths from Px,y, (ii) substitute z instead of x in all paths from P› x, and This watermark does not appear in the registered version - http://www.clicktoconvert.com 196 (iii) substitute z instead of y in all paths from Py› . Informally, detachment bypasses the edges x and y via a new edge z and directs all paths in P› x, Py› , and Px,y through z. Since every detachment reduces the number of edges in G, the detachments will eventually shorten all paths from P to single edges and will reduce the Eulerian Superpath Problem to the Eulerian Path Problem. However, in the case of graphs with multiple edges, the detachment procedure described above may lead to nonequivalent transformations. In this case, the edge x may be visited many times in the Eulerian path and it may or may not be followed by the edge y on some of these visits. That’s why, in case of multiple edges, “directing” all paths from the set P› x through a new edge z may not be an equivalent transformation. However, if the vertex vmid has no other incoming edges but x, and no other outgoing edges but y, then x, y-detachment is an equivalent transformation even if x and y are multiple edges. In particular, detachments can be used to reduce every repeat to a single edge. It is important to realize that even in the case when the graph G has no multiple edges, the detachments may create multiple edges in the graphs G1, . . . , Gk (for example, if the edge (vin, vout) were present in the graph prior to the detachment procedure). However, such multiple edges do not pose problems, since in this case it is clear what instance of the multiple edge is used in every path. For illustration purposes, let’s consider a simple case when the vertex vmid has the only incoming edge x = (vin, vmid) with multiplicity 2 and two outgoing edges y1 = (vmid, vout1) and y2 = (vmid, vout2), each with multiplicity 1. In this case, the Eulerian path visits the edge x twice, in one case it is followed by y1 and in another case it is followed by y2. Consider an x, y1detachment that adds a new edge z = (vin, vout1) after deleting the edge y1 and one of two copies of the edge x. This detachment (i) shortens all paths in Px,y1 by substitution of x, y1 by a single edge z and (ii) substitute z instead of y1 in every path from Py1› . This detachment is an equivalent transformation if the set P› x is vin vmid vout1 vout2 x y1 y2 P x Px,y1 vin vmid vout1 vout2 z y2 P x Px,y1 x ??? Py1 Py1 However, if P› x is not empty, it is not clear whether the last edge of a path P P› x should be assigned to the edge z or to the (remaining copy of) edge x. To resolve this dilemma, one has to analyze every path P P› x and to decide whether it “relates” to Px,y1 (in this case it should be directed through z) or to Px,y2 (in this case it should This watermark does not appear in the registered version - http://www.clicktoconvert.com 197 be directed through x). By “relates” to Px,y1 (Px,y2) we mean that every Eulerian superpath visits y1 (y2) right after visiting P. Two paths are called consistent if their union is a path again (there is no branching vertices in their union). A path P is consistent with a set of paths P if it is consistent with all paths in P and inconsistent otherwise (i.e. if it is inconsistent with at least one path in P). There are three possibilities • P is consistent with exactly one of the sets Px,y1 and Px,y2. • P is inconsistent with both Px,y1 and Px,y2. • P is consistent with both Px,y1 and Px,y2. In the first case, the path P is called resolvable since it can be unambiguously related to either Px,y1 or Px,y2. If P is consistent with Px,y1 and inconsistent with Px,y2 then P should be assigned to the edge z after x, y1-detachment (substitute x by z in P). If P is inconsistent with Px,y1 and consistent with Px,y2 then P should be assigned to the edge x (no action taken). An edge x is called resolvable if all paths in P› x are resolvable. If the edge x is resolvable then the described x, y-detachment is an equivalent transformation after the correct assignments of last edges in every path from P› x. In our analysis of the NM genome we found that 18026 among 18962 edges in the de Bruijn graph are resolvable. Although we defined the notion of resolvable path for a simple case in when the edge x has multiplicity 2, it can be generalized for edges with arbitrary multiplicities. The second condition (P is inconsistent with both Px,y1 and Px,y2) implies that the Eulerian Superpath Problem has no solution, i.e., sequencing data are inconsistent. Informally, (a) Px,y1 P Px,y2 P Px,y2 Px,y1 (b) P Px,y2 (c) Px,y1. in this case P, Px,y1 and Px,y2 impose three different scenario for just two visits of the edge x. After discarding the poor-quality and chimeric reads we did not encounter this condition in our analysis of the NM genome. The last condition (P is consistent with both Px,y1 and Px,y2) corresponds to the most difficult situation and deserves a special discussion. If this condition holds for at least one path in P› x, the edge x is called unresolvable and we postpone the analysis of this edge until all resolvable edges are analyzed. It may turn out that equivalent transformation of other resolvable edges will make the edge x resolvable. It illustrates that equivalent transformations may resolve previously unresolvable edges. However, some edges cannot be resolved even after the detachments of all resolvable edges are completed. Such situations usually correspond to tangles and they have to be addressed by another equivalent transformations called a cut. Consider a fragment of graph G with 5 edges and four paths y3- x, y4- x, x- y1 and x- y2, each path consisting of two edges. In this case P› x This watermark does not appear in the registered version - http://www.clicktoconvert.com 198 consists of two paths y3- x and y4 - x and each of those paths is consistent with both Px,y1 and Px,y2. In fact, in this symmetric situation, x is a tangle and there is no information available to relate any of path y3 - x and y4 - x to any of paths x - y1 and x - y2. Therefore, it may happen that no detachment is an equivalent transformation in this case. To address this problem, we introduce another equivalent transformation that affects the system of paths P and does not affect the graph G itself. An edge x = (v,w) is removable if (i) it is the only outgoing edge for v and the only incoming edge for w and (ii) x is either initial or terminal edge for every for every path P ∈ P containing x. An x-cut transforms P into a new system of paths by simply removing x from all paths in P› x and Px› . In the case , x-cut shortens the paths x- y1, x- y2, y3 - x, and y4 - x to single-edge paths y1, y2, y3, and y4. It is easy to check that an x-cut of a removable edge is an equivalent transformation, i.e., every Eulerian superpath in (G,P) corresponds to an Eulerian superpath in (G,P1) and vice versa. Cuts proved to be a powerful technique to analyze tangles that are not amenable to detachments. Detachments reduce such tangles to single unresolvable edges that turned out to be removable in our analysis of bacterial genomes. It allowed us to reduce the Eulerian Superpath Problem to the Eulerian Path Problem for all studied bacterial genomes. Although detachments and cuts are sufficient to reduce the Eulerian Superpath Problem to the Eulerian Path Problem for the studied bacterial genomes, there is still a gap in the theoretical analysis of the Eulerian Superpath Problem in the case when the systems of paths is not amenable to neither detachments, nor The idea of equivalent graph transformations for fragment assembly is conceptually similar to the idea of equivalent graph transformations for genome rearrangements. We also emphasize that our equivalent transformation approach is very different from the graph reduction techniques for fragment assembly suggested. 16.2 Let us Sum up This lesson gives a detailed description and explanation of Fragment assembly, Introduction, New ideas, Error correction,Error correction or data corruption, Eulerian superpath problem. 16.3 Lesson end activities 1. Could the Eulerian path approach be applied to fragment assembly? This watermark does not appear in the registered version - http://www.clicktoconvert.com 199 2. Given two similar reads, how can we decide whether they correspond to the same region or to two copies of a repeat located in different parts of the genome? 16.4 Check your progress: Model answers Your answer must include these points: The position where an orphan and its neighbor differ is called an orphan position. A sequencing read is orphan-free if it contains no orphans. 16.5 Points for Discussion 1. “Fragment assembly in a superior task in Gene Production” – Discuss.. 16.6 References 1. Saeys Y, Rouzé P, Van de Peer Y (2007). "In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists". Bioinformatics 23 (4): 414-420. doi:10.1093/Bioinformatics /btl639. 2. Hiller M, Pudimat R, Busch A, Backofen R (2006). "Using RNA secondary structures to guide sequence motif finding towards single-stranded regions". Nucleic Acids Res 34 (17): e117. Entrez PubMed 16987907. 3. Patterson DJ, Yasuhara K, Ruzzo WL (2002). "Pre- mRNA secondary structure prediction aids splice site prediction". Pac Symp Biocomput: 223-234. Entrez PubMed 11928478. 4. Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H (2006). "Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks". Comput Biol Chem 30 (1): 50-57. Entrez PubMed 16386465. This watermark does not appear in the registered version - http://www.clicktoconvert.com 200 This watermark does not appear in the registered version - http://www.clicktoconvert.com 201 LESSON – 17 GENOME SEQUENCE ASSEMBLY 17.0 Aim and objective 17.1 Genome sequence assembly 17.1.1 Whole genome shotgun sequencing 17.1.2 Clones and coverage 17.1.3 Assembly 17.1.4 Finishing 17.1.5 Assembly algorithms 17.1.6 Overlap-layout-consensus 17.1.7 Eulerian path 17.1.8 Handling repeats 17.1.9 Assesing assembly quality 17.2 Let us Sum up 17.3 Lesson end activities 17.4 Check your progress 17.5 Points for Discussion 17.6 References 17.0 Aims and Objectives: This unit describes the Genome sequence assembly, Algorithms and issues, Whole genome shotgun sequencing, Clones and coverage, Assembly, Finishing, Assembly algorithms, Overlaplayout-consensus, Eulerian path, Handling repeats, Detecting repeats, Unresolved repeats, Scaffolding software, Assessing assembly quality. This watermark does not appear in the registered version - http://www.clicktoconvert.com 202 17.1 Genome Sequence Assembly Each cell of a living organism contains chromosomes composed of a sequence of DNA base pairs. This sequence, the genome, represents a set of instructions that controls the replication and function of each organism. The automated DNA sequencer gave birth to genomics, the analytic and comparative study of genomes, by allowing scientists to decode entire genomes. Although genomes vary in size from millions of nucleotides in bacteria to billions of nucleotides in humans and most animals and plants, the chemical reactions researchers use to decode the DNA base pairs are accurate for only about 600 to 700 nucleotides at a time. The process of sequencing begins by physically breaking the DNA into millions of random fragments, which are then “read” by a DNA sequencing machine. Next, a computer program called an assembler pieces together the many overlapping reads and reconstructs the original sequence. This general technique, called shotgun sequencing, was introduced by Fred Sanger in 1982. The technique took a quantum leap forward in 1995, when a team led by Craig Venter and Robert Fleischmann of The Institute for Genomic Research (TIGR) and Hamilton Smith of Johns Hopkins University used it on a large scale to sequence the 1.83 million base pair (Mbp) genome of the bacterium Haemophilus influenzae.2 Much like a large jigsaw puzzle, the DNA reads that shotgun sequencing produces must be assembled into a complete picture of the genome. This seemingly simple process is not without technical challenges. For one thing, the data contains errors— some from limitations in sequencing technology and others from human mistakes during laboratory work. Even in the absence of errors, DNA sequences have features that complicate the assembly process—most notably, repetitive sections called repeats. The human genome, for example, includes some repeats that occur in more than 100,000 copies each. Similar to pieces of sky in jigsaw puzzles, reads belonging to repeats are difficult to position correctly. Further complicating assembly, some DNA fragments from each genome are impossible to sequence, resulting in gaps in coverage. The resolution of these problems entails an additional finishing phase involving a large amount of human intervention. Finishing is very costly, as it requires specialized laboratory techniques and highly trained personnel. Assembly programs can dramatically reduce this cost by taking into account additional information obtained during finishing, yet most current assemblers disregard this information and generate This watermark does not appear in the registered version - http://www.clicktoconvert.com 203 the best possible assembly solely from the initial shotgun reads. Advances in assembly algorithms must include features that help finishing efforts. 17.1.1 Whole Genome Shotgun Sequencing While shotgun sequencing remains the basic strategy for all genome sequencing projects, its applicability to large genomes has been controversial. Until recently it was applied only at the end of a hierarchical process. The BACs are then mapped to the genome to obtain a tiling path, after which the shotgun method is used to sequence each BAC in the tiling path separately. In contrast, whole-genome shotgun sequencing (WGSS) assembles the genome from the initial fragments without using a BAC map. This requires enormous computational resources. The sheer size of the data argued against WGSS for large projects. So did the presence of repeats. If positioning the long repeat stretches correctly is difficult, automating the process is even harder. In 2000, however, Eugene Myers and colleagues put most doubts to rest when they published a whole-genome assembly of the fruit fly Drosophila melanogaster. Using a new assembler built specifically for very large genomes, the Myers team successfully sequenced and assembled the 135-Mbp genome. The project was 25 times larger than any previous WGSS project, and the team went on to apply the WGSS strategy to sequence and assemble the draft human genome in 2001. 17.1.2 Clones and coverage A WGSS project begins in the laboratory, where ultrasound or a high-pressure air stream randomly shatters the DNA into pieces that researchers then insert into cloning vectors, or clones. The clone in this case is a circular piece of DNA called a plasmid. It has a known sequence of base pairs and can accept a clone insert of foreign DNA. The bacterium Escherichia coli is then used to multiply the plasmid, thus amplifying the clone insert. In most projects, researchers sequence both ends of each clone insert, yielding a set of sequencing reads that defines the clone-pairing data for that insert. This process links each read from a clone insert to its clone mate from the opposite end of the insert. The resulting clone-pairing data is extremely This watermark does not appear in the registered version - http://www.clicktoconvert.com 204 valuable not only in guiding the assembly process but also in correctly ordering the contiguous sequences, or contigs, resulting from assembly. It shows the process for three contigs. The ultimate goal of sequencing is to determine all the base pairs contained in the DNA. In practice, however, we try to achieve the goal of having more than 99 percent of the genome covered by reads after the initial shotgun phase. To achieve this goal we need to sequence clones until the reads (averaging 600 to 700 base pairs) provide an eightfold (8X) oversampling of the genome. For example, a 2-Mbp bacterial genome sequenced to 8X coverage requires 16 Mbp, or approximately 27,000 reads. Researchers choose the inserts from among several “libraries” of clone collections generated in the laboratory. The insert size specifies the average distance separating each pair of clone mates, and sizes vary from one library to the next. Typical projects contain at least two insert libraries of sizes 2 to 3 kbp and 8 to 10 kbp, respectively, and may include others, such as BAC libraries of 100 to 150 kbp. The sequenced portion of each insert averages 1,200 bp out of 3,000 bp total, so the clone inserts of a 3-kbp library sequenced to 8X cover 2.5 times as much distance as the sequences themselves. These libraries provide a “clone coverage” of more than 20-fold, meaning that, on average, 20 clones span each of the genome’s bases, thus offering the theoretical guarantee that each base is contained in at least one of the clones. This guarantee assumes uniformly random-sampled clones from the genome. In practice, this requirement is seldom perfectly satisfied. Cloning biases lead to a nonrandom clone distribution, causing areas of the genome to remain unsequenced regardless of the amount of sequencing performed. 17.1.3 Assemly A WGSS assembler’s task is to combine all the reads into contigs based on sequence similarity between the individual reads. The basic principle is that two overlapping reads—that is, reads where a suffix of one is a prefix of another—presumably originate from the same region of the genome and can be assembled together. This assumption is invalid, however, for repetitive sequences, where it is impossible to distinguish reads from two or more distinct places in the genome. It shows how an assembler can incorrectly combine the reads from two copies (rpt1A, rpt1B) of a repeat, producing a misassembled contig and throwing out the unique region between the two Clone insert Sequencing reads. Repeats represent a major challenge to assembly This watermark does not appear in the registered version - http://www.clicktoconvert.com 205 software. An assembler’s utility depends in large part on detecting and correctly resolving repeat regions. Resolving misassemblies in the finishing phases can be costly. Information about clone mates, combined with knowledge about the distribution of clone sizes, may help assembly programs to put some classes of repeats together correctly. If a repeat is shorter than the length of a clone insert, mate-pair information is enough to separate the individual repeat copies because each read within the repeat has an anchoring clone mate in the nearby nonrepetitive region. 17.1.4 Finishing n practice, imperfect coverage, repeats, and sequencing errors cause the assembler to produce not one but hundreds or even thousands of contigs. The task of closing the gaps between contigs and obtaining a complete molecule is called finishing. First, a program called a scaffolder uses clonemate information to order and orient the contigs with respect to each other into larger structures called scaffolds. Within a scaffold, pairs of reads spanning the gaps between contigs determine the order and orientation of contigs. Note that the physical DNA molecule has an easily determined direction, even though the textual representation of DNA as a string of A, C, T, or G characters appears to be directionless. The gaps between contigs belonging to the same scaffold are called sequence gaps. Although they represent genuine gaps in the sequence, researchers can retrieve the original clone inserts spanning the gap and use a straightforward “walking” technique to fill in the sequence. Determining the order and orientation of the scaffolds with respect to each other is more difficult. The gaps between scaffolds are called physical gaps because the physical DNA that would span them is either not present in the clone inserts or indeterminable due to misassemblies. Filling these gaps involves a large amount of manual labor and complex laboratory techniques. Check your progress: 1. Give details about meta genomics. This watermark does not appear in the registered version - http://www.clicktoconvert.com 206 Notes: gg) Write your answer in the space given below. hh) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… 17.1.5 Assembly Algorithms Researchers first approximated the shotgun sequence assembly problem as one of finding the shortest common superstring of a set of sequences: Given a set of input strings {s1, s2, ...}, find the shortest string T such that every si is a substring of T. While this problem has been shown to be NP-hard, there is an efficient approximation algorithm. This greedy algorithm starts by computing all possible overlaps between the strings and assigning a score to each potential overlap. The algorithm then merges strings in an iterative fashion by combining those strings whose overlap has the highest score. This procedure continues until no more strings can be merged. While it can be argued that the shortest superstring problem does not correctly model the assembly problem, the first successful assembly algorithms applied the greedy merging heuristic in their design. For example, TIGR Assembler,4 Phrap,5 and CAP36 followed this paradigm. Greedy algorithms are relatively easy to implement, but they are inherently local in nature and ignore long-range relationships between reads, which could be useful in detecting and resolving repeats. In addition, all current implementations of the greedy method require up to one gigabyte of RAM for each megabase of assembled sequence, assuming the genome was sequenced at 8X coverage. This limits their applicability on currently available hardware to organisms with genomes of 32 Mbp or less. Such organisms include bacteria and a few singlecelled eukaryotes, but not plants, mammals, or other multicellular organisms. These limitations spurred the development of new algorithms. Two approaches exploit techniques developed in the This watermark does not appear in the registered version - http://www.clicktoconvert.com 207 field of graph theory: one that represents the sequence reads as graph nodes and another that represents them as edges. 17.1.6 Overlap-layout-consensus The first approach, overlap- layout-consensus, constructs a graph in which nodes represent reads, and edges indicate that the corresponding reads overlap. Each contig is represented as a simple rpt1A rpt1B I I II II III III Computer path—that is, a path through the graph that contains each node at most once. An assembler following this paradigm must first build the graph by computing all possible alignments between the reads. A second stage cleans up the graph by removing transitive edges and resolving ambiguities. The output of this stage comprises a set of nonintersecting simple paths in this refined graph, each such path corresponding to a contig. A final step generates a consensus sequence for each contig by constructing the multiple alignment of the reads that is consistent with the chosen path. Full information about each read in the input is only necessary for the overlap and the consensus stages. The graph-refinement stage stores only a limited amount of information about each overlap, such as its coordinates and length. This allows a memory-efficient implementation. Not surprisingly, recent WGSS assemblers use this approach. The overlap- layout-consensus technique has the additional value of encoding other relationships between reads, such as clonemate information, which an assembler can use in correctly assembling repetitive areas. 17.1.7 Eulerian path The second graph-theoretical approach to shotgun sequence assembly uses a sequencingbyhybridization (SBH) technique. The idea is to create a virtual SBH problem by breaking the reads into overlapping n- mers, where an n- mer is a substring of length n from the original sequence. Next, the assembler builds a directed deBruijn graph in which each edge corresponds t o a n n- mer from one of the original sequence reads. The source and destination nodes correspond respectively to the n- 1 prefix and n- 1 suffix of the corresponding n-mer. For This watermark does not appear in the registered version - http://www.clicktoconvert.com 208 example, an edge connecting the nodes ACTTA and CTTAG represents the 6- mer ACTTAG. Under this formulation, the problem of reconstructing the original DNA molecule corresponds to finding a path that uses all the edges—that is, an Eulerian path. In theory, the Eulerian path approach is computationally far more efficient than the overlaplayout- consensus approach because the assembler can find Eulerian paths in linear time while the problems associated with the overlap- layout-consensus paradigm are NP-complete. Despite this dramatic theoretical difference, the actual performance of existing algorithms indicates that overlap-layout-consensus is just as fast as the SBHbased approach. 17.1.8 Handling Repeats If genomic data included no repeats, an assembler could use any assembly algorithm to put all the pieces together correctly, even in the presence of sequencing errors. The repeats found in real genomes can, however, prohibit correct automated assembly, at least solely from information contained in the original reads. For example, a large tandem repeat found in the bacterium Streptococcus pneumoniae consists of a 24-bp unit that is repeated in identical and nearly identical copies, in tandem, for a stretch covering approximately 14,000 bp. Given that individual reads have an average length of 600 bp and that all reads obtained from this region are identical, no assembler can determine a unique tiling across this repeat. The resulting assembly is likely to contain a reconstruction of a 600-bp section of the repeat in which all the reads have collapsed on top of each other—similar to the situation. In this particular case, clone mates also fail to resolve the problem because the largest clone insert for this project covered only 10 kbp. This example highlights several issues that assembly programs must address. First, they must identify repeats, preferably during the assembly process, to avoid mistakes caused by overcollapsing repeat copies. Detecting such misassemblies is much more difficult after assembly is completed, and the misassemblies can lead to incorrect genome reconstructions. Second, assemblers must attempt to correctly assemble as many repeats as possible to reduce the amount of human labor involved in completing the genome. For short repeats, this step can be as simple as using anchored reads, meaning those having mates in the unique areas surrounding the repeat. In more complex repeats, the assembler must be able to use additional information obtained through laboratory experiments. This watermark does not appear in the registered version - http://www.clicktoconvert.com 209 Detecting repeats A simple solution to the repeat-detection problem identifies the pileup caused by a misassembly. Because the reads come from a random sampling of the genomic DNA, typically with 8X coverage of the genome, areas covered by a significantly large number of reads indicate an overcollapsed repeat. Most assemblers use variations on this simple idea. Although the idea is useful, it assumes that the reads are sampled uniformly at random from the genome. In reality, certain areas tend to be poorly represented or absent from the sample—for example, if the insert is toxic to the laboratory organism detecting misassemblies is more difficult after assembly is completed, and the misassemblies can lead to incorrect genome reconstructions. In addition, low-copy repeats, which appear only two to three times in a genome, may escape detection because they do not appear to be statistically oversampled. While statistical methods can provide a rough filter, assembly programs must use other techniques to accurately separate out the repeats. As an example, the recently developed assembly program Euler detects repeats by finding complex areas, or tangles, in the graph constructed during assembly. Researchers can use the information contained in the tangle to guide experiments to resolve the repeat. Assemblers that simply mask out repeats—another common strategy—lose this information and must obtain it by other means. Because the cloning process generates reads in pairs from opposite ends of clone inserts, assemblers can use information about clone mates to help detect areas that have been incorrectly assembled due to repeats. Such areas usually contain many instances of clone mates that were assembled either too close or too far from each other, or whose relative orientation is incorrect. This information must be used with care; however, since clone length estimates are usually imprecise, especially for larger clones. The difficult problem here is finding outliers in a data set whose distribution is unknown. The most reliable information comes from the relative orientation of the sequencing reads, which can nearly always be tracked correctly. When repeats are widely separated in the genome, clone-pairing data can resolve them effectively for reads whose mates are anchored in the neighboring nonrepetitive areas. Although some repeats are identical, it is more common to find some differences in them. These differences sometimes provide enough information for the assembler to distinguish the copies from one another. This watermark does not appear in the registered version - http://www.clicktoconvert.com 210 In the absence of sequencing errors, a single nucleotide difference between two copies of a repeat is enough to distinguish them. Researchers have developed several techniques to correct sequencing errors during repeat resolution. All current techniques are based on finding statistically significant clusters of reads, where the clusters are based on shared differences in the reads. This approach assumes that sequencing errors are independent and, therefore, that an identical position difference in multiple reads is likely to be a real difference typical of that copy of the repeat. For example, if four reads contain an A in position 200 and four other reads contain a G in that position, then the assembler can infer with high confidence that the first four reads come from one copy of the repeat, while the second four represent a different copy. One drawback of this approach is the need for relatively deep coverage to detect true differences between repeat copies. If a repeat region is difficult to clone—a common phenomenon— the coverage of that repeat will be low. Moreover, true polymorphisms, such as those between different copies of nearly identical chromosomes—for example, each human chromosome occurs in two copies—or from nonclonal source DNA further complicate this problem. Unresolved repeats Even using all these information sources, an assembler cannot resolve every repeat. Humans must intervene to finish some complex areas. The basic technique for this task is to separate out the reads coming from distinct repeat copies. In directed sequencing experiments, researchers amplify stretches of DNA anchored in unique areas around the repeat. If we consider each copy of the repeat in isolation from the others, an assembly program can put the genome together by holding these repeat contigs together. Assembling a mixture of contigs and reads, while guaranteeing that the contigs will not break up in the process, is known as a jumpstart assembly. Only TIGR Assembler currently supports this capability. Scaffolding Software The scaffolding process groups contigs together into subsets with a known order and orientation. Researchers generally infer relationships between contigs from clone- mate information. Most This watermark does not appear in the registered version - http://www.clicktoconvert.com 211 recent assemblers include a scaffolding step. Moreover, the Human Genome Project BAC collections were ordered and oriented through scaffolding. To reformulate the scaffolding problem in graphtheoretic terms, we can construct a graph in which the nodes correspond to contigs, and a directed edge links two nodes when mate pairs bridge the gap between them. In this case, each pair of reads implies a particular orientation and spacing of the contigs to form a correct pair. The scaffolding program must now solve three problems: • Find all connected components in the defined graph. • Find a consistent orientation for all nodes in the graph, where nodes are connected by two Those requiring the two nodes to have the same orientation and those forcing the two nodes to have different orientations. We call the latter reversal edges. A consistent orientation of all the nodes is possible only if all undirected cycles contain an even number of reversal edges. Because errors in the pairing data or misassemblies can invalidate this condition, we must solve an optimization problem: Find the smallest number of edges that must be removed so that no cycle has an odd number of reversal edges. This optimization problem is NP-complete. Given the length estimates of the edges, embed the graph on a line or—for some bacterial and archaeal genomes—on a circle, such that the least number of constraints is invalidated. This problem is a special case of the optimal linear arrangement problem, which is also NP-complete. While the last two problems are difficult from a theoretical standpoint, simple heuristics can easily handle the instances encountered in practice. Moreover, in practice we can relax the optimality criteria. During the finishing phase, for example, a linear embedding of the contigs is not necessary. In fact, ambiguities in the graph can highlight possible misassemblies, and finishing teams can use this information in designing experiments to confirm a particular embedding of the graph. The complexity of scaffolding stems specifically from the presence of errors in the data. Again, simple heuristics can reduce the effect of such errors. For example, we can reduce errors caused by data tracking problems or misassemblies by requiring at least two sources of linking information between contigs or by ignoring links anchored in repeat areas. For any but the smallest genomes, it is unlikely that a single scaffold will hold all contigs. Thus we will need additional information to order and orient the scaffolds themselves. This watermark does not appear in the registered version - http://www.clicktoconvert.com 212 Two common sources of such information are physical maps and comparisons to related organisms. Physical mapping encompasses a variety of laboratory techniques for characterizing a set of markers along a DNA strand. Markers include known genes and short, unique sequences of a few hundred nucleotides, called tags, that researchers have fluorescently tagged and mapped to an approximate point on a chromosome. Determining the layout of these markers before sequencing provides an independent information source for scaffolding software. Using contigs created by an assembler, researchers can simulate the mapping experiment computationally by searching for the tag locations in the contig sequence. The comparison between the electronic map and the physical map also provides ordering information that the scaffolding program can use. The sequence of a closely related organism is another source of scaffolding information. For example, by aligning the scaffolds from a preliminary assembly of the mouse genome to the human genome, we can obtain the likely order of the mouse contigs. Of course, this information will be incorrect where major genome rearrangements have occurred in the evolutionary divergence of the two species. This technique therefore works best with a very closely related genome that has been sequenced to completion. The sources of linking information used to construct scaffolds vary in quality. In particular, the error in determining the length of inserts, and thus the distance between clone mates, increases with the insert size. Physical map data is inherently error prone. Finally, large-scale genome rearrangements can affect the homology data. TIGR has developed Bambus, a scaffolder that factors our confidence in the linking information into hierarchically constructed scaffolds. The algorithm first builds a set of scaffolds based on the highest confidence links, then incorporates the lower confidence information to combine each scaffold into a larger structure. This hierarchical method reduces the effect of incorrect linking data, while still using all the information sources. 17.1.9 Assessing Assembly Quality Correcting misassemblies is expensive, especially if they go undetected until the late stages of a sequencing project. Assemblers highlight problematic areas by outputting the confidence level in This watermark does not appear in the registered version - http://www.clicktoconvert.com 213 each base of the consensus. Because this simple quality-control method is an inherently local measure, it fails to capture larger scale phenomena, such as whole DNA sections that are incorrectly spliced together. The assembly pipeline must therefore contain a validation module that uses additional information to determine the contig quality. Finding errors in assemblies is easy when the complete sequence is already known, and we can use known benchmark data sets to fine-tune assembly software. These data sets, either artificially generated or representing real sequencing reads from completed projects, provide both the correct consensus sequence and the exact location of all reads in the true DNA sequence. Ambiguities in a graph can highlight possible misassemblies that finishing teams can investigate. July 2002 US detect both local errors in the consensus base calls and large-scale rearrangements, such as reversals and insertions, in the genome assembly. We can apply some of these ideas to the challenge in real practice: finding assembly errors when the true layout is unknown. For example, physical maps provide markers that we can use to validate large contigs. Similarly, we can use the sequence of a closely related organism to confirm areas that we do not expect to have significantly diverged. In the absence of any other types of information, clone mates have been used to detect assembly errors. Areas of the genome that violate the orientation and distance constraints imposed by the clone mates indicate potential misassemblies. Most reported measures of assembly quality are aggregate measures, such as the number and sizes of contigs. They assume that an assembly consisting of a few large contigs is better than one composed of many small contigs. This assumption is partly true, in that the number of contigs indicates the number of gaps, which in turn correlates with the amount of work needed to finish the genome. Aggregate size measures do not, however, account for the possibility of misassemblies, and they are therefore only marginally useful. If anything, an assembler can generate large contig sizes at the expense of misassemblies. To demonstrate these concepts, we performed a series of tests on the genome of Wolbachia, an endosymbiotic bacterium found in the Drosophila (fruit fly) and other insects. We recently completed this genome at TIGR, so we had a “true” DNA sequence to which we could compare assembly results. We assembled this genome from the original shotgun reads using Phrap, TIGR Assembler, and Celera Assembler. We ran all assemblers with their default This watermark does not appear in the registered version - http://www.clicktoconvert.com 214 settings. We verified the assemblies by aligning the resulting contigs to the finished sequence. According to the number and length of contigs, Phrap appears to produce the best output, followed closely by TIGR Assembler. The Phrap assembly contains about one- fourth as many contigs as Celera Assembler’s, and its contigs are about four times larger on average. In addition, the total size of these contigs (1.26 Mbp) matches the actual size of the Wolbachia genome (1.26 Mbp). In contrast, if we look at the proportion of the sequence covered by correct assemblies, the Celera Assembler’s output spans more than 99 percent of all bases, while the TIGR Assembler contigs cover just over 93 percent, and Phrap covers barely 36 percent. These results lead to a couple of conclusions. First, Phrap and TIGR Assembler appear to have misassembled some repeats, which explains the lack of coverage. At the same time, the small length of the Celera Assembler’s contigs, combined with their large total size, lead us to believe that it failed to combine many contigs that should have been assembled together. Closer examination—and similar experience with many other genomes—indicates that this usually results from poor quality data at the ends of sequences. To get Celera Assembler to combine more contigs, we performed an additional step of more aggressively trimming poor quality data from the ends of the input sequences. The fourth row in indicates that this technique closed a number of gaps, yielding larger contigs overall. At the same time, the coverage of the genome decreased, indicating a potential drawback to the technique. These observations correlate with our understanding of the assembly algorithms used by the three programs. TIGR Assembler and Phrap are more tolerant of incorrect data at the sequence ends, which allows them to create bigger contigs. At the same time, their handling of clone- mate information is less sophisticated than Celera Assembler’s. In particular, TIGR Assembler uses a greedy approach that lets it walk through a repeat, occasionally violating clone-link constraints. The Phrap program simply does not take these constraints into consideration, leading to the overcollapse of repeat regions. Closer analysis verified the hypothesis that all the misassemblies presented and correlated with repeats in the Wolbachia genome. T he ultimate goal of genome sequencing is the complete DNA sequence of an organism. A good assembler can aid the human effort involved in the finishing phase. This watermark does not appear in the registered version - http://www.clicktoconvert.com 215 Computer designed for finishing should use multiple sources of information, such as data from finishing experiments in the laboratory; it should also let the human experts put restrictions on the assembly, such as regions that need to be held together or repeats that should be kept separate. Better quality- control tools are essential, and defining quality measures that make it possible to evaluate assembly algorithms is a first step toward their improvement. This issue is particularly critical for incomplete genomes, such as the various and constantly changing versions of the draft human genome sequence. Assemblers that can report “weak” areas in the assembly and highlight potential misassembly sites are essential not only for the subsequent analysis of assembly data but also for guiding the efforts of finishing experts. Moreover, welldefined objective quality measures will provide an additional level of validation even in the case of completely finished genomes. 17.2 Let us sum up Despite the fact that the assembly of bacterial genomes has become a routine task at major sequencing centers, the assembly problem is far from being solved. Many new challenges are uncovered as scientists tackle diverse new organisms. Furthermore new sequencing technologies will change the assumptions currently made on the characteristics of the data being assembled. Current sequencing technologies only allow us to "read" up to 1000 - 2000 bases of DNA at a time. To overcome this limitation, sequencing of entire organisms is performed through a process called shotgun-sequencing, wherein the DNA is sheared into smaller fragments whose ends are then sequenced. The reconstruction of the original DNA sequence is handled by specialized computer programs called assemblers. The output of assembly programs consists in a collection of contiguous pieces (contigs) - rarely are entire chromosomes reconstructed into a single piece. An additional computer program - the scaffolder - uses the information linking together sequencing reads from the ends of fragments to order and orient the contigs with respect to each other along a chromosome. 17.3 Lesson end activities 1. Find out details about · Automatic finishing techniques · Automatic sequencing error correction This watermark does not appear in the registered version - http://www.clicktoconvert.com 216 · Handling of polymorphic data · Repeat resolution · Representation of assembly data in public databases 17.4 Check your progress: Model answers 1. Your answer may include these points: Metagenomics is a new field of research in which scientists analyze the genomes of organisms recovered directly from the environment. Most naturally occuring bacteria cannot be cultured and therefore cannot be analyzed by traditional means. Metagenomic studies, however, overcome this limitation and provide us with a mechanism for analyzing previously unknown organisms and have a wide range of applications, from environmental studies to human health. 17.5 Points for Discussion 1. How do you rate genome sequence assembly’s contribution to Bioinformatics?. 17.6 Reference 1. Schatz, M.C., Phillippy, A.M., Shneiderman, B., Salzberg, S.L. Hawkeye: a visual analytics tool for genome assemblies. Genome Biology 8:R34. 2007. 2. M. Roberts, B.R. Hunt, J.A. Yorke, R.A. Bolanos and A.L. Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology. Vol. 11, No. 4: 734-752. 2004. 3. M. Roberts, W. Hayes, B.R. Hunt, S.M. Mount and J.A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics . 20(18):3363-3369; 2004. 4. M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. Comparative genome assembly. Briefings in Bioinformatics . 5(3), pp. 237-248, 2004. 5. M. Pop. Shotgun sequence assembly. Advances in Computers vol. 60, M. Zelkowitz ed. June 2004. 6. M. Pop, D. Kosack. Using the TIGR Assembler in shotgun-sequencing projects. i n Bacterial Artificial Chromosomes vol. 1, S. Zhao and M. Stodolsky eds. Humana Press, pp. 279-294, March 2004. This watermark does not appear in the registered version - http://www.clicktoconvert.com 217 7. M. Pop, D.S. Kosack, S.L. Salzberg. Hierarchical scaffolding with Bambus. Genome Research 14(1), pp. 149-159, 2004 8. P. Gajer, M. Schatz, S.L. Salzberg. Automated correction of genome sequence errors. Nucleic Acids Research 32(2), pp. 562-569, 2004. 9. M. Pop, S. L. Salzberg, M. Shumway. Genome Sequence Assembly: Algorithms and Issues. IEEE Computer 35(7) 2002, pp. 47-54. Copyright 2002 IEEE. Reproduced with permission from IEEE. This watermark does not appear in the registered version - http://www.clicktoconvert.com 218 LESSON – 18 GENE PREDICTION PROGRAMS 18.0 Aims and objectives 18.1 Gene prediction programs 18.1.1 Introduction 18.1.2 GLIMMER (gene locator and interpolated markov modeler) 18.1.3 Interpolated markov models (imms) 18.1.4 The GLIMMER system 18.1.5 Fungal gene predictions by GLIMMER, GLIMMERM and Genscan 18.2 Let us Sum up 18.3 Lesson end activities 18.4 Check your progress 18.5 Points for Discussion 18.6 References 18.0 Aim and Objective: This unit describes the Gene prediction programs, Introduction, Glimmer (gene locator and interpolated markov modeler), Interpolated markov models (imms), The GLIMMER system, Fungal gene predictions by GLIMMER, glimmerm, and Genscan. 18.1 Gene prediction programs 18.1.1 Introduction A major goal of genome projects is to identify all genes in a given organism. Consequently, the development of automated gene- finding procedures has become one of the most active areas of research in Bioinformatics . Protein-coding DNA sequences exhibit characteristics that distinguish them from non-coding sequences. For prokaryotic organisms, the task of gene This watermark does not appear in the registered version - http://www.clicktoconvert.com 219 identification is relatively easy as prokaryotic genomes are rather small and genes are not interrupted by introns. Here, all open reading frames (ORFs) exceeding some threshold length are likely to code proteins. The gene- finding problem is much more complicated for eukaryotic organisms where the density of genes in the genome is about two orders of magnitude lower than in bacterial genomes and genes typically consist of multiple exons separated by introns of varying length. The commonly used approach for gene prediction is to train computer programs to recognize sequences that are characteristic of known exons in genomic DNA sequences. The patterns used to predict genes include intron-exon boundaries and upstream promoter sequences. However, in eukaryotes, these signals are poorly defined, and therefore cannot be searched by a simple pattern- matching technique as used with prokaryotes. During the past few years, various prediction methods have been developed to identify genes in eukaryotic genome sequences. Recent studies show, however, that the reliability of these methods is limited for large genomic sequences as they cannot locate all possible exons encoded in the sequence . Moreover, many gene-prediction programs have originally been tested on genomic sequences of only a few kilo bases (kb) in length where each sequence contained only a single gene. The performance of standard gene-prediction methods drops significantly when tested under more realistic conditions usually containing multiple genes . Practically all existing gene-prediction programs rely on information derived from known genes. Major differences between existing methods are in how they assess if a stretch of genomic DNA looks as known genes. Two approaches are used. Abinitio or intrinsic methods use content statistics such as ORF length or codon usage together with sequence signals like splice junctions to distinguish coding from non-coding regions. GLIMMER, GRAIL, GENEID, GenScan , and GeneMark are among the most popular ab-initio programs. By contrast, extrinsic methods work by comparing genomic sequence to known ESTs or proteins in databases and check if a piece of the genomic sequence is similar to any known genes or proteins. This idea has been implemented in GENEWISE and PROCRUSTES . Neither ab-initio nor extrinsic methods can elucidate perfectly the complex and variable genomic structure of higher eukaryotic organisms. Their genes contain a large number of small exons separated by long This watermark does not appear in the registered version - http://www.clicktoconvert.com 220 intervening sequences (introns). Furthermore, in the actual genomes, some non-coding sequences could exhibit features of typical coding sequences (e.g., pseudogenes) and vice versa. Moreover, a large fraction of higher eukaryotic coding exons are very short, which cannot be effectively detected by commonly used gene-prediction programs. The following sections describe three geneprediction methods: GLIMMER, GLIMMERM, and GenScan, used in this study for detecting mainly small exons in the Neurospora crassa genome. They were chosen among several others based on their possible ability to detect small coding regions, sensitivity and specificity values when used on other genomes. 18.1.2 GLIMMER (Gene Locator and Interpolated Markov Modeler) GLIMMER is a computational gene finder that was initially developed to predict genes in prokaryotic genomes. Gene finders for prokaryotes have an advantage in that genomes tend to be gene-rich, containing 90% coding sequences. One major problem is to correctly identify the genes when two or more open reading frames (ORFs) overlap. GLIMMER uses a technique called interpolated Markov model (IMM), a generalization of Markov chain methods, to identify coding regions in microbial sequences. GLIMMER 1.0 has been used as the gene finder for some bacterial genomes (Borrelia burgdorferi, Treponema pallidum, Chlamydia trachomatis, and Thermotoga maritima). GLIMMER 2.0 has several technical improvements to the GLIMMER 1.0 algorithm and works better in resolving overlapping ORFs. GLIMMER uses an approach based on frequency of occurrence of nucleotides in a DNA to determine the relative weights of oligomers that have different lengths from 1 to 9 bp. First IMMs are created for the six open reading frames (three frames for each of the two strands: forward and reverse), and then used to score the entire ORFs. When there is an overlap between two high scoring ORFs, the overlapped regions are scored separately to determine the more likely gene. 18.1.3 Interpolated Markov models (IMMs) A Markov chain contains a sequence of random variables Xi (i is the position in the sequence), where the probability distribution for each variable depends only on the preceding k variables Xi- This watermark does not appear in the registered version - http://www.clicktoconvert.com 221 1, … , Xi-k for some constant k. In the case of DNA sequences, the random variables Xi takes the value from the set of four nucleotide bases (a, c, g, and t). Depending on the order of the Markov chain used, the constant k takes values from 0 to 8. For example, a fixed first-order Markov chain is specified completely by a matrix of 16 conditional probabilities: p(a|a), p(a|c), p(a|g), … , p(t|t), where each of the terms represents the probability of the current base given the previous base. A second-order Markov chain predicts a base by looking at the two previous bases. In general, for a kthorder Markov model, the number of probabilities we need to look into is 4k+1 for each 33 reading frame. In a 0th-order model, the matrix contains only the individual probabilities of the four nucleotides (a, c, g, and t). In the case of a first order, the 16 dinucleotide (aa, ac, ag, at, …, tg, tt) probabilities are calculated by looking at the previous base. A secondorder model gives the probability of 64 trinucleotides (aaa, aag, aac, aat,…, ttg, ttt). In principle, using longer oligomers is always preferable to using shorter ones, but only if sufficient data is available to produce probability estimates. Currently most of the gene finders use a 5th-order fixed Markov chains (it uses hexamer nucleotide or di-codon frequencies) as they have proven to be effective for gene predictions . IMMs are generalization of fixed order Markov chains. The main difference between IMMs and fixed Markov models is that IMMs use varying number of bases for each prediction rather than making decision in advance regarding the number of bases to consider. This allows IMMs to be sensitive depending on the frequencies of particular oligomers in a genome. For example, if some 5- mers (oligomers having five bases) occur too infrequently, their probabilities cannot be estimated reliably, and they will not be used in the model. On the other hand, if some 8-mers occur sufficiently frequently, IMM use this longer context to make better predictions. Thus it has all the additional information for prediction. From the training data sets, GLIMMER computes the probability for each nucleotide base (a, c, g, or t) following all k-mers (0 k 8). For each k-mer, weights are computed for use in different models. These weights and Markov models are interpolated to produce a score for each base in any potential coding region. The logs of scores are summed to score each coding region. The probability that the model M generates the sequence S, P(S|M), is computed as P(S|M) = Σn x=1 IMM8 (Sx) Where Sx is the oligomer ending at the position x, and n is the length of the sequence. IMM8 (Sx) is the 8th-order interpolated Markov model score computed as IMMk(Sx) = λk(Sx - 1) * Pk(Sx) + [1- λk(Sx - 1)] * IMMk -1(Sx) where λk(Sx - 1) is the numeric weight associated with the k-mer ending at the position x-1 in the sequence S, and Pk(Sx) is the estimate This watermark does not appear in the registered version - http://www.clicktoconvert.com 222 obtained from the training data of the probability of the base located at x in the kth-order model. Pk(Sx) = P(sx|Sx, i ) = f(Sx,i)/(Σ bε{a,c,g,t} f(Sx,i, b)) where f(S) denotes the number of occurrences of the string S = s1s2…sn. GLIMMER uses two criteria to determine λk (Sx). The first criterion is simply the frequency of occurrence. The current default threshold value is 400. The default threshold value gives 95% confidence that the sample probabilities are within 5% of the true probabilities from which the sample was taken. When there are insufficient sample occurrences of a context string (oligomer), additional criteria are employed to assign a λ value. For a given context string Sx,i of length i, observed frequencies of the base f(Sx,i,a), f(Sx,i,c), f(Sx,i,g), and f(Sx,i,t) are compared with previously calculated IMM probabilities using the next shorter context, IMMi-1(Sx,i – 1, a), IMMi-1(Sx,i – 1, c), IMMi-1(Sx,i – 1, g), and IMMi-1(Sx,i – 1, t). Using a χ2 test the two values are compared. If the values differ significantly, then the observed values are used. If they are consistent with IMM values, a lower value is given as they offer less predictive value. The value of λk(Sx) that we associate with Pk(Sx) can be regarded as a measure of our confidence in estimating the true probability. The number of parameters we need to estimate grows exponentially with the level of the order and higher the order, the parameter estimates can be less reliable. 18.1.4 The GLIMMER alignment The GLIMMER system consists of two programs: build-imm and glimmer (or glimmer2). The program build-imm takes an input set of sequences. The set can be complete genes or parital ORFs. It builds and outputs the interpolated Markov model. The second program glimmer uses this IMM to identify genes in a genomic sequence. GLIMMER does not use sliding windows to score the coding regions. Instead it identifies all ORFs that are longer than the threshold value and scores them in six possible reading frames. The ORF is assumed to have only one stop codon after the start codon in the sequence. It selects the frame that scores the highest for further examination of overlaps. If there is an overlap between reading frames, it selects the overlapped regions and scores them separately. Overlapping ORFs are resolved based on the length and a separate score computed for their overlapped regions. Suppose that A and B are two ORFs that overlap. If the overlap scores higher in A’s reading frame and A is longer than B, we reject B. If the overlap scores higher in B’s reading frame and B is longer than A, we reject A. Otherwise, This watermark does not appear in the registered version - http://www.clicktoconvert.com 223 both A and B are marked as “suspect”. GLIMMER 2.0 has resolved some of the prediction problems of GLIMMER 1.0. GLIMMER 1.0 occasionally discarded a gene due to the placement of its start codon in the 5’ direction resulting in an overlap with another gene. GLIMMER 2.0 resolves overlapping problems by incorporating extra rules. The scoring is similar to that of GLIMMER 1.0 for potential overlapping genes, but the system attempts to move the 36 locations of the start codons much more aggressively. In the case of ORFs A and B overlap, there are four different orientations to be considered. The process of evaluating overlaps is performed in an iterative fashion to prevent unnecessary rejection of genes. The current version also helps to find genes that were missed earlier due to the high probability threshold score. GLIMMER is the primary microbial gene finder at The Institute of Genomic Research (TIGR), and has been used to annotate the complete genomes of Borrelia burgdorferi, Treponema pallidum, Thermotoga maritima, Deinococcus radiodurans, Mycobacterium tuberculosis, and non-TIGR projects including Chlamydia trachomatis and Chlamydophila pneumoniae. The accuracy rates of gene prediction for bacterial and archaebacterial genomes (including Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Bacillus subtilis, Mycoplasma genitalium) were close to 98% . It has not been used for eukaryotic genomes. Even though GLIMMER is developed for prokaryotic genomes, it could be still useful for finding eukaryotic genes. Many small eukaryotes, for example, have relatively high gene density and contain short genes without interrupted by introns, and these genes cannot be detected by commonly used prediction programs. The GLIMMER system including the source codes was downloaded from The Institute of Genomic Research website (http://www.tigr.org). The example output of the program is listed in Appendix D. 37 The GLIMMERM system The GLIMMERM algorithm uses the same IMM scoring method used in GLIMMER 2.0 and was developed specifically for eukaryotes having a gene density of less than 20%. The splice site predictor algorithm in GLIMMERM captures dependencies among neighboring bases in a small window around each splice junction (16 and 29 bp around the 5'-donor and 3'-acceptor sites, respectively) . The algorithm takes advantage of the fact that the coding and non-coding This watermark does not appear in the registered version - http://www.clicktoconvert.com 224 sequences switch at the splice junction and detects this switch with two second-order Markov chains, one models coding sequence and another non-coding sequence. The length of each of these coding or noncoding context windows is currently fixed at 80 bp. Potential coding regions are evaluated by a scoring function based on decision trees that estimate the probability that a DNA subsequence is coding. Subsequences are evaluated according to their putative type: intron, initial exon, internal exon, final exon, and single-exon gene. Each such subsequence is run through ten different decision trees built with the OC1 induction system that can take multiple numeric feature values. The probabilities obtained with the decision trees are averaged to produce a smoothed estimate of the probability that the given subsequence is of a certain type. A putative gene model is then accepted only if the IMM score for the coding sequence in the correct reading frame exceeds a fixed threshold. The main assumptions of this program are: • The coding region of every gene begins with a start codon ATG, • The gene has no in- frame stop codons except the very last codon, and 38 • Each exon is in a consistent reading frame with the previous exon. These constraints significantly enhance the efficiency of computing the optimal gene models by restricting the search space of the dynamic programming algorithm. The dynamic programming algorithm processes sequences from left to right searching for stop codons. At each stop codon, it searches back in the 5' direction (right to left) finding all possible genes using this stop codon, and chooses the highest scoring gene. The only positions that are considered as possible intron donor and acceptor sites are those that score above the threshold determined by the Markov chains. The algorithm is run separately on direct and complementary strands of the input. GLIMMERM rejects overlapping genes by going through the list of putative genes. Overlap occurs when two models share a common stop codon and have different exon locations. If the genes overlap by less than 30 bp (the default value), then the overlap is ignored and both are considered possible genes. If the overlap is more than 30 bp, then they are rescored using the IMM and a gene with the best score is retained. GLIMMERM was used for a malaria parasite (Plasmodium falciparum) genome and showed the rates of sensitivity and specificity for nucleotide level recognition above 94% and 97%, respectively. GLIMMERM's accuracy of 93% This watermark does not appear in the registered version - http://www.clicktoconvert.com 225 on a plant genome, Arabidopsis thaliana , was comparable to the accuracy of 95% and 94% for GeneMark.hmm and GenScan, respectively . GenScan GenScan is a general-purpose gene identification program used to analyze genomic DNA sequences from a variety of organisms including human, other vertebrates, 39 invertebrates, and plants . For each genomic sequence, the program determines the most likely gene structure under a probabilistic model of the gene structural and compositional properties of the genomic DNA for the given organism. The probabilistic model used by GenScan accounts for many of the essential gene structural properties of genomic sequences: e.g., typical gene density, typical number of exons per gene, distribution of exon sizes for different types of exons; and also many of the important compositional properties of genes: e.g., the reading frame-specific hexamer composition of coding regions versus the reading frame- independent hexamer composition of introns and intergenic regions, and the position-specific composition of the translation initiation, termination signals, TATA box, cap site, and poly-adenylation signals. Importantly, novel models of the donor and acceptor splice sites are used, which capture potentially important dependencies between positions in these signals. For human and other vertebrate sequences, separate sets of model parameters are used, which account for the many differences in gene density and structure observed in genomic regions that exhibit distinct nucleotide composition (G+C%). GenScan has an additional feature that draws a representation of the resultant prediction showing all putative exons in their respective positions on both strands and whether they are leading, internal, or terminal, and a simplified scoring scheme. GenScan uses a homogeneous 5th-order Markov model of non-coding regions and three periodic 5th-order Markov models of coding regions. The parameters are typically estimated using the maximum likelihood method, that is, by using the observed conditional frequencies obtained from an appropriate training set of known genes to estimate the corresponding conditional probabilities. Nucleotides are generated 40 according to the probabilistic rules derived from an underlying hidden Markov process. It is parameterized for G+C content. This watermark does not appear in the registered version - http://www.clicktoconvert.com 226 The training set containing exons and introns are divided into four categories depending on the G+C content of the sequence. The categories are: I (< 43% G+C), II (43-51% G+C), III (51-57% G+C), and IV (> 57% G+C). For each of these categories, separate initial state probabilities are computed by estimating the relative frequencies of various functional units in these categories. GenScan uses double-stranded models to allow for occurrences of multiple genes on either or both DNA strands unlike other programs, which analyze one strand at a time assuming the input sequence contains a single complete gene. The essential idea is that a precise probabilistic model a gene/genomic sequence looks like is specified in advance, and then, given a sequence, one determines which of the vast number of possible gene structures (involving any valid combination of states/lengths) has the highest likelihood given the sequence. It cannot handle overlapping transcription units and does not address alternative splicing. GenScan program was designed primarily to predict genes in human/vertebrate genomic sequences; its accuracy level may be lower for other organisms. However, the vertebrate version of the program performed fairly well on an invertebrate (Drosophila melanogaster) sequences with accuracy per exon value of 68%. The maize and Arabidopsis versions (both are plants) also performed fairly well on their respective organisms with per exon accuracy of 78% and 67%, respectively. It differs from the majority of existing gene finding algorithms in that it allows partial genes as well as complete genes and the occurrence of multiple genes in a single sequence, on either or 41 both DNA strands. For prokaryotic or yeast sequences, the programs GLIMMER and/or GeneMark are better in comparison to GenScan . Check your progress: 1. Mention the 2 programs available for GLIMMER system. Notes: ii) Write your answer in the space given below. jj) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… This watermark does not appear in the registered version - http://www.clicktoconvert.com 227 …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… 18.1.5 Fungal gene predictions by GLIMMER, GLIMMERM, and GenScan The gene prediction methods were chosen based on their ability to identify small genes. GLIMMER has been widely used for prokaryotes and is highly accurate for their gene detection. The program allows building models on any datasets. GLIMMERM, a modified version of GLIMMER, was used as it was developed mainly for eukaryotes with small genome sizes and can identify short genes from high density genomes. GenScan was chosen for a comparison purpose since it is a widely used prediction program. • GLIMMER was trained using cDNA datasets of N. crassa, S. cerevisiae, a n d S. pombe obtained from NCBI. The trained model was then used to extract putative genes from the N. crassa genomic sequences. Program getputative, extract, build-icm, and glimmer2 are all part of the GLIMMER package. USAGE build- icm < tmp.train > tmp.model It builds the model using the training datasets in fasta format (tmp.train) and stores it in tmp.model glimmer2 Sequence tmp.model –g n | get-putative > g2.coord Using the trained model (tmp.model) and genomic sequence (Sequence), glimmer2 predicts all possible gene locations. The most likely gene coordinates are extracted by 42 the program get-putative included in the GLIMMER package and stored in g2.coord. The –g option denotes the minimum gene length. 30 bp is used in this study. extract Sequence g2.coord > Nucleotide_Output Using the stored coordinates (g2.coord), the ORFs are extracted from the genomic sequence (Sequence). • GLIMMERM was run using pre-trained models of one filamentous fungus species (Aspergillus fumigatus) and two plant species (Arabidopsis thaliana and Oryza sativa) available from the GLIMMERM software package. USAGE glimmerm Sequence -d directory of trained model > Output • GenScan was run using pre-trained models of human and two plant species (A. thaliana and maize) available from the downloaded GenScan software package. USAGE- genscan This watermark does not appear in the registered version - http://www.clicktoconvert.com 228 Parameter_file_of_organism Sequence -cds > Output The program takes in a parameter file of the trained model and a genomic sequence file (Sequence), and outputs the predicted ORFs. The –cds option prints the predicted nucleotide sequences. 18.2 Let us sum up Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. 18.3 Lesson end activities Step 1 Getting the DNA sequence For the practical we have choosen a human genomic sequence that may perhaps encode an adenylate kinase... · Open a new navigator window and load this page into it. · Click here to get the ~12 kb sequence (press on the middle mouse button to open a new browser window). Step 2 Looking for a Promoter candidate Promoter prediction is heavily dependent on finding good matches to the TATAAAA motif. Further clues may be provided by CCAAT-Box, CpG islands and other transcription factor binding sites. · Open a new navigator window and load this page into it. · Load the LBL Promoter query submission page. · It is worth familiarising yourself with the layout and note the on- line help. · Cut and paste the query sequence into the sequence box and submit the job. · When the result arrives look at the set of predicted promoters o Can you see matches to the TATA box consensus tATAA/T AA/T o Which promoter has the silliest TATA-box? This watermark does not appear in the registered version - http://www.clicktoconvert.com 229 · Open a new navigator window and load this page into it. · Load the TSSG query submission page. · Toggle on the TSSG button. · Don't use the sequence pastebox. (it can't strip out numbers from the sequence). · Type (or paste) /home/seqanal/public_html/courses/spring99/seq.fasta in the load box and then perform search. · When the result arrives look at the set of predicted promoters o How many TATA boxes were found? o Are the listed transcription factor binding sites informative? o How well do the two searches agree on candidate promoters? o How many candidates do they both find? o Is there a single best candidate from combining the searches? Step 3. Poly(A) site prediction In mammalian genes, polyadenylation sites are usually preceded by AATAAA or ATTAAA ~20 bases before the cleavage site and followed by a more weakly conserved GT-based motif. While these motifs are trivial to find, they only function in the right context - which is harder to define and includes regulation by upstream splicing factors. An important rule to remember is that there must not be an in- frame stop codon in an internal exon ie the true translation termination will be in the last exon. (Violations to this rule suppress mRNA production, to the cost of many experimentalists, and is occasionally used for differential mRNA regulation e.g. for certain Ig splice variants.) · (As needed, open a new navigator window and load this page into it.) · Load the POLYAH query submission page. · Toggle on the POLYAH button. · Look at the POLYAH help and note the quoted prediction accuracy. · Don't use the sequence pastebox. (it can't strip out numbers from the sequence). · Type (or paste) /home/seqanal/public_html/courses/spring99/seq.fasta in the load box and then perform search. · When the result arrives, look at the predicted poly(A) sites. o How many candidate sites were found? This watermark does not appear in the registered version - http://www.clicktoconvert.com 230 o If one or more of these sites are false is the prediction accuracy as good as claimed? o How might overprediction of poly(A) sites be avoided? Step 4. Predicting Splice Sites and Coding Exons There are a number of servers that separately predict splice sites and coding sequence bias but this information needs to be analysed together. We found that the CBS site in Denmark could provide all the information, though from two different servers. The NetGene2 server provides a graphical postscript output that we can print out and mark our predictions on. From the same group, the HMMgene server (using different algorithms) provides list output including potential Start and Stop codons. Both servers overpredict splice site candidates. In case you need reminding, classical splice sites look something like: o D o n o r C o n s e n s u s : o Acceptor Consensus: c /aAG^GTA/ gAGt (T>C)n N(C>T)AG^gt · (As needed, open a new navigator window and load this page into it.) · Load the NetGene2 query submission page. · Paste in the sequence and submit the job, which takes a few minutes to run. · The output provides a list of candidate splice sites (on both strands) and a graphical coding/splicing prediction. o · However it is not clear which translation frame is supposed to be coding. It is worth printing this figure out and using it to summarise our prediction attempts! o Click on Direct strand and save the compressed postscript output (has a .Z suffix). o Open a UNIX X-window (terminal from the desktop) o Uncompress the file by typing UNIX command § o gunzip filename.ps.Z Print the file to the printer outside room V111 by typing § lpr -Plw-v111 filename.ps · Now load the HMMgene query submission page. · Paste /home/seqanal/public_html/courses/spring99/seq.fasta in the local file box. · select 5 best predictions and toggle on predict signals. · Submit the job. This watermark does not appear in the registered version - http://www.clicktoconvert.com 231 · Click on the Explanation link to understand the output format. · We can now begin to assemble a complete gene prediction. Step 5. Combining the server outputs into an overall prediction We now have predictions for all the components needed to assemble the gene, rather inconveniently spread over many separate web outputs. We have to manually assemble all this into one prediction. This can be done on the Netgene2 and DNA sequence outputs using biro and fluorescent marker. The following guidelines may help. · Start from a strong point such as: o A well-predicted internal coding exon with good splice borders. · Work forwards and backwards toward the promoter and poly(A) boundary signals. · Reported splice site quality is not a completely robust guide to usage. o Context dependence is also important. o Splice sites tend to be overpredicted. o Some (true) splice sites might be better predicted by the HMMgene algorithm than by NetGene2... · The terminal exon should be partially coding, including the stop codon and the poly(A) signature. · The initiation codon should obey the Kozak rules: § It is normally the first methionine from the 5' end of the mRNA. § At least one of the underlined residues should be present in the consensus APuXXAUGG. · Once the prediction is completed, we can check it in the next exercise. · Good luck! Step 6. Gene prediction by homology using GeneWise Usually nowadays, related sequences are already present in the databases. When available these may be the fastest way to get a good gene prediction. Often this prediction will be more reliable than the coding bias predictions though one should be aware of the possibility of sequence error, differential splicing etc. and of course finding the coding exons is not a complete gene prediction. The GeneWise program has an exhaustive (slow) algorithm to align a protein to a DNA sequence, allowing for splice site recognition. (In a real situation, BLAST programs would be useful for first picking up the matches in a DB search.) This watermark does not appear in the registered version - http://www.clicktoconvert.com 232 · Open an X-window (or terminal on Tau's desktop). · Type prepare wise2 in the window. · We've prepared files with the human DNA and a homologous chicken protein to compare. · Now you can type or cut and paste the following command into the UNIX window: o g e n e w i s e / h o m e /seqanal/public_html/courses/spring99/kad1_chick /home/seqanal/public_html/courses/spring99/hsak1.dna · The program will run with default parameters and after a couple of minutes will print out the matched exons. · Now compare the results to the predictions so far o How many Exons are found? o Are the splice sites between or within codons? o Did you find all these coding regions earlier? o Have we now found all the coding exons (the chicken homologue has 194 AA)? 18.4 Check your progress: Model answers Your answer must include these points: · build-imm and · glimmer (or glimmer2) · build-imm takes an input set of sequences · glimmer uses this IMM to identify genes in a genomic sequence This watermark does not appear in the registered version - http://www.clicktoconvert.com 233 18.5 Points for Discussion 1. Do a comparative study on the various gene prediction program. 18.6 References: 1. Benfey, P.; Protopapas, A.D. (2004). Essentials of Genomics. Prentice Hall. 2. Brown, Terence A. (2002). Genomes 2. Oxford: Bios Scientific Publishers. ISBN 9781859960295. 3. Gibson, Greg; Muse, Spencer V. (2004). A Primer of Genome Science, Second Edition, Sunderland, Mass: Sinauer Assoc. ISBN 0-87893-234-8. 4. Gregory, T. Ryan (ed) (2005). The Evolution of the Genome. Elsevier. ISBN 0-12301463-8. 5. Reece, Richard J. (2004). Analysis of Genes and Genomes. Chichester: John Wiley & Sons. ISBN 0-470-84379-9. 6. Saccone, Cecilia; Pesole, Graziano (2003). Handbook of Comparative Genomics. Chichester: John Wiley & Sons. ISBN 0-471-39128-X. 7. Werner, E. (2003). "In silico multicellular systems biology and minimal genomes". Drug Discov Today 8 (24): 1121-1127. PMID 14678738. 8. Witzany, G. (2006). "Natural Genome Editing Competences of Viruses". Acta Biotheoretica 54 (4): 235-253. PMID 17347785. This watermark does not appear in the registered version - http://www.clicktoconvert.com 234 UNIT V LESSON – 19 SECONDARY STRUCTURE PREDICTION 19.0 Aims and Objectives 19.1 Secondary structure prediction 19.1.1Prediction of protein secondary structure from the amino acid sequence 19.1.2 Accuracy of secondary structure prediction 19.1.3 Methods for secondary structure prediction 19.2 Let us Sum up 19.3 Lesson end activities 19.4 Check your progress 19.5 Points for Discussion 19.6 References 19.0 Aims and Objectives: This unit discuss Secondary structure prediction, Prediction of protein secondary structure from the amino, acid sequence, Accuracy of secondary structure prediction, Methods for secondary structure prediction. 19.1 Secondary Structure Prediction Secondary structure prediction is a set of techniques in Bioinformatics that aim to predict the local secondary structures of proteins and RNA sequences based only on knowledge of their primary structure - amino acid or nucleotide sequence, respectively. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands (often noted as "extended" conformations), or turns. The success of a prediction is determined by comparing it to the results of the DSSP algorithm applied to the crystal structure of the protein; for nucleic acids, it may be determined from the hydrogen bonding pattern. Specialized algorithms have been developed for the detection of specific well-defined patterns such as This watermark does not appear in the registered version - http://www.clicktoconvert.com 235 transmembrane helices a n d coiled coils in proteins, or canonical microRNA structures in RNA. http://en.wikipedia.org/wiki/Secondary_structure_prediction - _note-Mount#_note-Mount The best modern methods of secondary structure prediction in proteins reach about 80% accuracy; this high accuracy allows the use of the predictions in fold recognition and ab initio protein structure prediction, classification of structural motifs, and refinement of sequence alignments. The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks such as LiveBench and EVA. The problems of predicting RNA secondary structure are broadly related but dependent mainly on base pairing a n d base stacking interactions; many RNA molecules have several possible three-dimensional structures, so predicting these structures remains out of reach unless obvious sequence and functional similarity to a known class of RNA molecules, such as transfer RNA or microRNA, is observed. Many RNA secondary structure prediction methods rely on variations of dynamic programming and therefore are unable to efficiently identify pseudoknots. 19.1.1 Prediction of Protein Secondary Structure from the Amino Acid Sequence Accurate prediction as to where alpha helices, beta strands, and other secondary structures will form along the amino acid chain of proteins is one of the greatest challenges in sequence analysis. At present, it is not possible to predict these events with very high reliability. As methods have improved, prediction has reached an average accuracy of 64–75% with a higher accuracy for aplha helices, depending on the method used. These predictive methods can be made especially useful when combined with other types of analyses discussed in this. For example, a search of a sequence database or a protein motif database for matches to a candidate sequence may discover a family or superfamily relationship with a protein of known structure. If significant matches are found in regions of known secondary or three-dimensional structure, the candidate protein may share the three-dimensional structural features of the matched protein. Several Web sites provide such an enhanced analysis of secondary structure. These sites and others that provide secondary structure analysis of a query protein are given. The main methods of analyses used at these sites are described below. Methods of structure prediction from amino acid sequence begin with an analysis of a database of known structures. These databases are examined for possible relationships between sequence This watermark does not appear in the registered version - http://www.clicktoconvert.com 236 and structure. When secondary structure predictions were first being made in the 1970s and 1980s, only a few dozen structures were available. This situation has now changed with present databases including approximately 500 independent structural folds. The combination of more structural and sequence information presents a new challenge to investigators who wish to develop more powerful predictive methods. The ability to predict secondary structure also depends on identifying types of secondary structural elements in known structures and determining the location and extent of these elements. The main types of secondary structures that are examined for sequence variation are alpha helices and beta htrands. Early efforts focused on more types of structures, including other types of helices, turns, and coils. To simplify secondary structure prediction, these additional structures that are not an aplha helix or beta htrand were subsequently classified as coils. Assignment of secondary structure to particular amino acids is sometimes included in the PDB file by the investigator who has solved the threedimensional structure. In other cases, secondary structure must be assigned to amino acids by examination of the structural coordinates of the atoms in the PDB file. Methods for comparing three-dimensional structures, described above, frequently assign these features automatically, but not always can be made especially useful when combined with other types of analyses discussed in this. For example, a search of a sequence database or a protein motif database for matches to a candidate sequence may discover a family or superfamily relationship with a protein of known structure. If significant matches are found in regions of known secondary or three-dimensional structure, the candidate protein may share the threedimensional structural features of the matched protein. Several Web sites provide such an enhanced analysis of secondary structure. These sites and others that provide secondary structure analysis of a query protein are given. The main methods of analyses used at these sites are described below. Methods of structure prediction from amino acid sequence begin with an analysis of a database of known structures. These databases are examined for possible relationships between sequence and structure. When secondary structure predictions were first being made in the 1970s and 1980s, only a few dozen structures were available. This situation has now changed with present databases including approximately 500 independent structural folds. The combination of more This watermark does not appear in the registered version - http://www.clicktoconvert.com 237 structural and sequence information presents a new challenge to investigators who wish to develop more powerful predictive methods. The ability to predict secondary structure also depends on identifying types of secondary structural elements in known structures and determining the location and extent of these elements. The main types of secondary structures that are examined for sequence variation are alpha helices and beta strands. Early efforts focused on more types of structures, including other types of helices, turns, and coils. To simplify secondary structure prediction, these additional structures that are not an alpha helix or beta strand were subsequently classified as coils. Assignment of secondary structure to particular amino acids is sometimes included in the PDB file by the investigator who has solved the three-dimensional structure. In other cases, secondary structure must be assigned to amino acids by examination of the structural coordinates of the atoms in the PDB file. Methods for comparing three-dimensional structures, described above, frequently assign these features automatically, but not always in the same manner. Hence, some variation is possible, and deciding which is the best method can be difficult. The DSSP database of secondary structures and solvent accessibilities is a useful and widely used resource for this purpose (Kabsch and Sander 1983; http://www.sander.ebi.ac.uk/dssp/). This database, which is based on recognition of hydrogen-bonding patterns in known structures, distinguishes eight secondary structural classes that can be grouped into alpha helices, beta strands, and coils (Rost and Sander 1993). A more recently described automatic method makes predictions in accord with published assignments (Frishman and Argos 1995). The assumption on which all the secondary structure prediction methods are based is that there should be a correlation between amino acid sequence and secondary structure. The usual assumption is that a given short stretch of sequence may be more likely to form one kind of secondary structure than another. Thus, many methods examine a sequence window of 13–17 residues and assume that the central amino acid in the window will adopt a conformation that is determined by the side groups of all the amino acids in the window. This window size is within the range of lengths of alpha helices (5–40 residues) and beta strands (5–10 residues). This watermark does not appear in the registered version - http://www.clicktoconvert.com 238 There is evidence that more distant interactions within the primary amino acid chain may influence local secondary structure. The same amino acid sequence up to 5 (Kabsch and Sander 1984) and 8 (Sudarsanam 1998) residues in length can be found in different secondary structures. An 11-residue- long amino acid “chameleon” sequence has been found to form an alpha helix when inserted into one part of a primary protein sequence and a beta sheet when inserted into another part of the sequence (Minor and Kim 1996). More distant interactions may account for the observation that beta strands are predicted more poorly by analysis of local regions (Garnier et al. 1996). However, the methods that have been used to predict the secondary structure of an amino acid residue all perform less well when amino acids more distant than in the small window of sequence are used. The number of possible amino acid combinations in a sequence window of 17 amino acids is very large (1720 _ 14 _ 1024). If many combinations influence one type of secondary structure, examination of a large number of protein structures is required to discover the significant patterns and correlations within this window. Earlier methods for predicting secondary structure assumed that each amino acid within the sequence window of 13–17 residues influences the local secondary structure independently of other nearby amino acids; i.e., there is no interaction between amino acids in influencing local secondary structure. Later methods assumed that interactions between amino acids within the window could play a role. Neural network models described below have the ability to detect interactions between amino acids in a sequence window, including conditional interactions. A hypothetical example of the interactions that might be discovered illustrates the possibilities. If the central amino acid in the sequence window is Leu and if the second upstream amino acid toward the amino terminus is Asn, the Leu is in an alpha helix; however, if the neighboring amino acid is not Asn, the Leu is in a beta strand. In another method of secondary structure prediction, the nearest-neighbor method, sequence windows in known structures that are most like the query sequence are identified. This method bypasses the need to discover complex amino acid patterns associated with secondary structure. Protein secondary structure has also been modeled by hidden Markov models, also described as discrete state-space models, which are described below (Stultz et al. 1993; White et al. 1994). This watermark does not appear in the registered version - http://www.clicktoconvert.com 239 19.1.2 Accuracy of Secondary Structure Prediction One method of assessing accuracy of secondary structure prediction is to give the percentage of correctly predicted residues in sequences of known structure, called Q3. This measure, however, is not very effective by itself, because even a random assignment of structure can achieve a high score by this test (Holley and Karplus 1991). Another measure is to report the fraction of each type of predicted structure that is correct. A third method is to calculate a correlation coefficent for each type of predicted secondary structure (Mathews 1975). The coefficient indicating success of predicting residues in the _-helical configuration, C_, is given by where p_ is the number of correct positive predictions, n_ is the number of correct negative predictions, o_ is the number of overpredicted positive predictions (false positives), and u_ is the number of underpredicted residues (misses). The closer this coefficient is to a value of 1, the more successful the method for predicting a helical residue. An overall level of prediction accuracy does not provide information on the accuracy of the number of predicted secondary structures, and their lengths and location in the sequence. One simple index of success is to compare the average of the predicted lengths with the known average (Rost and Sander 1993). Another factor to consider in prediction accuracy is that some protein structures are more readily predictable than others, such that the spectrum of test proteins chosen will influence the frequency of success. A representative set of proteins that have limited similarity will provide the most objective test. Rost and Sander (1993) have chosen a set of 126 globular and 4 membrane proteins that have less than 25% pair-wise similarity and have used this set for training and testing neural network models. A newer set of 540 structurally distinct fold types in the FSSP database provides an even larger set of training and test structures of unique structure and sequence (Holm and Sander 1998). In the often-used jackknife test, one protein in a set of known structure is left out of a calibration or training step of the program being tested. The rest of the proteins are used to predict the structure of the left-out one, and the procedure is cycled through all of the sequences. This watermark does not appear in the registered version - http://www.clicktoconvert.com 240 The overall frequency of success of predicting the secondary structural features of the left-out sequence is used as an indicator of success. An even more comprehensive approach to the problem of accuracy is to examine the predictions for different structural classes of proteins. Because some classes are much more difficult to predict, the overall success rate with respect to protein class is an important index of success. Prediction accuracy is discussed further below. A valuable addition to secondary structure prediction is giving the degree of reliability of the prediction at each position. Some prediction methods produce a score for each of the three types of structures (helix, strand, coil or loop) at each residue position. If one of these scores is much higher than the other two, the score is considered to be more reliable, and a high reliability index may be assigned that reflects high confidence in the prediction. If the scores are more similar, the index is lower. By examining predictions for known structures, as in a jackknife experiment, the accuracy of these reliability indices may be determined. What has been found is that a prediction with a high index score is much more accurate (Yi and Lander 1993; and see PHD server below), thus increasing confidence in the prediction of these residues. 19.1.3 Methods for Secondary Structure Prediction Three widely used methods of protein secondary structure prediction, (1) the Chou- Fasman and GOR methods, (2) neural network models, and (3) nearest- neighbor methods. An additional method that models structural families by Hidden Markov models is then described. These methods can be further enhanced by examining the distribution of hydrophobic, charged, and polar amino acids in protein sequences. Check your progress: 1. List the widely used secondary structure prediction methods. Notes: kk) Write your answer in the space given below. ll) Check your answer with the one given at the end of this lesson. This watermark does not appear in the registered version - http://www.clicktoconvert.com 241 …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… 19.2 Let us Sum up Secondary structure prediction is a set of techniques in Bioinformatics that aim to predict the local secondary structures of proteins and RNA sequences based only on knowledge of their primary structure - amino acid or nucleotide sequence. Accurate prediction as to where alpha helices, beta strands, and other secondary structures will form along the amino acid chain of proteins is one of the greatest challenges in sequence analysis. 19.3 Lesson end activities (i) Download any protein sequence from PDB (ii) Do secondary structure prediction using all the methods. 19.4 Check your progress: Model answers 1. Your answer must include these points: 1) the Chou- Fasman and GOR methods, (2) neural network models, and (3) nearest-neighbor methods 1.9 Points for Discussion “Secondary structure prediction of proteins requires a thorough knowledge of primary structure” – Substantiate. This watermark does not appear in the registered version - http://www.clicktoconvert.com 242 19.6 References 1. C Branden and J Tooze (1999). Introduction to Protein Structure 2nd ed. Garland Publishing: New York, NY. 2. W. Kabsch and C. Sander. Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen Bonded and Geometrical Features. Biopolymers 22: 2577-2637 (1983). [1] 3. M. Zuker "Computer prediction of RNA structure", Methods in Enzymology, 180:262-88 (1989). (The classic paper on dynamic programming algorithms to predict RNA secondary structure.) 4. L. Pauling and R.B Corey. Configurations of polypeptide chains with favored orientations of the polypeptide around single bonds: Two pleated sheets. Proc. Natl. Acad. Sci. Wash., 37:729-740 (1951). (The original beta-sheet conformation article.) 5. L. Pauling, R.B. Corey and H.R. Branson. Two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. Wash., 37:205-211 (1951). (alpha- and pi- helix conformations, since they predicted that 310 helices would not be possible.) 6. (1983) "Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features". Biopolymers 22: 2577-2637. PMID 6667333. This watermark does not appear in the registered version - http://www.clicktoconvert.com 243 LESSON – 20 METHODS FOR SECONDARY STRUCTURE PREDICTION 20.0 Aims and Objective 20.1 Chou-Fasman/ GOR method 20.2 Patterns of hydrophobic amino acids can aid structure prediction 20.3 Secondary structure prediction by neural network models 20.4 Let us Sum up 20.5 Lesson end activities 20.6 Check your progress 20.7 Points for Discussion 20.8 References 20.0 Aims and Objective This unit discuss about the secondary structure prediction methods namely Chou-Fasman/ Gor method, Patterns of hydrophobic amino acids can aid structure prediction and Secondary structure prediction by neural network models. 20.1 Chou-Fasman/GOR Method The Chou-Fasman method (Chou and Fasman 1978) was based on analyzing the frequency of each of the 20 amino acids in alpha helices, beta sheets, and turns of the then-known relatively small number of protein structures. It was found, for example, that amino acids Ala (A), Glu (E), Leu (L), and Met (M) are strong predictors of alpha helices, but that Pro (P) and Gly (G) are predictors of a break in a helix. This watermark does not appear in the registered version - http://www.clicktoconvert.com 244 A table of predictive values for each type of secondary structure was made for each of the alpha helices, beta strands, and turns. To produce these values, the frequency of amino acid i in structure s is divided by the frequency of all residues in structure s. The resulting three structural parameters (P_, P_, and Pt) vary roughly from 0.5 to 1.5 for the 20 amino acids. To predict a secondary structure, the following set of rules is used. The sequence is first scanned to find a short sequence of amino acids that has a high probability for starting a nucleation event that could form one type of structure. For alpha helices, a prediction is made when four of six amino acids have a high probability _1.03 of being in an alpha helix. For beta strands, the presence in a sequence of three of five amino acids with a probability of _1.00 of being in a beta strand predicts a nucleation event for a beta strand. These nucleated regions are extended along the sequence in each direction until the prediction values for four amino acids drops below 1. If both _-helical and _-strand regions are predicted, the higher probability prediction is used. Turns are predicted somewhat differently. Turns are modeled as a tetrapeptide, and two probabilities are calculated. First, the average of the probabilities for each of the four amino acids being in a turn is calculated as for alpha helix and beta strand predictions. Second, the probabilities of amino acid combinations being present at each position in the turn tetrapeptide (i.e., the probability that a particular amino acid such as Pro is at position 1, 2, 3, or 4 in the tetrapeptide) are determined. These probabilities for the four amino acids in the candidate sequence are multiplied to calculate the probability that the particular tetrapeptide is a turn. A turn is predicted when the first probability value is greater than the probabilities for an alpha helix and a beta strand in the region and when the second probability value is greater than 7.5 _ 10–5. In practice, the Chou-Fasman method is only about 50–60% accurate in predicting secondary structural domains. Garnier et al. (1978) developed a somewhat more involved method for protein secondary structure prediction that is based on a more sophisticated analysis. The method is called the GOR (Garnier, Osguthorpe, and Robson) method. Whereas the Chou-Fasman method is based on the assumption that each amino acid individually influences secondary structure within a window of sequence, the GOR method is based on the assumption that amino This watermark does not appear in the registered version - http://www.clicktoconvert.com 245 acids flanking the central amino acid residue influence the secondary structure that the central residue is likely to adopt. In addition, the GOR method uses principles of information theory to derive predictions (Garnier et al. 1996). As in the Chou-Fasman method, known secondary structures are scanned for the occurrence of amino acids in each type of structure. However, the frequency of each type of amino acid at the next 8 amino-terminal and carboxy-terminal positions is also determined, making the total number of positions examined equal to 17, including the central one. In the original GOR method, three scoring matrices, containing in each column the probability of finding each amino acid at one of the 17 positions, are prepared. One matrix corresponds to the central (eighth) amino acid being found in an alpha helix, the second for the amino acid being in a beta strand, the third a coil, and the fourth, a turn. Later versions omitted the turn calculation because these were the most variable features and were consequently the most difficult to predict. A candidate sequence is analyzed by each of the three to four matrices by a sliding window of 17 residues. Each matrix is positioned along a candidate sequence and the matrix giving the highest score predicts the structural state of the central amino acid. At least 4 residues in a row have to be predicted as an alpha helix and 2 in a row for a beta strand for a prediction to be validated. Matrix values are calculated in somewhat the same manner as amino acid substitution matrices, in that matrix values are calculated as log odds units representing units of information. The information available as to the joint occurrence of secondary structural conformation S and amino acid a is given by (Garnier et al. 1996) where P(S _ a) is the conditional probability of conformation S given residue a, and P(S) is the probability of conformation S. By Bayes’ rule, the probability of conformation S given amino acid a, P(S _ a) is given by where P(S, a) is the joint probability of S and a and P(a) is the probability of a. These probabilities can be estimated from the frequency of each amino acid found in each structure and the frequency of each amino acid in the structural database. Given these frequencies where fS,a is the frequency of amino acid a in conformation S and fS is the frequency of all amino acid residues found to be in conformation S. The GOR method maximizes the information available in the values of fS,a This watermark does not appear in the registered version - http://www.clicktoconvert.com 246 and avoids data size and sampling variations by calculating the information difference between the competing hypotheses that residue a is in structure S, I (S;a), or that a is in a different conformation (not S), I (not S;a). This difference I (_S;a) is calculated from it with simple substitutions by which is derived from the observed amino acid data as where the frequency of finding amino acid a not in conformation S is 1 fS,a and of not finding any amino acid in conformation S i s 1 fS. It is used to calculate the information difference for a series of x consecutive positions flanking sequence position m, from which the following ratio of the joint probability of conformation Sm given a1,..aX to the joint probability of any other conformation may be calculated Searching for all possible patterns in the structural database would require an enormous number of proteins. Hence, three simplifying approaches have been taken. First, it was assumed in earlier versions of GOR that there is no correlation between amino acids in any of the 17 positions (both the flanking 8 positions and the central amino acid position), or that each amino acid position had a separate and independent influence on the structural conformation of the central amino acid. The steps are then: (1) Values for I (_S; a) in it are calculated for each of the 17 positions; (2) these values are summed to approximate the value of I (_Sm; a1,..aX) in it; (3) the probability ratios in it are calculated. The second assumption used in later versions of GOR was that certain pair-wise combinations of an amino acid in the flanking region and central amino acid influence the conformation of the central amino acid. This model requires a determination of the frequency of amino acid pairs between each of the 16 flanking positions and the central one, both for when the central residue is in conformation S and when the central residue is not in conformation S. Finally, in the most recent version of GOR, the assumption is made that certain pair-wise combinations of amino acids in the flanking region, or of a flanking amino acid and the central one, influence the conformation of the central one. Thus, there are 17 _ 16/2 _ 136 possible pairs to use for frequency measurements and to examine for correlation with the conformation of the central residue. With the advent of a large number of protein structures, it has become possible to assess the frequencies of amino acid combinations and to use this information for secondary structural predictions. This watermark does not appear in the registered version - http://www.clicktoconvert.com 247 The GOR method predicts 64% of the residue conformations in known structures and quite drastically (36.5%) underpredicts the number of residues in beta strands. Use of the ChouFasman and GOR methods for predicting the secondary structure of the beta subunit of Salmonella typhimurium tryptophan synthase is illustrated. In this particular case, the positions of the secondary structures predicted by either of these methods are very similar to those in the solved crystal structure (Branden and Tooze 1991). However, tests of the accuracy of these methods using sequences of other proteins whose structures are known have shown that the Chou-Fasman method is only about 50–60% accurate in predicting the structural domains. The methods are most useful in the hands of a knowledgeable structural biologist, and have been used most successfully in polypeptide design and in analysis of motifs for organelle transport (Branden and Tooze 1991). A useful approach is to analyze each of a series of aligned amino acid sequences and then to derive a consensus structural prediction. 20.2 Patterns of Hydrophobic Amino Acids Can Aid Structure Prediction Prediction of secondary structure can be aided by examining the periodicity of amino acids with hydrophobic side chains in the protein chain. This type of analysis was discussed above in the prediction of transmembrane _-helical domains in proteins. Hydrophobicity tables that give hydrophobicity values for each amino acid are used to locate the most hydrophobic regions of the protein. As for secondary structure prediction, a sliding window is moved across the sequence and the average hydrophobicity value of amino acids within the window is plotted. These methods use the chemical properties of amino acid side chains to predict the location of these amino acids on the surface or buried within the core structure. The location of hydrophobic amino acids within a predicted secondary structure can also be used to predict the location of the structure. One type of display of this distribution is the helical wheel or spiral display of the amino acids in an alpha helix,. The use of this display was described above as a way to visualize the location of leucine residues on one face of the helix in a leucine zipper structure. There is also a tendency of hydrophobic residues located in alpha helices on the surface of protein structures to face the core of the protein and for polar and charged amino acids to face the aqueous environment on the outside of the alpha helix. This arrangement is also revealed by the This watermark does not appear in the registered version - http://www.clicktoconvert.com 248 helical wheel display. The contours in this plot show positions in the amino acid sequence where hydrophobic amino acids tend to segregate to opposite sides of a structure plotted against various angles of rotation from one residue to the next along the protein chain. For alpha helices, the angle of rotation is 100 degrees and for beta strands, 160 degrees. The analysis predicts, for example, an alpha helix at approximate sequence position 165 that has segregated hydrophobic amino acids on one helix face. Helix _5 runs from positions 160 to168 in the crystal structure of this protein. Check your progress: 1. Mention the main criteria for Chou Fasman method. Notes: mm) Write your answer in the space given below. nn) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… 20.3 Secondary Structure Prediction by Neural Network Models The most sophisticated methods that have been devised to make secondary structural predictions for proteins use artificial intelligence, or so-called neural net algorithms. An earlier method of this type examined patterns that represent secondary structural features like the Chou-Fasman method. However, this method went farther and tried to locate these patterns in a particular order that coincides with a known domain structure. Patterns typical of proteins (Cohen et al. 1983), turns in globular proteins (Cohen et al. 1986), or helices in helical proteins (Presnell et al. 1992) may be located and used to predict secondary structure with increased confidence. The program This watermark does not appear in the registered version - http://www.clicktoconvert.com 249 MACMATCH, which combines these methods with a neural network approach to predict the secondary structure of globular proteins on a Macintosh computer, has been described (Presnell et al. 1993). In the neural network approach, computer programs are trained to be able to recognize amino acid patterns that are located in known secondary structures and to distinguish these patterns from other patterns not located in these structures. There are many examples of the use of this method to predict protein structures, which have been reviewed (Holley and Karplus 1991; Hirst and Sternberg 1992). The early methods are reported to be up to 63–64% accurate. These methods have been improved to a level of over 70% for globular proteins by the use of information from multiple sequence alignments (Rost and Sander 1993, 1994). Two Web sites that perform a neural network analysis for protein secondary structure prediction are PHD (Rost and Sander 1993; Rost 1996; http://www.embl heidelberg.de/predictprotein/ predictprotein.html) and NNPREDICT (Kneller et al. 1990; http://www.cmpharm.ucsf.edu/ _nomi/nnpredict.html). These neural network models are theoretically able to extract more information from sequences than the information theory method described above (Qian and Sejnowski 1988). Neural networks have also been used to model translational initiation sites and promoter sites in E. coli, splice junctions, and specific structural features in proteins, such as _helical transmembrane domains. These applications are discussed elsewhere in this and in 8. Neural network models are meant to simulate the operation of the brain. The complex patterns of synaptic connections among a large number of neurons are presumed to underlie the functions of the brain. Some groups of neurons are involved in collecting data as environmental signals, others in processing data, and yet others in providing a response to the signals. Neural networks are an attempt to build a similar kind of learning machine where the input is a 13–17-amino-acid length of sequence and the output is the predicted secondary structure of the central amino acid residue. The object is to train the neural network to respond correctly to a set of such flanking sequence fragments when the secondary structural features of the centrally located amino acid are known. The training is designed to achieve recognition of amino acid patterns associated with secondary structure. If the neural network has sufficient capacity for learning, these patterns may potentially include complex interactions among the flanking amino acids in determining secondary structures. This watermark does not appear in the registered version - http://www.clicktoconvert.com 250 However, two studies with neural networks described below have so far not found evidence for such interactions. A sliding window of 13–17 amino acid residues is moved along a sequence. The sequence within each window is read and used as input to a neural network model previously trained to recognize the secondary structure most likely to be associated with that pattern. The model then predicts the secondary structural configuration of the central amino acid as alpha helix, beta strand, or other. Rules or another trained network are then applied that make the prediction of a series of residues reasonable. For example, at least 4 amino acids in a row should be predicted as being in an alpha helix if the prediction is to make structural sense. The model comprises three layers of processing units—the input layer, the output layer, and the so-called hidden layer between these layers. Signals are sent from the input layer to the hidden layer and from the hidden layer to the output layer through junctions between the units. This configuration is referred to as a feed- forward multilayer network. The input layer of units reads the sequence, one unit per amino acid residue, and transmits information on the amino acid at that location. A small window of sequence is read at a time and information is sent as signals through junctions to a number of sequential units in the hidden layer by all of the input units within the window, as shown by the lines joining units. These signals are each individually modified by a weighting factor and then added together to give a total input signal into each hidden unit. Sometimes a bias is added to this sum to influence the response of the unit. The resulting signal is then transformed by the hidden unit into a number that is very close either to a 1 or to a zero (or sometimes to a 1). A mathematical function known as a sigmoid trigger function, simulating the firing or nonfiring states of a neuron, is used for this transformation. Signals from the hidden units are then sent to three individual output units, each output unit representing one type of secondary structure (helix, strand, or other). Each signal is again weighted, the input signals are summed, and each of the three output units then converts the combined signal into a number that is approximately a 1 or a 0. An output signal that is close to 1 represents a prediction of the secondary structural feature represented by that output unit and a signal near to the value 0 means that the structure is not predicted. When hidden layers are included, a neural network model is capable of detecting higher levels of interaction among amino acids that influence secondary structure. For example, This watermark does not appear in the registered version - http://www.clicktoconvert.com 251 particular combinations of amino acids may produce a particular type of secondary structure. To resolve these patterns, a sufficient number of hidden units is needed (Holley and Karplus 1991); the number varies from 2 to a range of 10–40. An interesting side effect of adding more hidden units is that the neural network memorizes the training set but at the same time is less accurate with test sequences. This effect is revealed by using the trained network to predict the same structures used for training. The number correct increases by over 20% as the number of hidden units increases from 0 to 10. In contrast, accuracy of prediction of test sequences not used for training decreased 3% (Holley and Karplus 1991). Without hidden layers, the neural network model is known as a perceptron, and has a more limited capacity to detect such combinations. In two studies, networks with no hidden units were as successful in predicting secondary structure as those with hidden units. In addition, the number of hidden units was increased to as many as 60 in one study (Qian and Sejnowski 1988) and 20 in another (Holley and Karplus 1991) without significantly changing the level of success. These observations imply that the influence of local sequence on secondary structure is the additive influence of individual residues and that there is no higher level of interaction among these residues. To detect such interactions, however, requires a large enough training set to provide a significant number of examples, and these conditions may not have been met. These same studies examined the effect of input window size and found that a maximum information for secondary structure prediction seems to be located within a window of 13–17 amino acids, as larger windows do not increase accuracy. However, small windows were less effective, suggesting that they have insufficient information, and below a window size of 5, success at predicting beta strands was decreased. Training the neural network model is the process of adjusting the values of the weights used to modify the signals from the input layer to the hidden layer and from the hidden layer to the output layer. The object is to have these weights balance the input signals so that the model output correctly identifies the known secondary structure of the central amino acid in a sequence window of a protein of known structure. Because there may be thousands of connections between the various units in the network, a systematic method is needed to adjust these values. Initially, the weights are assigned a constant or random value (typical range 0.1 to 0.1). The sliding window is then positioned along one of the training sequences. The predicted output for a given sequence window is then This watermark does not appear in the registered version - http://www.clicktoconvert.com 252 compared to the known structure of the central amino acid residue. The model is adjusted to increase the chance of predicting the correct residue. The adjustment involves changing the weighting of propagated signals by a method called the back-propagation algorithm. This procedure is repeated for all windows in all of the training sequences. The better the model, the more predicted structures that will be correct. Conversely, the worse the model, the more predictions that will be incorrect. The object then becomes to minimize this incorrect number. The error E is expressed as the square of the total number of incorrect predictions by the output units. When the back-propagation algorithm is applied, the weights are adjusted by a small amount to decrease errors. A window of a training sequence is used as input to the network, and the predicted and expected (known) structures of the central residue are compared. A set of small corrections is then made to the weights to improve an incorrect prediction, or the weights are left relatively unchanged for a correct prediction. This procedure is repeated using another training sequence until the number of errors cannot be reduced further. A large number of training cycles representing a slow training rate is an important factor for training the network to produce the smallest number of incorrect predictions. Not all of the training sequences may be used—a random input of training patterns may be used and sometimes these may be chosen from subsets of sequences that represent one type of secondary structure to balance the training for each type of structure. The back-propagation algorithm examines the contribution of each connection in the network on the subsequent levels and adjusts the weight of this connection, if needed to improve the predictions. The following example illustrates the operation of the algorithm. The PHDsec program in the PHD system described above in the section on prediction of transmembrane-spanning proteins is an example of a neural network program for protein secondary structure prediction (Rost and Sander 1993; Rost 1996). The Web address of this resource is http://www.emblheidelberg.de/predictprotein/predictprotein.html. PhDsec uses a procedure similar to that used by PHD. A BLAST search of the input sequence is conducted to identify similar but not closely identical sequences, and a multiple alignment of the sequences is transformed into a sequence profile. This profile is then used as input to a neural network trained to recognize correlations between a window of 13 amino acids and the secondary structure of the This watermark does not appear in the registered version - http://www.clicktoconvert.com 253 central amino acid in the window. Program output includes a reliability index of each estimate on a scale of 1 (low reliability) to 9 (high reliability). These reliabilities (not shown) are obtained as normalized scores derived from the output values of the three units in the output layer of the network. The highest output value is compared to the next lowest value and the difference is normalized to give the reliability index. These indices are a useful way to examine the predictions in closer detail. 20.4 Let us Sum up The Chou-Fasman method (Chou and Fasman 1978) was based on analyzing the frequency of each of the 20 amino acids in alpha helices, beta sheets, and turns of the then-known relatively small number of protein structures. This unit discuss about the secondary structure prediction methods namely Chou-Fasman/ Gor method, Patterns of hydrophobic amino acids can aid structure prediction and Secondary structure prediction by neural network models. 20.5 Lesson end activities (i) What are the other methods used? (ii) Find out the major differences between these methods. 20.6 Check your progress: Model answers 1. Your answer must include these points: Based on analyzing the frequency of each of the 20 amino acids in alpha helices, beta sheets, and turns of the then-known relatively small number of protein structures. 20.7 Points for Discussion Make a sensitivity analysis on Chou-Fasman method of secondary structure prediction. This watermark does not appear in the registered version - http://www.clicktoconvert.com 254 20.8 References 1. Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77 (2), p. 257–286, February 1989. 2. Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1999. ISBN 0-521-62971-3. 3. Lior Pachter and Bernd Sturmfels. "Algebraic Statistics for Computational Biology". Cambridge University Press, 2005. ISBN 0-521-85700-7. 4. Olivier Cappé, Eric Moulines, Tobias Rydén. Inference in Hidden Markov Models, Springer, 2005. ISBN 0-387-40264-0. 5. Kristie Seymore, Andrew McCallum, and Roni Rosenfeld. Learning Hidden Markov Model Structure for Information Extraction. AAAI 99 Workshop on Machine Learning for Information Extraction, 1999 (also at CiteSeer: [1]). This watermark does not appear in the registered version - http://www.clicktoconvert.com 255 LESSON – 21 NEAREST-NEIGHBOR METHOD 21.0 Aims and Objective 21.1 Nearest-neighbor Methods of Secondary Structure Prediction 21.2 Hidden Markov model 21.2.1 Architecture of a hidden Markov model 21.2.2 Probability of an observed sequence 21.2.3 Using hidden Markov models 21.2.4 History 21.3 Probabilistic Hidden Markov Model (Discrete-Space Model) 21.4 Prediction of Three-dimensional Protein Structure 21.5 Let us Sum up 21.6 Lesson end activities 21.7 Check your progress 21.8 Points for Discussion 21.9 References 21.0 Aims and Objectives: This unit discuss about Nearest- neighbor methods of secondary structure prediction, Hidden Markov model, Architecture of a hidden Markov model, Probability of an observed sequence, Using hidden Markov models, History, Probabilistic Hidden markov model (discrete-space model), Prediction of three-dimensional protein structure. 21.1 Nearest-neighbor Methods of Secondary Structure Prediction Like neural networks, nearest-neighbor methods are also a type of machine learning method. They predict the secondary structural conformation of an amino acid in the query sequence by This watermark does not appear in the registered version - http://www.clicktoconvert.com 256 identifying sequences of known structures that are similar to the query sequence (Levin et al. 1986; Salzberg and Cost 1992; Zhang et al. 1992; Yi and Lander 1993; Salamov and Solovyev 1995, 1997; Frishman and Argos 1996). A large list of short sequence fragments is made by sliding a window of length n (e.g., n _16) along a set of approximately 100–400 training sequences of known structure but minimal sequence similarity to each other, and the secondary structure of the central amino acid in each window is recorded. A window of the same size is then selected from the query sequence and compared to each of the above sequence fragments, and the 50 best- matching fragments are identified. The frequencies of the known secondary structure of the middle amino acid in each of these matching fragments (f_, f_, and fcoils) are then used to predict the secondary structure of the middle amino acid in the query window. As with other secondary structure prediction programs, the predicted secondary structure of a series of residues in the query sequence is subjected to a set of rules or used as input to a neural network to make a final prediction for each amino acid position. Although not implemented in the most available programs, a true estimate of probability of the above set of frequencies may be obtained by identifying sets of training sequences that give the same value of (f_ f_ fcoils)1/2. The frequencies of the secondary structures predicted by this group then give true estimates for p_, p_, and pcoils for the targeted amino acid in the query sequence (Yi and Lander 1993). Predictions based on the highest probabilities have been shown to be the most accurate, with the top 28% of the predictions being 86% accurate and the top 43% being 81% accurate. In addition, this method of calculating probability possesses more information than single-state predictions. Using this method, therefore, a substantial proportion of protein secondary structures can be predicted with high accuracy (Yi and Lander 1993, 1996). The several nearest-neighbor programs that have been developed for secondary structure prediction differ largely in the method used to identify related sequences in the training set. Originally, an amino acid scoring matrix such as a BLOSUM scoring matrix was used (Zhang et al. 1992). Distances between sequences based on a statistical analysis of the training sequences have also been proposed (Salzberg and Cost 1992). Use of a scoring matrix (Bowie et al. 1991, 1996) based on a categorization of amino acids into local structural environments, discussed below, in conjunction with a standard amino acid scoring matrix increased the success of the predictions (Yi and Lander 1993; Salamov and Solovyev 1995, 1997). Yet further increases in This watermark does not appear in the registered version - http://www.clicktoconvert.com 257 success have been achieved by aligning the query sequence with the training sequences to obtain a set of nonintersecting alignments with windows of the query sequence (as described in 3, p. 75), and of using a multiple sequence alignment as input with amino-terminal and carboxyterminal positions of alpha helices and beta strands and _ turns treated as distinctive types of secondary structure (Salamov and Solovyev 1997). The program PREDATOR is based on an analysis of amino acid patterns in structures that form H-bond interactions between adjacent beta strands (_ bridges) and between amino acid n and n 4 on alpha helices (Frishman and Argos 1995, 1996). The Hbond pattern between parallel and antiparallel beta strands is different and two types of antiparallel patterns have been recognized. By utilizing such information combined with substitutions found in sequence alignments, the prediction success of PREDATOR has been increased to 75% (Frishman and Argos 1997). Examples of the NNSSP (Salamov and Solovyev 1997) and PREDATOR (Frishman and Argos 1997) 21.2 Hidden Markov model A Hidden Markov Model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. A HMM can be considered as the simplest dynamic Bayesian network. In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, musical score following, partial discharges and Bioinformatics . 21.2.1 Architecture of a hidden Markov model The diagram below shows the general architecture of an instantiated HMM. Each oval shape represents a random variable that can adopt a number of values. The random variable x(t) is the This watermark does not appear in the registered version - http://www.clicktoconvert.com 258 hidden state at time t (with the model from the above diagram, ). The random variable y(t) is the observation at time t. The arrows in the diagram (often called a trellis diagram) denote conditional dependencies. From the diagram, it is clear that the value of the hidden variable x(t) (at time t) only depends on the value of the hidden variable x(t - 1) (at time t - 1). This is called the Markov property. Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t). 21.2.2 Probability of an observed sequence The probability of observing a sequence of length L is given by where the sum runs over all possible hidden node sequences Brute force calculation of P(Y) is intractable for most real- life problems, as the number of possible hidden node sequences is typically extremely high. The calculation can however be sped up enormously using an algorithm called the forward algorithm. 21.2.3 Using hidden Markov models There are three canonical problems associated with HMM: Given the parameters of the model, compute the probability of a particular output sequence, and the probabilities of the hidden state values given that output sequence. This problem is solved by the forward-backward algorithm. Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by the Viterbi algorithm. This watermark does not appear in the registered version - http://www.clicktoconvert.com 259 Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, discover the parameters of the HMM given a dataset of sequences. This problem is solved by the Baum-Welch algorithm. A concrete example Assume you have a friend who lives far away and to whom you talk daily over the telephone about what he did that day. Your friend is only interested in three activities: walking in the park, shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the weather on a given day. You have no definite information about the weather where your friend lives, but you know general trends. Based on what he tells you he did each day, you try to guess what the weather must have been like. You believe that the weather operates as a discrete Markov chain. There are two states, "Rainy" and "Sunny", but you cannot observe them directly, that is, they are hidden from you. On each day, there is a certain chance that your friend will perform one of the following activities, depending on the weather: "walk", "shop", or "clean". Since your friend tells you about his activities, those are the observations. The entire system is that of a hidden Markov model (HMM). You know the general weather trends in the area, and what your friend likes to do on average. In other words, the parameters of the HMM are known. You can write them down in the Python programming language: states = ('Rainy', 'Sunny') observations = ('walk', 'shop', 'clean') start_probability = {'Rainy': 0.6, 'Sunny': 0.4} transition_probability = { 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, } emission_probability = { This watermark does not appear in the registered version - http://www.clicktoconvert.com 260 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, } In this piece of code, start_probability represents your belief about which state the HMM is in when your friend first calls you (all you know is that it tends to be rainy on average). The particular probability distribution used here is not the equilibrium one, which is (given the transition probabilities) actually approximately {'Rainy': 0.571, 'Sunny': 0.429}. T h e transition_probability represents the change of the weather in the underlying Markov chain. In this example, there is only a 30% chance that tomorrow will be sunny if today is rainy. The emission_probability represents how likely your friend is to perform a certain activity on each day. If it is rainy, there is a 50% chance that he is cleaning his apartment; if it is sunny, there is a 60% chance that he is outside for a walk. 21.2.4 History Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s. One of the first applications of HMMs was speech recognition, starting in the mid- 1970s. In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA. Since then, they have become ubiquitous in the field of Bioinformatics . http://en.wikipedia.org/wiki/Hidden_Markov_model - _note-3#_note-3 21.3 Probabilistic Hidden Markov Model (Discrete-Space Model) HMMs have been used to model alignments of three-dimensional structure in proteins (Stultz et al. 1993; Hubbard and Park 1995; Di Francesco et al. 1997, 1999; FORREST Web server at http://absalpha.dcrt.nih.gov:8008/). In one example of this approach, the models are trained on patterns of alpha helices, beta strands, tight turns, and loops in specific structural classes (Stultz et al. 1993, 1997; White et al. 1994), which then may be used to provide the most probable secondary structure and structural class of a protein. The manner by which protein threedimensional domains can be modeled. This watermark does not appear in the registered version - http://www.clicktoconvert.com 261 21.4 Prediction of Three-dimensional Protein Structure Because the number of ways that proteins can fold appears to be limited, there is considerable optimism that ways will be found to predict the fold of any protein, just given its amino acid sequence. Structural alignment studies have revealed that there are more than 500 common structural folds found in the domains of the more than 12,500 three-dimensional structures that are in the Brookhaven Protein Data Bank. These studies have also revealed that many different sequences will adopt the same fold. Thus, there are many combinations of amino acids that can fit together into the same three-dimensional conformation, filling the available space and making suitable contacts with neighboring amino acids to adopt a common three-dimensional structure. There is also a reasonable probability that a new sequence will possess an already identified fold. The object of fold recognition is to discover which fold is best matched. Considerable headway toward this goal has been made. Sequence alignment can be used to identify a family of homologous proteins that have the same sequence, and presumably a similar three-dimensional structure. As discussed above, there are many databases that link sequence families to the known three-dimensional structure of a family member. The structure of even a remote family or superfamily member can be predicted through such sequence alignment methods. When the sequence of a protein of unknown structure has no detectable similarity to other proteins, other methods of three-dimensional structure prediction may be employed. One such method is sequence threading. In threading, the amino acid sequence of a query protein is examined for compatibility with the structural core of a known protein structure. Recall that the protein core is made up of alpha helices, beta strands, and other structural elements folded into a compact structure. The environment of the core is strongly hydrophobic with little room for water molecules, extra amino acids, or amino acid side chains that are not able to fit into the available space. Side chains must also make contact with neighboring amino acid side chains in the structure, and these contacts are needed for folding and stability. Threading methods examine the sequence of a protein for compatibility of the side groups with a known protein core. The sequence is “threaded” into a database of protein cores to look for matches. If a reasonable degree of compatibility is found with a given structural core, the protein is predicted to fold into a similar three-dimensional configuration. Threading methods This watermark does not appear in the registered version - http://www.clicktoconvert.com 262 are undergoing a considerable degree of evolution at the present time. An excellent description of algorithms for threading is found in Lathrop et al. (1998). Presently available methods require considerable expertise with protein structure and with programming. However, there are some sites where the analysis may be performed on a Web server. Check your progress: 1. Mention the 3 common probabilities. Notes: oo) Write your answer in the space given below. pp) Check your answer with the one given at the end of this lesson. …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… …………………………………………………………………………………………………… ………………………………………………………… HMM It turns out that sequence profiles are a special case of a more general mathematical approach, called hidden Markov models (HMMs). These methods were originally used in speech recognition before the were applied to biological sequence analysis. A well-defined formalism exists, which helps with the theoretical understanding of what can be expected when applying it to sequence analysis. This is an important advantage of using HMMs instead of sequence profiles; the underlying theoretical basis is much more solid. Also, Bayesian statistics is used in several aspects of the method. A Markov process is a physical process of a special, but common kind. The basic idea is that we have a physical system that stepwise goes through some kind of change. For example, it may be die (svenska: "tärning") that we throw time and again; the change is the transition from the new This watermark does not appear in the registered version - http://www.clicktoconvert.com 263 value to the next. An essential characteristic of a Markov process is that the change is dependent only on the current state. The history of the system does not matter. The states that the system has been in before are not relevant, only the current state determines what will happen next. The system has no memory. One may view a protein (or DNA) sequence as the record of such a process. There is some hidden process that generates a sequence of amino-acid residues, where chance (based on specific probabilities) play an essential role in determining the exact sequence being produced. This is one (very crude) way of describing an HMM. This approach can be applied in sequence motif searches. Given a multiple sequence alignment of a particular domain family, one uses statistical methods to build a specific HMM for that domain family. The probabilities that are required are estimated from the frequencies in the alignment, together with other data. This HMM can then be used to test other sequences whether they match this domain family or not. HMMs can be set up so that insertions, deletions and substitutions can be handled in sensible ways, and their probabilities estimated properly. The plan (or topology) of an HMM determines which probabilities need to be estimated, and what kind of matches are allowed. For instance, it is perfectly possible to design an HMM plan that strictly forbids insertions and deletions. This means that it is very important for the HMM designer (i.e. the software programmer, usually not the user) to decide on which type of topology that should be implemented. This will determine which kinds of sequence profiles that can be matched by the HMM. In an HMM plan designed for matching a sequence, each state corresponds to a residue in the sequence. The transitions between the states are assigned probabilities that are determined from the multiple sequence alignment that is used as training set. In order to test whether a new sequence contains a segment that matches the HMM profile, an algorithm that works essentially like a dynamical programming algorithm is used to find the best match between the HMM profile and the sequence. The best match is the one that maximizes the transition probabilites given those particular residues. This watermark does not appear in the registered version - http://www.clicktoconvert.com 264 Here is an example of what an HMM plan may look like. This is the plan used in the popular HMMER software, and the image was taken from its documentation. Fig 21.1: A typical HMM model The abbreviations for the states are as follows: · [M x ] Match state x. Has K emission probabilities. · [Dx ] Delete state x. Non-emitter. · [Ix ] Insert state x. Has K emission probabilities. · [S] Start state. Non-emitter. · [N] N -terminal unaligned sequence state. Emits on transition w i t h K emission probabilities. · [B] Begin state (for entering main model). Non-emitter. · [E] End state (for exiting main model). Non-emitter. · [C] C -terminal unaligned sequence state. Emits on transition w i t h K emission probabilities. · [J] Joining segment unaligned sequence state. Emits on transition with K emission probabilities. Compared with the HMM plan shown in the course book. This is slightly more complicated. The reason is that the creator of HMMER (Sean Eddy) wanted to obtain a method that could locate a domain in a sequence where the true domain is flanked by possibly very large regions of nonmatching sequence. Therefore the states N and C were added, which will be used to match such completely irrelevant parts of a sequence. This watermark does not appear in the registered version - http://www.clicktoconvert.com 265 21.5 Let us sum up Nearest-neighbor methods are also a type of machine learning method. They predict the secondary structural conformation of an amino acid in the query sequence by identifying sequences of known structures that are similar to the query sequence. This unit discuss about Nearest-neighbor methods of secondary structure prediction, Hidden Markov model, Architecture of a hidden Markov model, Probability of an observed sequence, Using hidden Markov models, History, Probabilistic Hidden markov model (discrete-space model), Prediction of three-dimensional protein structure. 21.6 Lesson end activities (i) Find out how nearest neighbor method differs from other methods. (ii) Which one do you consider best method among all the above discussed ones?Why? 21.7 Check your progress: Model answers 1. Your answer must include these points: (i) Start probability (ii) Transition probability (iii) Emission probability 21.9 Points for Discussion 1. Comment on the applications of HMM. 21.10 References 1. Tutorial from University of Leeds[2]. 2. J. Li, A. Najmi, R. M. Gray, Image classification by a two dimensional hidden Markov model, IEEE Transactions on Signal Processing, 48(2):517-33, February 2000. This watermark does not appear in the registered version - http://www.clicktoconvert.com 266 3. Y. Ephraim and N. Merhav, Hidden Markov processes, IEEE Trans. Inform. Theory, vol. 48, pp. 1518-1569, June 2002. 4. B. Pardo and W. Birmingham. Modeling Form for On- line Following of Musical Performances. AAAI-05 Proc., July 2005. 5. http://citeseer.ist.psu.edu/starner95visual.html 6. L.Satish and B.I.Gururaj.Use of hidden Markov models for partial discharge pattern classification.IEEE Transactions on Dielectrics and Electrical Insulation, Apr 1993.