Computational Methods

Transcription

This watermark does not appear in the registered version - http://www.clicktoconvert.com
UNIT I
LESSON -1
INTRODUCTION TO BIOINFORMATICS
1.0 Aims and Objectives
1.1 Introduction to Bioinformatics
1.2 Landmark Sequences Completed
1.3 Sequence Analysis: Sequence to Potential Function
1.4 The Creation of Sequence Databases
1.5 Searching for Genes
1.6 Let us sum up
1.7 Lesson end activities
1.8 Check your progress
1.9 Points for Discussion
1.10 References
1.0 Aims and Objectives:
This unit describes the introduction to Bioinformatics, landmark sequences completed,
computational biology and sequence databases.
1.1 Introduction to Bioinformatics:
Bioinformatics and computational biology involve the use of techniques including applied
mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and
biochemistry to solve biological problems usually on the molecular level. Research in
computational biology often overlaps with systems biology. Major research efforts in the field
include sequence alignment, gene finding, genome assembly, protein structure alignment, protein
structure prediction, prediction of gene expression and protein-protein interactions, and the
modeling of evolution.
2
The terms Bioinformatics and computational biology are often used interchangeably. However
Bioinformatics more properly refers to the creation and advancement of algorithms,
computational and statistical techniques, and theory to solve formal and practical problems
inspired from the management and analysis of biological data. Computational biology, on the
other hand, refers to hypotheses-driven investigation of a specific biological problem using
computers, carried out with experimental or simulated data, with the primary goal of discovery
and the advancement of biological knowledge. Put more simply, Bioinformatics is concerned
with the information while computational biology is concerned with the hypotheses. A similar
distinction is made by National Institutes of Health in their working definitions of Bioinformatics
and Computational Biology, where it is further emphasized that there is a tight coupling of
developments and knowledge between the more hypotheses-driven research in computational
biology and technique-driven research in Bioinformatics. Bioinformatics is also often specified
as an applied subfield of the more general discipline of Biomedical informatics.
A common thread in projects in Bioinformatics and computational biology is the use of
mathematical tools to extract useful information from data produced by high-throughput
biological techniques such as genome sequencing. A representative problem in Bioinformatics is
the assembly of high-quality genome sequences from fragmentary "shotgun" DNA sequencing.
Other common problems include the study of gene regulation using data from microarrays or
mass spectrometry.
In the last few decades, advances in molecular biology and the equipment available for research
in this field have allowed the increasingly rapid sequencing of large portions of the genomes of
several species. In fact, to date, several bacterial genomes, as well as those of some simple
eukaryotes (e.g., Saccharomyces cerevisiae, or baker's yeast) and more complex eukaryotes (C.
elegans and Drosophila) have been sequenced in full. The Human Genome Project, designed to
sequence all 24 of the human chromosomes, is also progressing and a rough draft was completed
in the spring of 2000.
Popular sequence databases, such as GenBank and EMBL, have been growing at exponential
rates. This deluge of information has necessitated the careful storage, organization and indexing
3
of sequence information. Information science has been applied to biology to produce the field
called Bioinformatics.
1.2 Landmark Sequences Completed
·
tRNA - (1964) - 75 bases (old, slow, complicated method)
·
First complete DNA genome: X174 DNA (1977) - 5386 bases
·
human mitochondrial DNA (1981) - 16,569 bases
·
tobacco chloroplast DNA (1986) - 155,844 bases
·
First complete bacterial genome (H. Influenzae)(1995) - 1.9 x 106 bases
·
Yeast genome (eukaryote at ~ 1.5 x 107 ) completed in 1996
·
Several archaebacteria
·
E. coli -- 4 x 106 bases [1997 & 1998]
·
Several pathogenic bacterial genomes sequenced
o
Helicobacter pyloris (ulcers)
o
Treponema pallidium (Syphilis)
o
Borrelia burgdorferi (Lyme disease)
o
Chlamydia trachomatis (trachoma - blindness)
o
Rickettsia prowazekii (epidemic typhus)
o
Mycobacterium tuberculosis (tuberculosis)
·
Nematode C. elegans ( ~ 4 x 108 ) - December 1998
·
Drosophila (fruit fly) (2000)
·
Human genome (rough draft completed 5/00) - 3 x 109 base
Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic
acid sequence (genes and RNAs), protein sequence and structural information. This includes
databases of the sequences and structural information as well methods to access, search,
visualize and retrieve the information.
Sequence data can be used to make predictions of the functions of newly identified
genes,estimate evolutionary distance in phylogeny reconstruction, determine the active sites of
4
enzymes, construct novel mutations and characterize alleles of genetic diseases to name just a
few uses. Sequence data facilitates:
Analysis of the organization of genes and genomes and their evolution
Protein sequence can be predicted from DNA sequence which further facilitates possible
prediction of protein properties, structure, and function (proteins rarely sequenced in entirety
today)
Identification of regulatory elements in genes or RNAs
Identification of mutations that lead to disease, etc.
Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. The ultimate goal of the field is to enable the
discovery of new biological insights as well as to create a global perspective from which
unifying principles in biology can be discerned.
There are three important sub-disciplines within Bioinformatics involving computational
biology:
the development of new algorithms and statistics with which to assess relationships among
members of large data sets;
the analysis and i nterpretation of various t ypes of data including nucleotide and amino acid
sequences, protein domains, and protein structures
the development and implementation of tools that enable efficient access and management of
different types of information.
One of the simpler tasks used in Bioinformatics concern the creation and maintenance of
databases of biological information. Nucleic acid sequences (and the protein sequences derived
from them) comprise the majority of such databases. While the storage and organization of
millions of nucleotides is far from trivial, designing a database and developing an interface
whereby researchers can both access existing information and submit new entries is only the
beginning. The most pressing tasks in Bioinformatics involve the analysis of sequence
information.
Computational Biology is the name given to this process, and it involves the following:
·
Finding the genes in the DNA sequences of various organisms
·
Developing methods to predict the structure and/or function of newly discovered proteins
and structural RNA sequences.
5
·
Clustering protein sequences into families of related sequences and the development of
protein models.
·
Aligning similar proteins and generating phylogenetic trees to examine evolutionary
relationships.
Data- mining is the process by which the testable hypotheses are generated regarding the
function or structure of a gene or protein of interest by identifying similar sequences in better
characterized organisms. For example, new i nsight into the molecular basis of a disease may
come from investigating the function of homologs of the disease gene in model organisms.
Equally exciting is the potential for uncovering phylogenetic relationships and evolutionary
patterns.The process of evolution has produced DNA sequences that encode proteins with very
specific functions. It is possible to predict the three-dimensional structure of a protein using
algorithms that have been derived from our knowledge of physics, chemistry and most
importantly, from the analysis of other proteins with similar amino acid sequences.
1.3 Sequence Analysis: Sequence to Potential Function
Sequence to Potential Function (see flow chart scheme)(see handout)
ORF prediction and gene identification (see handout for eukaryotic gene organization)
Search databases for potential protein function or homologue
Protein structure prediction and multiple sequence alignment (conserved regions)
Analysis of potential gene regulatory elements
Gene knockout or inhibition (RNA interference) for phenotypic analysis
Overview of Sequence Analysis (see handout)( Figure: Requires Adobe Acrobat)
Sequence data facilitates:
Analysis of the organization of genes and genomes
Prediction of protein properties, functions, and structure from gene sequence or cDNA
(proteins rarely sequenced in entirety today) -- Cystic fibrosis
Identification of regulatory elements
Identification of mutations that lead to disease, etc.
1.4 The Creation of Sequence Databases
6
Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine,
cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each sequence of
nucleotides or amino acids represents a particular gene or protein (or section thereof),
respectively. Sequences are represented in shorthand, using single letter designations. This
decreases the space necessary to store information and increases processing speed for analysis.
While most biological databases contain nucleotide and protein sequence information, there are
also databases which include taxonomic information such as the structural and biochemical
characteristics of organisms. The power and ease of using sequence information has however,
made it the method of choice in modern analysis.
In the last three decades, contributions from the fields of biology and chemistry have facilitated
an increase in the speed of sequencing genes and proteins. The advent of cloning technology
allowed foreign DNA sequences to be easily introduced into bacteria. In this way, rapid mass
production of particular DNA sequences, a necessary prelude to sequence determination, became
possible. Oligonucleotide synthesis provided researchers with the ability to construct short
fragments of DNA with sequences of their own choosing. These oligonucleotides could then be
used in probing vast libraries of DNA to extract genes containing that sequence. Alternatively,
these DNA fragments could also be used in polymerase chain reactions to amplify existing DNA
sequences or to modify these sequences. With these techniques in place, progress in biological
research increased exponentially.
For researchers to benefit from all this information, however, two additional things were
required: 1) ready access to the collected pool of sequence information and 2) a way to extract
from this pool only those sequences of interest to a given researcher. Simply collecting, by hand,
all necessary sequence information of interest to a given project from published journal articles
quickly became a formidable task. After collection, the organization and analysis of this data still
remained. It would take weeks to months for a researcher to search sequences by hand in order to
find related genes or proteins.
Computer technology has provided the obvious solution to this problem. Not only can computers
be used to store and organize sequence information into databases, but they can also be used to
7
analyze sequence data rapidly. The evolution of computing power and storage capacity has, so
far, been able to outpace the increase in sequence information being created. Theoretical
scientists have derived new and sophisticated algorithms which allow sequences to be readily
compared using probability theories. These comparisons become the basis for determining gene
function, developing phylogenetic relationships and simulating protein models. The physical
linking of a vast array of computers in the 1970's provided a few biologists with ready access to
the expanding pool of sequence information. This web of connections, now known as the
Internet, has evolved and expanded so that nearly everyone has access to this information and the
tools necessary to analyze it.
Check your progress:
1. List the processes involved in computational biology.
Notes:
a) Write your answer in the space given below.
b) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
Description and Links to the NCBI Entrez Database (Example of a Complex Database)
Illustration of Nucleotide Sequence Database with Entre z(Selected Davis Lab sequence
entries). You should explore the NCBI and its services on your own after browsing these entries.
Databases of protein and nucleic acid sequences
·
In the US, the repository of this information is The National Center for Biotechnology
Information (NCBI)
·
The database at the NCBI is a collated and interlinked dataset known as the Entrez
Databases
8
o
Description of the Entrez Databases
o
Examples of a selected database files
§
Protein
§
§
Most protein sequence is derived from conceptual translation
Chromosome with genes and predicted proteins ( Accession #D50617,
Yeast Chromosome VI: Entrez)
o
§
Genome ( C. elegans)
§
Protein Structure (TPI database file or Chime structure)
§
Expressed sequence tags (Ests)( Summary of current data)
Neighboring
Searching databases to identify sequences and predicting functions or properties of
predicted proteins
·
Searching by keyword, accession, etc.
·
Searching for homologous sequences
o
see the NCBI BLAST
§
BLAST (Basic Local Alignment Search Tool) is a set of similarity search
programs designed to explore all of the available sequence databases
regardless of whether the query is protein or DNA.
1.5 Searching for Genes
The collecting, organizing and indexing of sequence information into a database, a challenging
task in itself, provides the scientist with a wealth of information, albeit of limited use. The power
of a database comes not from the collection of information, but in its analysis. A sequence of
DNA does not necessarily constitute a gene. It may constitute only a fragment of a gene or
alternatively, it may contain several genes.
Luckily, in agreement with evolutionary principles, scientific research to date has shown that all
genes share common elements. For many genetic elements, it has been possible to construct
consensus sequences, those sequences best representing the norm for a given class of organisms
(e.g, bacteria, eukaroytes). Common genetic elements include promoters, enhancers,
polyadenylation signal sequences and protein binding sites. These elements have also been
further characterized into further subelements.
9
Genetic elements share common sequences, and it is this fact that allows mathematical
algorithms to be applied to the analysis of sequence data.
1.6 Let us sum up:
Bioinformatics
more properly refers to the creation and advancement of algorithms,
computational and statistical techniques, and theory to solve formal and practical problems
inspired from the management and analysis of biological data. This lesson gives a brief
introduction on Bioinformatics, its area of specialization, the basic difference between
Bioinformatics and computational biology. It also gives details of the various sequence details
available and other facilties.
1.7 Lesson end activities:
i.
Find out the details of the various genomes sequenced so far.
ii.
What are the various databases for protein and DNA sequence?
iii.
Give details of the various options available at NCBI.
1.8 Check your progress: Model answers
1. Your answer must include these points:
·
Finding the genes in the DNA sequences of various organisms
·
Developing methods to predict the structure and/or function of newly discovered proteins
and structural RNA sequences.
·
Clustering protein sequences into families of related sequences and the development of
protein models.
·
Aligning similar proteins and generating phylogenetic trees to examine evolutionary
relationships.
1. Substantiate the significance of Bioinformatics.
2. Compare and contrast – Bioinformatics and computational Biology.
10
1.10 References:
1. Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J. 1990. Basic local alignment
search tool. J. Mol. Biol. 215: 403–410.
2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J.
1997.
3. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.
Nucleic Acids Res. 25: 3389–3402.
4. Bairoch A., Bucher P., and Hofmann K. 1997. The PROSITE database, its status in 1997.
5. Barker W.C. and Dayhoff M.O. 1982. Viral src gene products are related to the catalytic
chain of mammalian cAMP-dependent protein kinase. Proc. Natl. Acad. Sci. 79: 2836–2839.
6. Barns S.M., Delwiche C.F., Palmer J.D., and Pace N.R. 1996. Perspectives on archaeal
diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl.
Acad. Sci. 93: 9188–9193.
11
LESSON – 2
CLASSIFICATION OF BIOLOGICAL DATABASES
2.1 Classification of Biological Database
2.2 Primary sequence databases
2.3 Protein sequence databases
2.4 Protein structure databases
2.5 Other databases
2.6 Specialized databases
2.7 Let us Sum up
2.11 References
This unit describes the different biological databases, primary sequence databases, meta
databases, genome browsers, protein sequence databases, and protein structure databases.
2.1 Classification of Biological Database
Biological databases have become an important tool in assisting scientists to understand and
explain a host of biological phenomena from the structure of biomolecules and their interactions,
to the whole metabolism of organisms and to understanding the evolution of species. This
knowledge helps facilitate the fight against diseases, assists in the development of medications
and in discovering basic relationships amongst species in the history of life.
12
The biological knowledge of databases is usually (locally) distributed amongst many different
specialized databases. This makes it difficult to ensure the consistency of information, which
sometimes leads to low data quality.
By far the most important resource for biological databases is a special (yearly) issue of the
journal "Nucleic Acids Research" (NAR). The Database Issue is freely available, and categorizes
all the publicly available online databases related to computational biology (or Bioinformatics).
2.2 Primary sequence databases
The International Nucleotide Sequence Database (INSD) consists of the following databases.
1. DDBJ (DNA Data Bank of Japan)
2. EMBL Nucleotide DB (European Molecular Biology Laboratory)
3. GenBank (National Center for Biotechnology Information)
These databanks represent the current knowledge about the sequences of all organisms. They
interchange the stored information and are the source for many other databases.
Meta-databases
Strictly speaking a meta-database can be considered a database of databases, rather than any one
integration project or technology. It collects information from different other sources and usually
makes them available in new and more convenient form. The following are some examples for
meta databases:
1. Entrez (National Center for Biotechnology Information)
2. euGenes (Indiana University)
3. GeneCards (Weizmann Inst.)
4. SOURCE (Stanford University)
13
5. mGen containing four of the world biggest databases GenBank, Refseq, EMBL and DDBJ easy and simple program friendly gene extraction
6. Harvester III - Karlsruhe Institute of Technology - Integrating 26 major protein/gene
resources.
Genome Browsers
Genome Browsers enable researchers to visualize and browse entire genomes (most have many
complete genomes) with annotated data including gene prediction and structure, proteins,
expression, regulation, variation, comparative analysis, etc. Annotated data is usually from
multiple diverse sources. The following are some examples for genome browsers:
1. Integrated Microbial Genomes (IMG) system by the DOE-Joint Genome Institute
2. UCSC Genome Bioinformatics Genome Browser and Tools (UCSC)
3. Ensembl The Ensembl Genome Browser (Sanger Institute and EBI)
4. GBrowse The GMOD GBrowse Project
5. Pathway Tools Genome Browser
6. X:Map A genome browser that shows Affymetrix Exon Microarray hit locations alongside
the gene, transcript and exon data on a Google maps api
2.3 Protein sequence databases
1. UniProt: Universal Protein Resource (UniProt Consortium: EBI, Expasy, PIR)
2. PIR: Protein Information Resource (Georgetown University Medical Center (GUMC))
3. Swiss-Prot: Protein Knowledgebase (Swiss Institute of Bioinformatics )
4. PEDANT: Protein Extraction, Description and ANalysis Tool (Forschungszentrum f.
Umwelt & Gesundheit)
5. PROSITE: Database of Protein Families and Domains
6. DIP: Database of Interacting Proteins (Univ. of California)
7. Pfam: Protein families database of alignments and HMMs (Sanger Institute)
8. ProDom: Comprehensive set of Protein Domain Families (INRA/CNRS)
9. SignalP: Server for signal peptide prediction
14
2.4 Protein structure databases
Protein structure databases:
1. Protein Data Bank (PDB): (Research Collaboratory for Structural Bioinformatics (RCSB))
2. CATH: Protein Structure Classification
3. SCOP: Structural Classification of Proteins
4. SWISS-MODEL: Server and Repository for Protein Structure Models
5. ModBase: Database of Comparative Protein Structure Models (Sali Lab, UCSF)
Protein-protein interactions:
1. BioGRID: A General Repository for Interaction Datasets (Samuel Lunenfeld Research
Institute)
2. STRING: STRING is a database of known and predicted protein-protein interactions.
(EMBL)
3. Database of Interacting Proteins
2.5 Other Databases
Pathway Databases
1. BioCyc Database Collection including EcoCyc and MetaCyc
2. KEGG PATHWAY Database (Univ. of Kyoto)
3. Reactome (Cold Spring Harbor Laboratory, EBI, Gene Ontology Consortium)
Microarray-databases
1. ArrayExpress (European Bioinformatics Institute)
2. Gene Expression Omnibus (National Center for Biotechnology Information)
3. maxd (Univ. of Manchester)
4. SMD (Stanford University)
5. GPX(Scottish Centre for Genomic Technology and Informatics)
15
1. List the protein structure databases available.
Notes:
c) Write your answer in the space given below.
d) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
………………………………………………………………………………………………………
………………………………………………………
2.6 Specialized databases
1. CGAP Cancer Genes (National Cancer Institute)
2. Clone Registry Clone Collections (National Center for Biotechnology Information)
3. DBGET H.sapiens (Univ. of Kyoto)
4. GDB Hum. Genome Db (Human Genome Organisation)
5. MGI Mouse Genome (Jackson Lab.)
6. SHMPD The Singapore Human Mutation and Polymorphism Database
7. NCBI-UniGene (National Center for Biotechnology Information)
8. OMIM Inherited Diseases (Online Mendelian Inheritance in Man)
9. Off. Hum. Genome Db (HUGO Gene Nomenclature Committee)
10. HGMD disease-causing mutations (HGMD Human Gene Mutation Database)
11. List with SNP-Databases
12. p53 The p53 Knowledgebase
13. Edinburgh Mouse Atlas
14. Corn (Maize Genetics and Genomics Database)
16
2.7 Let us sum up:
Biological databases have become an important tool in assisting scientists to understand and
explain a host of biological phenomena from the structure of biomolecules and their interaction,
to the whole metabolism of organisms and to understanding the evolution of species. The
biological knowledge of databases is usually (locally) distributed amongst many different
specialized databases. This lesson briefs about the various biological databases available and
their web location. It gives us the list of available databases under various categories.
i.
Give details of the various primary sequence databases.
ii.
Give details of the various protein structure databases.
iii.
Give details of the NCBI and KEGG.
1. Your answer must include any of these points:
1. Protein Data Bank (PDB)
2. CATH
3. SCOP
4. SWISS-MODEL
5. ModBase
1. Make a critical analysis of the primary sequence databases.
2. Make a comparative study on the protein structure database.
2.11 References
1. Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J. 1990. Basic local
alignment search tool. J. Mol. Biol. 215: 403–410.
17
2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman
D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search
programs. Nucleic Acids Res. 25: 3389–3402.
3. Bairoch A., Bucher P., and Hofmann K. 1997. The PROSITE database, its status in 1997.
4. Barker W.C. and Dayhoff M.O. 1982. Viral src gene products are related to the catalytic
chain of mammalian cAMP-dependent protein kinase. Proc. Natl. Acad. Sci. 79: 2836–2839.
5. Barns S.M., Delwiche C.F., Palmer J.D., and Pace N.R. 1996. Perspectives on archaeal
diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl.
Acad. Sci. 93: 9188–9193.
18
LESSON – 3
BIOLOGICAL DATA FORMAT
3.1 Biological Data Format
3.2 GenBank DNA Sequence Entry
3.3 European Molecular Biology Laboratory Data Library Format
3.4 SwissProt Sequence Format
3.5 FASTA Sequence Format
3.6 National Biomedical Research Foundation/Protein Information Resource Sequence Format
3.7 Stanford University/Intelligenetics Sequence Format
3.8 Genetics Computer Group Sequence Format
3.9 Plain/ASCII.Staden Sequence Format
3.10 Abstract Syntax Notation Sequence Format
3.11 Genetic Data Environment Sequence Format
3.12 Multiple Sequence Formats
3.13 Let us Sum up
3.16 References
This unit describes the different biological data formats, primary sequence databases, meta
databases, genome browsers, protein sequence databases, and protein structure databases.
3.1 Biological Data Format
One major difficulty encountered in running sequence analysis software is the use of differing
sequence formats by different programs. These formats all are standard ASCII files, but they
may differ in the presence of certain characters and words that indicate where different types of
19
information and the sequence itself are to be found. The more commonly used sequence formats
are discussed below.
3.2 GenBank DNA Sequence Entry
The format of a database entry in GenBank, the NCBI nucleic acid and protein sequence
database, is as follows: Information describing each sequence entry is given, including literature
references, information about the function of the sequence, locations of mRNAs and coding
regions, and positions of important mutations. This information is organized into fields, each
with an identifier, shown as the first text on each line. In some entries, these identifiers may be
abbreviated to two letters, e.g., RF for reference, and some identifiers may have additional
subfields. The information provided in these fields is described in and the database organization
is described. The CDS subfield in the field FEATURES gives the amino acid sequence, obtained
by translation of known and potential open reading frames, i.e., a consecutive set of three- letter
words that could be codons specifying the amino acid sequence of a protein. The sequence entry
is assumed by computer programs to lie between the identifiers “ORIGIN” and “//”. The
sequence includes numbers on each line so that sequence positions can be located by eye.
Because the sequence count or a sequence checksum value may be used by the computer
program to verify the sequence composition, the sequence count should not be modified except
by programs that also modify the count. The GenBank sequence format often has to be changed
for use with sequence analysis software.
3.3 European Molecular Biology Laboratory Data Library Format
The European Molecular Biology Laboratory (EMBL) maintains DNA and protein sequence
databases. As with GenBank entries, a large amount of information describing each sequence
entry is given, including literature references, information about the function of the sequence,
locations of mRNAs and coding regions, and positions of important mutations. This information
is organized into fields, each with an identifier, shown as the first text on each line. These
identifiers are abbreviated to two letters, e.g., RF for reference, and some identifiers may have
additional subfields. The sequence entry is assumed by computer programs to lie between the
identifiers “SEQUENCE” and “//” and includes numbers on each line to locate parts of the
20
sequence visually. The sequence count or a checksum value for the sequence may be used by
computer programs to make sure that the sequence is complete and accurate. For this reason, the
sequence part of the entry should usually not be modified except with programs that also modify
this count. This EMBL sequence format is very similar to the GenBank format. The main
differences are in the use of the term ORIGIN in the GenBank format to indicate the start of
sequence; also, the EMBL entry does not include the sequence of any translation products, which
are shown instead as a different entry in the database. This sequence format often has to be
changed for use with sequence analysis software.
3.4 SwissProt Sequence Format
The format of an entry in the SwissProt protein sequence database is very similar to the EMBL
format, except that considerably more information about the physical and biochemical properties
of the protein is provided.
3.5 FASTA Sequence Format
The FASTA sequence format includes three parts: (1) a comment line identified by a “_”
character in the first column followed by the name and origin of the sequence; (2) the sequence
in standard one- letter symbols; and (3) an optional “*” which indicates end of sequence and
which may or may not be present. The presence of “*” may be essential for reading the sequence
correctly by some sequence analysis programs. The FASTA format is the one most often used by
sequence analysis software. This format provides a very convenient way to copy just the
sequence part from one window to another because there are no numbers or other nonsequence
characters within the sequence. The FASTA sequence format is similar to the protein
information resource (PIR) format except that the PIR format includes a first line with a “_”
character in the first column followed by information about the sequence, a second line
containing an identification name for the sequence, and the third to last lines containing the
sequence, as described below.
21
3.6
National Biomedical Research Foundation/Protein Information Resource Sequence
Format
This sequence format, which is sometimes also called the PIR format, has been used by the
National Biomedical Research Foundation/Protein Information Resource (NBRF/PIR) and also
by other sequence analysis programs. Note that sequences retrieved from the PIR database on
their Web site (http://www-nbrf.georgetown.edu) are not in this compact format, but in an
expanded format with much more information about the sequence, as shown below. The
NBRF/PIR format is similar to the FASTA sequence format but with significant differences. The
first line includes an initial “_” character followed by a two-letter code such as P for complete
sequence or F for fragment, followed by a 1 or 2 to indicate type of sequence, then a semicolon,
then a four- to six-character unique name for the entry. There is also an essential second line with
the full name of the sequence, a hyphen, then the species of origin. In FASTA format, the second
line is the start of the sequence and the first line gives the sequence identifier after a “_” sign.
The sequence terminates with an “*”.
3.7 Stanford University/Intelligenetics Sequence Format
Started by a molecular genetics group at Stanford University, and subsequently continued by a
company, Intelligenetics, the IG format is similar to the PIR format, except that a ‘semicolon’ is
usually placed before the comment line. The identifier on the second line is also present. At the
end of the sequence, a 1 is placed if the sequence is linear, and a 2, if the sequence is circular.
3.8 Genetics Computer Group Sequence Format
Earlier versions of the Genetics Computer Group (GCG) programs required a unique sequence
format and include programs that convert other sequence formats into GCG format. Later
versions of GCG accept several sequence formats. Information about the sequence in the
GenBank entry is first included, followed by a line of information about the sequence and a
checksum value. This value is provided as a check on the accuracy of the sequence by the
22
addition of the ASCII values of the sequence. If the sequence has not been changed, this value
should stay the same. If one or more sequence characters changed through error, a program
reading the sequence will be able to determine that the change had occurred because, the
checksum value in the sequence entry will no longer be correct. Lines of information are
terminated by two periods, which mark the end of information and the start of the sequence on
the next line. The rest of the text in the entry is treated as the sequence. Since there is no symbol
to indicate end of sequence, no text other than sequence should be added beyond this point. The
sequence should not be altered except by programs that will also adjust the checksum score for
the sequence. The GCG sequence format may have to be changed for use with other sequence
analysis software. GCG also includes programs for reformatting sequence files.
3.9 Plain/ASCII.Staden Sequence Format
This sequence format is a computer file that includes only the sequence with no accessory
information. This particular format is used by the Staden Sequence Analysis programs
(http://www/.mrc-lmb.com.ac.uk/pubseq) produced by Roger Staden at Cambridge University
(Staden et al. 2000). The sequence must be further formatted to be used for most sequence
analysis programs.
3.10 Abstract Syntax Notation Sequence Format
Abstract Syntax Notation (ASN.1) is a formal data description language that has been developed
by the computer industry. ASN.1 (http://www-sop.inria.fr/rodeo/personnel/ hoschka/asn1.html;
NCBI 1993) has been adopted by the National Center for Biotechnology Information (NCBI) to
encode data such as sequences, maps, taxonomic information, molecular structures, and
bibliographic information. These data sets may then be easily connected and accessed by
computers. The ASN.1 sequence format is a highly structured and detailed format especially
designed for computer access to the data. All the information found in other forms of sequence
storage, e.g., the GenBank format, is present. For example, sequences can be retrieved in this
format by ENTREZ. However, the information is much more difficult to read by eye than a
23
GenBank formatted sequence. One would normally not required to use the ASN.1 format except
when running a computer program that uses this format as input.
3.11 Genetic Data Environment Sequence Format
Genetic Data Environment (GDE) format is used by a sequence analysis system called the
Genetic Data Environment, which was designed by Steven Smith and collaborators (Smith et al.
1994) around a multiple sequence alignment editor that runs on UNIX machines. The GDE
features are incorporated into the SEQLAB interface of the GCG software, version 9. GDE
format is a tagged- field format similar to ASN.1 that is used for storing all available information
about a sequence, including residue colour. The file consists of various fields, each enclosed by
brackets, and each field has specific lines, each with a given name tag. The information
following each tag are placed in double quotes or follows the tag name by one or more spaces.
1. How does an NBRF sequence format look like?
Notes:
e) Write your answer in the space given below.
f) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
………………………………………………………………………………………………………
………………………………………………………
READSEQ to Switch between Sequence Formats
READSEQ is an extremely useful sequence formatting program developed by D. G. Gilbert at
Indiana University, Bloomington (gilbertd_bio.indiana.edu). READSEQ can recognize a DNA
24
or protein sequence file in any of the formats discussed above, identify the format, and write a
new file with an alternative format. Some of these formats are used for special types of analysis
such as multiple sequence alignment and phylogenetic analysis. READSEQ may be reached at
the Baylor College of Medicine site at http://dot.imgen.bcm.tmc.edu:9331/seq-util/readseq.html
and
also
by
anonymous
FTP
from
ftp.bio.indiana.edu/molbio/readseq
or
ftp.bioindiana.edu/molbio/mac to obtain the appropriate files. Data files that have multiple
sequences, such as those required for multiple sequence alignment and phylogenetic analysis
using parsimony (PAUP), are also converted. Options to reverse-complement and to remove
gaps from sequences are included. SEQIO, another sequence conversion program for a UNIX
machine, is described at http://bioweb.pasteur.fr/docs/seqio/seqio. html and is available for
download at http://www.cs.ucdavis.edu/_gusfield/seqio.html.
Sequence formats recognized by format conversion program READSEQ
1. Abstract Syntax Notation (ASN.1)
2. DNA Strider
3. European Molecular Biology Laboratory (EMBL)
4. FASTA/Pearson
5. Fitch (for phylogenetic analysis)
6. GenBank
7. Genetics Computer Group (GCG)
8. Intelligenetics/Stanford
9. Multiple sequence format (MSF)
10. National Biomedical Research Foundation (NBRF)
11. Olsen (input only)
12. Phylogenetic Analysis Using Parsimony (PAUP) NEXUS format
13. Phylogenetic Inference package (Phylip v3.3, v3.4)
14. Phylogenetic Inference package (Phylip v3.2)
15. Plain text/Stadena
16. Pretty format for publication (output only)
17. Protein Information Resource (PIR or CODATA)
18. Zuker for RNA analysis (input only)
25
GCG Programs for Conversion of Sequence Formats
The “FROM” programs convert sequence files from GCG format into the named format, and the
“TO” programs convert the alternative format into GCG format. Shown below are the actual
program names, no spaces included. There are no programs that convert to GenBank and EMBL
formats.
FROMEMBL
FROMFASTA
FROMGENBANK
FROMIG
FROMPIR
FROMSTADEN
TOFASTA
TOIG
TOPIR
TOSTADEN
In addition, the GCG programs include the following sequence formatting programs: (1)
ETSEQ, which converts a simple ASCII file being received from a remote PC to GCG format,
(2) REFORMAT, which will format a GCG file that has been edited, and will also perform other
functions, and (3) SPEW, which sends a GCG sequence file as an ASCII file to a remote PC.
3.12 Multiple Sequence Formats
Most of the sequence formats listed above can be used to store multiple sequences in tandem in
the same computer file. Exceptions are the GCG and raw sequence formats, which are designed
only for single sequences. GCG has an alternative multiple sequence format, which is described
below. In addition, there are formats especially designed for multiple sequences that can also be
used to show their alignments or to perform types of multiple sequence analysis such as
26
phylogenetic analysis. In the case of PAUP, the program will accept MSA format and convert to
the NEXUS format. These formats are illustrated below using the same two short sequences.
The sequences are in FASTA format. The aligned sequence characters occupy the same
line and column, and gaps are indicated by a dash.
>gi|730305|
MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFL
RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ
>gi|404390|
----------------------APEAQVSVQPNFQPDKFL
RTQTPRAELKEKFTAFCKAQGFTEDSIVFLPQTDKCMTEQ
>gi|895868
MAALRMLWMGLVLLGLLGFPQTPAQGHDTVQPNFQQDKFL
RTQTLKDELKEKFTTFSKAQGLTEEDIVFLPQPDKCIQE
Represents the same alignment as:
MATHHTLWMGLALLGVLGDLQAAPEAQVSVQPNFQQDKFL
----------------------APEAQVSVQPNFQPDKFL
RTQTPRAELKEKFTAFCKAQGFTEDTIVFLPQTDKCMTEQ
RTQTLKDELKEKFTTFSKAQGLTEEDIVFLPQPDKCIQE
Storage of Information in a Sequence Database
As shown by the above examples, each DNA or protein sequence database entry has much
information, including an assigned accession number(s); source organism; name of locus;
reference(s); keywords that apply to sequence; features in the sequence such as coding regions,
intron splice sites, and mutations; and finally the sequence itself. The above information is
organized into a tabular form very much like that found in a relational database. If one imagines
a large table with each sequence entry occupying one row, then each column will include one of
the above types of information for each sequence, and each column is called a FIELD. The last
column contains the sequences themselves. It is very easy to make an index of the information in
each of these fields so that a search query can locate all the occurrences through the index. Even
27
related sequences are cross-referenced. In addition, the information in one database can be crossreferenced to that in another database. The DNA, protein, and reference databases have all been
cross-referenced so that moving between them is readily accomplished.
Database Types
There are several types of databases; the two principal types are the relational and objectoriented databases. The relational database orders data in tables made up of rows giving specific
items in the database, and columns giving the features as attributes of those items. These tables
are carefully indexed and cross-referenced with each other, sometimes using additional tables, so
that each item in the database has a unique set of identifying features. A relational model for the
GenBank sequence database has been devised at the National Center for Genome Resources
(http://www.ncgr.org/research/sequence/schema.html). The object-oriented database structure
has been useful in the development of biological databases. The objects, such as genetic maps,
genes, or proteins, each have an associated set of utilities for analysis and display of the object
and a set of attributes such as identifying name or references. In developing the database,
relationships among these objects are identified. To standardize some commonly arising objects
in biological databases, e.g., maps, the Object Management Group (http://www. omg.org) has
formed a Life Science Research Group. The Life Science Research Group is a consortium of
commercial companies, academic institutions, and software vendors that is trying to establish
standards for displaying biological information from Bioinformatics and genomics analysis
(http://www.omg.org/home pages/lsr). The Common Object Request Broker Architecture
(CORBA) is the Object Management Group’s interface for objects that allows different computer
applications to communicate with each other through a common language called Interface
Definition Language (IDL). To plan an object-oriented database by defining the classes of
objects and the relationships among these objects, a specific set of procedures called the Unified
Modeling Language (UML) has been devised by the OMG.
DNA sequence analysis software packages often include sequence databases that are updated
regularly. The organizations that manage sequence databases also provide public access through
the Internet. Using a browser such as the Netscape navigator or the Internet Exploreron a
28
personal computer, these sites may be visited through the internet and a form can be filled in
with the sequence name. Once the correct sequence has been identified, the sequence is delivered
to the browser and may be saved as a local computer file, cut-and-pasted from the browser
window to another window of an analysis program or editor, or even pasted into another browser
page for analysis in another website. A useful feature of browser programs for sequence analysis
is the capability of having more than one browser window running at a time. Hence, one browser
window may retrieve sequences from a database and another may analyze these sequences. At
the time of retrieving the sequence, several sequence formats may be available. The FASTA
format, which is readily converted into other formats and also is smaller and simpler, containing
just a line of sequence identifiers followed by the sequence without numbers, is very useful for
this purpose.
Using the Database Access Program Entrez
One straightforward way to access the sequence databases is through ENTREZ, a resource
prepared by the staff of the National Center for Biotechnology Information, National Library of
Medicine,
Bethesda,
Maryland,
and
available
through
their
web
site
at
http://ncbi.nlm.nih.gov/Entrez. ENTREZ provides a series of forms that can be filled into
retrieve a DNA or protein sequence, or a Medline reference related to the molecular biology
sequence databases. After a search for either a protein or a DNA sequence is chosen at the above
address, another web page is provided with a form to be filled in for the search. On the
ENTREZ form, make a selection in the data entry window after the term “Search,” then enter
search terms in the longer data entry window after “for.” The database will be searched for
sequence database entries that contain all of these terms or related ones. Using boolean logic, the
search looks for database entries that include the first term AND the second, and subsequent
terms repeated until the last term. The “Limits” link on the ENTREZ form page is used to limit
the GenBank field to be searched, and various logical combinations of search terms may be
designed by this method. These fields refer to the GenBank fields. When searching for terms in a
particular field, some knowledge of the terms that are in the database can be helpful. To assist in
finding suitable terms, for each field, ENTREZ provides a list of index entries. For a protein
search, for example, current choices for fields include accession (number), all fields, author
29
name, E. C. number, issue, journal name, keyword, modification date, organism, page number,
primary accession (number), properties, protein name, publication date (of reference), seqID
string, sequence length, substance name, text word, title word, volume, and sequence ID. Similar
fields are shown for the DNA database search. Later, the results of searches in separate fields
may be combined to narrow down the choices. The number of terms to be searched for and the
field to be searched are the main decisions to be made. In doing so, keep in mind that it is
important to be as specific as possible, or else there may be a great many possibilities. Thus,
knowing accession number, protein name, or name of gene should be enough to find the required
entry quickly. If the same protein has been sequenced in several organisms, providing an
organism name is also helpful. When the chosen search terms and fields have been decided and
submitted, a database comprising all of the currently available sequences (called the
nonredundant or NR database) will be searched. Other database selections may also be made.
The program returns the number of matches found and provides an opportunity to narrow this list
by including more terms. When the number of matching sequences has been narrowed to a
reasonable number, the sequence may be retrieved in a chosen format in several straightforward
steps. It is important to look through the sequences to locate the one intended. There may be
several different copies of the sequence because it may have been sequenced from more than one
organism, or the sequence may be a mutant sequence, a particular clone, or a fragment. There is
no simple way to find the correct sequence without manually checking the information provided
in each sequence, but this usually takes only a short time. Before leaving ENTREZ, it is often
useful to check for sequence database entries that are similar to the one of interest, called
“neighbors” by ENTREZ. The expanded query searches other database entries of interest, such
as the same protein in another organism, a large chromosomal sequence that includes the gene,
or members of the same gene family. While visiting the site, note that ENTREZ has been
adapted to search through a number of other biological databases, and also through Medline, and
these searches are available from the initial ENTREZ Web page.
Retrieving a Specific Sequence
Even strictly following the above instructions, it may be difficult to retrieve the sequence of a
specific gene or protein simply because of the sheer number of sequences in the Gen Bank
30
database and the complex problem of indexing them. For projects that require the most currently
available sequences, the NR databases should be searched. Other projects may benefit from the
availability of better curated and annotated protein sequence databases, including PIR and
SwissProt. The genomic databases can also provide the sequence of a particular gene or protein.
Protein sequences in the Gen P ro database are generated by automatic translation of DNA
sequences. When read from cDNA copies of mRNA sequences, they provide a reliable sequence,
given a certain amount of uncertainty as to the translational start site. Many protein sequences
are now predicted by translation of genomic sequences, requiring a prediction of exons, a
somewhat error-prone step escribed later int his material. The origin of protein sequence entries
thus needs to be determined, and if they are not from a cDNA sequence, it may be necessary to
obtain and sequence a cDNA copy of the gene.
3.13 Let us sum up:
One major difficulty encountered in running sequence analysis software is the use of differing
sequence formats by different programs. These formats all are standard ASCII files, but they
may differ in the presence of certain characters and words that indicate where different types of
information and the sequence itself are to be found. The various biological data formats were
discussed in this lesson.
(i)
Visit the NCBI website and write details of the data format available there.
(ii)
Give details of the data formats available at Swiss Prot.
(iii)
Mention a few data formats available for multiple sequence.
The first line includes an initial “_” character followed by a two- letter code such as P for
complete sequence or F for fragment, followed by a 1 or 2 to indicate type of sequence, then a
semicolon, then a four- to six-character unique name for the entry.
31
1. Eluciate the liightiglting features of FASTA format.
2. How do you rate the GCG format when compared to the other formats? Elobrate your
discussion.
3.16 References:
1. Blattner F.R., Plunkett III, G., Bloch C.A., Perna N.T., Burland V., Riley M., Collado-Vides
J., Glasner J.D., Rode C.K., Mayhew G.F., Gregor J., Davis N.W., Kirkpatrick H.A., Goeden
M.A., Rose D.J.,
2. Mau B., and Shao Y. 1997. The complete genome sequence of Escherichia coli K-12. Science
277: 1453–1474.
3. Bowie J.U., Luthy R., and Eisenberg D. 1991. A method to identify protein sequences that
fold into a known three-dimensional structure. Science 253: 164–170.
4. C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A
platform for investigating biology. Science 282: 2012–2018.
5. Cherry J.M. and Cartinhour S.W. 1993. ACEDB, a tool for biological information. In
Automated DNA sequencing and analysis (ed. M. Adams et al.). Academic Press, New York.
6. Cherry J.M., Ball C., Weng S., Juvik G., Schmidt R., Adler C., Dunn B., Dwight S., Riles L.,
Mortimer R. K., and Botstein D. 1997. Genetic and physical maps of Saccharomyces cerevisiae.
Nature (suppl. 6632) 387: 67–73.
7. Chothia C. 1992. Proteins. One thousand families for the molecular biologist. Nature 357:
543–544.
8. Chou P.Y. and Fasman G.D. 1978. Prediction of the secondary structure of proteins from
their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol. 47: 45–147.
32
LESSON – 4
APPLICATIONS OF BIOINFORMATICS
4.1 Application Of Bioinformatics in Various Field
4.1.1 Sequence analysis
4.1.2 Genome annotation
4.1.3 Computational evolutionary biology
4.1.4 Measuring biodiversity
4.1.5 Analysis of gene expression
4.1.6 Analysis of regulation
4.1.7 Analysis of protein expression
4.1.8 Analysis of mutations in cancer
4.1.9 Prediction of protein structure
4.2 Let us Sum up
4.6 References
This unit describes the Application of Bioinformatics in Various Fields, genome annotation,
measuring biodiversity, expression of genes, prediction of protein structure.
4.1 Application of Bioinformatics in Various Fields
4.1.1 Sequence analysis
Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of hundreds of organisms
have been decoded and stored in databases. The information is analyzed to determine genes that
encode polypeptides, as well as regulatory sequences. A comparison of genes within a species or
between different species can show similarities between protein functions, or relations between
33
species (the use of molecular systematics to construct phylogenetic trees). With the growing
amount of data, it is quite impractical to analyze DNA sequences manually. Today, computer
programs are used to search the genome of thousands of organisms, containing billions of
nucleotides. These programs would compensate for mutations (exchanged, deleted or inserted
bases) in the DNA sequence, in order to identify sequences that are related, but not identical. A
variant of this sequence alignment is used in the sequencing process itself. The so-called shotgun
sequencing technique (which was used, for example, by The Institute for Genomic Research to
sequence the first bacterial genome, Haemophilus influenzae) does not give a sequential list of
nucleotides, but instead the sequences of thousands of small DNA fragments (each about 600800 nucleotides long). The ends of these fragments overlap and, when aligned in the right way,
make up the complete genome. Shotgun sequencing yields sequence data quickly, but the task of
assembling the fragments can be quite complicated for larger genomes. In the case of the Human
Genome Project, it took several months of CPU time (on a circa-2000 vintage DEC Alpha
computer) to assemble the fragments. Shotgun sequencing is the method of choice for virtually
all genomes sequenced today, and genome assembly algorithms are a critical area of
Bioinformatics research.
Another aspect of Bioinformatics in sequence analysis is the automatic search for genes and
regulatory sequences within a genome. Not all of the nucleotides within a genome are genes.
Within the genome of higher organisms, large parts of the DNA do not serve any obvious
purpose. This so-called junk DNA may, however, contain unrecognized functional elements.
Bioinformatics helps to bridge the gap between genome and proteome projects--for example, in
the use of DNA sequences for protein identification.
4.1.2 Genome annotation
In the context of genomics, annotation is the process of marking the genes and other biological
features in a DNA sequence. The first genome annotation software system was designed in 1995
by Dr. Owen White, who was part of the team that sequenced and analyzed the first genome of a
free- living organism to be decoded, the bacterium Haemophilus influenzae. Dr. White built a
software system to find the genes (places in the DNA sequence that encode a protein), the
34
transfer RNA, and other features, and to make initial assignments of function to those genes.
Most current genome annotation systems work similarly, but the programs available for analysis
of genomic DNA are constantly changing and improving.
4.1.3 Computational Evolutionary Biology
Evolutionary biology is the study of the origin and descent of species, as well as their change
over time. Informatics has assisted evolutionary biologists in several key ways. It has enabled
researchers to:
* trace the evolution of a large number of organisms by measuring changes in their DNA,
rather than through physical taxonomy or physiological observations alone,
* more recently, compare entire genomes, which permits the study of more complex
evolutionary events, such as gene duplication, lateral gene transfer, and the prediction of factors
important in bacterial speciation,
* build complex computational models of populations to predict the outcome of the system
over time
* track and share information on an increasingly large number of species and organisms
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science that uses genetic algorithms is sometimes confused
with computational evolutionary biology, but the two areas are unrelated.
4.1.4 Measuring Biodiversity
Biodiversity of an ecosystem might be defined as the total genomic complement of a particular
environment, from all of the species present, whether it is a biofilm in an abandoned mine, a drop
of sea water, a scoop of soil, or the entire biosphere of the planet Earth. Databases are used to
collect the specie’s names, descriptions, distributions, genetic information, status and size of
35
populations, habitat needs, and how each organism interacts with other species. Specialized
software programs are used to find, visualize, and analyze the information, and most importantly,
communicate it to other people. Computer simulations model such things as population
dynamics, or calculate the cumulative genetic health of a breeding pool (in agriculture) or
endangered population (in conservation). One very exciting potential of this field is that entire
DNA sequences, or genomes of endangered species can be preserved, allowing the results of
Nature's genetic experiment to be remembered insilico, and possibly reused in the future, even if
that species is eventually lost.
4.1.5 Analysis of gene expression
The expression of many genes can be determined by measuring mRNA levels with multiple
techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial
analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing
(MPSS), or various applications of multiplexed in-situ hybridization. All of these techniques are
extremely noise-prone and/or subject to bias in the biological measurement, and a major research
area in computational biology involves developing statistical tools to separate signal from noise
in high-throughput gene expression studies. Such studies are often used to determine the genes
implicated in a disorder: one might compare microarray data from cancerous epithelial cells to
data from non-cancerous cells to determine the transcripts that are up-regulated and downregulated in a particular population of cancer cells.
4.1.6 Analysis of regulation
Regulation is the complex orchestration of events starting with an extracellular signal such as a
hormone and leading to an increase or decrease in the activity of one or more proteins.
Bioinformatics techniques have been applied to explore various steps in this process. For
example, promoter analysis involves the identification and study of sequence motifs in the DNA
surrounding the coding region of a gene. These motifs influence the extent to which that region
is transcribed into mRNA. Expression data can be used to infer gene regulation: one might
compare microarray data from a wide variety of states of an organism to form hypotheses about
36
the genes involved in each state. In a single-cell organism, one might compare stages of the cell
cycle, along with various stress conditions (heat shock, starvation, etc.). One can then apply
clustering algorithms to that expression data to determine which genes are co-expressed. For
example, the upstream regions (promoters) of co-expressed genes can be searched for overrepresented regulatory elements.
4.1.7 Analysis of protein expression
Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot
of the proteins present in a biological sample. Bioinformatics is very much involved in making
sense of protein microarray and HT MS data; the former approach faces similar problems as with
microarrays targeted at mRNA, the latter involves the problem of matching large amounts of
mass data against predicted masses from protein sequence databases, and the complicated
statistical analysis of samples where multiple, but incomplete peptides from each protein are
detected.
4.1.8 Analysis of mutations in cancer
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways.
Massive sequencing efforts are used to identify previously unknown point mutations in a variety
of genes in cancer. Bioinformaticians continue to produce specialized automated systems to
manage the sheer volume of sequence data produced, and they create new algorithms and
software to compare the sequencing results to the growing collection of human genome
sequences and germline polymorphisms. New physical detection technologies are employed,
such as oligonucleotide microarrays to identify chromosomal gains and losses (called
comparative genomic hybridization), and single nucleotide polymorphism arrays to detect known
point mutations. These detection methods simultaneously measure several hundred thousand
sites throughout the genome, and when used in high-throughput to measure thousands of
samples, generate terabytes of data per experiment. Again the massive amounts and new types of
data generate new opportunities for Bioinformaticians. The data is often found to contain
37
considerable variability, or noise, and thus Hidden Markov Model and change-point analysis
methods are being developed to infer real copy number changes.
Another type of data that requires novel informatics development is the analysis of lesions found
to be recurrent across many tumors .
4.1.9 Prediction of protein structure
Protein structure prediction is another important application of Bioinformatics. The amino acid
sequence of a protein, the so-called primary structure, can be easily determined from the
sequence on the gene that codes for it. In the vast majority of cases, this primary structure
uniquely determines a structure in its native environment. (Of course, there are exceptions, such
as the bovine spongiform encephalopathy - aka Mad Cow Disease - prion.) Knowledge of this
structure is vital in understanding the function of the protein. For lack of better terms, structural
information is usually classified as one of secondary, tertiary and quaternary structure. A viable
general solution to such predictions remains an open problem. As of now, most efforts have been
directed towards heuristics that work most of the time.
One of the key ideas in Bioinformatics is the notion of homology. In the genomics branch of
Bioinformatics , homology is used to predict the function of a gene: if the sequence of gene A,
whose function is known, is homologous to the sequence of gene B, whose function is unknown,
one could infer that B may share A's function. In the structural branch of Bioinformatics,
homology is used to determine which parts of a protein are important in structure formation and
interaction with other proteins. In a technique called homology modeling, this information is
used to predict the structure of a protein once the structure of a homologous protein is known.
This currently remains the only way to predict protein structures reliably.
One example of this is the similar protein homology between haemoglobin in the humans and the
haemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen
in the organism. Though both of these proteins have completely different amino acid sequences,
their protein structures are virtually identical, which reflects their near identical purposes.
38
Other techniques for predicting protein structure include protein threading and de novo (from
scratch) physics-based modeling.
1. Explain how Bioinformatics has assisted evolutionary biologists in their research.
Notes:
g) Write your answer in the space given below.
h) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
Comparative Genomics
The core of comparative genome analysis is the establishment of the correspondence between
genes (orthology analysis) or other genomic features in different organisms. It is these
intergenomic maps that make it possible to trace the evolutionary processes responsible for the
divergence of two genomes. A multitude of evolutionary events acting at various organizational
levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides.
At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion,
transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of
hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The
complexity of genome evolution poses many exciting challenges to developers of mathematical
models and algorithms, who have recourse to a spectra of algorithmic, statistical and
mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation
39
algorithms for problems based on parsimony models to Markov Chain Monte Carlo algorithms
for Bayesian analysis of problems based on probabilistic models. Many of these studies are
based on the homology detection and protein families computation.
Modeling biological systems
Systems biology involves the use of computer simulations of cellular subsystems (such as the
networks of metabolites and enzymes which comprise metabolism, signal transduction pathways
and gene regulatory networks) to both analyze and visualize the complex connections of these
cellular processes. Artificial life or virtual evolution attempts to understand evolutionary
processes via the computer simulation of simple (artificial) life forms.
High-throughput image analysis
Computational technologies are used to accelerate or fully automate the processing,
quantification and analysis of large amounts of high- information-content biomedical imagery.
Modern image analysis systems augment an observer's ability to make measurements from a
large or complex set of images, by improving accuracy, objectivity, or speed. A fully developed
analysis system may completely replace the observer. Although these systems are not unique to
biomedical imagery, biomedical imaging is becoming more important for both diagnostics and
research. Some examples are:
*
high-throughput and high- fidelity quantification and sub-cellular localization (highcontent screening, cytohistopathology)
*
morphometrics
*
clinical image analysis and visualization
*
determining the real- time air-flow patterns in breathing lungs of living animals
*
quantifying occlusion size in real-time imagery from the development of and recovery
during arterial injury
*
making behavioral observations from extended video recordings of laboratory animals
*
infra-red measurements for metabolic activity determination
40
Protein-Protein docking
In the last two decades, tens of thousands of protein three-dimensional structures are determined
by X-ray crystallography and protein nuclear magnetic resonance spectroscopy (protein NMR).
One central question for the biological scientist is whether it is practical to predict possible
protein-protein interactions only based on these 3D shapes, without doing protein-protein
interaction experiments. A variety of methods have been developed to tackle the Protein-protein
docking problem, though it seems that there is still much place to work on in this field.
4.2 Let us sum up
Bioinformatics is the use of IT in biotechnology for the data storage, data warehousing and
analyzing the DNA sequences. In Bioinfomatics knowledge of many branches are required like
biology, mathematics, computer science, laws of physics & chemistry, and of course
sound knowledge of IT to analyze biotech data. Bioinformatics is not limited to the computing
data, but in reality it can be used to solve many biological problems and find out how living
things works.
1. Mention the other fields where Bioinformatics is used, other than those mentioned in this
lesson.
1. Your answer may include these points:
*
trace the evolution of a large number of organisms
*
compare entire genomes
*
build complex computational models of populations to predict the outcome of the system
over time
*
track and share information on an increasingly large number of species and organisms.
41
1. Bioinformatics will be ruling all the sciences in the near future – Comment on this
statement.
2. Do you support the fact that, Bioinformatics above will be able to provide landful of
information to all the researchers in different areas mentioned in this lesson?
4.6 References
1. Dayhoff M.O., Ed. 1972. Atlas of protein sequence and structure, vol. 5. National Biomedical
Research Foundation, Georgetown University, Washington, D.C. ———. 1978. Survey of new
data and computer methods of analysis. In Atlas of protein sequence and structure, vol. 5, suppl.
2. National Biomedical Research Foundation, Georgetown University, Washington, D.C.
Doolittle R.F., Hunkapiller M.W., Hood L.E., Devare S.G., Robbins K.C., Aaronson S.A., and
Antoniades H.N. 1983. Simian sarcoma onc gene v-sis is derived from the gene (or genes)
encoding a platelet-derived growth factor. Science 221: 275–277.
3. Eddy S.R., Mitchison G., and Durbin R. 1995. Maximum discrimination hidden Markov
models of sequence consensus. J. Comput. Biol. 2: 9–23.
4.
Ewing B. and Green P. 1998. Base-calling of automated sequence traces using phred. II.
Error probabilities. Genome Res. 8: 186–194.
5. Ewing B., Hillier L., Wendl, M.C., and Green P. 1998. Base-calling of automated sequence
traces using phred. I. Accuracy assessment. Genome Res. 8: 175–185.
6.
Felsenstein J. 1988. Phylogenies from molecular sequences: Inferences and reliability. Annu.
Rev. Genet. 22: 521–565.
7. Fitch W.M. and Margoliash E. 1987. Construction of phylogenetic trees. Science 155: 279–
284.
8. Fleischmann R.D., Adams M.D., White O., Clayton R.A., Kirkness E.F., Kerlavage A.R.,
Bult C.J., Tomb J.F., Dougherty B.A., Merrick J.M., et al. 1995. Whole-genome random
sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512.
9. Garnier J., Osguthorpe D.J., and Robson B. 1978. Analysis of the accuracy and implications
of simple
42
LESSON – 5
STRUCTURE OF PROTEINS
5.1 Single letter code of amino acid:
5.2 Functions in proteins
5.3 Non-protein functions
5.4 General structure
5.5 Peptide bond formation
5.6 Hydrophilic and hydrophobic amino acids
5.7 Entrez and SRS
5.8 Let us Sum up
5.12 References
This unit describes the single letter code of amino acid, function of proteins, general structure of
amino acid, isomerism, peptide bond formation.
5.1 Single letter Code of Amino acid:
Alpha-amino acids are the building blocks of proteins. A protein forms via the condensation of
amino acids to form a chain of amino acid "residues" linked by peptide bonds. Proteins are
defined by their unique sequence of amino acid residues; this sequence is the primary structure
of the protein. Just as the letters of the alphabet can be combined to form an almost endless
variety of words, amino acids can be linked in varying sequences to form a huge variety of
proteins.
43
Twenty standard amino acids are used by cells in protein biosynthesis, and these are specified by
the general genetic code. These twenty amino acids are biosynthesized from other molecules, but
organisms differ in which ones they can synthesize and which ones must be provided in their
diet. The ones that cannot be synthesized by an organism are called essential amino acids.
5.2 Functions in proteins
A polypeptide is a chain of amino acids.
Amino acids are the basic structural building units of proteins. They form short polymer chains
called peptides or longer chains either called polypeptides or proteins. The process of such
formation from an mRNA template is known as translation which is part of protein biosynthesis.
Twenty amino acids are encoded by the standard genetic code and are called proteinogenic or
standard amino acids. Other amino acids contained in proteins are usually formed by posttranslational modification, which is modification after translation in protein synthesis. These
modifications are often essential for the function or regulation of a protein; for example, the
carboxylation of glutamate allows for better binding of calcium cations, and the hydroxylation of
proline is critical for maintaining connective tissues and responding to oxygen starvation. Such
modifications can also determine the localization of the protein, e.g., the addition of long
hydrophobic groups can cause a protein to bind to a phospholipid membrane.
5.3 Non-protein functions
The twenty standard amino acids are either used to synthesize proteins and other biomolecules,
or oxidized to urea and carbon dioxide as a source of energy. The oxidation pathway starts with
the removal of the amino group by a transaminase, the amino group is then fed into the urea
cycle. The other product of transamidation is a keto acid that enters the citric acid cycle.
Glucogenic amino acids can also be converted into glucose, through gluconeogenesis.
44
Hundreds of types of non-protein amino acids have been found in nature and they have multiple
functions in living organisms. Microorganisms and plants can produce uncommon amino acids.
In microbes, examples include 2-aminoisobutyric acid and lanthionine, which is a sulfidebridged alanine dimer. Both these amino acids are both found in peptidic lantibiotics such as
alamethicin. While in plants, 1-Aminocyclopropane-1-carboxylic acid is a small disubstituted
cyclic amino acid that is a key intermediate in the production of the plant hormone ethylene.
In humans, non-protein amino acids also have biologically- important roles. Glycine, gammaaminobutyric acid and glutamate are neurotransmitters and many amino acids are used t o
synthesize other molecules.
For example:
* Tryptophan is a precursor of the neurotransmitter serotonin
* Glycine is a precursor of porphyrins such as heme
* Arginine is a precursor of nitric oxide
* Carnitine is used in lipid transport within a cell,
* Ornithine and S-adenosylmethionine are precursors of polyamines,
* Homocysteine is an intermediate in S-adenosylmethionine recycling
Also present are hydroxyproline, hydroxylysine, and sarcosine. The thyroid hormones are also
alpha-amino acids. Some amino acids have even been detected in meteorites, especially in a type
known as carbonaceous chondrites. This observation has prompted the suggestion that life may
have arrived on earth from an extraterrestrial source.
45
5.4 General structure
Fig 5. The general structure of an α-amino acid, with the amino group on the left and the
carboxyl group on the right.
In the structure shown to the right, the R represents a side chain specific to each amino acid. The
central carbon atom called Cα is a chiral central carbon atom (with the exception of glycine) to
which the two termini and the R-group are attached. Amino acids are usually classified by the
properties of the side chain into four groups. The side chain can make them behave like a weak
acid, a weak base, a hydrophile if they are polar, and hydrophobe if they are nonpolar. The
chemical structures of the 20 standard amino acids, along with their chemical properties, are
catalogued in the list of standard amino acids.
The phrase "branched-chain amino acids" or BCAA is sometimes used to refer to the amino
acids having aliphatic side-chains that are non- linear, these are leucine, isoleucine and valine.
Proline is the only proteinogenic amino acid whose side group links to the α-amino group, and
thus is also the only proteinogenic amino acid containing a secondary amine at this position.
Proline has sometimes been termed an imino acid, but this is not correct in the current
nomenclature.
46
Isomerism
Most amino acids can exist in either of two optical isomers, called D and L. The L-amino acids
represent the vast majority of amino acids found in proteins. D-amino acids are found in some
proteins produced by exotic sea-dwelling organisms, such as cone snails. They are also abundant
components of the peptidoglycan cell walls of bacteria.
The L and D conventions for amino acid configuration do not refer to the optical activity, but
rather to the optical activity of the isomer of glyceraldehyde having the same stereochemistry as
the amino acid. S-Glyceraldehyde is levorotary, and R- glyceraldehyde is dexterorotary, and so Samino acids are called L- even if they are not levorotary, and R-amino acids are likewise called
D- even if they are not dexterorotary.
There are two exceptions to these general rules of amino acid isomerism. Firstly, glycine, where
R = H, no isomerism is possible because the alpha-carbon bears two identical groups (hydrogen).
Secondly, in cysteine, the L = S and D = R assignment is reversed to L = R and D = S. Cysteine
is structured similarly (with respect to glyceraldehyde) to the other amino acids but the sulfur
atom alters the interpretation of the Cahn-Ingold-Prelog priority rule.
Reactions
As amino acids have both a primary amine group and a primary carboxyl group, these chemicals
can undergo most of the reactions associated with these functional groups. These include
nucleophilic addition, amide bond formation and imine formation for the amine group and
esterification, amide bond formation and decarboxylation for the carboxylic acid group. The
multiple side chains of amino acids can also undergo chemical reactions. The types of these
reactions are determined by the groups on these side chains and are discussed in the articles
dealing with each specific type of amino acid.
5.5 Peptide bond formation
47
As both the amine and carboxylic acid groups of amino acids can react to form amide bonds, one
amino acid molecule can react with another and become joined through an amide linkage. This
polymerization of amino acids is what creates proteins. This condensation reaction yields the
newly formed peptide bond and a molecule of water. In cells, this reaction does not occur
directly, instead the amino acid is activated by attachment to a transfer RNA molecule through
an ester bond. This aminoacyl- tRNA is produced in an ATP-dependent reaction carried out by an
aminoacyl tRNA synthetase. This aminoacyl-tRNA is then a substrate for the ribosome, which
catalyzes the attack of the amino group of the elongating protein chain on the ester bond. As a
result of this mechanism, all proteins are synthesized starting at their N-terminus and moving
towards their C-terminus.
However, not all peptide bonds are formed in this way. In a few cases peptides are synthesized
by specific enzymes. For example, the tripeptide glutathione is an essential part of the defenses
of cells against oxidative stress. This peptide is synthesized in two steps from free amino acids.
In the first step gamma-glutamylcysteine synthetase condenses cysteine and glutamic acid
through a peptide bond formed between the side-chain carboxyl of the glutamate (the gamma
carbon of this side chain) and the amino group of the cysteine. This dipeptide is then condensed
with glycine by glutathione synthetase to form glutathione.
In chemistry, peptides are synthesized by a variety of reactions. One of the most used in solidphase peptide synthesis, which uses the aromatic oxime derivatives of amino acids as activated
units. These are added in sequence onto the growing peptide chain, which is attached to a solid
resin support.
As amino acids have both the active groups of an amine and a carboxylic acid they can be
considered both acid and base (though their natural pH is usually influenced by the R group). At
a certain pH known as the isoelectric point, the amine group gains a positive charge (is
protonated) and the acid group a negative charge (is deprotonated). The exact value is specific to
each different amino acid. This ion is known as a zwitterion, which comes from the German
word Zwitter meaning "hybrid". A zwitterion can be extracted from the solution as a white
48
crystalline structure with a very high melting point, due to its dipolar nature. Near-neutral
physiological pH allows most free amino acids to exist as zwitterions.
5.6 Hydrophilic and hydrophobic amino acids
Depending on the polarity of the side chain, amino acids vary in their hydrophilic or hydrophobic
character. These properties are important in protein structure and protein-protein interactions.
The importance of the physical properties of the side chains comes from the influence this has on
the amino acid residues' interactions with other structures, both within a single protein and
between proteins. The distribution of hydrophilic and hydrophobic amino acids determines the
tertiary structure of the protein, and their physical location on the outside structure of the proteins
influences their quaternary structure. For example, soluble proteins have surfaces rich with polar
amino acids like serine and threonine, while integral membrane proteins tend to have outer ring
of hydrophobic amino acids that anchors them into the lipid bilayer, and proteins anchored to the
membrane have a hydrophobic end that locks into the membrane. Similarly, proteins that have to
bind to positively-charged molecules have surfaces rich with negatively charged amino acids like
glutamate and aspartate, while proteins binding to negatively-charged molecules have surfaces
rich with positively charged chains like lysine and arginine. Recently a new scale of
hydrophobicity based on the free energy of hydrophobic association has been proposed
Hydrophilic and hydrophobic interactions of the proteins do not have to rely only on the
sidechains of amino acids themselves. By various posttranslational modifications other chains
can be attached to the proteins, forming hydrophobic lipoproteins or hydrophilic glycoproteins.
Alanine
A
Ala
89.09404
6.01
2.35
9.87
Cysteine
C
Cys
121.15404 5.05
1.92
10.70
Aspartic acid
D
Asp
133.10384 2.85
1.99
9.90
Glutamic acid
E
Glu
147.13074 3.15
2.10
9.47
Phenylalanine
F
Phe
165.19184 5.49
2.20
9.31
Glycine
G
Gly
75.06714
6.06
2.35
9.78
Histidine
H
His
155.15634 7.60
1.80
9.33
49
Isoleucine
I
Ile
131.17464 6.05
2.32
9.76
Lysine
K
Lys
146.18934 9.60
2.16
9.06
Leucine
L
Leu
131.17464 6.01
2.33
9.74
Methionine
M
Met
149.20784 5.74
2.13
9.28
Asparagine
N
Asn
132.11904 5.41
2.14
8.72
Proline
P
Pro
115.13194 6.30
1.95
10.64
Glutamine
Q
Gln
146.14594 5.65
2.17
9.13
Arginine
R
Arg
174.20274 10.76
1.82
8.99
Serine
S
Ser
105.09344 5.68
2.19
9.21
Threonine
T
Thr
119.12034 5.60
2.09
9.10
Selenocysteine
U
Sec
169.06
Valine
V
Val
117.14784 6.00
2.39
9.74
Tryptophan
W
Trp
204.22844 5.89
2.46
9.41
Tyrosine
Y
Tyr
181.19124 5.64
2.20
9.21
50
5.7 Entrez and SRS
The
The Entrez Global Query Cross-Database Search System is a powerful federated search engine,
or web portal that allows users to search many discrete health sciences databases at the National
Center for Biotechnology Information (NCBI) website. NCBI is part of the National Library of
Medicine (NLM), itself a department of the National Institutes of Health (NIH) of the United
States government. Entrez also happens to be the French word for the second person plural form
of the verb "to enter", meaning literally "come in".
Entrez Global Query is an integrated search and retrieval system that provides access to all
databases simultaneously with a single query string and user interface. Entrez can efficiently
retrieve related sequences, structures, and references. The Entrez system can provide views of
gene and protein sequences and chromosome maps. Some textbooks are also available online
through the Entrez system.
51
The Entrez front page provides, by default, access to the global query. All databases indexed by
Entrez can be searched via a single query string, supporting boolean operators and search term
tags to limit parts of the search statement to particular fields. This returns a unified results page,
that shows the number of hits for the search in each of the databases, which are also links to
actual search results for that particular database.
Entrez also provides a similar interface for searching each particular database and for refining
search results. The Limits feature allows the user to narrow a search a web forms interface. The
History feature gives a numbered list of recently performed queries. Results of previous queries
can be referred to by number and combined via boolean operators. Search results can be saved
temporarily in a Clipboard. Users with a MyNCBI account can save queries indefinitely and also
choose to have updates with new search results e- mailed for saved queries of most databases. It
is widely used in the field of Biotechnology to enhance the knowledge of students worldwide.
Entrez searches the following databases:
* PubMed: biomedical literature citations and abstracts, including Medline - articles from
(mainly medical) journals, often including abstracts. Links to PubMed Central and other
full-text resources are provided to articles from the 1990s.
* PubMed Central: free, full text journal articles
* Site Search: NCBI web and FTP web sites
* Books: online books
* OMIM: Online Mendelian Inheritance in Man
* OMIA: Online Mendelian Inheritance in Animals
* Nucleotide: sequence database (GenBank)
* Protein: sequence database
* Genome: whole genome sequences and Mapping
* Structure: three-dimensional macromolecular structures
* Taxonomy: organisms in GenBank Taxonomy
* SNP: Single Nucleotide Polymorphism
* Gene: gene-centered information
* HomoloGene: eukaryotic homology groups
52
* PubChem Compound: unique small molecule chemical structures
* PubChem Substance: deposited chemical substance records
* Genome Project: genome project information
* UniGene: gene-oriented clusters of transcript sequences
* CDD: conserved protein domain database
* 3D Domains: domains from Entrez Structure
* UniSTS: markers and mapping data
* PopSet: population study data sets (epidemiology)
* GEO Profiles: expression and molecular abundance profiles
* GEO DataSets: experimental sets of GEO data
* Cancer Chromosomes: cytogenetic databases
* PubChem BioAssay: bioactivity screens of chemical substances
* GENSAT: gene expression atlas of mouse central nervous system
* Probe: sequence-specific reagents
* NLM Catalog: NLM bibliographic data for over 1.2 million journals, books, audiovisuals,
computer software, electronic resources, and other materials resident in LocatorPlus (updated
every weekday).
1. What are the single letter codes for the following amino acids.
Tryptophan, leucine, tyrosine, glutamine, asparagine
Notes:
i) Write your answer in the space given below.
j) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
53
Accessing Entrez and SRS
In addition to using the search engine forms to query the data in Entrez, NCBI provides the
Entrez Programming Utilities (eUtils) for more direct access to query results. The eUtils are
accessed by posting specially formed URLs to the NCBI server, and parsing the XML response.
There is also an eUtils SOAP interface.
5.8 Let us Sum up
Proteins are an important class of biological macromolecules present in all biological
organisms, made up of such elements as carbon, hydrogen, nitrogen, oxygen, and sulfur. All
proteins are polymers of amino acids. The polymers, also known as polypeptides consist of a
sequence of 20 different L-α-amino acids, also referred to as residues. For chains under 40
residues the term peptide is frequently used instead of protein. To be able to perform their
biological function, proteins fold into one, or more, specific spatial conformations, driven by a
number of non covalent interactions such as hydrogen bonding, ionic interactions, Van der
Waals' forces and hydrophobic packing. In order to understand the functions of proteins at a
molecular level, it is often necessary to determine the three dimensional structure of proteins.
This is the topic of the scientific field of structural biology, that employs techniques such as Xray crystallography or NMR spectroscopy, to determine the structure of proteins.
1. Biochemistry refers to four distinct aspects of a protein's structure. Find out those different
structures.
2. Find out the nutritional importance of various amino acids.
1. Your answer must include these:
Tryptophan - W
Leucine - L
Tyrosine - Y
54
Glutamine - Q
Asparagine - N
1. Elaborate on the features of NCBI..
2. Make a comparative study on hydrophilic and hydropholic amino acids.
5.12 References
1. Gibbs A.J. and McIntyre G.A. 1970. The diagram, a method for comparing sequences. Its use
with amino acid and nucleotide sequences. Eur. J. Biochem. 16: 1–11.
2. Gibrat J.F., Madej T., and Bryant S.H. 1996. Surprising similarity in structure comparison.
Curr. Opin.Struct. Biol. 6: 377–385.
3. Gribskov M., McLachlan A.D., and Eisenberg D. 1987. Profile analysis: Detection of distantly
related proteins. Proc. Natl. Acad. Sci. 84: 4355–4358.
4. Henikoff S. and Henikoff J.G. 1992. Amino acid substitution matrices from protein blocks.
Proc. Natl. Acad. Sci. 89: 10915–10919.
55
UNIT II
LESSON - 6
SEQUENCE ALIGNMENT
6.0 Aims and Objective
6.1 Introduction
6.2 Definition of sequence alignment
6.2.1 Global alignment
6.2.2 Local alignment
6.3 Significance of sequence alignment
6.4 Let us Sum up
6.8 References
This unit discuss about Definition of sequence alignment , Global alignment, Local alignment ,
Significance of sequence alignment , Overview of methods of sequence alignment , Alignment
of pairs of sequences , Multiple sequence alignment.
6.1 Introduction to Sequence Alignment
In 1970, A.J. Gibbs and G.A. McIntyre (1970) described a new method for comparing two amino
acid and nucleotide sequences in which a graph was drawn with one sequence written across the
page and the other down the left- hand side. Whenever the same letter appeared in both
56
sequences, a dot was placed at the intersection of the corresponding sequence positions on the
graph. The resulting graph was then scanned for a series of dots that formed a diagonal, which
revealed similarity, or a string of the same characters, between the sequences. Long sequences
can also be compared in this manner on a single page by using smaller dots. The dot matrix
method quite readily reveals the presence of insertions or deletions between sequences because
they shift the diagonal horizontally or vertically by the amount of change.
Comparing a single sequence to itself can reveal the presence of a repeat of the same sequence in
the same (direct repeat) or reverse (inverted repeat or palindrome) orientation. This method of
self-comparison can reveal several features, such as similarity between chromosomes, tandem
genes, repeated domains in a protein sequence, regions of low sequence complexity where the
same characters are often repeated, or self-complementary sequences in RNA that can
potentially base-pair to give a double-stranded structure. Because diagonals may not always be
apparent on the graph due to weak similarity, Gibbs and McIntyre counted all possible diagonals
and these counts were compared to those of random sequences to identify the most significant
alignments. Maizel and Lenk (1981) later developed various filtering and color display schemes
that greatly increased the usefulness of the dot matrix method. This dot matrix representation of
sequence comparisons continues to play an important role in analysis of DNA and protein
sequence similarity, as well as repeats in genes and very long chromosomal sequences.
6.2 Definition of Sequence Alignment
Sequence alignment is the procedure of comparing two (pair-wise alignment) or more (multiple
sequence alignment) sequences by searching for a series of individual characters or character
patterns that are in the same order in the sequences. Two sequences are aligned by writing them
across a page in two rows. Identical or similar characters are placed in the same column, and
nonidentical characters can either be placed in the same column as a mismatch or opposite a gap
in the other sequence. In an optimal alignment, nonidentical characters and gaps are placed to
bring as many identical or similar characters as possible into vertical register. Sequences that can
be readily aligned in this manner are said to be similar. There are two types of sequence
alignment namely global and local. In global alignment, an attempt is made to align the entire
57
sequence, using as many characters as possible, up to both ends of each sequence. Sequences that
are quite similar and approximately the same length are suitable candidates for global alignment.
In local alignment, stretches of sequence with the highest density of matches are aligned, thus
generating one or more islands of matches or subalignments in the aligned sequences. Local
alignments are more suitable for aligning sequences that are similar along some of their lengths
but dissimilar in others, sequences that differ in length, or sequences that share a conserved
region or domain.
6.2.1 Global Alignment
For the two hypothetical protein sequence fragments, the global alignment is stretched over the
entire sequence length to include as many matching amino acids as possible up to and including
the sequence ends. Vertical bars between the sequences indicate the presence of identical amino
acids. Although there is an obvious region of identity in this example (the sequence GKG
preceded by a commonly observed substitution of T for A), a global alignment may not align
such regions so that more amino acids along the entire sequence lengths can be matched.
6.2.2 Local Alignment
In a local alignment, the alignment stops at the ends of regions of identity or strong similarity,
and a much higher priority is given to finding these local regions than to extending the alignment
to include more neighbouring amino acid pairs. Dashes indicate sequence not included in the
alignment. This type of alignment favours finding conserved nucleotide patterns, DNA
sequences, or amino acid patterns in protein sequences.
58
1. Mention the algorithm used for global alignment and local alignment.
Notes:
k) Write your answer in the space given below.
l) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
6.3 SIGNIFICANCE OF SEQUENCE ALIGNMENT
Sequence alignment is useful for discovering functional, structural, and evolutionary information
in biological sequences. It is important to obtain the best possible or so-called “optimal”
alignment to discover this information. Sequences that are very much alike, or “similar” in the
parlance of sequence analysis, probably have the same function, be it a regulatory role in the case
of similar DNA molecules, or a similar biochemical function and three-dimensional structure in
the case of proteins. Additionally, if two sequences from different organisms are similar, there
may have been a common ancestor sequence, and the sequences are then defined as being
homologous. The alignment indicates the changes that could have occurred between the two
homologous sequences and a common ancestor sequence during evolution. With the advent of
genome analysis and large-scale sequence comparisons, it becomes important to recognize that
sequence similarity may be an indicator of several possible types of ancestor relationships, or
there may be no ancestor relationship at all. For example, new gene evolution is often thought to
occur by gene duplication, creating two tandem copies of the gene, followed by mutations in
these copies. In rare cases, new mutations in one of the copies provide an advantageous change
in function. The two copies may then evolve along separate pathways. Although the resulting
separation of function will generate two related sequence families, sequences among both
families will still be similar due to the single gene ancestor. In addition, genetic rearrange ments
59
can reassort domains in proteins, leading to more complex proteins with an evolutionary history
that is difficult to reconstruct (Henikoff et al. 1997). Evolutionary theory provides terms that may
be used to describe sequence relationships. Homologous genes that share a common ancestry and
function in the absence of any evidence of gene duplication are called orthologs. When there is
evidence for gene duplication, the genes in an evolutionary lineage derived from one of the
copies and with the same function are also referred to as orthologs. The two copies of the
duplicated gene and their progeny in the evolutionary lineage are referred to as paralogs. In other
cases, similar regions in sequences may not have a common ancestor but may have arisen
independently by two evolutionary pathways converging on the same function, called convergent
evolution. There are some remarkable examples in protein structures. For instance, although the
enzymes chymotrypsin and subtilisin have totally different three-dimensional structures and
folds, the active sites show similar structural features, including histidine (H), serine (S), and
aspartic acid (D) in the catalytic sites of the enzymes (for discussion, see Branden and Tooze
1991). Additional examples are given. In such cases, the similarity will be highly localized. Such
sequences are referred to as analogous (Fitch 1970). A closer examination of alignments can help
to sort out possible evolutionary origins among similar sequences (Tatusov et al. 1997). As
pointed out by Fitch and Smith (1983), sequences can be either homologous or nonhomologous,
but not in between. The genetic rearrangements referred to above can give rise to chimeric genes,
in which some regions are homologous and others are not. Referring to the entire sequences as
homologous in such situations leads to an inaccurate and incomplete description of the sequence
lineage. Another complication in tracing the origins of similar sequences is that individual genes
may not share the same evolutionary origin as the rest of the genome in which they presently
reside. Genetic events such as symbioses and viral- induced transduction can cause horizontal
transfer of genetic material between unrelated organisms. In such cases, the evolutionary history
of the transferred sequences and that of the organisms will be different. Again, with the
capability of detecting such events in the genomes of organisms comes the responsibility to
describe these changes with the correct evolutionary terminology. In this case, the sequences are
xenologous (Gray and Fitch 1983). Recently, Lawrence and Ochman (1997) have shown that
horizontal transfer of genes between species is as common in enteric bacteria, if not more
common, than mutation. Describing such changes requires a careful description of sequence
origins.
60
6.4 Let us sum up:
Many Bioinformatics tasks depend upon successful alignments. Alignments are conventionally
shown as a traces. When two symbolic representations of DNA or protein sequences are
arranged next to one another so that their most similar elements are juxtaposed they are said to
be aligned.In a symbolic sequence each base or residue monomer in each sequence is represented
by a letter.
1. Find out the methodology for (i) Global Alignment (ii) Local Alignment.
Global Alignment – Needlemann-Wunsch algorithm
Local Alignment – Smith-Waterman algorithm
1. “Sequence alignment has made the task of biological scientist easy” - Comment.
2. How do you rate the local and global alignments.
6.8 References
1. Altschul,S.F. (1989) Gap costs for multiple sequence alignmen J. Theor. Biol., 138, 297–
309.
2. Altschul,S.F., Madden,T.L., Sch¨affer,A.A., Zhang,J., Zhang,Z., Miller,W. and
Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein
database search program Nucleic Acids Res., 25, 3389–3402.
3. Barton,G.J. and Sternberg,M.J. (1987) A strategy for the rapi multiple alignment of
protein sequences. Confidence levels fro tertiary structure comparisons. J. Mol. Biol.,
198, 327–337. Boguski,M.S. and Schuler,G. (1995) ESTablishing a human transcrip map.
Nature Genet., 10, 369–371.
61
LESSON – 7
METHODS OF SEQUENCE ALIGNMENT
7.1 Overview of methods of sequence alignment
7.1.1 Alignment of pairs of sequences
7.1.2 Multiple sequence alignment
7.2 Let us Sum up
7.6 References
This unit discuss the methods of sequence alignment, alignment of pairs of sequences and
multiple sequence alignment.
7.1 Overview Of Methods Of Sequence Alignment
7.1.1 Alignment of Pairs of Sequences
Alignment of two sequences is performed using the following methods:
1. Dot matrix analysis
2. The dynamic programming (or DP) algorithm
3. Word or k-tuple methods, such as used by the programs FASTA and BLAST, described
62
Unless the sequences are known to be very much alike, the dot matrix method should be used
first, as this method displays any possible sequence alignments as diagonals on the matrix. Dot
matrix analysis can readily reveal the presence of insertions/deletions and direct and inverted
repeats that are more difficult to find by the other, more automated methods. The major
limitation of the method is that most dot matrix computer programs do not show an actual
alignment. The dynamic programming method, first used for global alignment of sequences by
Needleman and Wunsch (1970) and for local alignment by Smith and Waterman (1981a),
provides one or more alignments of the sequences. An alignment is generated by starting at the
ends of the two sequences and attempting to match all possible pairs of characters between the
sequences and by following a scoring scheme for matches, mismatches, and gaps. This procedure
generates a matrix of numbers that represents all possible alignments between the sequences. The
highest set of sequential scores in the matrix defines an optimal alignment. For proteins, an
amino acid substitution matrix, such as the Dayhoff percent accepted mutation matrix 250
(PAM250) or blosum substitution matrix 62 (BLOSUM62) is used to score matches and
mismatches. Similar matrices are available for aligning DNA sequences. The dynamic
programming method is guaranteed in a mathematical sense to provide the optimal (very best or
highest-scoring) alignment for a given set of user-defined variables, including choice of scoring
matrix and gap penalties. Fortunately, experience with the dynamic programming method has
provided much help for making the best choices, and dynamic programming has become widely
used. The dynamic programming method can also be slow due to the very large number of
computational steps, which increase approximately as the square or cube of the sequence lengths.
The computer memory requirement also increases as the square of the sequence lengths. Thus, it
is difficult to use the method for very long sequences. Fortunately, the computer scientists have
greatly reduced the time and space requirements to near-linear relationships without
compromising the reliability of the dynamic programming method, and these methods are widely
used in the available dynamic programming applications to sequence alignment. Other shortcuts
have been developed to speed up the early phases of finding an alignment. The word or k-tuple
methods are used by the FASTA and BLAST algorithms . They align two sequences very
quickly, by first searching for identical short stretches of sequences (called words or k-tuples)
and by then joining these words into an alignment by the dynamic programming method. These
methods are fast enough to be suitable for searching an entire database for the sequences that
63
align best with an input test sequence. The FASTA and BLAST methods are heuristic; i.e., an
empirical method of computer programming in which rules of thumb are used to find solutions
and feedback is used to improve performance. However, these methods are reliable in a
statistical sense, and usually provide a reliable alignment.
Dynamic Programming
The following is an example of global sequence alignment using Needleman Wunsch techniques.
For this example, the two sequences to be globally aligned are
GAATTCAGTTA(sequence#1)
G G A T C G A (sequence #2)
So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively)
A simple scoring scheme is assumed where
·
Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position j of
sequence #2 (match score); otherwise
·
Si,j = 0 (mismatch score)
·
w = 0 (gap penalty)
Three steps in dynamic programming
1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)
Initialization Step
The first step in the global alignment dynamic programming approach is to create a matrix with
M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be
aligned.
Since this example assumes there is no gap opening or gap extension penalty, the first row and
first column of the matrix can be initially filled with 0.
64
Matrix Fill Step
One possible (inefficient) solution of the matrix fill step finds the maximum global alignment
score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j
for each position in the matrix. In order to find Mi,j for any i,j it is minimal to know the score for
the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is
necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1 .
For each position, Mi,j is defined to be the maximum score at position i,j; i.e.
Mi,j = MAXIMUM[
Mi-1, j-1 + Si,j (match/mismatch in the diagonal),
Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2)]
Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be green and Mi-1,j will be blue.
Using this information, the score at position 1,1 in the matrix can be calculated. Since the first
residue in both sequences is a G, S1,1 = 1, and by the assumptions stated at the beginning, w = 0.
Thus, M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1.
A value of 1 is then placed in position 1,1 of the scoring matrix.
65
Since the gap penalty (w) is 0, the rest of row 1 and column 1 can be filled in with the value 1.
Take the example of row 1. At column 2, the value is the max of 0 (for a mismatch), 0 (for a
vertical gap) or 1 (horizontal gap). The rest of row 1 can be filled out similarly until we get to
column 8. At this point, there is a G in both sequences (light blue). Thus, the value for the cell at
row 1 column 8 is the maximum of 1 (for a match), 0 (for a vertical gap) or 1 (horizontal gap).
The value will again be 1. The rest of row 1 and column 1 can be filled with 1 using the above
reasoning.
Now look at column 2. The location at row 2 will be assigned the value of the maximum of
1(mismatch), 1(horizontal gap) or 1 (vertical gap). So its value is 1.
At the position column 2 row 3, there is an A in both sequences. Thus, its value will be the
maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so its value is 2.
66
Moving along to position colum 2 row 4, its value will be the maximum of 1 (mismatch), 1
(horizontal gap), 2 (vertical gap) so its value is 2. Note that for all of the remaining positions
except the last one in column 2, the choices for the value will be the exact same as in row 4 since
there are no matches. The final row will contain the value 2 since it is the maximum of 2
(match), 1 (horizontal gap) and 2(vertical gap).
Using the same techniques as
described for column 2, we
can fill in column 3.
After filling in all of the values the score matrix is as follows:
67
1. List the 3 steps involved in dynamic programming.
Notes:
m) Write your answer in the space given below.
n) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
Traceback Step
After the matrix fill step, the maximum alignment score for the two test sequences is 6. The
traceback step determines the actual alignment(s) that result in the maximum score. Note that
with a simple scoring algorithm such as one that is used here, there are likely to be multiple
maximal alignments.
The traceback step begins in the M,J position in the matrix, i.e. the position that leads to the
maximal score. In this case, there is a 6 in that location.
Traceback takes the current cell and looks to the neighbour cells that could be direct
predecessors. This means it looks to the neighbour to the left (gap in sequence #2), the diagonal
neighbour (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for
traceback chooses as the next cell in the sequence one of the possible predecessors. In this case,
the neighbours are marked in red. They are all also equal to 5.
68
Since the current cell has a value of 6 and the scores are 1 for a match and 0 for anything else,
the only possible predecessor i s the diagonal match/mismatch neighbor. If more than one
possible predecessor exists, any can be chosen. This gives us a current alignment of
(Seq #1)
A
|
(Seq #2)
A
So now we look at the current cell and determine which cell is its direct predecessor. In this case,
it is the cell with the red 5.
The alignment as described in the above step adds a gap to sequence #2, so the current alignment
is
(Seq #1)
TA
|
(Seq #2)
_A
Once again, the direct predacessor produces a gap in sequence #2.
69
After this step, the current alignment is
(Seq #1)
TTA
|
__A
Continuing on with the traceback step, we eventually get to a position in column 0 row 0 which
tells us that traceback is completed. One possible maximum alignment is :
Giving an alignment of :
GAATTCAGTTA
| | || |
|
GGA_TC_G__A
An alternate solution is:
70
Giving an alignment of :
G_AATTCAGTTA
|
| || |
|
GG_A_TC_G__A
There are more alternative solutions each resulting in a maximal global alignment score of 6.
Since this is an exponential problem, most dynamic programming algorithms will only print out
a single solution.
7.1.2 Multiple Sequence Alignment
From a multiple alignment of three or more protein sequences, the highly conserved residues that
define structural and functional domains in protein families can be identified. New members of
such families can then be found by searching sequence databases for other sequences with these
same domains. Alignment of DNA sequences can assist in finding conserved regulatory patterns
in DNA sequences. Despite the great value of multiple sequence alignments, obtaining one
presents a very difficult algorithmic problem.
Introduction
For many genes a database search will reveal a whole number of homologous sequences. One
then wishes to learn about the evolution and the sequence conservation in such a group. This
question surpasses what can reasonably be achieved by the sequence comparison methods
described in the previous sections. Pairwise comparisons do not readily show positions that are
conserved among a whole set of sequences and tend to miss subtle similarities that become
71
visible
when
observed simultaneously among many sequences. Thus one wants to
simultaneously compare several sequences.
A multiple alignment arranges a set of sequences in a scheme where positions believed to be
homologous are written in a common column. Like in a pairwise alignment, when a sequence
does not possess an amino acid in a particular position this is again denoted by a dash. Like for
pairwise alignments there are conventions regarding the scoring of a multiple alignment. In one
approach, one simply adds the scores of all the induced pairwise aligments contained in a
multiple alignment. For a linear gap penalty this amounts to scoring each column of the
alignment by the sum of the amino acid pair scores in this column. The corresponding score is
called the sum of pairs (SP) score.. Although it would be biologically meaningful, the
distinctions between global, local and other forms of alignment are rarely made in a multiple
alignment. The reason for this will become apparent below when we describe the computational
difficulties in computing multiple alignments.
Note that the full set of optimal pairwise alignments among a given set of sequences will
generally overdetermine the multiple alignment. If one wishes to assemble a multiple alignment
from pairwise alignments one has to avoid "closing loops", i.e. one can put together pairwise
alignments as long as no new pairwise alignment is included to a sequence which is already part
the multiple alignment. In particular, pairwise alignments can be merged when they align one
sequence to all others, when a linear order of the given sequence is maintained, or when
sequences pairs with pairwise alignments form a tree. While all these schemes allow for the
ready definition of algorithms that output multiply aligned sequences, they do not inlcude any
information stemming from the simultaneous analysis of several sequences.
The alternative approach is to generalize the dynamic programming optimization procedure
applied for pairwise alignment to the delineation of a multiple alignment that maximizes a score.
The algorithm used is a straightforward generalization of the global alignment algorithm
presented in the section Algorithms for the comparison of two sequences. This is easy to see, in
particular, for the case of column-oriented scoring function avoiding affine gap penalty in favor
of the simpler linear one. With this scoring, the arrangment of gaps and letters in a column can
72
be represented by a boolean vector indicating which sequences contain a gap in a particular
column. Given the letters that are being compared, one needs to evaluate the scores for all these
arrangements. However, the computational complexity of this algorithm is rather forbidding. For
n sequences it is proportional to 2n times the product of the lengths of all sequences.
In practice this algorithm can only be run for a modest number of sequences being compared.
There exists software to compare three sequences with this algorithm that additionally
implements a space-saving technique. For more than three sequences algorithms have been
developed that aim at reducing the search space while still optimizing the given scoring function.
The most prominent program of this kind is MSA2 [1]. An alternative approach is used by DCA
[2] which implements a divide-and-conquer philosophy. The search space is repeatedly
subdivided by identifying strongholds for the alignment. For a more detailed description of these
concepts, also look at the section on Algorithms for SP-optimal multiple alignments.
None of these approaches, however, would work independent of the number of sequences to be
aligned. The most common remedy is reducing the multiple alignment problem to an iterated
application of the pairwise alignment algorithm. However, in doing so, one also aims at drawing
on the increased amount of information contained in a set of sequences. Instead of simply
merging pairwise alignments of sequences, the notion of a profile has been introduced in order to
grasp the conservation patterns within subgroups of sequences. A profile is essentially a
representation of an already computed multiple alignment of a subgroup. This alignment is
"frozen" for the remaining computation. Other sequences or other profiles can be compared to a
given profile based on a generalized scoring scheme defined for this purpose. Two such schemes
are in use, one based on average scores and one based on information theoretic score [3].
Note that profile scoring schemes respect conservation patterns.
Given a profile and a single sequence, the two can be aligned using the basic dynamic
programming algorithm together with the accompanying scoring scheme. The result will be an
alignment between the two that can readily be converted into a multiple alignment now
comprising the sequences underlying the profile plus the new one. Likewise, two profiles can be
aligned with each other resulting in a multiple alignment containing all sequences from both
profiles. With these tools various multiple alignment strategies can be implemented. Most
73
commonly, a hierachical tree is generated for the given sequences which is then used as a guide
for iterative profile construction and alignment. The construction of such a tree is described in
the section Phylogenetic Trees and Multiple Alignments.
The above alignment strategy was introduced in papers by Taylor, Barton, Corpet, and Higgins.
Higgins' program Clustal has meanwhile become the de facto standard for multiple sequence
alignment. Another program in use is Dialign. Dialign is different in that it aims at the
delineation of regions of similarity among the given sequences.
Since the iterative profile alignment tends to be guided by a hierachical tree, this step of the
computation is also influencing the final result. Usually this tree is computed based on pairwise
comparisons and the resulting scores. Subsequently this score matrix is used as input to a
clustering procedure like single linkage clustering or UPGMA. However, it is well understood
that in an evolutionary sense such a hierarchic clustering does not necessarily result in a
biologically valid tree. Thus, when allowing this tree to determine the multiple alignment there is
the danger of directing further evolutionary analysis of this alignment into the wrong direction.
Consequently, the question has arisen of a common formulation of evolutionary reconstruction
and multiple sequence alignment.
The cleanest although biologically somewhat simplistic model attempts to reconstruct ancestral
sequences to attribute to the inner nodes of a tree. Such reconstructed sequences at the same time
determine the multiple alignment among the sequences. In this generalized tree alignment one
aims at minimizing the sum of the edges of this tree, where each edge is annotated with the
alignment distance between the sequences at its incident nodes. As to be expected, the
computational complexity of this problem again makes its solution unpractical but
approximation algorithms are known. The practical effort in this direction go back to the work of
Sankoff Jotun Hein Schwikowski and Vingron, and Tao Jiang produced programs relying on
these ideas.
7.2 Let us sum up
74
This lesson gives a detailed account of Dot matrix analysis, dynamic programming (or DP)
algorithm, word or k-tuple methods, such as used by the programs FASTA and BLAST and
multiple sequence alignment
1. Find out the various software and tools available for pair wise and multiple sequence
alignment.
1. Initialization
2. Matrix fill (scoring)
3. Traceback (alignment)
1. Dynamic programming method is the best one for the alignment of pairs of sequences –
Give your views on this statement.
7.6 References
1. Wang L, Jiang T. (1994) On the complexity of multiple sequence alignment. J Comput
Biol 1:337-348.
2. Just W. (2001). Computational complexity of multiple sequence alignment with SPscore. J Comput Biol 8(6):615-23.
3. Higgins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence
alignment on a microcomputer. Gene 73(1):237-44.
75
4. Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, positionsspecific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.
5. Notredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and
accurate multiple sequence alignment. J Mol Biol 302(1):205-17.
LESSON – 8
DYNAMIC PROGRAMMING
8.1 Dot matrix sequence comparison
8.1.1 Pair-wise sequence comparison
8.2 Dynamic programming algorithm for sequence alignment
8.2.1 Description of the algorithm
8.2.2 Formal description of the dynamic programming algorithm
8.3 Let us Sum up
76
8.7 References
This unit discuss about Dot matrix sequence comparison, Pair-wise sequence comparison,
Dynamic programming algorithm for sequence alignment , Description of the algorithm, Formal
description of the dynamic programming algorithm.
8.1 Dot Matrix Sequence Comparison
A dot matrix analysis is primarily a method for comparing two sequences to look for possible
alignment of characters between the sequences, first described by Gibbs and McIntyre (1970).
The method is also used for finding direct or inverted repeats in protein and DNA sequences, and
for predicting regions in RNA that are self-complementary and that, therefore, have the potential
of forming secondary structure. Every laboratory that does sequence analysis should have at least
one dot matrix program available. In choosing a program, look for as many of the features
described below as possible. The dot matrix should be visible on the computer terminal, thus
providing an interactive environment so that different types of analyses may be tried. Use of
colored dots can enhance the detection of regions of similarity (Maizel and Lenk 1981).
Additional descriptions of the dot matrix method have appeared elsewhere (Doolittle 1986;
States and Boguski 1991). The examples given below use the dot matrix module of DNA Strider
(version 1.3) on a Macintosh computer. The program DOTTER has interactive features for the
U N I X X -Windows
environment
(Sonnhammer
and
Durbin
1995;
http://www.cgr.ki.se/cgr/groups/sonnhammer/ Dotter.html). The Genetics Computer Group
programs COMPARE and DOTPLOT also perform a dot matrix analysis. Although not a dot
matrix method, the program PLALIGN in the FASTA suite may be used to display the
alignments found by the dynamic programming method between two sequences on a graph
(http://fasta.bioch. virginia.edu/fasta/fasta_list.html; Pearson 1990). A dot matrix program that
may be used with a Web browser is described in Junier and Pagni (2000)
(http://www.isrec.isbsib. ch/java/ dotlet/Dotlet.html).
77
8.1.1 Pair-wise Sequence Comparison
The major advantage of the dot matrix method for finding sequence alignments is that all
possible matches of residues between two sequences are found, leaving the investigator the
choice of identifying the most significant ones. Then, sequences of the actual regions that align
can be detected by using one of two other methods for performing sequence alignments, e.g.,
dynamic programming. These methods are automatic and usually show one best or optimal
alignment, even though there may be several different, nearly alike alignments. Alignments
generated by these programs can be compared to the dot matrix alignment to determine whether
the longest regions are being matched and whether insertions and deletions are located in the
most reasonable places. In the dot matrix method of sequence comparison, one sequence (A) is
listed across the top of a page and the other sequence (B) is listed down the left side. Starting
with the first character in B, one then moves across the page keeping in the first row and placing
a dot in any column where the character in A is the same. The second character in B is then
compared to the entire A sequence, and a dot is placed in row 2 wherever a match occurs. This
process is continued until the page is filled with dots representing all the possible matches of A
characters with B characters. Any region of similar sequence is revealed by a diagonal row of
dots. Isolated dots not on the diagonal represent random matches that are probably not related to
any significant alignment. Detection of matching regions may be improved by filtering out
random matches in a dot matrix. Filtering is achieved by using a sliding window to compare the
two sequences. Instead of comparing single sequence positions, a window of adjacent positions
in the two sequences is compared at the same time, and a dot is printed on the page only if a
certain minimal number of matches occur. The window starts at the positions in A and B to be
compared and includes characters in a diagonal line going down and to the right, comparing each
pair in turn, as in making an alignment. A larger window size is generally used for DNA
sequences than for protein sequences because the number of random matches is much greater
due to the use of only four DNA symbols as compared to 20 amino acid symbols. A typical
window size for DNA sequences is 15 and a suitable match requirement in this window is 10.
For protein sequences, the matrix is often not filtered, but a window size of 2 or 3 and a match
requirement of 2 will highlight matching regions. If two proteins are expected to be related but to
have long regions of dissimilar sequence with only a small proportion of identities, such as
78
similar active sites, a large window, e.g., 20, and small stringency, e.g., 5, should be useful for
seeing any similarity. Identification of sequence alignments by the dot matrix method can be
aided by performing a count of dots in all possible diagonal lines through the matrix to determine
statistically which diagonals have the most matches, and by comparing these match scores with
the results of random sequence comparisons (Gibbs and McIntyre 1970; Argos 1987). An
example of a dot matrix analysis between the DNA sequences that encode the Escherichia coli
phage _ cI and phage P22 c2 repressor proteins. With a window of 1 and stringency of 1, there is
so much noise that no diagonals can be seen, but with a window of 11 and a stringency of 7,
diagonals appear in the lower right. The analysis reveals that there are regions of similarity in the
3_ ends of the coding regions, which, in turn, suggests similarity in the carboxy-terminal
domains of the encoded repressors. Note that sequential diagonals in matrix C do not line up
exactly, indicating the presence of extra nucleotides in one sequence (the lambda cI gene on the
vertical scale). The diagonals shown in the lower part of the matrix reveal a region of sequence
similarity in the carboxy-terminal domains of the proteins. A small insertion in the cI protein that
is approximately in the middle of this region and shifts the diagonal slightly downward accounts
for this pattern. An example of a dot matrix analysis between the amino acid sequences of the
same two E. coli phage lambda cI and phage P22 c2 repressor proteins. This matrix was filtered
by a window of 1 and a stringency of 1. As found with the DNA sequence alignment of the
corresponding genes, diagonals shown in the lower part of the matrix reveal a region of sequence
similarity in the carboxy-terminal domains of the proteins. The small insertion in the cI protein
approximately in the middle of this region which shifts the diagonal slightly downward and
which is also observed in the DNA alignment of these corresponding genes is also visible. Note
that these windows are much smaller than required for DNA sequence comparisons due to the
greater number of possible symbols (20 amino acids) and therefore fewer random matches. In
conclusion, for DNA sequence dot matrix comparisons, use long windows and high stringencies,
e.g., 7 and 11, 11 and 15. For protein sequences, use short windows, e.g., 1 and 1, for window
and stringency, respectively, except when looking for a short domain of partial similarity in
otherwise not-similar sequences. In this case, use a longer window and a small stringency, e.g.,
15 and 5, for window and stringency, respectively. There are three types of variations in the
analysis of two protein sequences by the dot matrix method. First, chemical similarity of the
amino acid R group or some other feature for distinguishing amino acids may be used to score
79
similarity. Second, a symbol comparison table such as the PAM250 or BLOSUM62 tables may
be used (States and Boguski 1991). These tables provide scores for matches based on their
occurrence in aligned protein families. These tables are discussed later. When these tables are
used, a dot is placed in the matrix only if a minimum similarity score is found. These table values
may also be used in a sliding window option, which averages the score within the window and
prints a dot only above a certain average score. Finally, several different matrices can be made,
each with a different scoring system, and the scores can be averaged. This method should be
useful for aligning more distantly related proteins. The scores of each possible diagonal through
the matrix are then calculated, and the most significant ones are identified and shown on a
computer screen (Argos 1987).
80
8.2 Dynamic Programming Algorithm For Sequence Alignment
Dynamic programming is a computational method that is used to align two protein or nucleic
acid sequences. The method is very important for sequence analysis because it provides the very
best or optimal alignment between sequences. Programs that perform this analysis on sequences
are readily available, and there are Web sites that will perform the analysis. However, the
method requires the intelligent use of several variables in the program. Thus, it is important to
understand how the program works in order to make informed choices of these variables. The
method compares every pair of characters in the two sequences and generates an alignment. This
alignment will include matched and mismatched characters and gaps in the two sequences that
are positioned so that the number of matches between identical or related characters is the
maximum possible. The dynamic programming algorithm provides a reliable computational
method for aligning DNA and protein sequences. The method has been proven mathematically to
produce the best or optimal alignment between two sequences under a given set of match
conditions. Optimal alignments provide useful information to biologists concerning sequence
relationships by giving the best possible information as to which characters in a sequence should
be in the same column in an alignment, and which are insertions in one of the sequences (or
deletions on the other). This information is important for making functional, structural, and
evolutionary predictions on the basis of sequence alignments. Both global and local types of
alignments may be made by simple changes in the basic dynamic programming algorithm. A
global alignment program is based on the Needleman- Wunsch algorithm, and a local alignment
program on the Smith-Waterman algorithm, described (p. 72). The predicted alignment will be
given a score that gives the odds of obtaining the score between sequences known to be related
to that obtained by chance alignment of unrelated sequences. There is a method to calculate
whether or not an alignment obtained this way is statistically significant. One of the sequences
may be scrambled many times and each randomly generated sequence may be realigned with the
second sequence to demonstrate that the original alignment is unique. The statistical significance
of alignment scores is discussed in detail (p. 96). Another feature of the dynamic programming
algorithm is that the alignments obtained depend on the choice of a scoring system for
comparing character pairs and penalty scores for gaps. For protein sequences, the simplest
81
system of comparison is one based on identity. A match in an alignment is only scored if the two
aligned amino acids are identical. However, one can also examine related protein sequences that
can be aligned easily and find which amino acids are commonly substituted for each other. The
probability of a substitution between any pair of the 20 amino acids may then be used to produce
alignments. Recent improvements and experience with the dynamic programming programs and
the scoring systems have greatly simplified their use. These enhancements are discussed below
and at http://www.Bioinformatics online.org. It is important to recognize that several different
alignments may provide approximately the same alignment score; i.e., there are alignments
almost as good as the highest-scoring one reported by the alignment program. Some programs,
e.g., LALIGN, provide several entirely different alignments with different sequence positions
matched that can be compared to improve confidence in the best-scoring one. Alignment
programs have also been greatly improved in algorithmic design and performance. With the
advent of faster machines, it is possible to do a dynamic programming alignment between a
query sequence and an entire sequence database and to find the similar sequences in several
minutes. Dynamic programming has also been used to perform multiple sequence alignment, but
only for a small number of sequences because the complexity of the calculations increases
substantially for more than two sequences. Sequence alignment programs are available as a part
of most sequence analysis packages, such as the widely used Genetics Computer Group GAP
(global alignment) and BESTFIT (local alignment) programs. Sequences can also be pasted into
a text area on a guest Web page on a remote host machine that will perform a dynamic
programming alignment, and there are also versions of alignment programs that will run on a
microcomputer. In deciding to perform a sequence alignment, it is important to keep the goal of
the analysis in mind. Is the investigator interested in trying to find out whether two proteins have
similar domains or structural features, whether they are in the same family with a related
biological function, or whether they share a common ancestor relationship? The desired
objective will influence the way the analysis is done. There are several decisions to be made
along the way, including the type of program, whether to produce a global or local alignment, the
type of scoring matrix, and the value of the gap penalties to be used. There are a very large
number of amino acid scoring matrices in use, some much more popular than others, and these
scoring matrices are designed for different purposes. Some, such as the Dayhoff PAM matrices,
are based on an evolutionary model of protein change, whereas others, such as the BLOSUM
82
matrices, are designed to identify members of the same family. Alignments between DNA
sequences require similar kinds of considerations. It is often worth the effort to try several
approaches to find out which choice of scoring system and gap penalty give the most reasonable
result. Fortunately, most alignment programs come with a recommended scoring matrix and gap
penalties that are useful for most situations. A more recent development is the simultaneous use
of a set of scoring matrices and gap penalties by a method that generates the most probable
alignments. The final choice as to the most believable alignment is up to the investigator, subject
to the condition that reasonable decisions have been made regarding the methods used. For
sequences that are very similar, e.g., _95%, the sequence alignment is usually quite obvious, and
a computer program may not even be needed to produce the alignment. As the sequences become
less and less similar, the alignment becomes more difficult to produce and one is less confident
of the result. For protein sequences, similarity can still be recognized down to a level of
approximately 25% amino acid identity. At this level of identity, the relative numbers of
mismatched amino acids and gaps in the alignment have to be decided empirically and a decision
made as to which gap penalties work the best for a given scoring matrix. Alignment of sequences
at this level of identity is called the “twilight zone” of sequence alignment by Doolittle (1981).
The alignment program may provide a quite convincing alignment, which suggests that the two
sequences are homologous.
1. What are three types of variations in the analysis of two protein sequences by the dot matrix
method?
Notes:
o) Write your answer in the space given below.
p) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
83
………………………………………………………………………………………………………
………………………………………………………
8.2.1 Description of the Algorithm
Alignment of two sequences without allowing gaps requires an algorithm that performs a number
of comparisons roughly proportional to the square of the average sequence length, as in a dot
matrix comparison. If the alignment is to include gaps of any length at any position in either
sequence, the number of comparisons that must be made becomes astronomical and is not
achievable by direct comparison methods. Dynamic programming is a method of sequence
alignment that can take gaps into account but that requires a manageable number of comparisons.
The method of sequence alignment by dynamic programming and the proof that the method
provides an optimal (highest scoring) alignment. To understand how the method works, we must
first recall what is meant by an alignment, using the two protein sequences as an example. The
two sequences will be written across the page, one under the other, the object being to bring as
many amino acids as possible into register. In some regions, amino acids in one sequence will be
placed directly below identical amino acids in the second. In other regions, this process may not
be possible and nonidentical amino acids may have to be placed next to each other, or else gaps
must be introduced into one of the sequences. Gaps are added to the alignment in a manner that
increases the matching of identical or similar amino acids at subsequent portions in the
alignment. Ideally, when two similar protein sequences are aligned, the alignment should have
long regions of identical or related amino acid pairs and very few gaps. As the sequences become
more distant, more mismatched amino acid pairs and gaps should appear. The quality of the
alignment between two sequences is calculated using a scoring system that favors the matching
of related or identical amino acids and penalizes for poorly matched amino acids and gaps. To
decide how to score these regions, information on the types of changes found in related protein
sequences is needed. These changes may be expressed by the following probabilities: (1) that a
particular amino acid pair is found in alignments of related proteins; (2) that the same amino acid
pair is aligned by chance in the sequences, given that some amino acids are abundant in proteins
and others rare; and (3) that the insertion of a gap of one or more residues in one of the
84
sequences (the same as an insertion of the same length in the other sequence), thus forcing the
alignment of each partner of the amino acid pair with another amino acid, would be a better
choice. The ratio of the first two probabilities is usually provided in an amino acid substitution
matrix. Each table entry gives the ratio of the observed frequency of substitution between each
possible amino acid pair in related proteins to that expected by chance, given the frequencies of
the amino acids in proteins. These ratios are called odds scores. The ratios are transformed to
logarithms of odds scores, called log odds scores, so that scores of sequential pairs may be added
to reflect the overall odds of a real to chance alignment of an alignment. Examples are the
Dayhoff PAM250 and BLOSUM62 substitution matrices described (p. 76). These matrices
contain positive and negative values, reflecting the likelihood of each amino acid substitution in
related proteins. Using these tables, an alignment of a sequential set of amino acid pairs with no
gaps receives an overall score that is the sum of the positive and negative log odds scores for
each individual amino acid pair in the alignment. The higher this score, the more significant is
the alignment, or the more it resembles alignments in related proteins. The score given for gaps
in aligned sequences is negative, because such misaligned regions should be uncommon in
sequences of related proteins. Such a score will reduce the score obtained from an adjacent,
matching region upstream in the sequences. The score of the alignment, using values from the
BLOSUM62 amino acid substitution matrix and a gap penalty score of _11 for a gap of length 1,
is 26 (the sum of amino acid pair scores) _11 _15. The value of _11 as a penalty for a gap of
length 1 is used because this value is already known from experience to favor the alignment of
similar regions when the BLOSUM62 comparison matrix is used. Choice of the gap penalty is
discussed further below where a table giving suitable choices is presented. As shown in the
example, the presence of the gap decreases significantly the overall score of the alignment.
Although one may be able to align the two short sequences by eye and to place the gap where
shown, the dynamic programming algorithm will automatically place gaps in much longer
sequence alignments so as to achieve the best possible alignment. The derivation of the dynamic
programming algorithm, using the above alignment as an example. Consider building this
alignment in steps, starting with an initial matching aligned pair of characters from the sequences
(V/V) and then sequentially adding a new pair until the alignment is complete, at each stage
choosing a pair from all the possible matches that provides the highest score for the alignment up
to that point. If the full alignment finally reached on the left side (I) has the highest possible or
85
optimal score, then the old alignment from which it was derived (A) by addition of the aligned
Y/Y pair must also have been optimal up to that point in the alignment. If this were incorrect,
and a different preceding alignment other than A was the highest scoring one, then the alignment
on the left would also not be the highest scoring alignment, and we started with that as a known
condition. Similarly, (II), alignment A must also have been derived from an optimal alignment
(B) by addition of a C/C pair. In this manner, the alignment can be traced back sequentially to
the first aligned pair that was also an optimal alignment. One concludes that the building of an
optimal alignment in this stepwise fashion can provide an optimal alignment of the entire
sequences. The example also illustrates two of the three choices that can be made in adding to an
alignment between two sequences: Match the next two characters in the next positions in each
sequence, or match the next character to a gap in the upper sequence. The last possibility, not
illustrated, is to add a gap to the lower sequence. This situation is analogous to performing a dot
matrix analysis of the sequences, and of either continuing a diagonal or of shifting the diagonal
sideway or downward to produce a gap in one of the sequences.
8.2.2 Formal Description of the Dynamic Programming Algorithm
The algorithm may be written in mathematical form. The diagram indicates the moves that are
possible to reach a certain matrix position (i,j) starting from the previous row and column at
position (i _ 1, j _ 1) or from any position in the same row and column. There are three paths in
the scoring matrix for reaching a particular position, a diagonal move from position i _ 1, j _ 1 to
position i, j with no gap penalties, or a move from any other position from column j or row i,
with a gap penalty that depends on the size of the gap. For two sequences a _ a1a2 . . . an and b _
b1 b2 . . . bn, where Sij beta h(a1a2 . . . ai, b1b2..bj) then (Smith and Waterman 1981a,b) in
sequence a, and wy is the penalty for a gap of length y in sequence b. Note that Sij is a type of
running best score as the algorithm moves through every position in the matrix. Eventually,
when all of the matrix positions (all Sij) have been filled, the best score of the alignment will be
found as the highest scoring position in the last row and column (for a global alignment), after
correcting for any remaining gap penalties to align the sequence ends, if applicable. To
determine an optimal alignment of the sequences from the scoring matrix, a second matrix called
the trace-back matrix is used. The trace-back matrix keeps track of the positions in the scoring
86
matrix that contributed to the highest overall score found. The sequence characters
corresponding to these high scoring positions may align or may be next to a gap, depending on
the information in the trace-back matrix. An example of this procedure can be found on the book
Web site. Use of the dynamic programming method requires a scoring system for the comparison
of symbol pairs (nucleotides for DNA sequences and amino acids for protein sequences), and a
scheme for insertion/deletion (GAP) penalties. Once those parameters have been set, the
resulting alignment for two sequences should always be the same.
8.3 Let us Sum up
A dot matrix analysis is primarily a method for comparing two sequences to look for possible
alignment of characters between the sequences. Dynamic programming is a computational
method that is used to align two protein or nucleic acid sequences. The method is very important
for sequence analysis because it provides the very best or optimal alignment between sequences.
Programs that perform this analysis on sequences are readily available, and there are Web sites
that will perform the analysis. Alignment of two sequences without allowing gaps requires an
algorithm that performs a number of comparisons roughly proportional to the square of the
average sequence length, as in a dot matrix comparison.
1. Find out some more algorithms that use dynamic programming.
1. Chemical similarity of the amino acid R group or some other feature for distinguishing
amino acids may be used to score similarity.
2. A symbol comparison table such as the PAM250 or BLOSUM62 tables may be used
3. Several different matrices can be made, each with a different scoring system, and the
scores can be averaged.
87
1. Comment on the significant features of the dynamic programming algorithm with a
suitable examples.
8.7 References
1. Wagner, David B., 1995, " Dynamic Programming." An introductory article on dynamic
programming in Mathematica.
2. King, Ian, 2002 (1987), " A Simple Introduction to Dynamic Programming in
Macroeconomic Models." An introduction to dynamic programming as an important tool
in economic theory.
3. Dreyfus, Stuart, " Richard Bellman on the birth of Dynamic Programming
88
LESSON – 9
SCORING MATRICES AND GAP PENALTY
9.1 Use of scoring matrices and gap penalties in sequence alignments
9.1.1 Amino acid substitution matrices
9.1.2 Nucleic acid PAM scoring matrices
9.1.3 Gap penalties
9.1.4 Optimal combinations of scoring matrices and gap penalties
for finding related proteins
9.2 Let us Sum up
9.6 References
This unit discuss the use of scoring matrices and gap penalties in sequence alignments, Amino
acid substitution matrices, N u cleic acid PAM scoring matrices, Gap penalties, Optimal
combinations of scoring matrices and gap penalties , for finding related proteins.
89
9.1 Use of Scoring Matrices And Gap Penalties In Sequence Alignments
Amino Acid Substitution Matrices
Protein chemists discovered early on that certain amino acid substitutions commonly occur in
related proteins from different species. Because the protein still functions with these
substitutions, the substituted amino acids are compatible with protein structure and function.
Often, these substitutions are to a chemically similar amino acid, but other changes also occur.
Yet other substitutions are relatively rare. Knowing the types of changes that are most and least
common in a large number of proteins can assist with predicting alignments for any set of protein
sequences. If related protein sequences are quite similar, they are easy to align, and one can
readily determine the single-step amino acid changes. If ancestor relationships among a group of
proteins are assessed, the most likely amino acid changes that occurred during evolution can be
predicted. This type of analysis was pioneered by Margaret Dayhoff (1978). Amino acid
substitution matrices or symbol comparison tables, as they are sometimes called, are used for
such purposes. Although the most common use of such tables is for the comparison of protein
sequences, other tables of nucleic acid symbols are also used for comparison of nucleic acid
sequences in order to accommodate ambiguous nucleotide characters or models of expected
sequence changes during different periods of evolutionary time that vary scoring of transitions
and transversions. In the amino acid substitution matrices, amino acids are listed both across the
top of a matrix and down the side, and each matrix position is filled with a score that reflects
how often one amino acid would have been paired with the other in an alignment of related
protein sequences. The probability of changing amino acid A into B is always assumed to be
identical to the reverse probability of changing B into A. This assumption is made because, for
any two sequences, the ancestor amino acid in the phylogenetic tree is usually not known.
Additionally, the likelihood of replacement should depend on the product of the frequency of
occurrence of the two amino acids and on their chemical and physical similarities. A prediction
of this model is that amino acid frequencies will not change over evolutionary time (Dayhoff
1978).
90
9.1.1 Dayhoff Amino Acid Substitution Matrices (Percent Accepted Mutation or PAM
Matrices)
This family of matrices lists the likelihood of change from one amino acid to another in
homologous protein sequences during evolution. There is presently no other type of scoring
matrix that is based on such sound evolutionary principles as are these matrices. Even though
they were originally based on a relatively small data set, the PAM matrices remain a useful tool
for sequence alignment. Each matrix gives the changes expected for a given period of
evolutionary time, evidenced by decreased sequence similarity as genes encoding the same
protein diverge with increased evolutionary time. Thus, one matrix gives the changes expected in
homologous proteins that have diverged only a small amount from each other in a relatively
short period of time, so that they are still 50% or more similar. Another gives the changes
expected of proteins that have diverged over a much longer period, leaving only 20% similarity.
These predicted changes are used to produce optimal alignments between two protein sequences
and to score the alignment. The assumption in this evolutionary model is that the amino acid
substitutions observed over short periods of evolutionary history can be extrapolated to longer
distances. The BLOSUM matrices are based on scoring substitutions found over a range of
evolutionary periods and reveal that substitutions are not always as predicted by the PAM model.
In deriving the PAM matrices, each change in the current amino acid at a particular site is
assumed to be independent of previous mutational events at that site (Dayhoff 1978). Thus, the
probability of change of any amino acid a to amino acid b is the same, regardless of the previous
changes at that site and also regardless of the position of amino acid a in a protein sequence.
Amino acid substitutions in a protein sequence are thus viewed as a Markov model,
characterized by a series of changes of state in a system such that a change from one state to
another does not depend on the previous history of the state. Use of this model makes possible
the extrapolation of amino acid substitutions observed over a relatively short period of
evolutionary time to longer periods of evolutionary time. To prepare the Dayhoff PAM matrices,
amino acid substitutions that occur in a group of evolving proteins were estimated using 1572
changes in 71 groups of protein sequences that were at least 85% similar. Because these changes
are observed in closely related proteins, they represent amino acid substitutions that do not
significantly change the function of the protein. Hence they are called “accepted mutations,”
91
defined as amino acid changes “accepted” by natural selection. Similar sequences were first
organized into a phylogenetic tree. The number of changes of each amino acid into every other
amino acid was then counted. To make these numbers useful for sequence analysis, information
on the relative amount of change for each amino acid was needed. Relative mutabilities were
evaluated by counting, in each group of related sequences, the number of changes of each amino
acid and by dividing this number by a factor, called the exposure to mutation of the amino acid.
This factor is the product of the frequency of occurrence of the amino acid in that group of
sequences being analyzed and the total number of all amino acid changes that occurred in that
group per 100 sites. This factor normalizes the data for variations in amino acid composition,
mutation rate, and sequence length. The normalized frequencies were then summed for all
sequence groups. By these scores, Asn, Ser, Asp, and Glu were the most mutable amino acids,
and Cys and Trp were the least mutable. The above amino acid exchange counts and mutability
values were then used to generate a 20:20 mutation probability matrix representing all possible
amino acid changes. Because amino acid change was modeled by a Markov model, the mutation
at each site being independent of the previous mutations, the changes predicted for more
distantly related proteins that have undergone N mutations could be calculated. By this model,
the PAM1 matrix could be multiplied by itself N times, to give transition matrices for comparing
sequences with lower and lower levels of similarity due to separation of longer periods of
evolutionary history. Thus, the commonly used PAM250 matrix represents a level of 250% of
change expected in 2500 my. Although this amount of change seems very large, sequences at
this level of divergence still have about 20% similarity. For example, alanine will be matched
with alanine 13% of the time and with another amino acid 87% of the time. The percentage of
remaining similarity for any PAM matrix can be calculated by summing the percentages for
amino acids not changing (Ala versus Ala, etc.) after multiplying each by the frequency of that
amino acid pair in the database (e.g., 0.089 for Ala) (Dayhoff 1978). The PAM120, PAM80, and
PAM60 matrices should be used for aligning sequences that are 40%, 50%, and 60% similar,
respectively. Simulations by George et al. (1990) have shown that, as predicted, the PAM250
matrix provides a better-scoring alignment than lower-numbered PAM matrices for distantly
related proteins of 14–27% similarity.
92
At one time, the PAM250 scoring matrix was modified in an attempt to improve the alignment
obtained. All scores for matching a particular amino acid were normalized to the same mean and
standard deviation, and all amino acid identities were given the same score to provide an equal
contribution for each amino acid in a sequence alignment (Gribskov and Burgess 1986). These
modifications were included as the default matrices for the GCG sequence alignment programs
in versions 8 and earlier and are optional in later versions. They are not recommended because
they will not give an optimal alignment that is in accord with the evolutionary model.
Choosing the Best PAM Scoring Matrices for Detecting Sequence Similarity.
The ability of PAM scoring matrices to distinguish statistically between chance and biologically
meaningful alignments has been analyzed using a recently developed statistical theory for
sequences (Altschul 1991) that is discussed later in this material. As discussed above, each PAM
matrix is designed to score alignments between sequences that have diverged by a particular
degree of evolutionary distance. Altschul (1991) has examined how well the PAM matrices
actually can distinguish proteins that have diverged to a greater or lesser extent, when these
proteins are subjected to a local alignment. Initially, when using a scoring matrix to produce an
alignment, the amount of similarity between sequences may not be known. However, the
ungapped alignment scores obtained are maximal when the correct PAM matrix, i.e., the one
corresponding to the degree of similarity in the target sequences, is used (Altschul 1991).
Altschul (1991) has also examined the ability of PAM matrices to provide a reliable enough
indication of an ungapped local alignment score between sequences on an initial attempt of
alignment. For sequence alignments, the PAM200 matrix is able to detect a significant ungapped
alignment of 16–62 amino acids whose score is within 87% of the optimal one. Alternatively,
several combinations, such as PAM80 and PAM250 or PAM120 and PAM350, can also be used.
Altschul (1993) has also proposed using a single matrix and adjusting a statistical parameter in
the scoring system to reach more distantly related sequences, but this change would primarily be
for database searches. Scoring matrices are also used in database searches for similar sequences.
The optimal matrices for these searches have also been determined. It is important to remember
that these predictions assume that the amino acid distributions in the set of protein families used
to make the scoring matrix are representative of all families that are likely to be encountered.
93
The original PAM matrices represent only a small number of families. Scoring matrices obtained
more recently, such as the BLOSUM matrices, are based on a much larger number of protein
families. BLOSUM matrices are not based on a PAM evolutionary model in which changes at
large evolutionary distance are predicted by extrapolation of changes found at small distances.
Matrix values are based on the observed frequency of change in a large set of diverse proteins.
As is discussed on the book Web site, the BLOSUM scoring matrices (especially BLOSUM62)
appear to capture more of the distant types of variations found in protein families. In addition to
the aforementioned differences among PAM scoring matrices for scoring alignments of more- or
less-related proteins, the ability of each PAM matrix to discriminate real local alignments from
chance alignments also varies. To calculate the ability of the entire matrix to discriminate related
from unrelated sequences (H, the relative entropy), the score for each amino acid pair sij (in units
of log2, called bits) is multiplied by the probability of occurrence of that pair in the original
dataset, qij (Altschul 1991). This weighted score is then summed over all of the amino acid pairs
to produce a score that represents the ability of the average amino acid pair in the matrix to
discriminate actual from chance alignments. In information theory, this score is called the
average mutual information content per pair, and the sum over all pairs is the relative entropy of
the matrix (termed H). The relative entropy will be a small positive number. For the PAM250
matrix the number is _0.36, for PAM120, _0.98, and for PAM160, _0.70. In general, all other
factors being equal, the higher the value of H for a scoring matrix, the more likely it is to be able
to distinguish real from chance alignments.
Blocks Amino Acid Substitution Matrices (BLOSUM)
The BLOSUM62 substitution matrix (Henikoff and Henikoff 1992) is widely used for scoring
protein sequence alignments. The matrix values are based on the observed amino acid
substitutions in a large set of _2000 conserved amino acid patterns, called blocks. These blocks
have been found in a database of protein sequences representing more than 500 families of
related proteins (Henikoff and Henikoff 1992) and act as signatures of these protein families.
The BLOSUM matrices are thus based on an entirely different type of sequence analysis and a
much larger data set than the Dayhoff PAM matrices. These protein families were originally
identified by Bairoch in the Prosite catalog. This catalog provides lists of proteins that are in the
94
same family because they have a similar biochemical function. For each family, a pattern of
amino acids that are characteristic of that function is provided. Henikoff and Henikoff (1991)
examined each Prosite family for the presence of ungapped amino acid patterns (blocks) that
were present in each family and that could be used to identify members of that family. To locate
these patterns, the sequences of each protein family were searched for similar amino acid
patterns by the MOTIF program of H. Smith (Smith et al. 1990), which can find patterns of the
type aa1 d1 aa2 d2 aa3, where aa1 and aa2 are conserved amino acids and d1 and d2 are stretches
of intervening sequence up to 24 amino acids long located in all sequences. These initial patterns
were organized into larger ungapped patterns (blocks) between 3 and 60 amino acids long by the
Henikoffs’ PROTOMAT program (http://www.blocks.fhcrc.org). Because these blocks were
present in all of the sequences in each family, they could be used to identify other members of
the same family. Thus, the family collections were enlarged by searching the sequence databases
for more proteins with these same conserved blocks. The blocks that characterized each family
provided a type of multiple sequence alignment for that family. The amino acid changes that
were observed in each column of the alignment could then be counted. The types of substitutions
were then scored for all aligned patterns in the database and used to prepare a scoring matrix, the
BLOSUM matrix, indicating the frequency of each type of substitution. As previously described
for the PAM matrices, BLOSUM matrix values were given as logarithms of odds scores of the
ratio of the observed frequency of amino acid substitutions divided by the frequency expected by
chance. This procedure of counting all of the amino acid changes in the blocks, however, can
lead to an overrepresentation of amino acid substitutions that occur in the most closely related
members of each family. To reduce this dominant contribution from the most alike sequences,
these sequences were grouped together into one sequence before scoring the amino acid
substitutions in the aligned blocks. The amino acid changes within these clustered sequences
were then averaged. Patterns that were 60% identical were grouped together to make one
substitution matrix called BLOSUM60, and those 80% alike to make another matrix called
BLOSUM80, and so on. As with the PAM matrices, these matrices differ in the degree to which
the more common amino acid pairs are scored relative to the less common pairs. Thus, when
used for aligning protein sequences, they provide a greater or lesser distinction between the more
common and less common amino acid pairs. The ability of these different BLOSUM matrices to
distinguish real from chance alignments and to identify as many members as possible of a
95
protein family has been determined (Henikoff and Henikoff 1992). Two types of analysis were
performed: (1) an information content analysis of each matrix, as was described above for the
PAM matrices, and (2) an actual comparison of the ability of each matrix to find members of the
same families in a database search, discussed below. As the clustering percentage was increased,
the ability of the resulting matrix to distinguish actual from chance alignments, defined as the
relative entropy of the matrix or the average information content per residue pair, also increased.
As clustering increased from 45% to 62%, the information content per residue increased from
_0.4 to 0.7 bits per residue, and was _1.0 bits at 80% clustering. However, at the same time, the
number of blocks that contributed information decreased by 25% between no clustering and 62%
clustering. BLOSUM62 represents a balance between information content and data size.
Henikoff and Henikoff (1993) have prepared a set of interval BLOSUM matrices that represent
the changes observed between more closely related or more distantly related representatives of
each block. Rather than representing the changes observed in very alike sequences up to
sequences that were n% alike to give a BLOSUM-n matrix, the new BLOSUM-nm matrix
represented the changes observed in sequences that were between n% alike and m% alike. The
idea behind these matrices was to have a set of matrices corresponding to amino acid changes in
sequence blocks that are separated by different evolutionary distances.
Comparison of the PAM and BLOSUM Amino Acid Substitution Matrices
There are several important differences in the ways that the PAM and BLOSUM scoring
matrices were derived, and these differences should be appreciated in order to interpret the
results of protein sequence alignments obtained with these matrices. First, the PAM matrices are
based on a mutational model of evolution that assumes amino acid changes occur as a Markov
process, each amino acid change at a site being independent of previous changes at that site.
Changes are scored in sequences that are 85% similar after predicting a phylogenetic history of
the changes in each family. Thus, the PAM matrices are based on prediction of the first changes
that occur as proteins diverge from a common ancestor during evolution of a protein family.
Matrices that may be used to compare more distantly related proteins are then derived by
extrapolation from these short-term changes, assuming that these more distant changes are a
reflection of the short-term changes occurring over and over again. For each longer evolutionary
interval, each amino acid can change to any other with the same frequency as observed in the
96
short term. In contrast, the BLOSUM matrices are not based on an explicit evolutionary model.
They are derived from considering all amino acid changes observed in an aligned region from a
related family of proteins, regardless of the overall degree of similarity between the protein
sequences. However, these proteins are known to be related biochemically and, hence, should
share common ancestry. The evolutionary model implied in such a scheme is that the proteins in
each family share a common origin, but closer versus distal relationships are ignored, as if they
all were derived equally from the same ancestor, called a starburst model of protein evolution .
Second, the PAM matrices are based on scoring all amino acid positions in related sequences,
whereas the BLOSUM matrices are based on substitutions and conserved positions in blocks,
which represent the most alike common regions in related sequences. Thus, the PAM model is
designed to track the evolutionary origins of proteins, whereas the BLOSUM model is designed
to find their conserved domains.
9.1.2 Nucleic acid PAM scoring matrices
Just as amino acid scoring matrices have been used to score protein sequence alignments,
nucleotide scoring matrices for scoring DNA sequence alignments have also been developed.
The DNA matrix can incorporate ambiguous DNA symbols and information from mutational
analysis, which reveals that transitions (substitutions between the purines A and G or between
the pyrimidines C and T) are more probable than transversions (substitutions between purine to
pyrimidine or pyrimidine to purine) (Li and Graur 1991). These substitution matrices may be
used to produce global or local alignments of DNA sequences. States et al. (1991) have
developed a series of nucleic acid PAM matrices based on a Markov transition model similar to
that used to generate the Dayhoff PAM scoring matrices. Although designed to improve the
sensitivity of similarity searches of sequence databases, these matrices also may be used to score
nucleic acid alignments. The advantage of using these matrices is that they are based on a
defined evolutionary model and that the statistical significance of alignment scores obtained by
local alignment programs may be evaluated, as described later in this . To prepare these DNA
PAM matrices, a PAM1 mutation matrix representing 99% sequence conservation and one PAM
of evolutionary distance (1% mutations) was first calculated. For a model in which all mutations
from any nucleotide to any other are equally likely, and in which the four nucleotides are present
97
at equal frequencies, the four diagonal elements of the PAM1 matrix representing no change are
0.99 whereas the six other elements representing change are 0.00333 ( .4). The values are chosen
so that the sum of all possible changes for a given nucleotide in the PAM1 matrix is 1% (3 _
0.00333 _ 0.00999). For a biased mutation model in which a given transition is threefold more
likely than a transversion ( .4), the off-diagonal matrix elements corresponding to the one
possible transition for each nucleotide are 0.006 and those for the two possible transversions are
0.002, and the sum for each nucleotide is again 1% (0.006 _ 0.002 _ 0.002 _ 0.01). As with the
amino acid matrices, the above matrix values are then used to produce log odds scoring matrices
that represent the frequency of substitutions expected at increasing evolutionary distances. In
terms of an alignment, the probability (sij) of obtaining a match between nucleotides i and j,
divided by the random probability of aligning i and j, is given by where Mij is the value in the
mutation matrix, and pi and pj are the fractional composition of each nucleotide, assumed to be
0.25. The base of the logarithm can be any value, corresponding to multiplying every value in
the matrix by the same constant. With such scaling variations, the ability of the matrix to
distinguish among significant and chance alignments will not be altered. The resulting tables
with sij expressed in units of bits (logarithm to the base 2) and rounded off to the nearest whole
integer are shown in .5. From these PAM1 matrices, additional log odds matrices at an
evolutionary distance of n PAMs may be obtained by multiplying the PAM1 matrix by itself n
times. The ability of each matrix to distinguish real from random nucleotide matches in an
alignment, designated H, measured in bit units (log2) can be calculated using the equation where
the sij scores are also expressed in bit units. In .6 are shown the log odds values of the match
and mismatch scores for PAM matrices at increasing evolutionary distances, assuming a uniform
rate of mutation among all nucleotides. Also shown is the percentage of nucleotides that will be
changed at that distance. The identity score will be 100 minus this value. This percentage is not
as great as the PAM score due to expected backmutation over longer time periods. Also shown
are the H scores of the matrices at each PAM value.
1. List out the important differences between PAM and BLOSUM.
98
Notes:
q) Write your answer in the space given below.
r) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
9.1.3 Gap penalties
The inclusion of gaps and gap penalties is necessary in order to obtain the best possible
alignment between two sequences. A gap opening penalty for any gap (g) and a gap extension
penalty for each element in the gap (r) is most often used, to give a total gap score wx , according
to the equation where x is the length of the gap. Note that in some formulations of the gap
penalty, the equation wx _ g _ r (x _ 1) is used. Thus, the gap extension penalty is not added to
the gap opening penalty until the gap size is 2. Although this difference does not affect the
alignment obtained, one needs to distinguish which method is being used by a particular
computer program if the correct results are to be obtained. In the former case, the penalty for a
gap of size 1 is g _ x, whereas in the latter case this value is g. The values for these penalties
have to be chosen to balance the scores in the scoring matrix that is used. Thus, the Dayhoff log
odds matrix at PAM250 is expressed in units of log10, which is approximately 1/3 bits, but if
this matrix were converted to 1/2 bits, the same gap penalties would no longer be appropriate. If
too high a gap penalty is used relative to the range of scores in the substitution matrix, gaps will
never appear in the alignment. Conversely, if the gap penalty is too low compared to the matrix
scores, gaps will appear everywhere in the alignment in order to align as many of the same
characters as possible. Fortunately, most alignment programs will suggest gap penalties that are
appropriate for a given scoring matrix in most situations. In the GCG and FASTA program
suites, the scoring matrix itself is formatted in a way that includes default gap penalties.
Examples of the values of g and r used by various alignment programs are shown on the book
99
Web site. When deciding gap penalties for local alignment programs, another consideration is
that the penalties should be large enough to provide a local alignment of the sequences.
Examples of suitable values are given in Altschul and Gish (1996) and Pearson (1996, 1998)
have found that use of appropriate gap penalties will provide an improved local alignment based
on statistical analysis. These studies are described in detail in the following section.
Mathematician Peter Sellers (1974) showed that if sequence alignment was formulated in terms
of distances instead of similarity between sequences, a biologically more appealing interpretation
of gaps is possible. The distance is the number of changes that must be made to convert one
sequence into the other and represents the number of mutations that will have occurred following
separation of the genes during evolution; the greater the distance, the more distantly related are
the sequences in evolution. In this case, substitution produces a positive score of 1. Notice that
the distance score plus the similarity score for an alignment is equal to 1. Sellers proved that this
distance formulation of sequence alignment has a desirable mathematical property that also
makes evolutionary sense. If three sequences, a, b, and c, are compared using the above scoring
scheme, the distance score as defined above is described as a metric that satisfies the triangle
inequality relationship where d(a,b) is the distance between sequences a and b, and likewise for
the other two d values. Expressed another way, if the three possible distances between three
sequences are obtained, then the distance between any first pair plus that for any second pair
cannot underscore the third pair. Violating this rule would not be consistent with the expected
evolutionary origin of the sequences. To satisfy the metric requirement, the scoring of individual
matches, mismatches, and gaps must be such that in an alignment of two idend tical sequences a
and a_, d(a,a_) must equal 0 and for two totally different sequences b and b_, d(b,b_) must
equal 1. For any other two sequences a and b, d(a,b) _ d(b,a). Hence, it is important that the
distance score for changing one sequence character into a second is the same as the converse
score for changing the second into the first, if the distance score of the alignment is to remain a
metric and to make evolutionary sense. The above relationships were shown by Sellers to be true
for gaps of length 1 in a sequence alignment. He also showed that the smallest number of steps
required to change one sequence into the other could be calculated by the dynamic programming
algorithm. The method was similar to that discussed above for the Needleman-Wunsch global
and Smith-Waterman local alignments, except that these former methods found the maximum
similarity between two sequences, as opposed to the minimum distance found by the Sellers
100
analysis. Subsequently, Smith et al. (1981) and Smith and Waterman (1981a,b) showed that gaps
of any length could also be included in an alignment and still provide a distance metric for the
alignment score. In this formulation, the gap penalty was required to increase as a function of the
gap length. The argument was made that a single mutational event involving a single gap of n
residues should be more likely to have occurred than n single gaps. Thus, to increase the
likelihood of such gaps of length _1 being found, the penalty for a gap of length n was made
smaller than the score for n individual gaps. The simplest way of implementing this feature of
the gap penalty was to have the gap score wx be a linear function of gap length by consisting of
two parts, a larger gap opening penalty (g) and a smaller gap extension penalty (r) for each extra
position in the gap, or wx _ g _ rx, where x is the length of the gap, as described above. This type
of gap penalty is referred to as an affine gap penalty in the literature. Any other formula for
scoring gap penalties should also work, provided that the score increases with length of the gap
but that the score is less than x individual gaps. Scoring of gaps by the above linear function of
gap length has now become widely used in sequence alignment. However, more complex gap
penalty functions have been used (Miller and Myers 1988).
9.1.4 Optimal c o m binations of scoring matrices and gap penalties for finding related
proteins
The usefulness of combinations of scoring matrices and gap penalties for identifying related
proteins, including distantly related ones, has been compared (Feng et al. 1985; Doolittle 1986;
Henikoff and Henikoff 1993; Pearson 1995, 1996, 1998; Agarwal and States 1998; Brenner et al.
1998). The method generally used is to start with a database of protein sequences organized into
families, either based on sequence similarity or structural similarity (described in s 7 and 9,
respectively). A member of a family is then selected and used as a query sequence in a search of
the entire database from which the sequence came, using a database similarity search methods
(FASTA, BLAST, SSEARCH),. These methods basically use the dynamic programming
algorithm and a choice of scoring matrix and gap penalties to produce alignment scores. Details
of these studies are described on the book Web site. In summary, the following general
observations have been made: (1) Some scoring matrices are superior to others at finding related
proteins based on either sequence or structure. For example, matrices prepared by examining the
101
full range of amino acid substitutions in families of related proteins, such as the BLOSUM62
matrix, perform better than matrices based on variations in closely related proteins that are
extrapolated to produce matrices for more distantly related sequences, such as the Dayhoff
PAM250 matrix. (2) Gap penalties that for a given scoring matrix are adjusted to produce a local
alignment are the most suitable. (3) To identify related sequences, the significance of the
alignment scores should be estimated, as described in the following section. These methods
provide the means to demonstrate sequence similarity in even the most distantly related proteins.
For closely related proteins, a PAM-type scoring matrix that matches the evolutionary separation
of the sequences may provide a higher-scoring alignment. Another set of studies has suggested
that a global alignment algorithm in combination with scoring matrices that have all positive
values and suitable gap penalties can be used to align proteins that have limited sequence
similarity (i.e., 25% identity) but that have similar structure (Vogt et al. 1995; Abagyan and
Batalov 1997).
9.2 Let us Sum up
In this unit, we tried to discuss about
(i)
Use of scoring matrices and gap penalties in sequence alignments
(ii)
Amino acid substitution matrices
(iii)
Nucleic acid PAM scoring matrices
(iv)
Gap penalties
(v)
Optimal combinations of scoring matrices and gap penalties
1. Where did the BLOSUM62 alignment score matrix come from?
PAM matrices are based on a mutational model of evolution.
102
BLOSUM matrices are not based on an explicit evolutionary model
PAM matrices are based on scoring all amino acid positions in related sequences.
BLOSUM matrices are based on substitutions and conserved positions in blocks
PAM model is designed to track the evolutionary origins of proteins BLOSUM model is
designed to find their conserved domains.
1. “Scoring matrices play a vital role in sequence alignment” – Make a critical analysis of
this statement.
2. “The inclusion of gap and gap penalties in necessary to get the best possible alignment” Justify.
9.7 References
1. Altschul, S.F. Amino acid substitutions matrices from an information theoretic perspective.
J. Mol. Biol. 219, 555-665 (1991).
2. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. A model of evolutionary change in proteins.
In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 - 352 (1978).
3. Henikoff, S. and Henikoff, J. Amino acid substitution matrices from protein blocks. Proc.
Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919 (1992).
4. Eddy S.R. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol.
22(8), 1035-6. Review. (2004).
103
LESSON – 10
ASSESSING THE SIGNIFICANCE OF SEQUENCE ALIGNMENTS
10.1 Assessing the significance of sequence alignments
10.1.1 Significance of global alignments
10.1.2 Modeling a random DNA sequence alignment
10.1.3 Alignments with gaps
10.1.4 The Gumbel extreme value distribution
10.1.5 Methods for Calculating the Parameters of the Extreme Value Distribution
10.1.6 The Statistical Significance of Individual Alignment Scores between Sequences
and the Significance of Scores Found in a Database Search Are Calculated
Differently
10.1.7 FASTA and BLAST
10.2 Let us Sum up
10.6 References
This unit discuss about Assessing the significance of sequence alignments, Significance of global
alignments, Modeling a random DNA sequence alignment, Alignments with gaps, The Gumbel
extreme value distribution, A quick determination of the significance of an alignment, score, The
importance of the type of scoring matrix for statistical, analyses, Significance of gapped local
alignments, Methods for calculating the parameters of the extreme value, distribution, FASTA
and BLAST.
104
10.1 Assessing The Significance Of Sequence Alignments
One of the most important recent advances in sequence analysis is the development of methods
to assess the significance of an alignment between DNA or protein sequences. For sequences that
are quite similar, such as two proteins that are clearly in the same family, such an analysis is not
necessary. A significance question arises when comparing two sequences that are not so clearly
similar but are shown to align in a promising way. In such a case, a significance test can help the
biologist to decide whether an alignment found by the computer program is one that would be
expected between related sequences or would just as likely be found if the sequences were not
related. The significance test is also needed to evaluate the results of a database search for
sequences that are similar to a sequence by the BLAST and FASTA programs. The test will be
applied to every sequence matched so that the most significant matches are reported. Finally, a
significance test can also help to identify regions in a single sequence that have an unusual
composition suggestive of an interesting function. Our present purpose is to examine the
significance of sequence alignment scores obtained by the dynamic programming method.
Originally, the significance of sequence alignment scores was evaluated on the basis of the
assumption that alignment scores followed a normal statistical distribution. If sequences are
randomly generated in a computer by a Monte Carlo or sequence shuffling method, as in
generating a sequence by picking marbles representing four bases or 2 amino acids out of a bag
(the number of each type is proportional to the frequency found in sequences), the distribution
may look normal at first glance.
However, further analysis of the alignment scores of random sequences will reveal that the
scores follow a different distribution than the normal distribution called the Gumbel extreme
value distribution. In this section, we review some of the earlier methods used for assessing the
significance of alignments, then describe the extreme value distribution, and finally discuss some
useful programs for this type of analysis with some illustrative examples. The statistical analysis
of alignment scores is much better understood for local alignments than for global alignments.
Recall that the Smith-Waterman alignment algorithm and the scoring system used to produce a
local alignment are designed to reveal regions of closely matching sequence with a positive
105
alignment score. In random or unrelated sequence alignments, these regions are rarely found.
Hence, their presence in real sequence alignments is significant, and the probability of their
occurring by chance alignment of unrelated sequences can be readily calculated. The
significance of the scores of global alignments, on the other hand, is more difficult to determine.
Using the Needleman-Wunsch algorithm and a suitable scoring system, there are many ways to
produce a global alignment between any pair of sequences, and the scores of many different
alignments may be quite similar.
When random or unrelated sequences are compared using a global alignment method, they can
have very high scores, reflecting the tendency of the global algorithm to match as many
characters as possible. Thus, assessment of the statistical significance of a global alignment is a
much more difficult task. Rather than being used as a strict test for sequence homology, a global
alignment is more appropriately used to align sequences that are of approximately the same
length and already known to be related. The method will conveniently show which sequence
characters align. One can then use this information to perform other types of analyses, such as
structural modeling or an evolutionary analysis.
10.1.1 Significance of Global Alignments
In general, global alignment programs use the Needleman-Wunsch alignment algorithm and a
scoring system that scores the average match of an aligned nucleotide or amino acid pair as a
positive number. Hence, the score of the alignment of random or unrelated sequences grows
proportionally to the length of the sequences. In addition, there are many possible different
global alignments depending on the scoring system chosen, and small changes in the scoring
system can produce a different alignment. Thus, finding the best global alignment and knowing
how to assess its significance is not a simple task, as reflected by the absence of studies in the
literature. Waterman (1989) provided a set of means and standard deviations of global alignment
scores between random DNA sequences, using mismatch and gap penalties that produce a linear
increase in score with sequence length, a distinguishing feature of global alignments.
106
However, these values are of limited use because they are based on a simple gap scoring system.
Abagyan and Batalov (1997) suggested that global alignment scores between unrelated protein
sequences followed the extreme value distribution, similar to local alignment scores. However,
since the scoring system that they used favored local alignments, these alignments they produced
may not be global but local. Unfortunately, there is no equivalent theory on which to base an
analysis of global alignment scores as there is for local alignment scores. For zero mismatch and
gap penalties, which is the most extreme condition for a global alignment giving the longest
subsequence common to two sequences, the score between two random or unrelated sequences P
is proportional to sequence length n, such that P _ cn (Chvátal and Sankoff 1975), but it has not
proven possible to calculate the proportionality constant c (Waterman and Vingron 1994a). To
evaluate the significance of a Needleman-Wunsch global alignment score, Dayhoff (1978) and
Dayhoff et al. (1983) evaluated Needleman-Wunsch alignment scores for a large number of
randomized and unrelated but real protein sequences, using their log odds scoring matrix at 250
PAMs and a constant gap penalty. The distribution of the resulting random scores matched a
normal distribution. On the basis of this analysis, the significance of an alignment score between
two apparently related sequences A and B was determined by obtaining a mean and standard
deviation of the alignment scores of 100 random permutations or shufflings of A with 100 of B,
conserving the length and amino acid composition of each. If the score between A and B is
significant, the authors specify that the real score should be at least 3–5 standard deviations
greater than the mean of the random scores. This level of significance means that the probability
that two unrelated sequences would give such a high score is 1.35 _ 10_3 (3 S.D.s) and 2.87 _
10_6 (5 S.D.s).
In evaluating an alignment, two parameters were varied to maximize the alignment score: First, a
constant called the matrix bias was added to each value in the scoring matrix and, second, the
gap penalty was varied. The statistical analysis was then performed after the score between A
and B had been maximized. Recall that the log odds PAM250 matrix values vary from _7 to 17
in units of 1/3 bits. The bias varied from 2 to 20 and had the effect of increasing the score by the
bias times the number of alignment positions where one amino acid is matched to another. As a
result, the alignment frequently decreases in length because there are fewer gaps, assuming the
gap penalty is not also changed. It was these optimized alignments on which the significance test
107
was performed. Feng et al. (1985) used the same method to compare the significance of
alignment scores obtained by using different scoring matrices. They used 25–100 pairs of
randomized sequences for each test of an alignment. There are several potential problems with
this approach, some of which apply to other methods as well. First, the method is expensive in
terms of the number of computational steps, which increase at least as much as the square of
sequence length because many Needleman-Wunsch alignments must be done. However, this
problem is much reduced with the faster computers and more efficient algorithms of today.
Second, if the amino acid composition is unusual, and if there is a region of low complexity (for
example, many occurrences of one or two amino acids), the analysis will be oversimplified.
Third, when natural sequences were compared more closely, the patterns found did not conform
to a random set of the basic building blocks of sequences but rather to a random set of sequence
segments that were varying.
Consider use of the 26- letter alphabet in English sentences. Alphabet letters do not appear in any
random order in these sentences but rather in a vocabulary of meaningful words. What happens if
sentences, which are made up of words, are compared? On the one hand, if just the alphabet
composition of many sentences is compared, not much variation is seen. On the other hand, if
words are compared, much greater variation is found because there are many more words than
alphabet characters. If random sequences are produced from segments of sequences, rather than
from individual residues, more variation is observed, more like that observed when unrelated
natural sequences are compared. The increased variation found among natural sequences is not
surprising when one thinks of DNA and proteins as sources of information. For example,
protein-encoding regions of DNA sequences are constrained by the genetic code and by amino
acid patterns that produce functional domains in proteins. Lipman et al. (1984) analyzed the
distribution of scores among 100 vertebrate nucleic acid sequences and compared these scores
with randomized sequences prepared in different ways. When the randomized sequences were
prepared by shuffling the sequence to conserve base composition, as was done by Dayhoff and
others, the standard deviation was approximately one-third less than the distribution of scores of
the natural sequences. Thus, natural sequences are more variable than randomized ones, and
using such randomized sequences for a significance test may lead to an overestimation of the
significance. If, instead, the random sequences were prepared in a way that maintained the local
108
base composition by producing them from overlapping fragments of sequence, the distribution of
scores has a higher standard deviation that is closer to the distribution of the natural sequences.
The conclusion is that the presence of conserved local patterns can influence the score in
statistical tests such that an alignment can appear to be more significant than it actually is.
Although this study was done using the Smith-Waterman algorithm with nucleic acids, the same
cautionary note applies for other types of alignments. The final problem with the above methods
is that the correct statistical model for alignment scores was not used. However, these earlier
types of statistical analysis methods set the stage for later ones. The GCG alignment programs
have a RANDOMIZATION option, which shuffles the second sequence and calculates similarity
scores between the unshuffled sequence and each of the shuffled copies. If the new similarity
scores are significantly smaller than the real alignment score, the alignment is considered
significant. This analysis is only useful for providing a rough approximation of the significance
of an alignment score and can easily be misleading.
Dayhoff (1978) and Dayhoff et al. (1983) devised a second method for testing the relatedness of
two protein sequences that can accommodate some local variation. This method is useful for
finding repeated regions within a sequence, similar regions that are in a different order in two
sequences, or a small conserved region such as an active site. As used in a computer program
called RELATE (Dayhoff 1978), all possible segments of a given length of one sequence are
compared with all segments of the same length from another. An alignment score using a scoring
matrix is obtained for each comparison to give a score distribution among all of the segments. A
segment comparison score in standard deviation units is calculated as the difference between the
value for real sequences minus the average value for random sequences divided by the standard
deviation of the scores from the random sequences. A version of the program RELATE that runs
on many computer platforms is included with the FASTA distribution package by W. Pearson.
This program also calculates a distribution based on the normal distribution, thus it provides only
an approximate indication of the significance of an alignment.
10.1.2 Modeling a Random DNA Sequence Alignment
The above types of analyses assume that alignment scores between random sequences follow a
normal distribution that can be used to test the significance of a score between two test
109
sequences. For a number of reasons, mathematicians were concerned that this statistical model
might not be correct. Let us start by creating two aligned random DNA sequences by drawing
pairs of marbles from a large bag filled with four kinds of labeled marbles. The marbles are in
equal proportions and labeled A, T, G, and C to represent an assumed equal representation of the
four nucleotides in DNA. Now consider the probability of removing 10 identical pairs
representing 10 columns in an alignment between two random sequences. The probability of
removing an identical pair (an A and another A) is 1/4 _1/4, but there are 4 possible identical
pairs (A/A, C/C, G/G, and T/T), so that the probability of removing any identical pair is 4 _ 1/4 _
1/4 _ 1/4 and that for removing 6 identical pairs is (1/4)6_2.4 _10_4. The probability of drawing
a mismatched pair is 1 _1/4 _3/4, and that of drawing 6/6 mismatched pairs (3/4)6 _ 0.178. Most
random alignments produced in this manner will have a mixture of a few matches and many
mismatches. The calculations are a little more complex if the four nucleotides are not equally
represented, but the results will be approximately the same. The probability of drawing the same
pair is p, where p _ pA 2 _ pC 2 _ pG 2 _ pT 2, where pX is the proportion of nucleotide X. p is
an important parameter to remember for the discussion below. An even more complicated
situation is when the two random sequences to align have different nucleotide distributions. One
way would be to use an average p for the two sequences. This example illustrates the difficulty
of modeling sequence alignments between two different organisms that have a different base
composition. The above model is not suitable for predicting the number of sequentially matched
positions between random sequences of a given length. To estimate this number, a DNA
sequence alignment may also be modeled by coin-tossing experiments (Arratia and Waterman
1989; Arratia et al. 1986, 1990).
Random alignments will normally comprise mixtures of matches and mismatches, just as a series
of coin tosses will produce a mixture of heads and tails. The chance of producing a series of
matches in a sequence alignment with no mismatches is similar to the chance of tossing a coin
and coming up with a series of only heads. The numbers of interest are the highest possible score
that can be obtained and the probability of obtaining such a score in a certain number of trials. In
such models, coins are usually considered to be “fair” in that the probability of a head is equal to
that of a tail. The coin in this example has a certain probability p of scoring a head (H) and q _ 1
_ p of scoring a tail (T). The longest run of heads R has been shown by Erdös and Rényi to be
110
given by log1/p(n). If p _ 0.5 as for a normal coin, then the base of the logarithm is 1/p _ 2. For
the example of n _ 100 tosses, then R _ log2100 _ loge100/loge2 _ 4.605/0.693 _ 6.65. To use
the coin model, an alignment of two random sequences a _ a1, a2, a3---an and b _ b1, b2, b3--bn, each of the same length n is converted to a series of heads and tails. If ai _ bi then the
equivalent toss result is an H, otherwise the result is a T. The following example illustrates the
conversion of an alignment to a series of H and T tosses. The longest run of matches in the
alignment is now equivalent to the longest run of heads in the coin- tossing sequence, and it
should be possible to use the Erdös and Rényi law to predict the longest run of matches. This
score, however, only applies to one particular alignment of random sequences, such as generated
above by the marble draw. In performing a sequence alignment, two sequences are in effect
shifted back and forth with respect to each other to find regions that can be aligned. In addition,
the sequences may be of different lengths. If two random sequences of length m and n are
aligned in this same manner, the same law still applies but the length of the predicted match is
log1/p(mn) (Arratia et al. 1986). If m _ n, the longest run of matches is doubled. Thus, for DNA
sequences of length 100 and p_0.25 (equal representation of each nucleotide), the longest
expected run of matches is 2 _ log1/p(n) _ 2 _ log4100 _ 2 _ loge100 / loge4 _ 2 _ 4.605 / 1.386
_ 6.65, the same number as in the coin-tossing experiment. This number corresponds to the
longest subalignment that can be expected between two random sequences of this length and
composition.
A more precise formula for the expectation value or mean of the longest match M and its
variance has been derived (Arratia et al. 1986; Waterman et al. 1987; Waterman 1989). It also
applies when there are k mismatches in the alignment, except that another term _ k log1/p
log1/p(qmn) appears in the equation (Arratia et al. 1986). K, the constant in it, depends on k. The
log log term is small and can be replaced by a constant (Mott 1992), and simulations also suggest
that it is not important (Altschul and Gish 1996). Altschul and Gish (1996) have found a better
match to it when the length of each sequence is reduced by the expected length of a match. In the
example given above with two sequences of length 100, the expected length of a match was 6.65.
As the sequences slide align each other, it is not possible to have overlaps on the ends that are
shorter than 7 because there is not enough sequence remaining. Hence, the effective length of the
sequences is 100 _ 7 _ 93 (Altschul and Gish 1996). This correction is also used for the
111
calculation of statistical significance by the BLAST algorithm. It is fundamentally important for
calculating the statistical significance of alignment scores. Basically, it states that as the lengths
of random or unrelated sequences increase, the mean of the highest possible local alignment
scores will be proportional to the logarithm of the product of the sequence lengths, or twice the
logarithm of the sequence length if the lengths are equal (since log (nn) _ 2 log n). It also
predicts a constant variance among scores of random or unrelated sequences, and this prediction
is also borne out by experiment. It is important to emphasize once again that this relationship
depends on the use of scoring parameters appropriate for a local alignment algorithm, such as 1
for a match and _0.9 for a mismatch, or a scoring matrix that scores the average aligned position
as negative, and also upon the use of sufficiently large gap penalties. This type of scoring system
gives rise to positive scoring regions only rarely. The significance of these scores can then be
estimated as described herein.
Another way of describing the result in it uses a different parameter, _, where _ _ loge(1/p)
(Karlin and Altschul 1990) Recall that p is the probability of a match between the same two
characters, given above as 1/4 for matching a random pair of DNA bases, assuming equal
representation of each base in the sequences. p may also be calculated as the probability of a
match averaged over scoring matrix and sequence composition values. Instead, it is _ that is
more commonly used with scoring matrix values. The calculation of _ and also of K is described
below and in more detail on the book Web site. It is more useful in sequence analysis to use
alignment scores instead of lengths for comparing alignments. The expected or mean alignment
length between two random sequences given by Equations 11 and 12 can be easily converted to
an alignment score just by using match and mismatch or scoring matrix values along with some
simple normalization procedures. Thus, in addition to predicting length, these equations can also
predict the mean.
10.1.3 Alignments with Gaps
It was predicted on mathematical grounds and shown experimentally that a similar type of
analysis holds for sequence alignments that include gaps (Smith et al. 1985). Thus, when Smith
112
et al. (1985) optimally aligned a large number of unrelated vertebrate and viral DNA sequences
of different lengths (n and m) and their complements to each other, using a dynamic
programming local alignment method that allowed for a score of _1 for matches, _0.9 for
mismatches, and _2 for a single gap penalty (longer gaps were not considered in order to
simplify the analysis), a plot of the similarity score (S) versus the log1/p(nm) produced a straight
line with approximately constant variance. This result is as expected in the above model except
that with the inclusion of gaps, the slope was increased and was of the form with constant
standard deviation _ _ 1.78. This result was then used to calculate how many standard deviations
were between the predicted mean and variance of the local alignment scores for unrelated
sequences and the scores for test pairs of sequences. If the actual alignment score exceeded the
predicted Smean by several standard deviations, then the alignment score should be significant.
For example, the expected score between two unrelated sequences of lengths 2948 and 431,
average p _ 0.279, was Smean _ 2.55 _ log1/0.279(2948 _ 431) _ 8.99 _ 2.55 _ (loge(2948 _
431)/loge(1/0.279)) _ 8.99 _ 2.55 _ 14.1 / 1.28 _ 8.99 _ 28.1 _ 8.99 _ 19.1. The actual optimal
alignment score between the two real sequences of these lengths was 37.20, which exceeds the
alignment score expected for random sequences by (37.20 _ 19.1) / 1.78 _ 10.2_. Is this number
of standard deviations significant? Smith et al. (1985) and Waterman (1989) suggested the use of
a conservative statistic known as Chebyshev’s inequality, which is valid for many probability
distributions: The probability that a random variable exceeds its mean is less than or equal to the
square of 1 over the number of standard deviations from the mean. In this example where the
actual score is 10 standard deviations above the mean, the probability is (1/10)2 _ 0.01.
Waterman (1989) has noted that for low mismatch and gap penalties, e.g., _1 for matches, _0.5
for mismatches, and _0.5 for a single gap penalty, the predicted alignment scores between
random sequences as estimated above are not accurate because the score will increase linearly
with sequence length instead of with the logarithm of the length. The linear relationship arises
when the alignment is more global in nature, and the logarithmic relationship when it is local.
Waterman (1989) has fitted alignment scores from a large number of randomly generated DNA
sequences of varying lengths to either the predicted log(n) or n linear relationships expected for
low- and high-valued mismatch and gap penalties. The results provide the mean and standard
deviation of an alignment score for several scoring schemes, assuming a constant gap penalty.
113
With further mathematical analysis, it became apparent that the expected scores between
alignment of random and unrelated sequences follow a distribution called the Gumbel extreme
value distribution (Arratia et al. 1986; Karlin and Altschul 1990). This type of distribution is
typical of values that are the highest or best score of a variable, such as the number of heads only
expected in a coin toss discussed previously.
Subsequently, S. Karlin and S. Altschul (1990, 1993) further developed the use of this
distribution for evaluating the significance of ungapped segments in comparisons between a test
sequence and a sequence database using the BLAST program (for review, see Altschul et al.
1994). The method is also used for evaluating the statistical features of repeats and amino acid
patterns and clusters in the same sequence (Karlin and Altschul 1990; Karlin et al. 1991). The
program SAPS developed by S. Karlin and colleagues at Stanford University and available at
http://ulrec3.unil.ch/software/software.html provides this type of analysis. The extreme value
distribution is now widely used for evaluating the significance of the score of local alignments of
DNA and protein sequence alignments, especially in the context of database similarity searches.
10.1.4 The Gumbel Extreme Value Distribution
When two sequences have been aligned optimally, the significance of a local alignment score
can be tested on the basis of the distribution of scores expected by aligning two random
sequences of the same length and composition as the two test sequences (Karlin and Altschul
1990; Altschul et al. 1994; Altschul and Gish 1996). These random sequence alignment scores
follow a distribution called the extreme value distribution, which is somewhat like a normal
distribution with a positively skewed tail in the higher score range. When a set of values of a
variable are obtained in an experiment, biologists are used to calculating the mean and standard
deviation of the entire set assuming that the distribution of values will follow the normal
distribution.
For sequence alignments, this procedure would be like obtaining many different alignments, both
good and bad, and averaging all of the scores. However, biologically interesting alignments are
those that give the highest possible scores, and lower scores are not of interest. The experiment,
114
then, is one of obtaining a set of values, and then of using only the highest value and discarding
the rest. The focus changes from the statistical approach of wanting to know the average of
scores of random sequences, to one of knowing how high a value will be obtained next time
another set of alignment scores of random sequences is obtained. The distribution of alignment
scores between random sequences follows the extreme value distribution, not the normal
distribution. After many alignments, a probability distribution of highest values will be obtained.
The goal is to evaluate the probability that a score between random or unrelated sequences will
reach the score found between two real sequences of interest. If that probability is very low, the
alignment score between the real sequences is significant and the sequence similarity score is
significant. The probability distribution of highest values in an experiment, the extreme value
distribution, is compared to the normal probability distribution. The equations giving the
respective y coordinate values in these distributions, Yev and Yn.
10.1.5 Methods for Calculating the Parameters of the Extreme Value Distribution
In the analysis by Altschul and Gish (1996), 10,000 random amino acid sequences of variable
lengths were aligned using the Smith-Waterman method and a combination of the scoring matrix
and a reasonable set of gap penalties for the matrix. The scores found by this method followed
the same extreme value distribution predicted by the underlying statistical theory. Values of K
and _ were then estimated for each combination by fitting the data to the predicted extreme value
distribution. Some representative results are shown in .10. Readers should consult Tables V–VII
in Altschul and Gish (1996) for a more detailed list of the gap penalties tested. Altschul and Gish
(1996) have cautioned users of these statistical parameters. First, the parameters were generated
by alignment of random sequences that were produced assuming a particular amino acid
distribution, which may be a poor model for some proteins. Second, the accuracy of _ and K
cannot be estimated easily. Finally, for gap costs that give values of H _ 0.15, the optimal
alignment length is a significant fraction of the sequence lengths and produces a source of error
called the edge effect. The effect occurs when the expected length of an alignment is a
significant fraction of the sequence length, and, as discussed earlier, alignments between
sequences that overlap at their ends cannot be completed. The expected length is then subtracted
from the sequence length before _ is estimated.
115
If no such correction is done, _ may be overestimated. These values for gap penalties should also
not be construed to represent the best choice for a given pair of sequences or the only choices,
simply because the statistical parameters are available. The process of choosing a gap penalty
remains a matter of reasoned choice. In trying the effects of varying the gap penalty, it is
important to recognize that as the gap penalty is lowered, the alignments produced will have
more gaps and will eventually change from a local to a global type of alignment, even though a
local alignment program is being used. In contrast, higher H values are generated by a very large
gap penalty and produce alignments with no gaps ( .10), thus suggesting an increased ability to
discriminate between related and unrelated sequences. In this respect, Altschul and Gish (1996)
note that beyond a certain point increasing the gap extension penalty does not change the
parameters, indicating that most gaps in their simulations are probably of length 1. However,
reducing the gap penalty can also allow an alignment to be extended and create a higher scoring
alignment. Eventually, however, the optimal local alignment score between unrelated sequences
will lose the log length relationship with sequence length and become a linear function. At this
point, gap penalties are no longer useful for obtaining local alignments and the above statistical
relationships are no longer valid. The higher the H value, the better the matrix can distinguish
related from unrelated sequences. The lower the value of H, the longer the expected alignment.
These conditions may be better if a longer alignment region is required, such as testing a
structural or functional model of a sequence by producing an alignment. Conversely, scoring
parameters giving higher values of H should produce shorter, more compact alignments. If H _
0.15, the alignments may be very long. In this case, the sequences have a shorter effective length
since alignments starting near the ends of the sequences may not be completed. This edge effect
can lead to an overestimation but was corrected (Altschul and Gish 1996). Unfortunately, the
above method for calculating the significance of an alignment score may not be used to test the
significance of a global alignment score. The theory does not apply when these same substitution
matrices are used for global alignments. Transformation of these matrices by adding a fixed
constant value to each entry or by multiplying each value by a constant has no effect on the
relative scores of a series of global alignments. Hence, there is no theoretical basis for a
statistical analysis of such scores as there is for local alignments (Altschul 1991). As discussed,
116
two programs are commonly used for database similarity searches: FASTA and BLAST. These
programs both calculate the statistical significance of the higher scores found with similar
sequences, but the types of analyses used to determine the statistical significance of these scores
are somewhat different. BLAST uses the value of K and _ found by aligning random sequences,
where n and m are shortened to compensate for inability of ends to align. FASTA calculates the
statistical significance using the distribution of scores with unrelated sequences found during the
database search. In effect, the mean and standard deviation of the low scores found in a given
length range are calculated. These scores represent the expected range of scores of unrelated
sequences for that sequence length (recall that the local alignment scores increase as the
logarithm of the sequence length). The number of standard deviations to the high scores of
related sequences in the same length range (z score) is then determined. The significance of this z
score is then calculated according to the extreme value distribution expected of the z scores,
given in it. This method is discussed in greater detail in . Pearson (1996) showed that these two
methods are equally useful in database similarity searches for detecting sequences more distantly
related to the input query sequence. Pearson (1996) has also determined the influence of scoring
matrices and gap penalties on alignment scores of moderately related and distantly related
protein sequences in the same family.
For two examples of moderately related sequences, the choice of scoring matrix and gap
penalties (gap opening penalty followed by penalty for each additional gap position) did not
matter, i.e., BLOSUM50 _12/_2, BLOSUM62 _8/_2, Gonnet93 _10/_2, and PAM250 _12, _2 all
produced statistically significant scores. The scores of distantly related proteins in the same
family depended more on the choice of scoring matrix and gap penalty, and some scores were
significant and others were not. Pearson recommends using caution in evaluating alignment
scores using only one particular combination of scoring matrix and gap penalties. He also
suggests that using a larger gap penalty, e.g., _14, _2 with BLOSUM50, can increase the
selectivity of a database search for similarity (fewer sequences known to be unrelated will
receive a significant alignment score). A difficulty encountered by FASTA in calculating
statistical parameters during a database search is that of distinguishing unrelated from related
sequences, because only scores of unrelated sequences must be used. As score and sequence
length information is accumulated during the search, the scores will include high, intermediate,
117
and sometimes low scores of sequences that are related to the query sequence, as well as low
scores and sometimes intermediate and even high scores of unrelated sequences. As an example,
a high score with an unrelated database sequence can occur because the database sequence has a
region of low complexity, such as a high proportion of one amino acid. Regardless of the reason,
these high scores must be pruned from the search if accurate statistical estimates are to be made.
Pearson (1998) has devised several such pruning schemes, and then determined the influence of
the scheme on the success of a database search at demonstrating statistically significant
alignment scores among members of the same protein family or superfamily. However, no
particular scheme proved to be better than another.
The above method does not necessarily ensure that the choice of scoring matrix and gap
penalties provides a realistic set of local alignment scores. In the comparable situation of
matching a test sequence to a database of sequences, the scores also follow the extreme value
distribution. For this situation, Mott (1992) has explained that for local alignments the end point
of the alignment should on the average be half- way along the query sequence, and for global
alignments, the end point should be beyond that half- way point. Pearson (1996) has pointed out
that the presence of known, unrelated sequences in the upper part of the curve where E _ 1 can
be an indication of an inappropriate scoring system.
10.1.6 The Statistical Significance of Individual Alignment Scores between Sequences and
the Significance of Scores Found in a Database Search Are Calculated Differently
In performing a database search between a query sequence and a sequence database, a new
comparison is made for each sequence in the database. Alignment scores between unrelated
sequences are employed by FASTA to calculate the parameters of the extreme value distribution.
The probability that scores between unrelated sequences could reach as high as those found for
matched sequences can then be calculated (Pearson 1998). Similarly, in the database similarity
search program BLAST, estimates of the statistical parameters are calculated based on the
scoring matrix and sequence composition. The parameters are then used to calculate the
probability of finding conserved patterns by chance alignment of unrelated sequences (Altschul
et al. 1994). When performing such database searches, many trials are made in order to find the
118
most strongly matching sequences. As more and more comparisons between unrelated sequences
are made, the chance that one of the alignment scores will be the highest one yet found increases.
The probability of finding a match therefore has to be higher than the value calculated for a score
of one sequence pair. The length of the query sequence is about the same as it would be in a
normal sequence alignment, but the effective database sequence is very large and represents
many different sequences, each one a different test alignment. Theory shows that the Poisson
distribution should apply (Karlin and Altschul 1990, 1993; Altschul et al. 1994), as it did above
for estimating the parameters of the extreme value distribution from many alignments between
random sequences. The probability of observing, in a database of D sequences, no alignments
with scores higher than the mean of the highest possible local alignment scores s is given by
e_Ds, and that of observing at least one score s is P _ 1 _ e_Ds. For the range of values of P that
are of interest, i.e., P _ 0.1, P _ Ds. If two sequences are aligned by PRSS as given in the above
example, and the significance of the alignment is calculated, two scores must be considered. The
probability of the score may first be calculated using the estimates of _ and K. Thus, in the phage
repressor alignment, P(s _ 401) _ 3.7. _ 10_27. However, to estimate the EV parameters, 1000
shuffled sequences were compared, and the probability that one of those sequences would score
as high as 401 is given by Ds, or 1000 _ 3.7 _ 10_27 _ 3.7 _ 10_24. These numbers are also
shown in the statistical estimates computed by PRSS. Finally, if the score had arisen from a
database search of 50,000 sequences, the probability of a score of 401 among this many
sequence alignments is 5 _ 10_19, still a small number, but 50,000 larger than that for a single
comparison. These probability calculations are used for reporting the significance of scores with
database sequences by FASTA and BLAST, as described.
10.1.7 FASTA and BLAST
FASTA heuristic belongs to the family of programs for sequence database search called FAST.
All the heuristics in the FAST family follow three basic steps:
I. Use a heuristic to find a good local alignment of U with V1; : : : ; VN. Return only those
alignments, whose score is greater than T, where T is some threshold parameter chosen in
advance. Let _i be the initial score of the alginment (returned by the heuristic) between U and
Vi.
119
II. Sort _is in non- increasing order: _1 _ _2 _ : : : _ _M > T.
III. For the highest ranking sequences, compute the optimized score using the dynamic programming
algorithm. Return the top few alignments ranked according to their optimized score. “In practice, when
the sequences are truly related, the optimized score is usually significantly higher than the initial score.
This observation often helps distinguish between good alignments occuring by chance and true
relationships.” [3] The general idea behind the heuristic is very simple: first, we quickly compute a coarse
estimation for the scores of the sequences, and then we refine the scores for the top ranking few (using
dynamic programming). It is obvious that the success of this technique depends largely on the
effectiveness of the heuristic chosen for step I.
Intuition for heuristic
As the dynamic programming algorithm for sequence alignment fills out the table, the backreferences along the diagonals correspond to matches in the alignment, while the vertical and
horizontal back-references correspond to gaps. Inuitively, we want to find long diagonals in the
dynamic programming table without actually going through the trouble of filling it out. When
good matches (long, possibly broken diagonals) are found by the heuristic, they are merged
together. This behavior categorizes the whole FAST family; FASTA itself takes an extra step:
“after the best regions have been selected [FASTA] tries to join nearby regions, even if they do
not belong to the same diagonal”.
Therefore, we arrive to the following layout for an algorithm that computes _i (aligning U with
Vi):
I.
Identify long diagonals in the dynamic programming table. If two diagonals are close
by (next one starts after the previous one ends), merge them.
II.
Choose k0 largest diagonals.
III.
For each of the diagonals, find its score using some scoring matrix chosen in advance.
IV. Return the top score. The most time-consuming part of the above procedure again
lies in step I, so let us examine how this step could be performed efficiently. The key
observation that we need to make is that the alphabet over which the sequences are
given is much smaller than the sequences themselves. Let us assume that we are
comparing
120
BLAST
BLAST is an acronym for Basic Local Alignment Search Tool. “The BLAST programs are
among the most frequently used to search sequence databases worldwide”. BLAST works by
finding seeds, “which are short segment pairs between the query and a database sequence”. The
seeds are then extended in both directions until the maximum possible score for an extension of
the particular seed is reached (mechanisms are in place that determine when an extension should
terminate without getting lost in local maxima). The algorithm follows three major steps:
I.
Preprocessing query.
II.
Find all words x of length w (called w- mers, where w is a parameter for the algorithm
that varies based on whether DNA or protein sequences are being compared) such that
there is a substring y of length w in U, such that score(x; y) _ T, where T is another
(threshold) parameter for the algorithm. In other words a list of high-scoring w- mers is
computed; note that this list may not contain all query w- mers, e.g., if a w- mer consists
of common amino acids, its score with itself may fall below T, and it may be left out. [3]
All such words are stored in a hash table X = fx1; : : : ; xng, so that membership queries
y 2 X take O(1) time.
III.
Finding hits. Each Vi 2 V is scanned, and all substring of length w in Vi that are in X
(seeds) are recorded in a list fy1; : : : ; ykg.
Extending seeds. Each yi is extended in both directions: the seeds grow as long as a match can
be found, eventually, the score function starts decreasing. However, the cut-off does not happen
immediately since one wants to avoid local maxima, instead the extension stops when the value
of the score function goes below cSmax for some constant c < 1, where Smax is the maximum
score seen so far. The sketch of the statistical analysis of how the cutoff point should be
determined is given below. Clearly, this step is the most expensive one since the number of the
hits and the sizes of the extensions can be large. An improvement for step II can be achieved by
121
replacing a hash table with a deterministic finite automaton that has transitions for each
character in the alphabet, and that recognizes with its states words from the list of high-scoring
words. In effect, the automaton moves a window over the sequences in the database in a very
computationally cheap manner: one transition is made per character. However, since the most
expensive step is step III, the improvements obtained from using a finite automaton are less
significant than one may expect. [3] gives a sketch of the main points of the statistical theory
behind the extensions performed by BLAST that allows one to derive various parameters needed
by BLAST. Given two random sequences s and t of lengths m and n, the following
approximations can be obtained: Given a matrix of replacement costs sij for the pairs of
characters in the alphabet and the probability pi of occurrence of each individual character in the
sequences, we first compute a value _, solving the equation X i;j pipje_sij = 1 The parameter _ is
the unique positive solution to this equation and can be obtained by Newton’s method. Once _ is
known, the expected number of distinct segment pairs between s and t with score above S is
Kmne _S where K is a calculable constant. Actually, the distribution of the number of segment
pairs scoring above S is a Poisson distribution with mean given by the previous formula. From
this, it is easy to derive expressions for useful quantitites like the average score, intervals where
the score will fall 90% of the time, and so on.
BLAST Programs
The BLAST program can either be downloaded and run as a command- line utility "blastall" or
accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone
with a web browser to perform similarity searches against constantly updated databases of
proteins and DNA that include most of the newly sequenced organisms.
BLAST is actually a family of programs (all included in the blastall executable). The following
are some of the programs, ranked mostly in order of importance:
·
Nucleotide-nucleotide BLAST (blastn): This program, given a DNA query, returns the
most similar DNA sequences from the DNA database that the user specifies.
·
Protein-protein BLAST (blastp): This program, given a protein query, returns the most
similar protein sequences from the protein database that the user specifies.
·
Position-Specific Iterative BLAST (PSI-BLAST): One of the more recent BLAST
programs, this program is used for finding distant relatives of a protein. First, a list of all closely
related proteins is created. Then these proteins are combined into a "profile" that is a sort of
122
average sequence. A query against the protein database is then run using this profile, and a larger
group of proteins found. This larger group is used to construct another profile, and the process is
repeated.
By including related proteins in the search, PSI-BLAST is much more sensitive in picking up
distant evolutionary relationships than the standard protein-protein BLAST.
Nucleotide 6-frame translation-protein (blastx): This program compares the sixframe
·
conceptual translation products of a nucleotide query sequence (both strands) against a protein
sequence database.
Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx): T h i s
·
program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six
possible frames and compares it against the sixframe translations of a nucleotide sequence
database. The purpose of tblastx is to find very distant relationships between nucleotide sequences.
Protein-nucleotide 6-frame translation (tblastn): This program compares a protein
·
query against the sixframe translations of a nucleotide sequence database.
Large numbers of query sequences (megablast): When comparing large numbers of
·
input sequences via the command-line BLAST, "megablast" is much faster than running BLAST
multiple times. It basically concatenates many input sequences together to form a large sequence
before searching the BLAST database, then post-analyze the search results to glean individual
alignments and statistical values.
1. List the various types of BLAST programs available.
Notes:
s) Write your answer in the space given below.
t) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
123
10.2 Let us Sum up
One of the most important recent advances in sequence analysis is the development of methods to
assess the significance of an alignment between DNA or protein sequences. A significance
question arises when comparing two sequences that are not so clearly similar but are shown to
align in a promising way. In such a case, a significance test can help the biologist to decide
whether an alignment found by the computer program is one that would be expected between
related sequences or would just as likely be found if the sequences were not related. The
significance test is also needed to evaluate the results of a database search for sequences that are
similar to a sequence. In this lesson, we have tried to describe the various programs and methods
for assessment of significance.
1. What are the significance of Global Alignment and local alignment?
BLASTP compares an amino acid query sequence against a protein sequence database;
BLASTN compares a nucleotide query sequence against a nucleotide sequence database;
BLASTX compares the sixframe conceptual translation products of a nucleotide query sequence
(both
strands)
against
a
protein
sequence
database;
TBLASTN compares a protein query sequence against a nucleotide sequence database
dynamically
translated
in
all
six
reading
frames
(both
strands).
TBLASTX compares the sixframe translations of a nucleotide query sequence against the sixframe translations of a nucleotide sequence database.
1. Closely analysis and comment on the problems with global alignment of the sequences.
2. Explain the statistical significance of the individual alignment scores between sequence.
124
10.6 References
1. Burke,J., Wang,H., Hide,W. and Davison,D.B. (1998) Alternativ gene form discovery
and candidate gene selection from gen indexing projects. Genome Res., 8, 276–290.
2. Eddy,S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol., 6, 361–365.
Feng,D.F. and Doolittle,R.F. (1987) Progressive sequence alignmen as a prerequisite to
correct phylogenetic trees. J. Mol. Evol., 25, 351–360.
3. Gordon,D., Desmarais,C. and Green,P. (2001) Automated finishin with autofinish.
Genome Res., 11, 614–625.
4. Gotoh,O. (1993) Optimal alignment between groups of sequence and its application to
multiple sequence alignment. CABIOS, 9, 361–370.
5. Gotoh,O. (1996) Significant improvement in accuracy of multipl protein sequence
alignments by iterative refinement as assesse by reference to structural alignments. J.
Mol. Biol., 264, 823– 838.
6. Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package fo performing multiple
sequence alignment on a microcompute Gene, 73, 237–244.
7. Irizarry,K., Kustanovich,V., Li,C., Brown,N., Nelson,S., Wong, and Lee,C. (2000)
Genome-wide analysis of single- nucleotid polymorphisms in human expressed
sequences. Nature Genet., 26, 233–236.
8. Irizarry,K., Hu,G., Wong,M.L., Licinio,J. and Lee,C. (2001) Singl nucleotide
polymorphism identification in candidate gene system of obesity. Pharmacogenomics J.,
1, 193–203.
9. Lipman,D.J., Altschul,S.F. and Kececioglu,J.D. (1989) A tool fo multiple sequence
alignment. Proc. Natl Acad. Sci. USA, 86, 4412–4415. 463
125
UNIT III
LESSON – 11
PHYLOGENETIC ANALYSIS
11.1 Cluster Analysis
11.1.1 Statistical Significance Testing
11.1.2 Area of Application
11.1.3 Joining (Tree Clustering)
11.1.4 Two-way Joining
11.1.5 k-Means Clustering
11.1.6 EM (Expectation Maximization) Clustering
11.1.7 Finding the Right Number of Clusters in k-Means and EM
Clustering: v-Fold Cross-Validation
11.2 Let us Sum up
11.6 References
This chapter discusses about Evolutionary Analysis, Cluster Analysis, General Purpose,
Statistical Significance Testing, Area of Application, Joining (Tree Clustering), Two-way
Joining, k-Means Clustering, EM (Expectation Maximization) Clustering, Finding the Right
Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation.
126
11.1 Cluster analysis
The term cluster analysis (first used by Tryon, 1939) encompasses a number of different
algorithms and methods for grouping objects of similar kind into respective categories. A
general question facing researchers in many areas of inquiry is how to organize observed data
into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an
exploratory data analysis tool which aims at sorting different objects into groups in a way that
the degree of association between two objects is maximal if they belong to the same group and
minimal otherwise. Given the above, cluster analysis can be used to discover structures in data
without providing an explanation/interpretation. In other words, cluster analysis simply discovers
structures in data without explaining why they exist.
We deal with clustering in almost every aspect of daily life. For example, a group of diners
sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items
of similar nature, such as different types of meat or vegetables are displayed in the same or
nearby locations. There is a countless number of examples in which clustering playes an
important role. For instance, biologists have to organize the different species of animals before a
meaningful description of the differences between animals is possible. According to the modern
system employed in biology, man belongs to the primates, the mammals, the amniotes, the
vertebrates, and the animals. Note how in this classification, the higher the level of aggregation
the less similar are the members in the respective class. Man has more in common with all other
primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs),
etc. For a review of the general categories of cluster analysis methods, see Joining (Tree
Clustering), Two-way Joining (Block Clustering), and k-Means Clustering. In short, whatever
the nature of your business is, sooner or later you will run into a clustering problem of one form
or another.
11.1.1 Statistical Significance Testing
Note that the above discussions refer to clustering algorithms and do not mention anything about
statistical significance testing. In fact, cluster analysis is not as much a typical statistical test as it
127
is a "collection" of different algorithms that "put objects into clusters according to well defined
similarity rules." The point here is that, unlike many other statistical procedures, cluster analysis
methods are mostly used when we do not have any a priori hypotheses, but are still in the
exploratory phase of our research. In a sense, cluster analysis finds the "most significant solution
possible." Therefore, statistical significance testing is really not appropriate here, even in cases
when p-levels are reported (as in k-means clustering).
11.1.2 Area of Application
Clustering techniques have been applied to a wide variety of research problems. Hartigan (1975)
provides an excellent summary of the many published studies reporting the results of cluster
analyses. For example, in the field of medicine, clustering diseases, cures for diseases, or
symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct
diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful
therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral
objects, etc. by applying cluster analytic techniques. In general, whenever one needs to classify a
"mountain" of information into manageable meaningful piles, cluster analysis is of great utility.
11.1.3 Joining (Tree Clustering)
·
Hierarchical Tree
·
Distance Measures
·
Amalgamation or Linkage Rules
General Logic
The example in the General Purpose Introduction illustrates the goal of the joining or tree
clustering algorithm. The purpose of this algorithm is to join together objects (e.g., animals) into
successively larger clusters, using some measure of similarity or distance. A typical result of this
type of clustering is the hierarchical tree.
Hierarchical Tree
Consider a Horizontal Hierarchical Tree Plot, on the left of the plot, we begin with each object
in a class by itself. Now imagine that, in very small steps, we "relax" our criterion as to what is
128
and is not unique. Put another way, we lower our threshold regarding the decision when to
declare two or more objects to be members of the same cluster.
As a result we link more and more objects together and aggregate (amalgamate) larger and larger
clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined
together. In these plots, the horizontal axis denotes the linkage distance (in Vertical Icicle Plots,
the vertical axis denotes the linkage distance). Thus, for each node in the graph (where a new
cluster is formed) we can read off the criterion distance at which the respective elements were
linked together into a new single cluster. When the data contain a clear "structure" in terms of
clusters of objects that are similar to each other, then this structure will often be reflected in the
hierarchical tree as distinct branches. As the result of a successful analysis with the joining
method, one is able to detect clusters (branches) and interpret those branches.
Distance Measures
The joining or tree clustering method uses the dissimilarities (similarities) or distances between
objects when forming the clusters. Similarities are a set of rules that serve as criteria for grouping
or separating items. In the previous example the rule for grouping a number of dinners was
whether they shared the same table or not. These distances (similarities) can be based on a single
dimension or multiple dimensions, with each dimension representing a rule or condition for
grouping objects. For example, if we were to cluster fast foods, we could take into account the
number of calories they contain, their price, subjective ratings of taste, etc. The most
straightforward way of computing distances between objects in a multi-dimensional space is to
compute Euclidean distances. If we had a two- or three-dimensional space this measure is the
actual geometric distance between objects in the space (i.e., as if measured with a ruler).
However, the joining algorithm does not "care" whether the distances that are "fed" to it are
actual real distances, or some other derived measure of distance that is more meaningful to the
researcher; and it is up to the researcher to select the right method for his/her specific
application.
129
Euclidean distance. This is probably the most commonly chosen type of distance. It simply is
the geometric distance in the multidimensional space. It is computed as:
distance(x,y) = { i (xi - yi )2 }½
Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and
not from standardized data. This method has certain advantages (e.g., the distance between any
two objects is not affected by the addition of new objects to the analysis, which may be outliers).
However, the distances can be greatly affected by differences in scale among the dimensions
from which the distances are computed. For example, if one of the dimensions denotes a
measured length in centimeters, and you then convert it to millimeters (by multiplying the values
by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple
dimensions) can be greatly affected (i.e., biased by those dimensions which have a larger scale),
and consequently, the results of cluster analyses may be very different. Generally, it is good
practice to transform the dimensions so they have similar scales.
Squared Euclidean distance. You may want to square the standard Euclidean distance in order
to place progressively greater weight on objects that are further apart. This distance is computed
as distance(x,y) = i (xi - yi)2 .
City-block (Manhattan) distance. This distance is simply the average difference across
dimensions. In most cases, this distance measure yields results similar to the simple Euclidean
distance. However, note that in this measure, the effect of single large differences (outliers) is
dampened (since they are not squared). The city-block distance is computed as:
distance(x,y) = i |xi - yi|
Chebychev distance. This distance measure may be appropriate in cases when one wants to
define two objects as "different" if they are different on any one of the dimensions. The
Chebychev distance is computed as:
distance(x,y) = Maximum|xi - yi |
Power distance. Sometimes one may want to increase or decrease the progressive weight that is
placed on dimensions on which the respective objects are very different. This can be
accomplished via the power distance. The power distance is computed as:
130
distance(x,y) = ( i |xi - yi|p )1/r
where r and p are user-defined parameters. A few example calculations may demonstrate how
this measure "behaves." Parameter p controls the progressive weight that is placed on differences
on individual dimensions, parameter r controls the progressive weight that is placed on larger
differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean
distance.
Percent disagreement. This measure is particularly useful if the data for the dimensions
included in the analysis are categorical in nature. This distance is computed as:
distance(x,y) = (Number of xi yi )/ i
Amalgamation or Linkage Rules
At the first step, when each object represents its own cluster, the distances between those objects
are defined by the chosen distance measure. However, once several objects have been linked
together, how do we determine the distances between those new clusters? In other words, we
need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be
linked together. There are various possibilities: for example, we could link two clusters together
when any two objects in the two clusters are closer together than the respective linkage distance.
Put another way, we use the "nearest neighbors" across clusters to determine the distances
between clusters; this method is called single linkage. This rule produces "stringy" types of
clusters, that is, clusters "chained together" by only single objects that happen to be close
together. Alternatively, we may use the neighbors across clusters that are furthest away from
each other; this method is called complete linkage. There are numerous other linkage rules such
as these that have been proposed.
Single linkage (nearest neighbour). As described above, in this method the distance between
two clusters is determined by the distance of the two closest objects (nearest neighbours) in the
different clusters. This rule will, in a sense, string objects together to form clusters, and the
resulting clusters tend to represent long "chains."
Complete linkage (farthest neighbour). In this method, the distances between clusters are
determined by the greatest distance between any two objects in the different clusters (i.e., by the
"furthest neighbors"). This method usually performs quite well in cases when the objects actually
131
form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain"
type nature, then this method is inappropriate.
Unweighted pair-group average. In this method, the distance between two clusters is calculated
as the average distance between all pairs of objects in the two different clusters. This method is
also very efficient when the objects form natural distinct "clumps," however, it performs equally
well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973)
introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method
using arithmetic averages.
Weighted pair-group average. This method is identical to the unweighted pair-group average
method, except that in the computations, the size of the respective clusters (i.e., the number of
objects contained in them) is used as a weight. Thus, this method (rather than the previous
method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in
their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method
as weighted pair-group method using arithmetic averages.
Unweighted pair-group centroid. The centroid of a cluster is the average point in the
multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the
respective cluster. In this method, the distance between two clusters is determined as the
difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to
this method as unweighted pair-group method using the centroid average.
Weighted pair-group centroid (median). This method is identical to the previous one, except
that weighting is introduced into the computations to take into consideration differences in
cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one
suspects there to be) considerable differences in cluster sizes, this method is preferable to the
previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as
weighted pair-group method using the centroid average.
Ward's method. This method is distinct from all other methods because it uses an analysis of
variance approach to evaluate the distances between clusters. In short, this method attempts to
minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each
132
step. Refer to Ward (1963) for details concerning this method. In general, this method is
regarded as very efficient, however, it tends to create clusters of small size.
For an overview of the other two methods of clustering,
11.1.4 Two-way Joining
·
Introduction
·
Two-way Joining
Introduction
Previously, we have discussed this method in terms of "objects" that are to be clustered. In all
other types of analyses the research question of interest is usually expressed in terms of cases
(observations) or variables. It turns out that the clustering of both may yield useful results. For
example, imagine a study where a medical researcher has gathered data on different measures of
physical fitness (variables) for a sample of heart patients (cases). The researcher may want to
cluster cases (patients) to detect clusters of patients with similar syndromes. At the same time,
the researcher may want to cluster variables (fitness measures) to detect clusters of measures that
appear to tap similar physical abilities.
Two-way Joining
Given the discussion in the paragraph above concerning whether to cluster cases or variables,
one may wonder why not cluster both simultaneously? Two-way joining is useful in (the
relatively rare) circumstances when one expects that both cases and variables will simultaneously
contribute to the uncovering of meaningful patterns of clusters.
For example, returning to the example above, the medical researcher may want to identify
clusters of patients that are similar with regard to particular clusters of similar measures of
physical fitness. The difficulty with interpreting these results may arise from the fact that the
similarities between different clusters may pertain to (or be caused by) somewhat different
subsets of variables. Thus, the resulting structure (clusters) is by nature not homogeneous. This
may seem a bit confusing at first, and, indeed, compared to the other clustering methods
described and k-Means Clustering), two-way joining is probably the one least commonly used.
However, some researchers believe that this method offers a powerful exploratory data analysis
133
tool (for more information you may want to refer to the detailed description of this method in
Hartigan, 1975).
11.1.5 k-Means Clustering
·
Example
·
Computations
·
Interpretation of results
General logic
This method of clustering is very different from the Joining (Tree Clustering) and Two-way
Joining. Suppose that you already have hypotheses concerning the number of clusters in your
cases or variables. You may want to "tell" the computer to form exactly 3 clusters that are to be
as distinct as possible. This is the type of research question that can be addressed by the k- means
clustering algorithm. In general, the k-means method will produce exactly k different clusters of
greatest possible distinction. It should be mentioned that the best number of clusters k leading to
the greatest separation (distance) is not known as a priori and must be computed from the data.
Example
In the physical fitness example, the medical researcher may have a "hunch" from clinical
experience that her heart patients fall basically into three different categories with regard to
physical fitness. She might wonder whether this intuition can be quantified, that is, whether a kmeans cluster analysis of the physical fitness measures would indeed produce the three clusters
of patients as expected. If so, the means on the different measures of physical fitness for each
cluster would represent a quantitative way of expressing the researcher's hypothesis or intuition
(i.e., patients in cluster 1 are high on measure 1, low on measure 2, etc.).
Computations
Computationally, you may think of this method as analysis of variance ( ANOVA) "in reverse."
The program will start with k random clusters, and then move objects between those clusters
with the goal to 1) minimize variability within clusters and 2) maximize variability between
134
clusters. In other words, the similarity rules will apply maximally to the members of one cluster
and minimally to members belonging to the rest of the clusters. This is analogous to "ANOVA in
reverse" in the sense that the significance test in ANOVA evaluates the between group
variability against the within- group variability when computing the significance test for the
hypothesis that the means in the groups are different from each other. In k-means clustering, the
program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most
significant ANOVA results.
Interpretation of results
Usually, as the result of a k-means clustering analysis, we would examine the means for each
cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very
different means for most, if not all dimensions, used in the analysis. The magnitude of the F
values from the analysis of variance performed on each dimension is another indication of how
well the respective dimension discriminates between clusters.
11.1.6 EM (Expectation Maximization) Clustering
·
Introductory Overview
·
The EM Algorithm
Introductory Overview
The methods described here are similar to the k-Means algorithm described above, and you may
want to review that section for a general overview of these techniques and their applications. The
general purpose of these techniques is to detect clusters in observations (or variables) and to
assign those observations to the clusters. A typical example application for this type of analysis
is a marketing research study in which a number of consumer behavior related variables are
measured for a large sample of respondents. The purpose of the study is to detect "market
segments," i.e., groups of respondents that are somehow more similar to each other (to all other
members of the same cluster) when compared to respondents that "belong to" other clusters. In
addition to identifying such clusters, it is usually equally of interest to determine how the clusters
135
are different, i.e., determine the specific variables or dimensions that vary and how they vary in
regard to members in different clusters.
k-means clustering. To reiterate, the classic k-Means algorithm was popularized and refined by
Hartigan (1975; see also Hartigan and Wong, 1978). The basic operation of that algorithm is
relatively simple: Given a fixed number of (desired or hypothesized) k clusters, assign
observations to those clusters so that the means across clusters (for all variables) are as different
from each other as possible.
Extensions and generalizations. The EM (expectation maximization) algorithm extends this
basic approach to clustering in two important ways:
1. Instead of assigning cases or observations to clusters to maximize the differences in
means for continuous variables, the EM clustering algorithm computes probabilities of
cluster memberships based on one or more probability distributions. The goal of the
clustering algorithm then is to maximize the overall probability or likelihood of the data,
given the (final) clusters.
2. Unlike the classic implementation of k-means clustering, the general EM algorithm can
be applied to both continuous and categorical variables (note that the classic k-means
algorithm can also be modified to accommodate categorical variables).
The EM Algorithm
The EM algorithm for clustering is described in detail in Witten and Frank (2001). The basic
approach and logic of this clustering method is as follows. Suppose you measure a single
continuous variable in a large sample of observations. Further, suppose that the sample consists
of two clusters of observations with different means (and perhaps different standard deviations);
within each sample, the distribution of values for the continuous variable follows the normal
distribution. The resulting distribution of values (in the population) may look like this:
Mixtures of distributions. The illustration shows two normal distributions with different means
and different standard deviations, and the sum of the two distributions. Only the mixture (sum)
of the two normal distributions (with different means and standard deviations) would be
136
observed. The goal of EM clustering is to estimate the means and standard deviations for each
cluster so as to maximize the likelihood of the observed data (distribution). Put another way, the
EM algorithm attempts to approximate the observed distributions of values based on mixtures of
different distributions in different clusters.
With the implementation of the EM algorithm in some computer programs, you may be able to
select (for continuous variables) different distributions such as the normal, log- normal, and
Poisson distributions. You can select different distributions for different variables and, thus,
derive clusters for mixtures of different types of distributions.
Categorical variables. The EM algorithm can also accommodate categorical variables. The
method will at first randomly assign different probabilities (weights, to be precise) to each class
or category, for each cluster. In successive iterations, these probabilities are refined (adjusted) to
maximize the likelihood of the data given the specified number of clusters.
Classification probabilities instead of classifications. The results of EM clustering are different
from those computed by k-means clustering. The latter will assign observations to clusters to
maximize the distances between clusters. The EM algorithm does not compute actual
assignments of observations to clusters, but classification probabilities. In other words, each
observation belongs to each cluster with a certain probability. Of course, as a final result you can
usually review an actual assignment of observations to clusters, based on the (largest)
classification probability.
11.1.7 Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold
Cross-Validation
An important question that needs to be answered before applying the k-means or EM clustering
algorithms is how many clusters there are in the data. This is not known a priori and, in fact,
there might be no definite or unique answer as to what value k should take. In other words, k is a
nuisance parameter of the clustering model. Luckily, an estimate of k can be obtained from the
data using the method of cross- validation. Remember that the k-means and EM methods will
determine cluster solutions for a particular user-defined number of clusters. The k-means and EM
clustering techniques (described above) can be optimized and enhanced for typical applications
in data mining. The general metaphor of data mining implies the situation in which an analyst
searches for useful structures and "nuggets" in the data, usually without any strong a priori
137
expectations of what the analysist might find (in contrast to the hypothesis-testing approach of
scientific research). In practice, the analyst usually does not know ahead of time how many
clusters there might be in the sample. For that reason, some programs include an implementation
of a v-fold cross-validation algorithm for automatically determining the number of clusters in
the data.
This unique algorithm is immensely useful in all general "pattern-recognition" tasks - t o
determine the number of market segments in a marketing research study, the number of distinct
spending patterns in studies of consumer behavior, the number of clusters of different medical
symptoms, the number of different types (clusters) of documents in text mining, the number of
weather patterns in meteorological research, the number of defect patterns on silicon wafers, and
so on.
The v-fold cross-validation algorithm applied to clustering. The v- fold cross-validation
algorithm is described in some detail in Classification Trees and General Classification and
Regression Trees (GC&RT). The general idea of this method is to divide the overall sample into
a number of v folds. The same type of analysis is then successively applied to the observations
belonging to the v-1 folds (training sample), and the results of the analyses are applied to sample
v (the sample or fold that was not used to estimate the parameters, build the tree, determine the
clusters, etc.; this is the testing sample) to compute some index of predictive validity. The results
for the v replications are aggregated (averaged) to yield a single measure of the stability of the
respective model, i.e., the validity of the model for predicting new observations.
Cluster analysis is an unsupervised learning technique, and we cannot observe the (real) number
of clusters in the data. However, it is reasonable to replace the usual notion (applicable to
supervised learning) of "accuracy" with that of "distance." In general, we can apply the v-fold
cross-validation method to a range of numbers of clusters in k-means or EM clustering, and
observe the resulting average distance of the observations (in the cross- validation or testing
samples) from their cluster centers (for k-means clustering); for EM clustering, an appropriate
equivalent measure would be the average negative (log-) likelihood computed for the
observations in the testing samples.
138
1. List the different distance matrix methods.
Notes:
u) Write your answer in the space given below.
v) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
………………………………………………………………………………………………………
………………………………………………………
Reviewing the results of v-fold cross-validation. The results of v-fold cross-validation are best
reviewed in a simple line graph.
Shown here is the result of analyzing a data set widely known to contain three clusters of
observations (specifically, the well-known Iris data file reported by Fisher, 1936, and widely
referenced in the literature on discriminant function analysis). Also shown (in the graph to the
right) are the results for analyzing simple normal random numbers. The "real" data (shown to the
left) exhibit the characteristic scree-plot pattern, where the cost function (in this case, 2 times the
log- likelihood of the cross-validation data, given the estimated parameters) quickly decreases as
the number of clusters increases, but then (past 3 clusters) levels off, and even increases as the
data are overfitted. Alternatively, the random numbers show no such pattern, in fact, there is
basically no decrease in the cost function at all, and it quickly begins to increase as the number
of clusters increases and overfitting occurs.
It is easy to see from this simple illustration how useful the v-fold cross-validation technique,
applied to k-means and EM clustering can be for determining the "right" number of clusters in
the data.
139
11.2 Let us Sum up
Evolution is regarded as a branching process, whereby populations are altered over time and may
speciate into separate branches, hybridize together again, or terminate by extinction. This may
be visualized as a multidimensional character-space that a population moves through over time.
The problem posed by phylogenetics is that genetic data is only available for the present, and
fossil records ( osteometric data) are sporadic and less reliable. Our knowledge of how evolution
operates is used to reconstruct the full tree. This lesson discusses about Evolutionary Analysis,
Cluster Analysis, General Purpose, Statistical Significance Testing, Area of Application.
Students will work in groups in an attempt to identify the evolutionary history of a particular group of
organisms based on the protein sequences of various molecular analyses. Students will try to show
how similar organisms are related using rooted and un-rooted phylogenic trees.
Procedure:
1. Form groups of 3 or 4 students to do the investigation.
2. Students will select a group of organisms and then get approval from the instructor. Organisms
should be similar appearing or taxonomically related organisms. Its first come first served. You must
have a minimum of 4 organisms that you will study.
Possible animal and plant groups for study (some may be similar in appearance though not necessarily
related by analysis):
Canines (Dog, Wolves, Coyote, Dingo, Hyena, African Wild dog, Foxes, etc.)
Felines (House cat, Lynx, Tiger, African lion, Mt. Lion, Jaguar, Leopard, Panther, Cheetah, etc.)
Bears (Polar, Black, Kodiak, Grizzly, Panda, etc.)
Trees (Red oak, White oak, Sugar maple Norway maple, Am. elm, Ginkgo, Green spruce, etc.
140
Mollusks (Slugs, Snails, Squid, Octopus, Cuttlefish, Clams, Oysters, etc.)
Arthropods (Insects, Crabs, Spiders, Centipedes, etc)
Flowers (Roses, Tulip, Tiger lily, Daylily, Carnation, etc.)
Any other collections or subdivisions of plants or animals (Specific insects, specific members of a
particular taxonomic genera, families or orders, types of ferns, etc)
3. Identify each of the organisms by their scientific name.
4. Using at least 3 different protein compounds appropriate for their organisms (ex. Hemoglobin for
higher animals, myoglobin in muscle, enolase is good for most organism comparisons, cytochrome c
in all organisms, etc.), create and print (or save to disk) the BW evolutionary trees for those organisms
and the respective proteins.
5. Based on these trees create a composite tree on a web page with pictures obtained from websites.
i.
Neighbor-joining
ii.
Fitch-Margoliash method
iii.
Using outgroups
1. Role of statistics in clustering in vital - Discuss.
2. “Polylogenetic analysis help us understand the relation between us and the ancasters in a
lucid manner” – Substantiate.
141
11.6 References
1. Ackerly, D. D. 1999. Comparative plant ecology and the role of phylogenetic information.
Pages 391-413 in M. C. Press, J. D. Scholes, and M. G. Braker, eds. Physiological plant
ecology. The 39th symposium of the British Ecological Society held at the University of
York 7-9 September 1998. Blackwell Science, Oxford, U.K.
2. Berenbrink, M., P. Koldkjær, O. Kepp, and A. R. Cossins. 2005. Evolution of oxygen
secretion in fishes and the emergence of a complex physiological system. Science
307:1752-1757.
3. Blomberg, S. P., T. Garland, Jr., and A. R. Ives. 2003. Testing for phylogenetic signal in
comparative data: behavioral traits are more labile. Evolution 57:717-745.
4. Brooks, D. R., and D. A. McLennan. 1991. Phylogeny, ecology, and behavior: a research
program in comparative biology. Univ. Chicago Press, Chicago. 434 pp.
5. Cheverud, J. M., M. M. Dow, and W. Leutenegger. 1985. The quantitative assessment of
phylogenetic constraints in comparative analyses: sexual dimorphism in body weight
among primates. Evolution 39:1335-1351.
6. Eggleton, P., and R. I. Vane-Wright, eds. 1994. Phylogenetics and ecology. Linnean
Society Symposium Series Number 17. Academic Press, London.
7. Felsenstein, J. 1985. Phylogenies and the comparative method. American Naturalist 125:115.
142
LESSON – 12
ROOTED AND UNROOTED TREE
12.1 Rooted And Unrooted Tree
12.1.1 Definition of a phylogenetic tree
12.1.2 Features of a phylogenetic tree
12.1.3 Unrooted trees
12.1.4 Rooted trees
12.2 Let us Sum up
12.6 References
This chapter discusses the Rooted And Unrooted Tree, Definition of a phylogenetic tree,
Features of a phylogenetic tree, Branches, Nodes (External & Internal), Unrooted trees, Rooted
trees.
12.1 Rooted and Unrooted tree
12.1.1 Definition of a phylogenetic tree
A tree is an acyclic connected graph that consists of a collection of nodes (internal and external) and
branches connecting them so that every node can be reached by a unique path from every other
branch.
143
A
C
Branches
External nodes
B
Internal nodes
D
Figure 12.1: An unrooted phylogenetic tree joining taxonomic units.
12.1.2 Features of a phylogenetic tree
In the area of phylogenetic inference, trees are used as visual displays that represent hypothetical,
reconstructed evolutionary events. The tree in this case consists of:
v internal nodes which represent taxonomic units such as species or genes; the external nodes, those
at the ends of the branches, represent living organisms.
v The lengths of the branches usually represent an elapsed time, measured in years, or the length of
the branches may represent number of molecular changes (e.g. mutations) that have taken place
between the two nodes. This is calculated is from the degree of differences when sequences are
compared (refer to “alignments” later)
v Sometimes, the lengths are irrelevant and the tree represents only the order of evolution. [In a
dendrogram, only the lengths of horizontal (or vertical, as the case may be) branches count].
v Finally the tree may be rooted or unrooted.
1. List the features of a phylogenetic tree.
Notes:
w) Write your answer in the space given below.
x) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
144
……………………………………………………………………………………………………
…………………………………………………………
…………………………………………………………………………………………………………
……………………………………………………
12.1.3 Unrooted trees
An unrooted tree simply represents phylogenetic but doesnot provide an evolutionary path. In an
unrooted tree, an external node represents a contemporary organism. Internal nodes represent
common ancestors of some of the external nodes. In this case, the tree shows the relationship between
organisms A, B, C & D and does not tell us anything about the series of evolutionary events that led to
these genes. There is also no way to tell whether or not a given internal node is a common ancestor of
any external nodes.
12.1.4 Rooted trees
Gene trees are not the same as species trees. In case of a rooted tree, one of the internal nodes is used
as an outgroup, and, in essence, becomes the common ancestor of all the other external nodes. The
outgroup therefore enables the root of a tree to be located and the correct evolutionary pathway to be
identified. In the above case, five different evolutionary pathways are possible using an outgroup, each
depicted by a different rooted tree.
C
D
C
C
A
D
A
B
B
A
A
B
D
B
C
A
B
D
D
A
B
C
D
C
Unrooted
tree
Figure 12.2. The five rooted trees that can be drawn from the unrooted tree (box). The
positions of the roots are indicated by the number on the outline of the unrooted tree
(box)
145
12.2 Let us Sum up
This chapter gives a brief introduction about Rooted And Unrooted Tree, Definition of a
phylogenetic tree, Features of a phylogenetic tree, Branches, Nodes (External & Internal),
Unrooted trees, Rooted trees with diagrams.
1.Why
use
several
molecules
to
show
the
phylogeny
of
organisms?
2. Did you encounter any unusual patterns or branches? What might be a possible explanation for
these?
i. Internal nodes
ii. External nodes
iii. Lengths of the branches
iv. May be rooted or unrooted
1. “Polygenetic trees are very much useful in understanding our ancasters” - Justify.
12.6 References
1. Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunderland, Mass. xx
+ 664 pp.
2. Freckleton, R. P., P. H. Harvey, and M. Pagel. 2002. Phylogenetic analysis and
comparative data: a test and review of evidence. American Naturalist 160:712-726.
146
3. Garland, T., Jr., and A. R. Ives. 2000. Using the past to predict the present:
Confidence intervals for regression equations in phylogenetic comparative methods.
American Naturalist 155:346-364.
4. Garland, T., Jr., A. F. Bennett, and E. L. Rezende. 2005. Phylogenetic approaches in
comparative physiology. Journal of Experimental Biology 208:3015-3035.
5. Garland, T., Jr., A. W. Dickerman, C. M. Janis, and J. A. Jones. 1993. Phylogenetic
analysis of covariance by computer simulation. Systematic Biology 42:265-292.
6. Garland, T., Jr., P. H. Harvey, and A. R. Ives. 1992. Procedures for the analysis of
comparative data using phylogenetically independent contrasts. Systematic Biology
41:18-32.
7. Gittleman, J. L., and M. Kot. 1990. Adaptation: statistics and a null model for
estimating phylogenetic effects. Systematic Zoology 39:227-241.
8. Grafen, A. 1989. The phylogenetic regression. Philosophical Transactions of the
Royal Society of London B 326:119-157.
9. Harvey, P. H., and M. D. Pagel. 1991. The comparative method in evolutionary
biology. Oxford University Press, Oxford. 239 pp.
147
LESSON – 13
BOOTSTRAPPING
13.1 Inferred and true trees
13.2 Gene trees are not the same as species trees
13.3 Bootstrapping
13.4 Molecular sequences
13.5 Sequence alignment is the essential preliminary to tree construction.
13.6 Let us Sum up
13.10 References
This lesson discusses about Inferred and true trees, Gene trees, Bootstrapping, Molecular sequences,
tree construction.
13.1 Inferred and true trees
The criteria used to choose an outgroup depends very much on the type of analysis that is carried out.
Suppose that homologous (orthologous) genes in a tree come from human, chimpanzee, gorilla and
orangutan. A useful homologous primate outgroup sequence is that from baboon as palaeontological
evidence suggests that baboons branched away from the lineage leading to human, chimpanzee,
gorilla and orangutan before the time of the common ancestor of the four species (Fig 13.1).
Human
Chimpanze
e
Gorilla
Orangutan
Baboo
n
148
Fig13.1: The use of an outgroup to root a phylogenetic tree.
We refer to the rooted tree given above, as an inferred tree. This is to emphasise that it depicts the series
of evolutionary events that are inferred from the data that were analysed, and may not necessarily be the
same as the true tree, the one that depicts the actual series of events that occurred. Sometimes we can be
fairly confident that the inferred tree is the true tree but most phylogenetic data analysis are prone to
uncertainties. Degrees of confidence can be assigned to the branching patterns in an inferred tree using
bootstrap analysis (discussed in a later section). Due to the imprecise nature of phylogenetic analysis
controversies have arisen.
13.2 Gene trees are not the same as species trees
The above tree is a gene tree i.e. a tree derived by comparing orthologous sequences (those derived from
the same ancestral sequence). The assumption is that this gene tree is a more accurate reflection of a
species tree than the one that can be inferred from morphological data. This assumption is generally
correct but it does not mean that the gene tree is the same as a species tree.
Mutation and speciation are not expected to occur at the same time. For example, the mutation event
could precede the speciation event. This would mean that, to begin with, both alleles will still be present
in the same unsplit population of the ancestral species. When the population split occurs, it is likely that
both alleles will be present in each of the resulting groups. After the split, the new population evolve
independently. One possibility is that as a result of random genetic drift loss of one allele from one
population and the loss of the other allele from the second population occurs. This establishes the two
separate genetic lineages that were inferred from phylogenetic analysis of the gene. How do these
considerations affect the coincidence between a gene and a species tree?
(a) If a molecular clock is used to date the time at which gene divergence took place, than it cannot be
assumed that this is also the time of the speciation event. A significant difference between a gene
and a species event can exist though the species tree & gene tree look the same (Fig 13.2).
(b) If the first speciation event is followed closely by a second speciation event in one of the two
populations, then the branching order of the gene tree might be different to that of the species tree.
This can occur if the genes in the modern
149
(c) species are derived from alleles that had already appeared before the first of the two speciation
(Fig 13.3)
13.3 Bootstrapping Tree reconstruction
In any molecular phylogenetic reconstruction the following points need to be addressed.
i.
Molecular sequences
ii.
Sequence alignment is the essential preliminary to tree reconstruction
iii.
Converting
the
l
Mutation
l
Mutation
l
Mutation
alignment
Speciation
Speciation
Speciation
data
Allele loss
B
A
A
B
C
into a
A
B A
B
Bb
Fig 13.2 Gene tree & species tree look the same.
However, mutation might precede speciation
giving an incorrect time for the latter if a
molecular clock is used
A
B
C
A
B
Fig 13.3 A gene tree can have a different
branching order from a species tree
phylogenetic tree
iv.
Assessing accuracy of a reconstructed tree
v.
Molecular clocks enable the time of divergence of ancestral sequences to be estimated
13.4 Molecular sequences
C
150
Nucleic acids (rRNA, DNA) and protein sequences are used in molecular phylogenetic tree
construction. DNA yields more phylogenetic information than DNA and has become by the far
predominant molecule for phylogeny:
§
More statistical information from DNA data: The nucleotide sequences of a pair of
homologous genes has a higher information content than the amino acid of the
corresponding proteins, because mutation that result in nonsynchrononymous changes
affect only the DNA sequence. Hence coding as well as non-coding regions of the genome
can be examined. Write out the DNA sequences or the following two amino acids as an
example of this. You can see that at the protein level there is only difference but at the
nucleic acid level there are differences.
Protein -gly-ala-ile-leu-asp-arg-
DNA
-gga-gcc-ata-tta-gat-aga
DNA
-gga-gca-att-ttt-gat-aga-
Protein -gly-ala-ile-phe-asp-arg§
Ease of sequencing DNA: Samples for DNA sequencing can be prepared by PCR which is
an extremely easy technique.
Protein electrograms, Restriction fragment length polymorphism (RFLP), Simple sequence length
polymorphism (SSLP), Single nucleotide polymorphism (SNP) and DNA-DNA hybridazation data
have also been used for molecular phylogenetic reconstruction. Immunological data from crossreactivity studies were used in
for such work as well.
13.5 Sequence alignment is the essential preliminary to tree construction.
This is the most important step in molecular phylogeny and a number of issues have to be considered:
·
Sequence Homologs: Sequences that are to be aligned should be homologs. An example of this are
the b-globin genes of different vertebrates. This is to satisfy the phylogeny criteria which states
that the sequence should be derived from an common ancestral sequence.
151
·
Non-homologous sequences: If the sequences are not homologous and hence do not share a
common ancestor phylogenetic construction methods will always produce a tree but the tree will
not be of any biological relevance. This type of error commonly occurs when undertaking
homology analysis to assign functions to newly generated gene sequences. Blast is used
extensively as on of the homology analysis methods and hence interpretation of the data arising
from the analysis should be undertaken with care.
·
Easy alignments: Correctly aligning the homologous sequence is the next task. In some cases it is
an easy task. A simple sequence alignment is shown below:
Sequence AGCAATGGCCAGACAATAATG
Sequence AGCTATGGACAGACATTAATG
*** **** ****** *****
·
Difficult alignments: If sequences have evolved and diverged by accumulating insertions and
deletions as well as point mutations, then these sequence are not always easy to align. Insertions
and deletions cannot be distinguished when pairs of sequences are aligned so we refer to them as
indels Below is a pair of difficult sequences for alignment where placing the indel at the correct
location can become a problem.
Sequence
GACGACCATAGACCAGCATAG
Sequence
GACTACCATAGA-CTGCAAAG
*** ******** * *** **
Sequence
GACGACCATAGACCAGCATAG
Sequence
GACTACCATAGACT-GCAAAG
Two possible positions for
the indel
*** ********* *** **
·
The dot matrix technique for alignment: Some alignments can be easily done by "eye balling" the
sequences yet others may require a pen and paper. The simplest is known as the dot matrix
method. The two sequences are written out on the x- and y- axes of the graph paper at the positions
corresponding to the identical nucleotides of the two sequences. The alignment is indicated by a
152
diagonal series of dots broken by empty squares where the sequences have nucleotide differences,
and shifting from one column to another where indels occur.
·
Similarity approach is a mathematical based alignment technique: The similarity approach
(Needleman and Wunesh) aims to maximise the number of identical matched nucleotides in the
two sequences. The distance method, (Waterman) on the other hand, minimises the number of
mismatches. Often the two approaches will identify the same alignment as being the best one.
·
Multiple alignments are generated for more then two sequences: Rarely can one do multiple
alignments with a pen and paper and all the steps required for phylogenetic analysis is undertaken
on a computer. For automatically generating multiple alignments several computer programs are
available.
·
rRNA genes (aka rDNA) and rRNA have been used as molecular chronometers and phylogentetic
studies undertaken. Refer to the section on rRNA for detailed notes on the methods of aligning
these types of nucleic acids.
1. List the steps involved in bootstrapping tree construction.
Notes:
y) Write your answer in the space given below.
z) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
153
Methods of Phylogenetic Analysis
Two major groups of analyses exist to examine phylogenetic relationships: phenetic methods and
cladistic methods. It is important to note that phenetics and cladistics have had an uneasy relationship
over the last 40 years or so. Most of today's evolutionary biologists favor cladistics, although a strictly
cladistic approach may result in counterintuitive results.
Phenetic Method of Analysis
Phenetics, also known as numerical taxonomy, involves the use of various measures of overall
similarity for the ranking of species. There is no restriction on the number or type of characters
(data) that can be used, although all data must be first converted to a numerical value, without
any character "weighting". Each organism is then compared with every other for all characters
measured, and the number of similarities (or differences) is calculated. The organisms are then
clustered in such a way that the most similar are grouped close together and the more different
ones are linked more distantly. The taxonomic clusters, called phenograms, that result from such
an analysis do not necessarily reflect genetic similarity or evolutionary relatedness. The lack of
evolutionary significance in phenetics has meant that this system has had little impact on animal
classification, and as a consequence, interest in and use of phenetics has been declining in recent
years.
Cladistic Method of Analysis
An alternative approach to diagramming relationships between taxa is called cladistics. The basic
assumption behind cladistics is that members of a group share a common evolutionary history.
Thus, they are more closely related to one another than they are to other groups of organisms.
Related groups of organisms are recognized because they share a set of unique features
(apomorphies) that were not present in distant ancestors but which are shared by most or all of
the organisms within the group. These shared derived characteristics are called synapomorphies.
Therefore, in contrast to phenetics, cladistics groupings do not depend on whether organisms
share physical traits but depend on their evolutionary relationships. Indeed, in cladistic analyses
two organisms may share numerous characteristics but still be considered members of different
groups.
Cladistic analysis entails a number of assumptions. For example, species are assumed to arise
primarily by bifurcation, or separation, of the ancestral lineage; species are often considered to
become extinct upon hybridization (crossbreeding); and hybridization is assumed to be rare or
154
absent. In addition, cladistic groupings must possess the following characteristics: all species in a
grouping must share a common ancestor and all species derived from a common ancestor must
be included in the taxon. The application of these requirements results in the following terms
being used to describe the different ways in which groupings can be made:
·
A monophyletic grouping is one in which all species share a common ancestor, and all
species derived from that common ancestor are included. This is the only form of
grouping accepted as valid by cladists.
·
A paraphyletic grouping is one in which all species
·
share a common ancestor, but not all species derived
·
from that common ancestor are included.
·
A polyphyletic grouping is one in which species that do not share an immediate common
ancestor are lumped together, while excluding other members that would link them
13.6 Let us Sum up
Macromolecular data, meaning gene (DNA) and protein sequences, are accumulating at an
increasing rate because of recent advances in molecular biology. For the evolutionary biologist,
the rapid accumulation of sequence data from whole genomes has been a major advance, because
the very nature of DNA allows it to be used as a "document" of evolutionary history.
Comparisons of the DNA sequences of various genes between different organisms can tell a
scientist a lot about the relationships of organisms that cannot otherwise be inferred from
morphology, or an organism's outer form and inner structure. Because genomes evolve by the
gradual accumulation of mutations, the amount of nucleotide sequence difference between a pair
of genomes from different organisms should indicate how recently those two genomes shared a
common ancestor. Two genomes that diverged in the recent past should have fewer differences
than two genomes whose common ancestor is more ancient. Therefore, by comparing different
genomes with each other, it should be possible to derive evolutionary relationships between
them, the major objective of molecular phylogenetics.
Molecular phylogenetics attempts to determine the rates and patterns of change occurring in
DNA and proteins and to reconstruct the evolutionary history of genes and organisms. Two
general approaches may be taken to obtain this information. In the first approach, scientists use
DNA to study the evolution of an organism. In the second approach, different organisms are used
to study the evolution of DNA. Whatever the approach, the general goal is to infer process from
155
pattern: the processes of organismal evolution deduced from patterns of DNA variation and
processes of molecular evolution inferred from the patterns of variations in the DNA itself.
1. Are there any ideas you have that might simplify the process of making the combined or
composite tree to help others in the future?
2. Other ways of displaying the information obtained such that you or others more easily
understand it?
1. Molecular sequences
2. Sequence alignment is the essential preliminary to tree reconstruction
3. Converting the alignment data into a phylogenetic tree
4. Assessing accuracy of a reconstructed tree
5. Molecular clocks enable the time of divergence of ancestral sequences to be estimated
1. “Sequence alignment in essential process in tree construction” - Substantiate.
13.10 References
1.
Housworth, E. A., E. P. Martins, and M. Lynch. 2004. The phylogenetic mixed model.
American Naturalist 163:84-96.
2.
Ives, A. R., P. E. Midford, and T. Garland, Jr. 2007. Within-species variation and
measurement error in phylogenetic comparative methods. Systematic Biology 56:252270.
3.
Maddison, D. R. 1994. Phylogenetic methods for inferring the evolutionary history and
process of change in discretely valued characters. Annual Review of Entomology 39:267292.
156
4.
Maddison, W. P. 1990. A method for testing the correlated evolution of two binary
characters: Are gains or losses concentrated on certain branches of a phylogenetic tree?
Evolution 44:539-557.
5.
Maddison, W. P., and D. R. Maddison. 1992. MacClade. Analysis of phylogeny and
character evolution. Version 3. Sinauer Associates, Sunderland, Mass. 398 pp.
6.
Martins, E. P., ed. 1996. Phylogenies and the comparative method in animal behavior.
Oxford University Press, Oxford. 415 pp.
7.
Martins, E. P., and T. Garland, Jr. 1991. Phylogenetic analyses of the correlated evolution
of continuous characters: a simulation study. Evolution 45:534-557.
157
LESSON – 14
USE OF CLUSTAL AND PHYLIP
14.1 CLUSTAL Introduction
14.2 New Features
14.3 Use Of CLUSTAL
14.4 PHYLIP
14.4.1 Setup
14.1.2 Usage
14.5 Let us Sum up
14.9 References
This chapter discusses about Clustal programs, its features, use of Clustal, Phylip, its features, set
up and use.
14.1 CLUSTAL Introduction
One of the cornerstones of modern Bioinformatics is the comparison or alignment of protein
sequences. With the aid of multiple sequence alignments, biologists are able to study the
sequence patterns conserved through evolution and the ancestral relationships between different
organisms. Sequences can be aligned across their entire length (global alignment) or only in
certain regions (local alignment). The most widely used programs for global multiple sequence
alignment are from the Clustal series of programs. The first Clustal program was written by Des
Higgins in 1988 and was designed specifically to work efficiently on personal computers, which
at that time, had feeble computing power by today's standards. It combined a memory-efficient
dynamic programming algorithm with the progressive alignment strategy developed by Feng and
158
Doolittle and Willie Taylor. The multiple alignment is built up progressively by a series of
pairwise alignments, following the branching order in a guide tree. The initial pre-comparison
used a rapid word-based alignment algorithm and the guide tree was constructed using the
UPGMA method. In 1992, a new release was made, called ClustalV which incorporated profile
alignments (alignments of existing alignments) and the facility to generate trees from the
multiple alignment using the Neighbour-Joining (NJ) method . The third generation of the series,
ClustalW, released in 1994, incorporated a number of improvements to the alignment algorithm,
including sequence weighting, position-specific gap penalties and the automatic choice of a
suitable residue comparison matrix at each stage in the multiple alignment. In addition, the
approximate word search used for the pre-comparison step was replaced by a more sensitive
dynamic programming algorithm, and the dendogram construction by UPGMA was replaced by
NJ. The ClustalW program looked very similar to ClustalV, with simple text menus for
interactive use and the possibility of running the program in batch mode by specifying the input
file and the parameter options on the command line.
The rationale behind the development of the Clustal series has been to provide robust, portable
programs that are capable of providing good, biologically accurate alignments within a
reasonable time limit. A close collaboration between biologists and computer scientists is
probably one of the main reasons for the success and continued widespread use of the Clustal
programs. ClustalW has given rise to a number of developments, including the latest member of
the family, ClustalX. Although the alignments produced are the same as those produced by the
current release of ClustalW, the user can better evaluate alignments in ClustalX. The program
displays the multiple alignment in a scrollable window and all parameters are available using
pull-down menus. Within alignments, conserved columns are highlighted using a customizable
colour scheme and quality analysis tools are available to highlight potentially misaligned
regions. ClustalX is easy to install, is user- friendly and maintains the portability of the previous
generations through the NCBI Vibrant toolkit ( ftp://ncbi.nlm.nih.gov/toolbox/ncbitools/).
Numerous options are provided, such as the realignment of selected sequences or selected blocks
of the alignment and the possibility of building up difficult alignments piecemeal, making
ClustalX an ideal tool for working interactively on alignments.
159
Parallel
versions
of
ClustalW
and
ClustalX
have
been
developed
by
SGI
( http://www.sgi.com/industries/sciences/chembio/resources/clustalw/parallel_clustalw.html),
which show increased speeds of up to 10× when running ClustalW/X on 16 CPUs and
significantly reduce the time required for data analysis. A number of other significant
developments have been based on the ClustalW program. For example, ClustalNet is a Clustal
alignment CORBA server and DbClustal) is a program for aligning sequences detected by
database searches, which uses local alignment information to anchor the global multiple
alignment.
DbClustal
is
available
on
the
Web
at http://www-igbmc.u-
strasbg.fr/BioInfo/DbClustal and forms part of the WU-Blast2 (Washington University BLAST
version 2.0) server at the EBI ( http://www.ebi.ac.uk/blast2/).
Numerous Web servers have exploited the command line interface of ClustalW, notably the
EBI's ClustalWWW Web server, which currently runs between 2000–10 000 jobs/day and the
SRS server at the same site ( http://srs.ebi.ac.uk/), which has ClustalW built in. The EBI
ClustalWWW interface provides extensive help, ranging from an introduction to multiple
alignments for new users to detailed descriptions of each alignment option. An important factor
in obtaining a high-quality alignment is the ability to change the numerous alignment parameters
available in ClustalW. While the default values of the parameters have been optimised to work in
the majority of cases, they are not necessarily optimal for any given alignment problem. In the
ClustalWWW interface, all the options are easily accessible on the top page.
160
Sequences can be entered by either pasting them or by uploading a file from the user's local
computer. In both cases, the sequences should be in one of seven different formats (GCG,
FASTA, EMBL, GenBank, PIR, NBRF, Phylip or SWISS-PROT). Although users are
encouraged to submit large numbers of sequences, there is no guarantee that the alignment will
be completed within the job run limits. Therefore, users who experience problems when
attempting to make very large alignments are advised to download the software and run it
locally. In addition to the input format, the user can also specify the preferred output format for
the multiple sequence alignment. The options are currently ALN, GCG, PHYLIP, PIR and GDE.
It is also possible to configure the browser to automatically load the results files from ClustalW
into a suitable external application. Many commercial packages, e.g. the GCG package
(Wisconsin Package, Genetics Computer Group, Madison, WI) and its X Window graphical user
interface, SeqLab, can also accept ClustalW alignments.
A recent enhancement to the ClustalW WWW interface has been the addition of an option that
allows the user to upload the results of ClustalW into an alignment editor, using a Java Applet
called JalView ( http://www.compbio.dundee.ac.uk/). JalView is a fully featured multiple
sequence alignment editor which allows the user to perform further alignment analysis. Special
features include the definition of sequence sub-groups, links to the SRS server at the EBI and an
option to output the alignment as a colour postscript file for printing purposes.
ClustalW W WW can also calculate trees from a multiple alignment using the NJ method, a
widely used and relatively fast algorithm that clusters sequences by minimising the sum of
branch lengths. The resulting evolutionary relationships can be viewed either as cladograms or
phylograms, with the option to display branch lengths (or ‘tree graph distances’).
14.2 New Features
Both ClustalW and ClustalX are being actively maintained and updated. Recent enhancements
have included the possibility of saving both alignments and phylogenetic trees in the NEXUS
format for compatibility with a number of phylogeny programs. Some work has also been done
to optimise the alignment parameters, for example the Gonnet series of residue comparison
161
matrices is now used by default for protein sequence alignments. The latest version of the
programs (version 1.83), contained four main enhancements. The first modification is the facility
to save the multiple alignment result as a FASTA format file, for compatibility with a number of
other software packages. Another is to provide a percent identity matrix, which some users have
asked for. A third new option is the possibility of saving the residue range in the output file when
saving a user-specified range of the alignment. This is particularly useful when extracting a
single domain from the alignment of multi-domain proteins. The increased speeds obtained mean
that it is now possible to construct phylogenetic trees for very large sets of sequences, which
were previously only feasible on very large computer systems.
14.3 Use of CLUSTAL
CLUSTAL X is a new windows interface for the widely- used progressive multiple sequence
alignment program CLUSTAL W. The new system is easy to use, providing an integrated system
for performing multiple sequence and profile alignments and analysing the results. CLUSTAL X
displays the sequence alignment in a window on the screen. A versatile sequence colouring
scheme allows the user to highlight conserved features in the alignment. Pull-down menus
provide all the options required for traditional multiple sequence and profile alignment.
New features include:
§
the ability to cut-and-paste sequences to change the order of the alignment,
§
selection of a subset of the sequences to be realigned,
§
and selection of a sub-range of the alignment to be realigned and inserted back into the
original alignment.
§
Alignment quality analysis can be performed and low-scoring segments or exceptional
residues can be highlighted.
§
Quality analysis and realignment of selected residue ranges provide the user with a
powerful tool to improve and refine difficult alignments and to trap errors in input
sequences.
CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on
DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh
PowerMac.
162
Multiple sequence alignment with the Clustal series of programs.
The Clustal series of programs are widely used in molecular biology for the multiple alignment
of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity
of the programs depends on a number of factors, including not only the accuracy of the results,
but also the robustness, portability and user- friendliness of the programs. New features include
NEXUS and FASTA format output, printing range numbers and faster tree calculation.
Although, Clustal was originally developed to run on a local computer, numerous Web servers
have
been
set
up,
notably
at
the
EBI
(European Bioinformatics
Institute)
( http://www.ebi.ac.uk/clustalw/).
14.4 PHYLIP
PHYLIP is a free package of programs for inferring phylogenies. PHYLIP is a free
computational phylogenetics package of programs for inferring evolutionary trees
( phylogenies). The name is an acronym for PHYLogeny Inference Package. It consists of 35
portable programs, i.e. the source code is written in C and precompiled executables are available
for Windows (95/98/NT/2000/me/XP), MacOS 8 and 9, MacOS X, and Linux systems. A
complete documentation is written for all the programs in the package and is part of the package.
The author of this package is Joseph Felsenstein, Professor in the Department of Genome
Sciences and the Department of Biology at the University of Washington, Seattle.
Methods (implemented by each program) that are available in the package include parsimony,
distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data
types that can be handled include molecular sequences, gene frequencies, restriction sites and
fragments, distance matrices, and discrete characters.
Each program is controlled through a menu, which asks the users which options they want to set,
and allows them to start the computation. The data is read into the program from a text file,
which the user can prepare using any word processor or text editor (but it is important that this
text file not be in the special format of that word processor -- it should instead be in flat ASCII or
Text Only format). Some sequence analysis programs such as the ClustalW alignment program
163
can write data files in the PHYLIP format. Most of the programs look for the data in a file called
infile -- if they do not find this file they then ask the user to type in the file name of the data file.
Output is written onto files with names like outfile and outtree. Trees written onto outtree are in
the Newick format, an informal standard agreed to in 1986 by authors of a number of major
phylogeny packages.
14.4.1 Setup
To use PHYLIP, it is necessary to set the PHYLIP environment by running a special command
sequence once per login session. You may optionally place these commands in your .cshrc (C
Shell users) or .bash_profile (Bourne Shell users) to avoid having to manually run these
commands on login.
For csh and tcsh:
source /usr/local/setup/phylip.setup.csh
For sh and bash:
. /usr/local/setup/phylip.setup.sh
14.1.2 Usage
PHYLIP contains the following commands:
clique
o
finds the largest clique of mutually compatible characters, and the phylogeny
which they recommend, for discrete character data with two states.
consense
o
computes consensus trees by the majority-rule consensus tree method, which also
allows one to easily find the strict consensus tree.
contml
o
estimates phylogenies from gene frequency data by maximum likelihood under a
model in which all divergence is due to genetic drift in the absence of new
mutations.
164
dnacomp
o
estimates phylogenies from nucleic acid sequence data using the compatibility
criterion, which searches for the largest number of sites which could have all
states (nucleotides) uniquely evolved on the same tree.
dnadist
o
computes four different distances between species from nucleic acid sequences.
The distances can then be used in the distance matrix programs.
dnainvar
o
for nucleic acid sequence data on four species, computes Lake's and Cavender's
phylogenetic invariants, which test alternative tree topologies. The program also
tabulates the frequencies of occurrence of the different nucleotide patterns.
dnaml
o
estimates phylogenies from nucleotide sequences by maximum likelihood. The
model employed allows for unequal expected frequencies of the four nucleotides,
for unequal rates of transitions and transversions, and for different (prespecified)
rates of change in different categories of sites, with the program inferring which
sites have which rates.
dnamlk
o
same as dnaml but assumes a molecular clock. The use of the two programs
together permits a likelihood ratio test of the molecular clock hypothesis to be
made.
dnamove
o
interactive construction of phylogenies from nucleic acid sequences, with their
evaluation by parsimony and compatibility and the display of reconstructed
ancestral bases. This can be used to find parsimony or compatibility estimates by
hand.
dnapars
o
estimates phylogenies by the parsimony method using nucleic acid sequences.
Allows use the full IUB ambiguity codes, and estimates ancestral nucleotide
states.
dnapenny
165
o
finds all most parsimonious phylogenies for nucleic acid sequences by branchand-bound search.
dollop
o
estimates phylogenies by the Dollo or polymorphism parsimony criteria for
discrete character data with two states (0 and 1). Also reconstructs ancestral states
and allows weighting of characters.
dolmove
o
interactive construction of phylogenies from discrete character data with two
states (0 and 1) using the Dollo or polymorphism parsimony criteria. Evaluates
parsimony and compatibility criteria for those phylogenies and displays
reconstructed states throughout the tree. This can be used to find parsimony or
compatibility estimates by hand.
dolpenny
o
finds all most parsimonious phylogenies for discrete-character data with two
states, for the Dollo or polymorphism parsimony criteria using the branch-andbound method of exact search.
factor
o
takes discrete multistate data with character state trees and produces the
corresponding data set with two states (0 and 1).
fitch
o
estimates phylogenies from distance matrix data under the "additive tree model"
according to which the distances are expected to equal the sums of branch lengths
between the species. This program will be useful with distances computed from
DNA sequences, with DNA hybridization measurements, and with genetic
distances computed from gene frequencies.
gendist
o
computes one of three different genetic distance formulas from gene frequency
data. The formulas are Nei's genetic distance, the Cavalli- Sforza chord measure,
and the genetic distance of Reynolds etal. The former is appropriate for data in
which new mutations occur in an infinite isoalleles neutral mutation model, the
latter two for a model without mutation and with pure genetic drift. The distances
166
are written to a file in a format appropriate for input to the distance matrix
programs.
kitsch
o
estimates phylogenies from distance matrix data under the "ultrametric" model
which is the same as the additive tree model except that an evolutionary clock is
assumed. This program will be useful with distances computes from DNA
sequences, with DNA hybridization measurements, and with genetic distances
computed from gene frequencies.
mix
o
estimates phylogenies by some parsimony methods for discrete character data
with two states (0 and 1). Also reconstructs ancestral states and allows weighting
of characters.
move
o
interactive construction of phylogenies from discrete character data with two
states (0 and 1). Evaluates parsimony and compatibility criteria for those
phylogenies and displays reconstructed states throughout the tree. This can be
used to find parsimony or compatibility estimates by hand.
neighbor
o
an implementation by Mary Kuhner and John Yamato of Saitou and Nei's
"Neighbor Joining Method," and of the UPGMA (Average Linkage clustering)
method.
penny
o
finds all most parsimonious phylogenies for discrete-character data with two
states, for the Wagner, Camin-Sokal, and mixed parsimony criteria using the
branch-and-bound method of exact search.
protpars
o
estimates phylogenies from protein sequences (input using the standard one- letter
code for amino acids) using the parsimony method, in a variant which counts only
those nucleotide changes that change the amino acid, on the assumption that silent
changes are more easily accomplished.
protdist
167
o
computes a distance measure for protein sequences, using maximum likelihood
estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it,
or a model based on the genetic code plus a constraint on changing to a different
category of amino acid. The distances can then be used in the distance matrix
programs.
restml
o
estimation of phylogenies by maximum likelihood using restriction sites data (not
restriction fragments but presence/absence of individual sites).
seqboot
o
reads in a data set, and produces multiple data sets from it by bootstrap
resampling. Since most programs in the current version of the package allow
processing of multiple data sets, this can be used together with the consensus tree
program consense to do bootstrap (or delete- half-jackknife) analyses with most of
the methods in this package. This program also allows the Archie/Faith technique
of permutation of species within characters.
1. List any five Phylip commands.
Notes:
aa) Write your answer in the space given below.
bb) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
14.5 Let us Sum up
Clustal is a widely used multiple sequence alignment computer program. The latest version is
1.83. PHYLIP is a free Computational phylogenetics package of programs for inferring
evolutionary trees ( phylogenies). The name is an acronym for PHYLogeny Inference Package. It
168
consists of 35 portable programs, i.e. the source code is written in C and precompiled
executables are available for Windows (95/98/NT/2000/me/XP), MacOS 8 and 9, MacOS X, and
Linux systems. A complete documentation is written for all the programs in the package and is
part of the package. The author of this package is Joseph Felsenstein, Professor in the
Department of Genome Sciences and the Department of Biology at the University of
Washington, Seattle. Methods (implemented by each program) that are available in the package
include parsimony, distance matrix, a n d likelihood methods, including bootstrapping and
consensus trees. Data types that can be handled include molecular sequences, gene frequencies,
restriction sites and fragments, distance matrices, and discrete characters.
1. Collect any 10 related sequences and perform multiple sequence alignment using CLUSTAL.
2. Build a phylogenetic tree using PHYLIP.
1. Your answer must include any five commands given in this lesson above.
1. Do a comparative analysis of Cluster X and Cluster W and elaborate which are is better.
14.9 References
1. Martins, E. P., and T. F. Hansen. 1997. Phylogenies and the comparative method: a
general approach to incorporating phylogenetic information into the analysis of
interspecific data. American Naturalist 149:646-667. Erratum Am. Nat. 153:448.
2. Nunn, C. L., and R. A. Barton. 2001. Comparative methods for studying primate
adaptation and allometry. Evolutionary Anthropology 10:81-98.
3. Oakley, T. H., Z. Gu, E. Abouheif, N. H. Patel, and W.-H. Li. 2005. Comparative
methods for the analysis of gene-expression evolution: an example using yeast functional
genomic data. Molecular Biology and Evolution 22:40-50.
169
4. O’Meara, B. C., C. M. Ané, M. J. Sanderson, and P. C. Wainwright. 2006. Testing for
different rates of continuous trait evolution in different groups using likelihood.
Evolution 60:922-933.
5. Organ, C. L., A. M. Shedlock, A. Meade, M.. Pagel, and S. V. Edwards. 2007. Origin of
avian genome size and structure in non-avian dinosaurs. Nature 446:180-184.
6. Page, R. D. M., ed. 2003. Tangled trees : phylogeny, cospeciation, and coevolution.
University of Chicago Press, Chicago.
7. Pagel, M. D. 1993. Seeking the evolutionary regression coefficient: an analysis of what
comparative methods measure. Journal of Theoretical Biology 164:191-205.
8. Pagel, M. 1999. Inferring the historical patterns of biological evolution. Nature 401:877884.
9. Paradis, E. 2005. Statistical analysis of diversification with species traits. Evolution 59:112.
10. Paradis, E., and J. Claude. 2002. Analysis of comparative data using generalized
estimating equations. Journal of Theoretical Biology 218:175-185.
11. Purvis, A., and T. Garland, Jr. 1993. Polytomies in comparative analyses of continuous
characters. Systematic Biology 42:569-575.
12. Rezende, E. L., and T. Garland, Jr. 2003. Comparaciones interespecíficas y métodos
estadísticos filogenéticos. Pages 79-98 in F. Bozinovic, ed. Fisiología Ecológica &
Evolutiva. Teoría y casos de estudios en animales. Ediciones Universidad Católica de
Chile, Santiago.
13. Ridley, M. 1983. The explanation of organic diversity: The comparative method and
adaptations for mating. Clarendon, Oxford, U.K.
14. Rohlf, F. J. 2001. Comparative methods for the analysis of continuous variables:
geometric interpretations. Evolution 55:2143-2160.
170
UNIT IV
LESSON – 15
GENE PREDICTION
15.0 Aims and objectives
15.1 Gene prediction
15.2 Extrinsic Approaches
15.3 ab initio Approaches
15.4 Other Signals
15.4.1 Signal Sensors
15.4.2 Content Sensors
15.4.3 Integrated Gene Finding Methods
15.5 Comparative Genomics Approaches
15.6 Let us sum up
171
15.9 References
15.0 Aim and objectives:
This chapter discusses the various gene prediction methods, signal sensors, content sensors,
extrinsic and ab initio approaches, and integrated gene finding methods.
15.1 Gene prediction
Gene finding typically refers to the area of computational biology that is concerned with
algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically
functional. This especially includes protein-coding genes, but may also include other functional
elements such as RNA genes and regulatory regions. Gene finding is one of the first and most
important steps in understanding the genome of a species once it has been sequenced.
In its earliest days, "gene finding" was based on painstaking experimentation on living cells and
organisms. Statistical analysis of the rates of homologous recombination of several different
genes could determine their order on a certain chromosome, and information from many such
experiments could be combined to create a genetic map specifying the rough location of known
genes relative to each other. Today, with comprehensive genome sequence and powerful
computational resources at the disposal of the research community, gene finding has been
redefined as a largely computational problem.
Determining that a sequence is functional should be distinguished from determining the function
of the gene or its product. The latter still demands in vivo experimentation through gene
knockout and other assays, although frontiers of Bioinformatics
research are making it
increasingly possible to predict the function of a gene based on its sequence alone.
15.2 Extrinsic Approaches
In extrinsic gene finding systems, the target genome is searched for sequences that are similar to
extrinsic evidence in the form of the known sequence of a messenger RNA (mRNA) or protein
product. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from
which it had to have been transcribed. Given a protein sequence, a family of possible coding
DNA sequences can be derived by reverse translation of the genetic code. Once candidate DNA
sequences have been determined, it is a relatively straightforward algorithmic problem to
172
efficiently search a target genome for matches, complete or partial, and exact or inexact. BLAST
is a widely used system designed for this purpose.
A high degree of similarity to a known messenger RNA or protein product is strong evidence
that a region of a target genome is a protein-coding gene. However, to apply this approach
systemically requires extensive sequencing of mRNA and protein products. Not only is this
expensive, but in complex organisms, only a subset of all genes in the organism's genome are
expressed at any given time, meaning that extrinsic evidence for many genes is not readily
accessible in any single cell culture. Thus, in order to collect extrinsic evidence for most or all of
the genes in a complex organism, many hundreds or thousands of different cell types must be
studied, which itself presents further difficulties. For example, some human genes may be
expressed only during development as an embryo or foetus, which might be difficult to study for
ethical reasons.
Despite these difficulties, extensive transcript and protein sequence databases have been
generated for human as well as other important model organisms in biology, such as mice and
yeast. For example, the RefSeq database contains transcript and protein sequence from many
different species, and the Ensembl system comprehensively maps this evidence to human and
several other genomes. It is, however, likely that these databases are both incomplete and contain
small but significant amounts of erraneous data.
15.3 Ab Initio Approaches
Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it
is also necessary to resort to ab initio gene finding, in which genomic DNA sequence alone is
systematically searched for certain tell-tale signs of protein-coding genes. These signs can be
broadly categorized as either signals, specific sequences that indicate the presence of a gene
nearby, or content, statistical properties of protein-coding sequence itself. Ab initio gene finding
might be more accurately characterized as gene prediction, since extrinsic evidence is generally
required to conclusively establish that a putative gene is functional.
In the genomes of prokaryotes, genes have specific and relatively well- understood promoter
sequences (signals), such as the Pribnow box and transcription factor binding sites, which are
easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous
open reading frame (ORF), which is typically many hundred or thousands of base pairs long.
The statistics of stop codons are such that even finding an open reading frame of this length is a
173
fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons,
one would expect a stop codon approximately every 20-25 codons, or 60-75 base pairs, in a
random sequence.) Furthermore, protein-coding DNA has certain periodicities and other
statistical properties that are easy to detect in sequence of this length. These characteristics make
prokaryotic gene finding relatively straightforward, and well-designed systems are able to
achieve high levels of accuracy.
Ab initio gene finding in eukaryotes, especially complex organisms like humans, is considerably
more challenging for several reasons. First, the promoter and other regulatory signals in these
genomes are more complex and less well- understood than in prokaryotes, making them more
difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene
finders are CpG islands and binding sites for a polyA tail.
Second, splicing mechanisms employed by eukaryotic cells mean that a particular proteincoding sequence in the genome is divided into several parts ( exons), separated by non-coding
sequences ( introns). (Splice sites are themselves another signal that eukaryotic gene finders are
often designed to identify.) A typical protein-coding gene in humans might be divided into a
dozen exons, each less than two hundred base pairs in length, and some as short as twenty to
thirty. It is therefore much more difficult to detect periodicities and other known content
properties of protein-coding DNA in eukaryotes.
Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex
probabilistic models, such as Hidden Markov Models, in order to combine information from a
variety of different signal and content measurements. The GLIMMER system is a widely used
and highly accurate gene finder for prokaryotes. GeneMark is another popular approach.
Eukaryotic ab initio gene finders, by comparison, have achieved only limited success; notable
examples are the GENSCAN and geneid programs.
15.4 Other Signals
Among the derived signals used for prediction are statistics resulting from the sub-sequence
statistics like k-mer statistics, Fourier transform of a pseudo-number-coded DNA, Z-curve
parameters and certain run features.
It has been suggested that signals other than those directly detectable in sequences may improve
gene prediction. For example, the role of secondary structure in the identification of regulatory
174
motifs has been reported. In addition, it has been suggested that RNA secondary structure
prediction helps splice site prediction.
15.4.1 Signal Sensors
The most basic signal sensor is a simple consensus sequence or an expression that describes a
consensus sequence along with allowable variations, such as a PROSITE expression. More
sensitive sensors can be designed using weight matrices in place of the consensus, in which each
position in the pattern allows a match to any residue, but different costs are associated with
matching each residue in each position. The score returned by a weight matrix sensor for a
candidate site is the sum of the costs of the individual residue matches over that site. If this score
exceeds a given threshold, the candidate site is predicted to be a true site. Such sensors have a
natural probabilistic interpretation in which the score returned is a log likelihood ratio under a
simple statistical model in which each position in the site is characterized by an independent and
distinct distribution over possible residues. A mathematically equivalent interpretation of the
score is that it is the discrimination energy for site recognition.
Weight matrices can also be viewed as a simple type of neural network, sometimes called a
perceptron. Many investigators have also applied more complex neural networks, such as multilayer feed- forward networks and time delay networks, to various DNA signal recognition
problems. Multi- layer nets have the ability to capture statistical dependency between the residues
at different positions in a site, an ability that perceptrons (and hence weight matrices) lack. Time
delay neural networks also allow insertions and deletions while evaluating a match to a
prospective site, whereas weight matrices and feed- forward neural networks do not. Other
statistical/pattern models besides neural networks, such as nonhomogeneous Markov models (a
weight matrix where the distribution at position i depends on the residue at position i-1,
sometimes called ``WAM" models), decision trees, quadratic discriminant functions, and
graphical models, have also been used as biosequence signal sensors. In general, the penalty for
these more sophisticated models is that much more training data is needed to estimate the many
parameters that they contain, so they are unsuitable in cases where relatively few verified
examples are known of the site to be modeled.
15.4.2 Content Sensors
The most important and most studied content sensor is the sensor that predicts coding regions.
An extensive review of computational methods to detect coding regions is given by Fickett and
175
Tung [ 23] (see also [ 20, 21]). In prokaryotes, it is still common to locate genes by simply
looking for long open reading frames (ORFs); this is certainly not adequate for higher
eukaryotes. To discriminate coding from non-coding regions in eukaryotes, exon content sensors
often use inframe hexamer counts or, what is nearly equivalent, a set of 3 fifth-order Markov
models, one for each of the three nucleotide positions within a codon, as pioneered in the
genefinder GeneMark [ 7]. It is also important to consider local compositional biases, as the
codon preferences are quite different between genes in G+C rich regions and genes in A+T rich
regions. While many other measures of coding potential have been investigated (Fickett tested 19
different measures, which he took from the literature, few others have been proven to be as
effective. However, combinations of several measures can be effective, as in the popular GRAIL
exon detector, in which several coding measures are combined along with base composition and
signal sensor output for flanking splice sites, and fed into a neural net to predict exons.
Other content sensors include sensors for CpG islands, which are regions that often occur near
the beginnings of genes where the frequency of the dinucleotide CG is not as low as it typically
is in the rest of the genome and sensors for repetitive DNA, such as ALU sequences. The latter
sensors are often used as masks or filters that completely remove the repetitive DNA, leaving the
remaining DNA to be analyzed.
1. Describe the ab initio method.
Notes:
cc) Write your answer in the space given below.
dd) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
………………………………………………………………………………………………………
………………………………………………………
15.4.3 Integrated Gene Finding Methods
176
Signal and content sensors alone cannot solve the genefinding problem. The statistical signals
they are trying to recognize are too weak, and there are dependencies between signals and
contents that they cannot capture, such as the possible correlation between splice site strength
and exon size. During the last five years, a number of systems have been developed that combine
signal and content sensors to try to identify complete gene structure. Such systems are capable,
in principle, of handling more complex interdependencies between gene features. A linguistic
metaphor is sometimes applied here, likening the process of breaking down a sequence of DNA
into genes, each of which is a series of exons and introns, to the process of parsing a sentence by
breaking it down into its constituent grammatical parts. Indeed this parsing metaphor can be
pushed deeper. Searls was the first major proponent of describing gene structure in linguistic
terms using a formal grammar. His genefinding program, GenLang, was one of the earliest
integrated genefinders, following on the pioneering work of Gelfand, Gelfand and Roytberg
Fields and Soderlund, and Phil Green's GeneFinder, and was one of the inspirations for
significant later work and the HMM methods described below.)
Nearly all integrated genefinders use dynamic programming to combine candidate exons and
other scored regions and sites into an complete gene prediction with maximal total score. A brief
and lucid tutorial on this topic can be found in] and a more detailed exposition in. Gelfand, et al,
proposed a dynamic programming scheme, embodied in the genefinder GREAT, that calculates
the set of all so-called Pareto-optimal gene structure predictions, which include the optimal
predictions for a wide variety of different scoring functions. Dynamic programming methods are
also used in Grail II, GeneParser, FGENEH, and recent versions of GeneID.
Dynamic programming methods find the candidate gene structure with the best overall score.
The key to success in these methods is developing the right score function. A fruitful approach
here has been to define a statistical model of genes that includes parameters describing codon
dependencies in exons, characteristics of splice sites (e.g. the parameters of a weight matrix for
splice sites), as well as ``linguistic" information on what functional features are likely to follow
other features (see Figure 1).
In this approach the observed DNA sequences are actually modeled as if they were
manifestations of a stochastic process that generates gene-containing DNA. This process
includes a latent (or ``hidden") variable associated with each nucleotide that represents the
177
functional role or position of that nucleotide, e.g. a G residue might be part of a GT consensus
donor splice site or it might be in the third position of a start codon. Taken together, the states of
these hidden variables define a candidate gene structure. The linguistic rules for what functional
features follow what other features are expressed by the parameters of a Markov process on the
hidden variables. For this reason, these models are called hidden Markov models, or HMMs.
Because a Markov process is just a finite state machine with probabilities on the state transitions,
genefinding HMMs are merely a stochastic version of the genefinding finite state machines
(regular grammars) introduced by Searls.
Fig 15.1: A simplified diagram representing the liguistic rules for what might follow what when
parsing a sequence consisting of a multiple exon gene. The arcs represent contents and the nodes
represent signals. The contents are J5' : 5' UTR, EI : Initial Exon, E : Exon, I : Intron, E : Internal
Exon, EF: Final Exon, ES : Single Exon, and J3' : 3' UTR. The signals are B : Begin sequence, S
: Start Translation, D : Donor splice site, A : Acceptor splice site, T : Stop Translation, F : End
sequence. A candidate gene structure is created by tracing a path in this figure from B to F. An
HMM (GHMM) is defined by attaching stochastic models to each of the arcs and nodes. Figure
taken from [ 44].
The advantage of HMMs is that, being probabilistic models, they define a natural score function.
Let X denote the DNA sequence, Q denote a possible sequence of hidden states, one for each
nucleotide in X, and
denote the parameters of the HMM. Since Q represents a candidate gene
structure for X, to find the genes in X, we want to find the Q that is most likely given the
178
, the probability of the gene
sequence X, i.e., we want to find the Q that maximizes
structure Q given the DNA sequence X and the parameters
. Equivalently, we can maximize
. This is the score function that is optimized in a genefinding HMM. It can be
optimized using standard dynamic programming methods.
Early genfinding HMMs were EcoParse (for E. coli, also recently used in the annotation of the
M. Tuberculosis genome) and Xpound (for human). More recent programs are GeneMark-HMM
(for bacterial genomes) Veil and HMMgene (for human). A somewhat more general class of
probabilistic models, called generalized HMMs (GHMMs) or (hidden) semi- Markov models,
have their roots in GeneParser, and were more fully developed in Genie and then GenScan.
The probabilistic approach has further advantages. For example, for any given feature, such as a
5' splice site, and any position in the DNA sequence X, we can calculate the probability that that
feature occurs at that position. If we do this for separately for each feature of our overall
predicted gene structure, then this gives us a kind of individual ``confidence" value for each part
of our prediction. GeneParser pioneered this methodology, and it is used to give highly accurate
confidence values for predicted exons in Genscan. In addition, the probabilistic formulation
provides various new ways to estimate the parameters of the genefinding model. Given a large
``training" DNA contig (or set of contigs) X and its correct state sequence annotation Q, we can
find
to maximize
maximum a posteriori
(the
approach),
maximum
or
likelihood
approach),
(the
(the conditional maximum likelihood
approach). It is even possible to estimate the parameters from partially annotated training
sequences using the expectation- maximization method.
So far we have focused on genefinders that predict gene structure based only on general features
of genes, rather than using explicit comparisons to other, previously known genes, or auxiliary
information such as expressed sequence tag (EST) matches. One way to include information
about previously known genes is to use the database of known proteins as a basis for gene
prediction. Current state-of-the-art genefinding systems combine multiple statistical measures
with database homology searches, obtained by translating the DNA to protein in all possible
reading frames, and then searching the protein databases for similar protein sequences. Examples
179
are Genie, GeneID+, GeneParser3, and recent versions of Grail. The program AAT and new
versions of Grail also take into account EST information. Database homology has long been used
as a post hoc method to validate gene predictions, but these systems were among the first to
integrate database homology directly into the genefinding algorithm itself. This approach has
been taken to its extreme limit in a genefinding program developed by Gelfand, Mironov, and
Pevzner. This system, called Procrustes, requires the user to provide a close protein homolog of
the gene to be predicted. Then a ``spliced alignment'' algorithm, similar to a Smith-Waterman
alignment, is used to derive a putative gene structure by aligning the DNA to the homolog. The
major disadvantage to this method is the requirement of a close homolog. It is often the case that
homologs are unknown or are remote, in which case this system would be inappropriate.
Nevertheless, in the presence of a very close homolog, Procrustes is an extremely effective gene
finding method. Recent related methods, based on HMM models, have been developed by
Birney and Durbin and are currently being developed by Kulp.
In 1995, a number of different integrated genefinders were tested on a benchmark set of 570
vertebrate genes by Burset and Guigó. They looked at not only how many bases were predicted
correctly as either coding or non-coding, but how many exons were predicted exactly, with both
splice sites located correctly. In the former case, accuracy was about 75-80%. In the latter it was
about 40-60%. These numbers are for systems that do not employ protein database homology
searches. When database homology is employed, the upper limit for the accuracy increases about
10% in both categories. Integrated eukaryotic genefinding systems based on HMM and GHMM
models, starting with Genie, and followed by Veil, Genscan and HMMgene have pushed beyond
these early performance numbers, with the latter two programs now obtaining upwards of 90%
accuracy at the level of individual nucleotides and 80% for exact exon prediction, without the
use of database homologies. A new category of completely correct gene prediction has been
added to the list of performance measurements, and Genscan achieves an accuracy of about 40%
on the Burset and Guigó dataset in this category. Tests have also been conducted on the
identification of promoters, showing that the accuracy of currently available methods is much
lower on this task.
The currently available genefinding performance results must be approached with extreme
caution. The primary reason is that they depend very strongly on the difficulty of the genes in the
180
test set, and for some genefinders, on the homology overlap between the genes in the test set and
those in the training set that is used to optimize the parameters of the models. The latter is a
factor even when no homology is explicitly used by the genefinding method. To avoid this
problem, it is best to compare genefinders by training and testing on the same genes, and to
avoid homologies between genes used for training and testing. Reese has constructed benchmark
sets for human and for Drosophila genes of this type that are randomly partitioned into specified
parts for use in cross-validated train-test experiments. These have been used by Genie, Genscan
and HMMgene. Reese's human dataset is a bit harder than the original Burset and Guigó dataset
as well, so genefinding programs get overall lower scores on it. Furthermore, the variance in
performance from one train-test partition to another is quite high, since some parts by chance
ended up with more ``hard-to-predict" genes (usually genes with many exons and or long
introns) than others. This graphically demonstrates the unreliability of the currently available
genefinding performance figures: if by chance a different set of human genes had been included
in Genbank, the numbers would have been quite different, and probably lower, since Genbank is
biased towards genes with fewer exons and shorter introns. We need a much larger sample of
human genes before we can get stable performance numbers.
Reese's datasets, like those of Burset and Guigó, contain exactly one gene per sequence. Little is
known about the accuracy of genefinders on large genomic sequences containing multiple genes.
Some harder and more realistic human genomic data, consisting of large annotated contigs, is
available. The Sanger center also proposes a standardized format, Gene Finding Format or GFF,
for both gene annotation and comparing the results of various genefinders. It would greatly aid
the maturation of this field if we could agree on a simple standard data interchange format like
this. Once this is established, we could then share a set of tools for the display, comparison,
analysis and combination of different gene predictions, along with auxiliary sequence annotation.
15.5 Comparative Genomics Approaches
As the entire genomes of many different species are sequenced, a promising direction in current
research on gene finding is a comparative genomics approach. This is based on the principle that
the forces of natural selection cause genes and other functional elements undergo mutation at a
slower rate than the rest of the genome, since mutations in functional elements are more likely to
negatively impact the organism than mutations elsewhere. Genes can thus be detected by
181
comparing the genomes of related species to detect this evolutionary pressure for conservation.
This approach was first applied to the mouse and human genomes, using programs such as
SLAM, SGP and Twinscan.
Comparative gene finding can also be used to project high quality annotations from one genome
to another. Notable examples include Projector, GeneWise and GeneMapper. Such techniques
now play a central role in the annotation of all genomes.
15.6 Let us sum up
We briefly review computational methods for finding genes in genomic DNA sequences.
Specific programs are now available to find genes in the genomic DNA of many organisms. We
discuss the approaches used by these programs, their performance, and future directions for this
field.
It is important to distinguish two different goals in genefinding research. The first goal is to
provide computational methods to aid in the annotation of the large volume of genomic data that
is produced by genome sequencing efforts. The second goal is to provide a computational model
to help elucidate the mechanisms involved in transcription, splicing, polyadenylation and other
critical processes in the pathway from genome to proteome. While there is some overlap in these
goals, there is also some conflict. No one computational genefinding approach will be optimal
for both goals.
A ``purist" system that mimics the cellular processes cannot take advantage of homologies with
other proteins and matches to EST sequences when deciding where to splice. It presumably
should not use codon statistics, frame consistency between exons, or lack of inframe stop
codons to predict overall gene structure, although there is some evidence that absence of early inframe stop codons may be involved in biological start site selection. One would think that these
restrictions would completely cripple computational genefinding methods, however Guigó has
shown that just using simple weight matrices to find the best combination of splice site signals,
translation start and stop signals, along with the standard syntactic constraints on gene structure
(frame consistency, no inframe stop codons, minimum intron size), gives results on his
182
benchmark data set that are comparable to those obtained by most of the genefinders he and
Burset tested in 1995. These results are not competitive with the older genefinders that use
protein homology, nor with the newer methods that use exon coding potential but not homology,
but they nevertheless indicate a surprising potential for purist genefinding models. More detailed
models of the splicing process, the selection of translation start and the process of
polyadenylation may significantly improve such purist models. These models may prove useful
in human genome annotation for finding rapidly evolving and rarely expressed genes, especially
those with unusual codon usage. However, if we simply want to produce genefinders that give
the most reliable annotation in ``everyday" genome center annotation efforts, it is clear that more
work needs to be done to incorporate EST information along with protein homology and
powerful statistical models.
There are other key issues that will effect future research in both of the above computational
genefinding paradigms. One is the issue of alternative splicing. No currently available
genefinders handle alternative splicing in an effective manner. Intimately tied with this issue is
that of gene regulation. The abundant regulatory signals flanking genes, and appearing in introns
(and sometimes in exons, combined with regulatory proteins specific to the cell type and cell
state, determine the expression of the gene. Gene annotation is not complete until these signals
are identified, and the cellular conditions that give rise to differing expression levels for different
transcripts are worked out. This implies, among other things, that future genefinders will need to
explicitly take into account experimental data relating to differential expression, along with the
other types of data we have discussed. It may be anticipated that this task will occupy
genefinding researchers for some years to come.
Ab initio gene finding in eukaryotes, especially complex organisms like humans, is considerably
more challenging for several reasons.
First, the promoter and other regulatory signals in these genomes are more complex and less
well- understood than in prokaryotes, making them more difficult to reliably recognize.
183
Two classic examples of signals identified by eukaryotic gene finders are CpG islands and
binding sites for a poly(A) tail.
1. “Gene prediction is closely related to computational Biology” - Substantiate.
15.9 References
1.
Saeys Y, Rouzé P, Van de Peer Y (2007). "In search of the small ones: improved
prediction of short exons in vertebrates, plants, fungi and protists". Bioinformatics 23
(4): 414-420. doi:10.1093/Bioinformatics /btl639.
2.
Hiller M, Pudimat R, Busch A, Backofen R (2006). "Using RNA secondary structures
to guide sequence motif finding towards single-stranded regions". Nucleic Acids Res 34
(17): e117. Entrez PubMed 16987907.
3.
Patterson DJ, Yasuhara K, Ruzzo WL (2002). "Pre- mRNA secondary structure
prediction aids splice site prediction". Pac Symp Biocomput: 223-234. Entrez PubMed
11928478.
4.
Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H (2006). "Importance of
RNA secondary structure information for yeast donor and acceptor splice site predictions
by neural networks". Comput Biol Chem 30 (1): 50-57. Entrez PubMed 16386465.
184
LESSON – 16
FRAGMENT ASSEMBLY
16.1 Fragment assembly
16.1.1 Introduction
16.1.2 New ideas
16.1.3 Error correction
16.1.4 Error correction or data corruption?
16.1.5 Eulerian superpath problem
16.2 Let us Sum up
16.6 References
185
To study about the fragment assembly, introduction, new ideas, error correction, error correction
or data corruption?, eulerian superpath problem.
16.1 Fragment Assembly
16.1.1 Introduction
For the last twenty years fragment assembly in DNA sequencing mainly followed the “overlap layout - consensus” paradigm, that is used in all currently available software tools for fragment
assembly. Although this approach proved to be useful in assembling contigs of moderate sizes, it
faces difficulties while assembling prokaryotic genomes a few million bases long. These
difficulties led to introduction of the double-barreled DNA sequencing
that uses additional
experimental information for assembling large genomes in the framework of the same “overlap layout - consensus” paradigm. Although the classical approach culminated in some excellent
fragment assembly tools (Phrap, CAP3, TIGR, and Celera assemblers are among them), critical
analysis of the “overlap - layout - consensus” paradigm reveals some weak points. First, the
overlap stage finds pairwise similarities that do not always provide true information on whether
the fragments (sequencing reads) overlap. A better approach would be to reveal multiple
similarities between fragments since sequencing errors tend to occur at random positions while
the differences between repeats are always at the same positions.
However, this approach is infeasible due to high computational complexity of the multiple
alignment problem. Another problem with the conventional approach to fragment assembly is
that finding the correct path in the overlap graph with many false edges (layout problem)
becomes very difficult. Unfortunately, these problems are difficult to overcome in the framework
of the “overlap - layout - consensus” approach and the existing fragment assembly algorithms are
often unable to resolve the repeats even in prokaryotic genomes. Inability to resolve repeats and
to figure out the order of contigs leads to additional experimental work to complete the assembly.
Moreover all the programs we tested made errors while assembling shotgun reads from the
bacterial sequencing projects Campylobacter jejuni, Neisseria meningitidis, a n d Lactococcus
lactis. Biologists at large sequencing centers are well-aware of potential assembly errors and are
186
forced to carry additional experimental tests to verify the assembled contigs. Bioinformaticians
are also aware of assembly errors as evidenced by finishing software that supports experiments
correcting these errors. How can one resolve these problems? Surprisingly enough, an unrelated
area of DNA arrays provides a hint. Sequencing by Hybridization (SBH) is a 10- years old idea
that never became practical but (indirectly) created the DNA arrays industry. Conceptually, SBH
is similar to fragment assembly, the only difference is that the “reads” in SBH are much shorter
l-tuples. In fact, the very first attempts to solve the SBH fragment assembly problem [3, 9]
followed the “overlaplayout-consensus” paradigm. However, even in a simple case of error- free
SBH data, the corresponding lay out problem leads to the NP-complete Hamiltonian Path
Problem. Pevzner, 1989 proposed a different approach that reduces SBH to an easy-to-solve
Eulerian PathPr oblem in the de Bruijn graph by abandoning the “overlap- layoutconsensus”
paradigm.
Since the Eulerian path approach transforms a once difficult layout problem into a simple one, a
natural question is: “Could the Eulerian path approach be applied to fragment assembly?”. Idury
and Waterman, 1995 answered this question by mimicking the fragment assembly problem as an
SBH problem. They represented every read of length n as a collection of n - l + 1 l- mers and
applied an Eulerian path algorithm to a set of l-tuples formed by the union of such collections for
all reads. At the first glance this transformation of every read into a collection of l-tuples is a
very short-sighted procedure since information about the sequencing reads is lost. However, the
loss of information is minimal for large l and is well paid for by the computational advantages of
the Eulerian path approach in the resulting easy-to-analyze graph. Not to mention that the lost
information can be easily restored at the later stages.
Unfortunately, the Idury-Waterman approach, while very promising, did not scale up well. The
problem is that the sequencing errors transform a simple de Bruijn graph (corresponding to an
error- free SBH) into a tangle of erroneous edges. For a typical sequencing project, the number of
erroneous edges is a few times larger than the number of real edges and finding the correct path
in this graph is extremely difficult, if not impossible task. Moreover, repeats in prokaryotic
genomes pose serious challenges even in the case of error- free data since the de Bruijn graph
gets very tangled and difficult to analyze. This paper abandons the classical “overlap-layout-
187
consensus” approach in favor of a new Eulerian superpath approach. Our main result is the
reduction of the fragment assembly problem to a variation of the classical Eulerian path problem.
This reduction opens new possibilities for repeat resolution and leads to the EULER software
that generated optimal solutions for the large-scale assembly projects that were studied.
16.1.2 New Ideas
Given two similar reads, how can we decide whether they correspond to the same region (i.e. the
differences between them are due to sequencing errors) or to two copies of a repeat located in
different parts of the genome? This problem is crucial for all fragment assembly algorithms and
pairwise comparison used in the conventional algorithms does not adequately resolve this
problem. Our error-correction procedure implicitly uses multiple comparison of reads and
successfully distinguishes these two situations. Both Idury and Waterman, 1995 and Myers,
1995 tried to deal with errors and repeats via graph reductions. However, both these methods do
not explore multiple alignment of reads to fix sequencing errors at the pre-processing stage. Of
course, multiple alignment of reads is costly and pairwise alignment is the only realistic option at
the overlap stage of the conventional fragment assembly algorithms.
However, the multiple alignment becomes feasible when we deal with perfect or nearly perfect
matches of short l-tuples, exactly the case in the SBH approach to fragment assembly. Our error
correction idea utilizes the multiple alignment of short substrings to modify the original reads
and to create a new instance of the fragment assembly problem with the greatly reduced number
of errors. The error correction makes our reads almost error-free and transforms the original very
large graph into a graph with very few erroneous edges. In some sense, the error correction is a
variation of the consensus step taken at the very first step of fragment assembly (rather than at
the last one as in the conventional approach). Imagine an ideal situation when the errorcorrection procedure eliminated all errors and we deal with a collection of error- free reads. Is
there an algorithm to reliably assemble such error- free reads in a large-scale sequencing project?
At the first glance, the problem looks simple, but surprisingly enough, the answer is no: we are
unaware of any algorithm that solves this problem.
188
For example, Phrap, CAP3 and TIGR assemblers make 17, 14, and 9 assembly errors
correspondingly while assembling real reads from the N. meningitidis genome. All these
algorithms still make errors while assembling the error-free reads from the N. meningitidis
genome (although the number of errors reduces to 5, 4, and 2 correspondingly). Although the
TIGR assembler makes less errors than other programs, this accuracy does not come for free,
since this program produces twice as many contigs as do the other programs. EULER made no
assembly errors and produced less contigs with real data than other programs produced with
error-free data! EULER can be also used to immediately improve the accuracy of Phrap, CAP3
and TIGR assemblers: these programs produce better assemblies if they use error-corrected reads
from EULER. To achieve such accuracy, EULER has to overcome the bottleneck of the IduryWaterman approach and to restore information about sequencing reads that was lost in the
construction of the de Bruijn graph. Our second Eulerian Superpath idea addresses this problem.
Every sequencing read corresponds to a path in the de Bruijn graph called a readpath.
An attempt to take into account the information about the sequencing reads leads to the problem
of finding an Eulerian path that is consistent with all read-paths, an Eulerian Superpath Problem.
Below we show how to solve this problem. This simple description hides some algorithmic
challenges.
16.1.3 Error Correction
Sequencing errors make implementation of the SBH-style approach to fragment assembly
difficult. To bypass this problem we reduce the error rate by a factor of 35-50 at the preprocessing stage and make the data almost errorfree by solving the Error Correction Problem.
We use the N. meningitidis (NM) sequencing project completed at the Sanger Center
as an
example. NM is one of the most “difficult-to-assemble” bacterial genome completed so far. It
has 126 long perfect repeats up to 3832 bp in length (not to mention many imperfect repeats).
The length of the genome is 2,184,406 bp. The sequencing project resulted in 53263 reads of
average length 400 (average coverage is 9.7). There were 255,631 errors overall distributed over
these reads. It results in 4.8 errors per read (error rate of 1.2%). Let s be a sequencing read (with
errors) derived from a genome G. If the sequence of G is known then the error correction in s can
189
be done by aligning the read s against the genome G. In real life, the sequence of G is not known
until the very last ”consensus” stage of the fragment assembly. It is a catch-22: to assemble a
genome it is highly desirable to correct errors in reads first, but to correct errors in reads one has
to assemble the genome first. To bypass this catch-22, let’s assume that, although the sequence
of G is unknown, the set Gl of all continuous strings of fixed length l (l-tuples) present in G is
known. Of course, Gl is unknown either, but Gl can be reliably approximated without knowing
the sequence of G. An l-tuple is called solid if it belongs to more than M reads (where M is a
threshold) and weak otherwise. A natural approximation for Gl is the set of all solid l-tuples from
a sequencing project. Let T be a collection of l-tuples called a spectrum. A string s is called a Tstring if all its l-tuples belong to T. Our approach to error correction leads to the following
Spectral Alignment Problem. Given a string s and a spectrum T, find the minimum number of
mutations in s that transform s into a T-string. A similar problem was considered by Peer and
Shamir, 2000, in a different context of resequencing by hybridization. In the context of error
corrections, the solution of the Spectral Alignment Problem makes sense only if the number of
mutations is small. In this case the Spectral Alignment Problem can be efficiently solved by
dynamic programming even for large l (compare with ). Spectral alignment of a read against the
set of all solid ltuples from a sequencing project, suggests the error corrections that may change
the sets of weak and solid l-tuples. Iterative spectral alignments with the set of all reads and all
solid l-tuples gradually reduce the number of weak l-tuples, increase the number of solid l-tuples,
and reads and all solid l-tuples gradually reduce the number of weak l-tuples, increase the
number of solid l-tuples, and lead to elimination of many errors in bacterial sequencing projects.
Although the Spectral Alignment Problem helps to eliminate errors (and we use it as one of the
steps in EULER) it does not adequately capture the specifics of the fragment assembly. The
Error Correction Problem described below is somewhat less natural than the Spectrum
Alignment Problem but it is probably a better model for fragment assembly (although it is not a
perfect model either).
The greedy heuristics for the Error Correction Problem eliminates up to
97% of errors in a
typical bacterial project. Given a collection of reads (strings) S = {s1, . . . , sn} from a sequencing
project and an integer l, the spectrum of S is a set Sl of all l-tuples from the reads s1, . . . , sn and
190
s1, . . . , sn, where s denotes a reverse complement of read s. Let ∆ be an upper bound on the
number of errors in each DNA read. A more adequate approach to error correction motivates the
error erroneous l-tuples in the sequencing read erroneous l-tuples in the complementary
sequencing read sequencing read Given S, ∆, and l, introduce up to ∆ corrections in each read in
S in such a way that |Sl| is minimized. An error in a read s affects at most l l-tuples in s and l ltuples in s and usually creates 2l erroneous ltuples that point out to the same sequencing error
(2d for positions within a distance d < l from the endpoint of the reads). Therefore a greedy
approach for the Error Correction Problem is to look for an error correction in the read s that
reduces the size of Sl by 2l (or 2d for positions close to the endpoints of the reads). This simple
procedure already eliminates 86.5% of errors in sequencing reads. Below we describe a more
involved approach that eliminates 97.7% of sequencing errors. This approach transforms the
original fragment assembly problem with 4.8 errors per read on average into an almost error- free
problem with 0.11 errors per read on average. Two l-tuples are called neighbors if they are one
mutation apart. For an l-tuple a define its multiplicity m(a) as the number of reads in S
containing this l-tuple. An l-tuple is called an orphan if (i) it has small multiplicity, i.e., m(a)
M, where M is a threshold, (ii) it has the only neighbor b, and (iii) m(b)
m(a). The position
where an orphan and its neighbor differ is called an orphan position. A sequencing read is
orphan-free if it contains no orphans. An important observation is that each erroneous l-tuple
created by a sequencing error usually does not appear in other reads and is usually one mutation
apart from a real l-tuple (for an appropriately chosen l). Therefore, a mutation in a read usually
creates 2l orphans. This observation leads to an approach that corrects errors in orphan positions
within the sequencing reads, if the overall number of error corrections in a given read to make it
orphan- free is at most ∆. The greedy orphan elimination approach to the Error Correction
Problem starts error corrections from the orphan positions that reduce the size of Sl by 2l (or 2d
for positions at distance d < l from the endpoints of the reads). After correcting all such errors
the “2l condition” gradually transforms into a weaker 2l - δ condition.
16.1.4 Error Correction Or Data Corruption?
A word of caution is in place. Our error-correction procedure is not perfect while deciding which
nucleotide, among, let’s say, A or T is correct in a given l-tuple within a read. If the correct
191
nucleotide is A, but T is also present in some reads covering the same region, the error-correction
procedure may assign T instead of A to all reads, i.e., to introduce an error, rather than to correct
it (particularly, in the low-coverage regions). Since our algorithm sometimes introduces errors,
data corruption is probably a more appropriate name for this approach! Introducing an error in a
read is not such a bad thing as long as the errors from overlapping reads covering the same
position are consistent (i.e., they corresponds to a single mutation in a genome).
An important insight is that, at this stage of the algorithm, we don’t care much whether we
correct or introduce errors in the sequencing reads. From algorithmic perspective, introducing an
error, which simply corresponds to changing a nucleotide in a final assembly, is not a big deal. It
is much more important to make sure that we eliminate a competition between A and T at this
stage, thus reducing the complexity of the de Bruijn graph. In this way we eliminate false edges
in our graph and deal with this problem later: the correct nucleotide can be easily reconstructed
at the final consensus stage of the algorithm. For N. meningitidis sequencing project, orphan
elimination corrects 234410 errors, and introduces 1452 errors. It leads to a tenfold reduction in
the number of sequencing errors (0.44 errors per read). The orphan elimination procedure is
usually run withM = 2 since orphan elimination with M = 1 leaves some errors uncorrected. For a
sequencing project with coverage 10 and error rate 1%, every solid 20-tuple has on average 2
orphans o1 and o2, each with multiplicity 1 (i.e., an expected multiplicity of this 20-tuple is 8
rather than 10 as in the case of error- free reads).
With some probability, the same errors in (different) reads correspond to the same position in the
genome thus “merging” o1 and o2 into a single l-tuple o with m(o) = 2. Although the probability
of such event is relatively small, the overall number of such cases is large for large genomes. In
our studies of bacterial genomes setting M = 2 and simultaneous correction of up to M multiple
errors worked well in practice. With M = 2, we eliminated additional 705 errors and created 131
errors (21837 errors, or 0.41 errors per read are left). Orphan elimination is a more conservative
procedure than spectral alignment. Orphans were defined as l-tuples of low multiplicity that have
only one neighbor. The latter condition (that is not captured by the spectral alignment) is
important since in the case of multiple neighbors it is not clear how to correct an error in an
orphan. For the N. meningitidis genome there were 1862 weak 20- mers (M
2) that had multiple
192
neighbors. Our approach to this problem is to increase l in a hope that there is only one
“competing” neighbor for longer l. After increasing l from 20 to 100, the number of orphans with
multiple neighbors have been reduced from 1862 to 17. Orphan elimination should be done with
caution since errors in reads are sometimes hard to distinguish from differences in repeats.
If we treated the differences between repeats (particularly repeats with low coverage) as errors,
then orphan elimination would correct the differences between repeats instead of correcting
errors. This may lead to inability to resolve repeats at the later stages. It is important to realize
that error corrections in orphan positions often create new orphans. Imagine a read containing an
imperfect low-coverage ( M) copy of a repeat that differs from a high-coverage (>M) copy of
this repeat by a substitution of a block of t consecutive nucleotides. Without knowing that we
deal with a repeat, the orphan elimination procedure would first detect two orphans, one of them
ending in the first position of the block, and the other one starting in the last position of the
block. If the orphans are eliminated without checking the “at most ∆ corrections per read”
condition, these two error corrections will shrink the block to the size t - 2 and will create two
new orphans in the beginning and the end of this shrunk block. At the next step, this procedure
would correct the first and the last nucleotides in the shrunk block, and, in just t/2 steps, erase the
differences between two copies of the repeat. Of course, for long bacterial genomes many “bad”
events that may look improbable happen and there are two types of errors that are prone to
orphan elimination. They require a few coordinated error corrections since single error
corrections do not lead to a significant reduction in the size of Sl and thus may be missed by the
greedy orphan elimination.
These errors include: (i) consecutive or closely-spaced errors in the same read and (ii) the same
error with high multiplicity (>M) at the same genome position in different reads. The first type
of error is best addressed by solving the Spectral Alignment Problem to identify reads that
require less than ∆ error corrections. We found that some reads from the N. meningitidis project
have very poor spectral alignment. These reads are likely to represent contamination, vector,
isolated reads, or an error in the sequencing pipeline. All these reads are of limited interest and
should be discarded. In fact, it is a common practice in sequencing centers to discard such “poorquality” reads and we adopt this approach. Although deleting poor-quality reads may slightly
193
reduce the amount of available sequencing information, it greatly simplifies the assembly
process.
Another important advantage of spectral alignment is an ability to identify the chimeric reads.
Such reads are characterized by good spectral alignments of the prefix and suffix parts, that,
however, cannot be extended to a good spectral alignment of the entire read. EULER breaks the
chimeric reads into two or more pieces and preserves the type of error reflects the situation with
M identical errors in different reads corresponding to the same genome position and generating
an erroneous l-tuple with high multiplicity. For example, if both the correct and erroneous ltuples have multiplicity 3 (with default threshold M = 2), it is hard to decide whether we deal
with a unique region (with coverage 6) or with two copies of an imperfect repeat (each with
coverage 3). In the N. meningitidis project there were 1610 errors with multiplicity 3 and larger.
Due to page limitation, the algorithm to correct high- multiplicity errors will be described
elsewhere.
1. What is orphan position?
Notes:
ee) Write your answer in the space given below.
ff) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
16.1.5 Eulerian Superpath Problem
As we have discussed, the idea of the Eulerian path approach to SBH is to construct a graph
whose edges correspond to l-tuples and to find a path visiting every edge of this graph exactly
194
once. Given a set of reads S = {s1, . . . , sn}, define the de Bruijn graph G(Sl) with vertex set Sl- 1
(the set of all (l- 1)-tuples from S) as follows. An (l - 1)-tuple v Sl- 1 is joined by a directed edge
with an (l - 1)-tuple w Sl- 1, if Sl contains an l-tuple for which the first l - 1 nucleotides
coincide with v and the last l- 1 nucleotides coincide with w. Each l-tuple from Sl corresponds to
an edge in G. If S contains the only sequence s1, then this sequence corresponds to a path visiting
each edge of G exactly once, an Eulerian path.
Finding Eulerian paths is a well-known problem that can be efficiently solved. The reduction
from SBH to the Eulerian path problem described above assumes unit multiplicities of edges (no
repeating l-tuples) in the de Bruijn graph. We usually assume that S contains a direct
complement of every read. In this case, G(Sl) includes reverse complement for every l-tuple and
the de Bruijn graph can be partitioned into 2 subgraphs, one corresponding to a “canonical”
sequence, and another one to its reverse complement. With real data, the errors hide the correct
path among many erroneous edges. The overall number of vertices in the graph corresponding to
the error- free data from the NM project is 4,039,248 (roughly twice the length of the genome),
while the overall number of vertices in the graph corresponding to real sequencing reads is
9,474,411 (for 20- mers). After the errorcorrection procedure this number is reduced to
4,081,857. A vertex v is called a source if indegree(v) = 0, a sink if outdegree(v) = 0 and a
branching vertex if indegree(v).outdegree(v) > 1.
For the N. meningitidis genome, the de Bruijn graph has 502,843 branching vertices for original
reads (for l-tuple size 20). Error corrections simplifies this graph and leads to a graph with 382
sources and sinks and 12,175 branching vertices. The error- free reads lead to a graph with 11173
branching vertices. Since the de Bruijn graph gets very complicated even in the error- free case,
taking into account the information about what l-tuples belong to the same reads (that was lost
after the construction of the de Bruijn graph) helps us to untangle this graph. A path v1 . . . vn in
the de Bruijn graph is called a repeat if indegree(v1) > 1, outdegree(vn) > 1, and outdegree(vi) =
1 for 1
i
n- 1. Edges entering the vertex v1 are called entrances into a repeat while edges
leaving the vertex vn are called exits from a repeat.
195
An Eulerian path visits a repeat a few times and every such visit defines a pairing between an
entrance and an exit. Repeats may create problems in fragment assembly since there are a few
entrances in a repeat and a few exits from a repeat but it is not clear which exit is visited after
which entrance in the Eulerian path. However, most repeats can be resolved by read-paths (i.e.,
v1 v2 vn-1 vn paths in the de Bruijn graph that correspond to sequencing reads) covering these
repeats. A read-path covers a repeat if it contains an entrance into this repeat and an exit from
this repeat. Every covering read-path reveals some information about the correct pairings
between entrances and exits. However, some parts of the de Bruijn graph are impossible to
untangle due to long perfect repeats that are not covered by any read-paths. A repeat is called a
tangle if there is no read-path containing this repeat. Tangles create problems in fragment
assembly since pairings of entrances and exits in a tangle cannot be resolved via the analysis of
read-paths. To address this issue we formulate the following generalization of the Eulerian Path
Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this
graph that contains all these paths as subpaths. The classical Eulerian Path Problem is a
particular case of the Eulerian Superpath Problem with every path being a single edge. To solve
the Eulerian Superpath Problem we transform both the graph G and the system of paths P in this
graph into a new graph G1 with a new system of paths P1. Such transformation is called
equivalent if there exists a one-to-one correspondence between Eulerian superpaths in (G,P) and
(G1,P1). Our goal is to make a series of equivalent transformations (G,P) › (G1,P1) › . . . ›
(Gk,Pk) that lead to a system of paths Pk with every path being a single edge. Since all
transformations on the way from (G,P) to (Gk,Pk) are equivalent, every solutions of the Eulerian
Path Problem in (Gk,Pk) provides a solution of the Eulerian Superpath Problem in (G,P) Below
we describe a simple equivalent transformation that solves the Eulerian Superpath Problem in the
case when the graph G has no multiple edges. Let x = (vin, vmid) and y = (vmid, vout) be two
consecutive edges in graph G and let Px,y be a collection of all paths from P that include both
these edges as a subpath. Define P› x as a collection of paths from P that end with x and Py› as
a collection of paths from P that start with y. The x,y-detachment is a transformation that adds a
new edge z = (vin, vout) and deletes the edges x and y from G. This detachment Vin Vmid Vout x
y Px,y P x Py P x Px,y Py z Vin Vout Vmid alters the system of paths P as follows: (i) substitute
z instead of x, y in all paths from Px,y, (ii) substitute z instead of x in all paths from P› x, and
196
(iii) substitute z instead of y in all paths from Py› . Informally, detachment bypasses the edges x
and y via a new edge z and directs all paths in P› x, Py› , and Px,y through z. Since every
detachment reduces the number of edges in G, the detachments will eventually shorten all paths
from P to single edges and will reduce the Eulerian Superpath Problem to the Eulerian Path
Problem.
However, in the case of graphs with multiple edges, the detachment procedure described above
may lead to nonequivalent transformations. In this case, the edge x may be visited many times in
the Eulerian path and it may or may not be followed by the edge y on some of these visits. That’s
why, in case of multiple edges, “directing” all paths from the set P› x through a new edge z may
not be an equivalent transformation. However, if the vertex vmid has no other incoming edges
but x, and no other outgoing edges but y, then x, y-detachment is an equivalent transformation
even if x and y are multiple edges. In particular, detachments can be used to reduce every repeat
to a single edge. It is important to realize that even in the case when the graph G has no multiple
edges, the detachments may create multiple edges in the graphs G1, . . . , Gk (for example, if the
edge (vin, vout) were present in the graph prior to the detachment procedure). However, such
multiple edges do not pose problems, since in this case it is clear what instance of the multiple
edge is used in every path.
For illustration purposes, let’s consider a simple case when the vertex vmid has the only
incoming edge x = (vin, vmid) with multiplicity 2 and two outgoing edges y1 = (vmid, vout1) and
y2 = (vmid, vout2), each with multiplicity 1. In this case, the Eulerian path visits the edge x
twice, in one case it is followed by y1 and in another case it is followed by y2. Consider an x, y1detachment that adds a new edge z = (vin, vout1) after deleting the edge y1 and one of two copies
of the edge x. This detachment (i) shortens all paths in Px,y1 by substitution of x, y1 by a single
edge z and (ii) substitute z instead of y1 in every path from Py1› . This detachment is an
equivalent transformation if the set P› x is vin vmid vout1 vout2 x y1 y2 P x Px,y1 vin vmid
vout1 vout2 z y2 P x Px,y1 x ??? Py1 Py1 However, if P› x is not empty, it is not clear whether
the last edge of a path P P› x should be assigned to the edge z or to the (remaining copy of) edge
x. To resolve this dilemma, one has to analyze every path P P› x and to decide whether it
“relates” to Px,y1 (in this case it should be directed through z) or to Px,y2 (in this case it should
197
be directed through x). By “relates” to Px,y1 (Px,y2) we mean that every Eulerian superpath
visits y1 (y2) right after visiting P. Two paths are called consistent if their union is a path again
(there is no branching vertices in their union). A path P is consistent with a set of paths P if it is
consistent with all paths in P and inconsistent otherwise (i.e. if it is inconsistent with at least one
path in P). There are three possibilities • P is consistent with exactly one of the sets Px,y1 and
Px,y2. • P is inconsistent with both Px,y1 and Px,y2. • P is consistent with both Px,y1 and Px,y2.
In the first case, the path P is called resolvable since it can be unambiguously related to either
Px,y1 or Px,y2. If P is consistent with Px,y1 and inconsistent with Px,y2 then P should be
assigned to the edge z after x, y1-detachment (substitute x by z in P). If P is inconsistent with
Px,y1 and consistent with Px,y2 then P should be assigned to the edge x (no action taken). An
edge x is called resolvable if all paths in P› x are resolvable. If the edge x is resolvable then the
described x, y-detachment is an equivalent transformation after the correct assignments of last
edges in every path from P› x.
In our analysis of the NM genome we found that 18026 among 18962 edges in the de Bruijn
graph are resolvable. Although we defined the notion of resolvable path for a simple case in
when the edge x has multiplicity 2, it can be generalized for edges with arbitrary multiplicities.
The second condition (P is inconsistent with both Px,y1 and Px,y2) implies that the Eulerian
Superpath Problem has no solution, i.e., sequencing data are inconsistent. Informally, (a) Px,y1 P
Px,y2 P Px,y2 Px,y1 (b) P Px,y2 (c) Px,y1. in this case P, Px,y1 and Px,y2 impose three different
scenario for just two visits of the edge x. After discarding the poor-quality and chimeric reads we
did not encounter this condition in our analysis of the NM genome. The last condition (P is
consistent with both Px,y1 and Px,y2) corresponds to the most difficult situation and deserves a
special discussion. If this condition holds for at least one path in P› x, the edge x is called
unresolvable and we postpone the analysis of this edge until all resolvable edges are analyzed. It
may turn out that equivalent transformation of other resolvable edges will make the edge x
resolvable. It illustrates that equivalent transformations may resolve previously unresolvable
edges. However, some edges cannot be resolved even after the detachments of all resolvable
edges are completed. Such situations usually correspond to tangles and they have to be addressed
by another equivalent transformations called a cut. Consider a fragment of graph G with 5 edges
and four paths y3- x, y4- x, x- y1 and x- y2, each path consisting of two edges. In this case P› x
198
consists of two paths y3- x and y4 - x and each of those paths is consistent with both Px,y1 and
Px,y2. In fact, in this symmetric situation, x is a tangle and there is no information available to
relate any of path y3 - x and y4 - x to any of paths x - y1 and x - y2.
Therefore, it may happen that no detachment is an equivalent transformation in this case. To
address this problem, we introduce another equivalent transformation that affects the system of
paths P and does not affect the graph G itself. An edge x = (v,w) is removable if (i) it is the only
outgoing edge for v and the only incoming edge for w and (ii) x is either initial or terminal edge
for every for every path P ∈ P containing x. An x-cut transforms P into a new system of paths
by simply removing x from all paths in P› x and Px› . In the case , x-cut shortens the paths
x- y1, x- y2, y3 - x, and y4 - x to single-edge paths y1, y2, y3, and y4. It is easy to check that an
x-cut of a removable edge is an equivalent transformation, i.e., every Eulerian superpath in (G,P)
corresponds to an Eulerian superpath in (G,P1) and vice versa. Cuts proved to be a powerful
technique to analyze tangles that are not amenable to detachments. Detachments reduce such
tangles to single unresolvable edges that turned out to be removable in our analysis of bacterial
genomes. It allowed us to reduce the Eulerian Superpath Problem to the Eulerian Path Problem
for all studied bacterial genomes. Although detachments and cuts are sufficient to reduce the
Eulerian Superpath Problem to the Eulerian Path Problem for the studied bacterial genomes,
there is still a gap in the theoretical analysis of the Eulerian Superpath Problem in the case when
the systems of paths is not amenable to neither detachments, nor The idea of equivalent graph
transformations for fragment assembly is conceptually similar to the idea of equivalent graph
transformations for genome rearrangements. We also emphasize that our equivalent
transformation approach is very different from the graph reduction techniques for fragment
assembly suggested.
16.2 Let us Sum up
This lesson gives a detailed description and explanation of Fragment assembly, Introduction,
New ideas, Error correction,Error correction or data corruption, Eulerian superpath problem.
1. Could the Eulerian path approach be applied to fragment assembly?
199
2. Given two similar reads, how can we decide whether they correspond to the same region or to
two copies of a repeat located in different parts of the genome?
Your answer must include these points:
The position where an orphan and its neighbor differ is called an orphan position. A sequencing
read is orphan-free if it contains no orphans.
1. “Fragment assembly in a superior task in Gene Production” – Discuss..
16.6 References
1.
Saeys Y, Rouzé P, Van de Peer Y (2007). "In search of the small ones: improved
prediction of short exons in vertebrates, plants, fungi and protists". Bioinformatics 23
(4): 414-420. doi:10.1093/Bioinformatics /btl639.
2.
Hiller M, Pudimat R, Busch A, Backofen R (2006). "Using RNA secondary structures
to guide sequence motif finding towards single-stranded regions". Nucleic Acids Res 34
(17): e117. Entrez PubMed 16987907.
3.
Patterson DJ, Yasuhara K, Ruzzo WL (2002). "Pre- mRNA secondary structure
prediction aids splice site prediction". Pac Symp Biocomput: 223-234. Entrez PubMed
11928478.
4.
Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H (2006). "Importance of
RNA secondary structure information for yeast donor and acceptor splice site
predictions by neural networks". Comput Biol Chem 30 (1): 50-57. Entrez PubMed
16386465.
200
201
LESSON – 17
GENOME SEQUENCE ASSEMBLY
17.0 Aim and objective
17.1 Genome sequence assembly
17.1.1 Whole genome shotgun sequencing
17.1.2 Clones and coverage
17.1.3 Assembly
17.1.4 Finishing
17.1.5 Assembly algorithms
17.1.6 Overlap-layout-consensus
17.1.7 Eulerian path
17.1.8 Handling repeats
17.1.9 Assesing assembly quality
17.2 Let us Sum up
17.6 References
This unit describes the Genome sequence assembly, Algorithms and issues, Whole genome
shotgun sequencing, Clones and coverage, Assembly, Finishing, Assembly algorithms, Overlaplayout-consensus, Eulerian path, Handling repeats, Detecting repeats, Unresolved repeats,
Scaffolding software, Assessing assembly quality.
202
17.1 Genome Sequence Assembly
Each cell of a living organism contains chromosomes composed of a sequence of DNA base
pairs. This sequence, the genome, represents a set of instructions that controls the replication and
function of each organism. The automated DNA sequencer gave birth to genomics, the analytic
and comparative study of genomes, by allowing scientists to decode entire genomes. Although
genomes vary in size from millions of nucleotides in bacteria to billions of nucleotides in
humans and most animals and plants, the chemical reactions researchers use to decode the DNA
base pairs are accurate for only about 600 to 700 nucleotides at a time. The process of
sequencing begins by physically breaking the DNA into millions of random fragments, which are
then “read” by a DNA sequencing machine. Next, a computer program called an assembler
pieces together the many overlapping reads and reconstructs the original sequence. This general
technique, called shotgun sequencing, was introduced by Fred Sanger in 1982. The technique
took a quantum leap forward in 1995, when a team led by Craig Venter and Robert Fleischmann
of The Institute for Genomic Research (TIGR) and Hamilton Smith of Johns Hopkins University
used it on a large scale to sequence the 1.83 million base pair (Mbp) genome of the bacterium
Haemophilus influenzae.2 Much like a large jigsaw puzzle, the DNA reads that shotgun
sequencing produces must be assembled into a complete picture of the genome. This seemingly
simple process is not without technical challenges. For one thing, the data contains errors—
some from limitations in sequencing technology and others from human mistakes during
laboratory work. Even in the absence of errors, DNA sequences have features that complicate the
assembly process—most notably, repetitive sections called repeats.
The human genome, for example, includes some repeats that occur in more than 100,000 copies
each. Similar to pieces of sky in jigsaw puzzles, reads belonging to repeats are difficult to
position correctly. Further complicating assembly, some DNA fragments from each genome are
impossible to sequence, resulting in gaps in coverage. The resolution of these problems entails an
additional finishing phase involving a large amount of human intervention. Finishing is very
costly, as it requires specialized laboratory techniques and highly trained personnel. Assembly
programs can dramatically reduce this cost by taking into account additional information
obtained during finishing, yet most current assemblers disregard this information and generate
203
the best possible assembly solely from the initial shotgun reads. Advances in assembly
algorithms must include features that help finishing efforts.
17.1.1 Whole Genome Shotgun Sequencing
While shotgun sequencing remains the basic strategy for all genome sequencing projects, its
applicability to large genomes has been controversial. Until recently it was applied only at the
end of a hierarchical process. The BACs are then mapped to the genome to obtain a tiling path,
after which the shotgun method is used to sequence each BAC in the tiling path separately. In
contrast, whole-genome shotgun sequencing (WGSS) assembles the genome from the initial
fragments without using a BAC map. This requires enormous computational resources. The
sheer size of the data argued against WGSS for large projects. So did the presence of repeats. If
positioning the long repeat stretches correctly is difficult, automating the process is even harder.
In 2000, however, Eugene Myers and colleagues put most doubts to rest when they published a
whole-genome assembly of the fruit fly Drosophila melanogaster. Using a new assembler built
specifically for very large genomes, the Myers team successfully sequenced and assembled the
135-Mbp genome. The project was 25 times larger than any previous WGSS project, and the
team went on to apply the WGSS strategy to sequence and assemble the draft human genome in
2001.
17.1.2 Clones and coverage
A WGSS project begins in the laboratory, where ultrasound or a high-pressure air stream
randomly shatters the DNA into pieces that researchers then insert into cloning vectors, or
clones. The clone in this case is a circular piece of DNA called a plasmid. It has a known
sequence of base pairs and can accept a clone insert of foreign DNA. The bacterium Escherichia
coli is then used to multiply the plasmid, thus amplifying the clone insert. In most projects,
researchers sequence both ends of each clone insert, yielding a set of sequencing reads that
defines the clone-pairing data for that insert. This process links each read from a clone insert to
its clone mate from the opposite end of the insert. The resulting clone-pairing data is extremely
204
valuable not only in guiding the assembly process but also in correctly ordering the contiguous
sequences, or contigs, resulting from assembly. It shows the process for three contigs. The
ultimate goal of sequencing is to determine all the base pairs contained in the DNA. In practice,
however, we try to achieve the goal of having more than 99 percent of the genome covered by
reads after the initial shotgun phase. To achieve this goal we need to sequence clones until the
reads (averaging 600 to 700 base pairs) provide an eightfold (8X) oversampling of the genome.
For example, a 2-Mbp bacterial genome sequenced to 8X coverage requires 16 Mbp, or
approximately 27,000 reads. Researchers choose the inserts from among several “libraries” of
clone collections generated in the laboratory. The insert size specifies the average distance
separating each pair of clone mates, and sizes vary from one library to the next. Typical projects
contain at least two insert libraries of sizes 2 to 3 kbp and 8 to 10 kbp, respectively, and may
include others, such as BAC libraries of 100 to 150 kbp. The sequenced portion of each insert
averages 1,200 bp out of 3,000 bp total, so the clone inserts of a 3-kbp library sequenced to 8X
cover 2.5 times as much distance as the sequences themselves. These libraries provide a “clone
coverage” of more than 20-fold, meaning that, on average, 20 clones span each of the genome’s
bases, thus offering the theoretical guarantee that each base is contained in at least one of the
clones. This guarantee assumes uniformly random-sampled clones from the genome. In practice,
this requirement is seldom perfectly satisfied. Cloning biases lead to a nonrandom clone
distribution, causing areas of the genome to remain unsequenced regardless of the amount of
sequencing performed.
17.1.3 Assemly
A WGSS assembler’s task is to combine all the reads into contigs based on sequence similarity
between the individual reads. The basic principle is that two overlapping reads—that is, reads
where a suffix of one is a prefix of another—presumably originate from the same region of the
genome and can be assembled together. This assumption is invalid, however, for repetitive
sequences, where it is impossible to distinguish reads from two or more distinct places in the
genome. It shows how an assembler can incorrectly combine the reads from two copies (rpt1A,
rpt1B) of a repeat, producing a misassembled contig and throwing out the unique region between
the two Clone insert Sequencing reads. Repeats represent a major challenge to assembly
205
software. An assembler’s utility depends in large part on detecting and correctly resolving repeat
regions. Resolving misassemblies in the finishing phases can be costly. Information about clone
mates, combined with knowledge about the distribution of clone sizes, may help assembly
programs to put some classes of repeats together correctly. If a repeat is shorter than the length of
a clone insert, mate-pair information is enough to separate the individual repeat copies because
each read within the repeat has an anchoring clone mate in the nearby nonrepetitive region.
17.1.4 Finishing
n practice, imperfect coverage, repeats, and sequencing errors cause the assembler to produce not
one but hundreds or even thousands of contigs. The task of closing the gaps between contigs and
obtaining a complete molecule is called finishing. First, a program called a scaffolder uses clonemate information to order and orient the contigs with respect to each other into larger structures
called scaffolds. Within a scaffold, pairs of reads spanning the gaps between contigs determine
the order and orientation of contigs. Note that the physical DNA molecule has an easily
determined direction, even though the textual representation of DNA as a string of A, C, T, or G
characters appears to be directionless. The gaps between contigs belonging to the same scaffold
are called sequence gaps. Although they represent genuine gaps in the sequence, researchers can
retrieve the original clone inserts spanning the gap and use a straightforward “walking”
technique to fill in the sequence. Determining the order and orientation of the scaffolds with
respect to each other is more difficult. The gaps between scaffolds are called physical gaps
because the physical DNA that would span them is either not present in the clone inserts or
indeterminable due to misassemblies. Filling these gaps involves a large amount of manual labor
and complex laboratory techniques.
1. Give details about meta genomics.
206
Notes:
gg) Write your answer in the space given below.
hh) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
17.1.5 Assembly Algorithms
Researchers first approximated the shotgun sequence assembly problem as one of finding the
shortest common superstring of a set of sequences: Given a set of input strings {s1, s2, ...}, find
the shortest string T such that every si is a substring of T. While this problem has been shown to
be NP-hard, there is an efficient approximation algorithm. This greedy algorithm starts by
computing all possible overlaps between the strings and assigning a score to each potential
overlap. The algorithm then merges strings in an iterative fashion by combining those strings
whose overlap has the highest score. This procedure continues until no more strings can be
merged. While it can be argued that the shortest superstring problem does not correctly model
the assembly problem, the first successful assembly algorithms applied the greedy merging
heuristic in their design. For example, TIGR Assembler,4 Phrap,5 and CAP36 followed this
paradigm. Greedy algorithms are relatively easy to implement, but they are inherently local in
nature and ignore long-range relationships between reads, which could be useful in detecting and
resolving repeats. In addition, all current implementations of the greedy method require up to one
gigabyte of RAM for each megabase of assembled sequence, assuming the genome was
sequenced at 8X coverage. This limits their applicability on currently available hardware to
organisms with genomes of 32 Mbp or less. Such organisms include bacteria and a few singlecelled eukaryotes, but not plants, mammals, or other multicellular organisms. These limitations
spurred the development of new algorithms. Two approaches exploit techniques developed in the
207
field of graph theory: one that represents the sequence reads as graph nodes and another that
represents them as edges.
17.1.6 Overlap-layout-consensus
The first approach, overlaplayout-consensus, constructs a graph in which nodes represent reads,
and edges indicate that the corresponding reads overlap. Each contig is represented as a simple
rpt1A rpt1B I I II II III III Computer path—that is, a path through the graph that contains each
node at most once. An assembler following this paradigm must first build the graph by
computing all possible alignments between the reads. A second stage cleans up the graph by
removing transitive edges and resolving ambiguities. The output of this stage comprises a set of
nonintersecting simple paths in this refined graph, each such path corresponding to a contig. A
final step generates a consensus sequence for each contig by constructing the multiple alignment
of the reads that is consistent with the chosen path. Full information about each read in the input
is only necessary for the overlap and the consensus stages. The graph-refinement stage stores
only a limited amount of information about each overlap, such as its coordinates and length. This
allows a memory-efficient implementation. Not surprisingly, recent WGSS assemblers use this
approach. The overlaplayout-consensus technique has the additional value of encoding other
relationships between reads, such as clonemate information, which an assembler can use in
correctly assembling repetitive areas.
17.1.7 Eulerian path
The second graph-theoretical approach to shotgun sequence assembly uses a sequencingbyhybridization (SBH) technique. The idea is to create a virtual SBH problem by breaking the
reads into overlapping n- mers, where an n- mer is a substring of length n from the original
sequence. Next, the assembler builds a directed deBruijn graph in which each edge corresponds
t o a n n- mer from one of the original sequence reads. The source and destination nodes
correspond respectively to the n- 1 prefix and n- 1 suffix of the corresponding n-mer. For
208
example, an edge connecting the nodes ACTTA and CTTAG represents the 6- mer ACTTAG.
Under this formulation, the problem of reconstructing the original DNA molecule corresponds to
finding a path that uses all the edges—that is, an Eulerian path. In theory, the Eulerian path
approach is computationally far more efficient than the overlaplayout- consensus approach
because the assembler can find Eulerian paths in linear time while the problems associated with
the overlaplayout-consensus paradigm are NP-complete. Despite this dramatic theoretical
difference, the actual performance of existing algorithms indicates that overlap-layout-consensus
is just as fast as the SBHbased approach.
17.1.8 Handling Repeats
If genomic data included no repeats, an assembler could use any assembly algorithm to put all
the pieces together correctly, even in the presence of sequencing errors. The repeats found in real
genomes can, however, prohibit correct automated assembly, at least solely from information
contained in the original reads. For example, a large tandem repeat found in the bacterium
Streptococcus pneumoniae consists of a 24-bp unit that is repeated in identical and nearly
identical copies, in tandem, for a stretch covering approximately 14,000 bp. Given that
individual reads have an average length of 600 bp and that all reads obtained from this region are
identical, no assembler can determine a unique tiling across this repeat. The resulting assembly is
likely to contain a reconstruction of a 600-bp section of the repeat in which all the reads have
collapsed on top of each other—similar to the situation. In this particular case, clone mates also
fail to resolve the problem because the largest clone insert for this project covered only 10 kbp.
This example highlights several issues that assembly programs must address. First, they must
identify repeats, preferably during the assembly process, to avoid mistakes caused by
overcollapsing repeat copies. Detecting such misassemblies is much more difficult after
assembly is completed, and the misassemblies can lead to incorrect genome reconstructions.
Second, assemblers must attempt to correctly assemble as many repeats as possible to reduce the
amount of human labor involved in completing the genome. For short repeats, this step can be as
simple as using anchored reads, meaning those having mates in the unique areas surrounding the
repeat. In more complex repeats, the assembler must be able to use additional information
obtained through laboratory experiments.
209
Detecting repeats
A simple solution to the repeat-detection problem identifies the pileup caused by a misassembly.
Because the reads come from a random sampling of the genomic DNA, typically with 8X
coverage of the genome, areas covered by a significantly large number of reads indicate an overcollapsed repeat. Most assemblers use variations on this simple idea. Although the idea is useful,
it assumes that the reads are sampled uniformly at random from the genome. In reality, certain
areas tend to be poorly represented or absent from the sample—for example, if the insert is toxic
to the laboratory organism detecting misassemblies is more difficult after assembly is completed,
and the misassemblies can lead to incorrect genome reconstructions. In addition, low-copy
repeats, which appear only two to three times in a genome, may escape detection because they do
not appear to be statistically oversampled. While statistical methods can provide a rough filter,
assembly programs must use other techniques to accurately separate out the repeats.
As an example, the recently developed assembly program Euler detects repeats by finding
complex areas, or tangles, in the graph constructed during assembly. Researchers can use the
information contained in the tangle to guide experiments to resolve the repeat. Assemblers that
simply mask out repeats—another common strategy—lose this information and must obtain it by
other means. Because the cloning process generates reads in pairs from opposite ends of clone
inserts, assemblers can use information about clone mates to help detect areas that have been
incorrectly assembled due to repeats. Such areas usually contain many instances of clone mates
that were assembled either too close or too far from each other, or whose relative orientation is
incorrect. This information must be used with care; however, since clone length estimates are
usually imprecise, especially for larger clones. The difficult problem here is finding outliers in a
data set whose distribution is unknown. The most reliable information comes from the relative
orientation of the sequencing reads, which can nearly always be tracked correctly. When repeats
are widely separated in the genome, clone-pairing data can resolve them effectively for reads
whose mates are anchored in the neighboring nonrepetitive areas. Although some repeats are
identical, it is more common to find some differences in them. These differences sometimes
provide enough information for the assembler to distinguish the copies from one another.
210
In the absence of sequencing errors, a single nucleotide difference between two copies of a
repeat is enough to distinguish them. Researchers have developed several techniques to correct
sequencing errors during repeat resolution. All current techniques are based on finding
statistically significant clusters of reads, where the clusters are based on shared differences in the
reads. This approach assumes that sequencing errors are independent and, therefore, that an
identical position difference in multiple reads is likely to be a real difference typical of that copy
of the repeat. For example, if four reads contain an A in position 200 and four other reads contain
a G in that position, then the assembler can infer with high confidence that the first four reads
come from one copy of the repeat, while the second four represent a different copy. One
drawback of this approach is the need for relatively deep coverage to detect true differences
between repeat copies. If a repeat region is difficult to clone—a common phenomenon— the
coverage of that repeat will be low. Moreover, true polymorphisms, such as those between
different copies of nearly identical chromosomes—for example, each human chromosome occurs
in two copies—or from nonclonal source DNA further complicate this problem.
Unresolved repeats
Even using all these information sources, an assembler cannot resolve every repeat. Humans
must intervene to finish some complex areas. The basic technique for this task is to separate out
the reads coming from distinct repeat copies. In directed sequencing experiments, researchers
amplify stretches of DNA anchored in unique areas around the repeat. If we consider each copy
of the repeat in isolation from the others, an assembly program can put the genome together by
holding these repeat contigs together. Assembling a mixture of contigs and reads, while
guaranteeing that the contigs will not break up in the process, is known as a jumpstart assembly.
Only TIGR Assembler currently supports this capability.
Scaffolding Software
The scaffolding process groups contigs together into subsets with a known order and orientation.
Researchers generally infer relationships between contigs from clonemate information. Most
211
recent assemblers include a scaffolding step. Moreover, the Human Genome Project BAC
collections were ordered and oriented through scaffolding. To reformulate the scaffolding
problem in graphtheoretic terms, we can construct a graph in which the nodes correspond to
contigs, and a directed edge links two nodes when mate pairs bridge the gap between them. In
this case, each pair of reads implies a particular orientation and spacing of the contigs to form a
correct pair. The scaffolding program must now solve three problems:
• Find all connected components in the defined graph.
• Find a consistent orientation for all nodes in the graph, where nodes are connected by two
Those requiring the two nodes to have the same orientation and those forcing the two nodes to
have different orientations. We call the latter reversal edges. A consistent orientation of all the
nodes is possible only if all undirected cycles contain an even number of reversal edges. Because
errors in the pairing data or misassemblies can invalidate this condition, we must solve an
optimization problem: Find the smallest number of edges that must be removed so that no cycle
has an odd number of reversal edges. This optimization problem is NP-complete.
Given the length estimates of the edges, embed the graph on a line or—for some bacterial and
archaeal genomes—on a circle, such that the least number of constraints is invalidated. This
problem is a special case of the optimal linear arrangement problem, which is also NP-complete.
While the last two problems are difficult from a theoretical standpoint, simple heuristics can
easily handle the instances encountered in practice.
Moreover, in practice we can relax the optimality criteria. During the finishing phase, for
example, a linear embedding of the contigs is not necessary. In fact, ambiguities in the graph can
highlight possible misassemblies, and finishing teams can use this information in designing
experiments to confirm a particular embedding of the graph. The complexity of scaffolding
stems specifically from the presence of errors in the data. Again, simple heuristics can reduce the
effect of such errors. For example, we can reduce errors caused by data tracking problems or
misassemblies by requiring at least two sources of linking information between contigs or by
ignoring links anchored in repeat areas. For any but the smallest genomes, it is unlikely that a
single scaffold will hold all contigs. Thus we will need additional information to order and orient
the scaffolds themselves.
212
Two common sources of such information are physical maps and comparisons to related
organisms. Physical mapping encompasses a variety of laboratory techniques for characterizing a
set of markers along a DNA strand. Markers include known genes and short, unique sequences
of a few hundred nucleotides, called tags, that researchers have fluorescently tagged and mapped
to an approximate point on a chromosome. Determining the layout of these markers before
sequencing provides an independent information source for scaffolding software. Using contigs
created by an assembler, researchers can simulate the mapping experiment computationally by
searching for the tag locations in the contig sequence. The comparison between the electronic
map and the physical map also provides ordering information that the scaffolding program can
use. The sequence of a closely related organism is another source of scaffolding information. For
example, by aligning the scaffolds from a preliminary assembly of the mouse genome to the
human genome, we can obtain the likely order of the mouse contigs. Of course, this information
will be incorrect where major genome rearrangements have occurred in the evolutionary
divergence of the two species. This technique therefore works best with a very closely related
genome that has been sequenced to completion. The sources of linking information used to
construct scaffolds vary in quality. In particular, the error in determining the length of inserts,
and thus the distance between clone mates, increases with the insert size. Physical map data is
inherently error prone.
Finally, large-scale genome rearrangements can affect the homology data. TIGR has developed
Bambus, a scaffolder that factors our confidence in the linking information into hierarchically
constructed scaffolds. The algorithm first builds a set of scaffolds based on the highest
confidence links, then incorporates the lower confidence information to combine each scaffold
into a larger structure. This hierarchical method reduces the effect of incorrect linking data, while
still using all the information sources.
17.1.9 Assessing Assembly Quality
Correcting misassemblies is expensive, especially if they go undetected until the late stages of a
sequencing project. Assemblers highlight problematic areas by outputting the confidence level in
213
each base of the consensus. Because this simple quality-control method is an inherently local
measure, it fails to capture larger scale phenomena, such as whole DNA sections that are
incorrectly spliced together. The assembly pipeline must therefore contain a validation module
that uses additional information to determine the contig quality. Finding errors in assemblies is
easy when the complete sequence is already known, and we can use known benchmark data sets
to fine-tune assembly software. These data sets, either artificially generated or representing real
sequencing reads from completed projects, provide both the correct consensus sequence and the
exact location of all reads in the true DNA sequence. Ambiguities in a graph can highlight
possible misassemblies that finishing teams can investigate.
July 2002 US detect both local errors in the consensus base calls and large-scale rearrangements,
such as reversals and insertions, in the genome assembly. We can apply some of these ideas to
the challenge in real practice: finding assembly errors when the true layout is unknown. For
example, physical maps provide markers that we can use to validate large contigs. Similarly, we
can use the sequence of a closely related organism to confirm areas that we do not expect to have
significantly diverged. In the absence of any other types of information, clone mates have been
used to detect assembly errors.
Areas of the genome that violate the orientation and distance constraints imposed by the clone
mates indicate potential misassemblies. Most reported measures of assembly quality are
aggregate measures, such as the number and sizes of contigs. They assume that an assembly
consisting of a few large contigs is better than one composed of many small contigs. This
assumption is partly true, in that the number of contigs indicates the number of gaps, which in
turn correlates with the amount of work needed to finish the genome. Aggregate size measures
do not, however, account for the possibility of misassemblies, and they are therefore only
marginally useful. If anything, an assembler can generate large contig sizes at the expense of
misassemblies. To demonstrate these concepts, we performed a series of tests on the genome of
Wolbachia, an endosymbiotic bacterium found in the Drosophila (fruit fly) and other insects. We
recently completed this genome at TIGR, so we had a “true” DNA sequence to which we could
compare assembly results. We assembled this genome from the original shotgun reads using
Phrap, TIGR Assembler, and Celera Assembler. We ran all assemblers with their default
214
settings. We verified the assemblies by aligning the resulting contigs to the finished sequence.
According to the number and length of contigs, Phrap appears to produce the best output,
followed closely by TIGR Assembler.
The Phrap assembly contains about one- fourth as many contigs as Celera Assembler’s, and its
contigs are about four times larger on average. In addition, the total size of these contigs (1.26
Mbp) matches the actual size of the Wolbachia genome (1.26 Mbp). In contrast, if we look at the
proportion of the sequence covered by correct assemblies, the Celera Assembler’s output spans
more than 99 percent of all bases, while the TIGR Assembler contigs cover just over 93 percent,
and Phrap covers barely 36 percent. These results lead to a couple of conclusions. First, Phrap
and TIGR Assembler appear to have misassembled some repeats, which explains the lack of
coverage. At the same time, the small length of the Celera Assembler’s contigs, combined with
their large total size, lead us to believe that it failed to combine many contigs that should have
been assembled together. Closer examination—and similar experience with many other
genomes—indicates that this usually results from poor quality data at the ends of sequences. To
get Celera Assembler to combine more contigs, we performed an additional step of more
aggressively trimming poor quality data from the ends of the input sequences. The fourth row in
indicates that this technique closed a number of gaps, yielding larger contigs overall. At the same
time, the coverage of the genome decreased, indicating a potential drawback to the technique.
These observations correlate with our understanding of the assembly algorithms used by the
three programs. TIGR Assembler and Phrap are more tolerant of incorrect data at the sequence
ends, which allows them to create bigger contigs. At the same time, their handling of clonemate
information is less sophisticated than Celera Assembler’s. In particular, TIGR Assembler uses a
greedy approach that lets it walk through a repeat, occasionally violating clone-link constraints.
The Phrap program simply does not take these constraints into consideration, leading to the
overcollapse of repeat regions. Closer analysis verified the hypothesis that all the misassemblies
presented and correlated with repeats in the Wolbachia genome. T he ultimate goal of genome
sequencing is the complete DNA sequence of an organism. A good assembler can aid the human
effort involved in the finishing phase.
215
Computer designed for finishing should use multiple sources of information, such as data from
finishing experiments in the laboratory; it should also let the human experts put restrictions on
the assembly, such as regions that need to be held together or repeats that should be kept
separate. Better quality- control tools are essential, and defining quality measures that make it
possible to evaluate assembly algorithms is a first step toward their improvement. This issue is
particularly critical for incomplete genomes, such as the various and constantly changing
versions of the draft human genome sequence. Assemblers that can report “weak” areas in the
assembly and highlight potential misassembly sites are essential not only for the subsequent
analysis of assembly data but also for guiding the efforts of finishing experts. Moreover, welldefined objective quality measures will provide an additional level of validation even in the case
of completely finished genomes.
17.2 Let us sum up
Despite the fact that the assembly of bacterial genomes has become a routine task at major
sequencing centers, the assembly problem is far from being solved. Many new challenges are
uncovered as scientists tackle diverse new organisms. Furthermore new sequencing technologies
will change the assumptions currently made on the characteristics of the data being assembled.
Current sequencing technologies only allow us to "read" up to 1000 - 2000 bases of DNA at a
time. To overcome this limitation, sequencing of entire organisms is performed through a
process called shotgun-sequencing, wherein the DNA is sheared into smaller fragments whose
ends are then sequenced. The reconstruction of the original DNA sequence is handled by
specialized computer programs called assemblers. The output of assembly programs consists in a
collection of contiguous pieces (contigs) - rarely are entire chromosomes reconstructed into a
single piece. An additional computer program - the scaffolder - uses the information linking
together sequencing reads from the ends of fragments to order and orient the contigs with respect
to each other along a chromosome.
1. Find out details about
·
Automatic finishing techniques
·
Automatic sequencing error correction
216
·
Handling of polymorphic data
·
Repeat resolution
·
Representation of assembly data in public databases
1. Your answer may include these points:
Metagenomics is a new field of research in which scientists analyze the genomes of organisms
recovered directly from the environment. Most naturally occuring bacteria cannot be cultured
and therefore cannot be analyzed by traditional means. Metagenomic studies, however,
overcome this limitation and provide us with a mechanism for analyzing previously unknown
organisms and have a wide range of applications, from environmental studies to human health.
1. How do you rate genome sequence assembly’s contribution to Bioinformatics?.
17.6 Reference
1.
Schatz, M.C., Phillippy, A.M., Shneiderman, B., Salzberg, S.L. Hawkeye: a visual
analytics tool for genome assemblies. Genome Biology 8:R34. 2007.
2.
M. Roberts, B.R. Hunt, J.A. Yorke, R.A. Bolanos and A.L. Delcher. A preprocessor for
shotgun assembly of large genomes. Journal of Computational Biology. Vol. 11, No. 4:
734-752. 2004.
3.
M. Roberts, W. Hayes, B.R. Hunt, S.M. Mount and J.A. Yorke. Reducing storage
requirements for biological sequence comparison. Bioinformatics . 20(18):3363-3369;
2004.
4.
M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. Comparative genome assembly.
Briefings in Bioinformatics . 5(3), pp. 237-248, 2004.
5.
M. Pop. Shotgun sequence assembly. Advances in Computers vol. 60, M. Zelkowitz ed.
June 2004.
6.
M. Pop, D. Kosack. Using the TIGR Assembler in shotgun-sequencing projects. i n
Bacterial Artificial Chromosomes vol. 1, S. Zhao and M. Stodolsky eds. Humana Press,
pp. 279-294, March 2004.
217
7.
M. Pop, D.S. Kosack, S.L. Salzberg. Hierarchical scaffolding with Bambus. Genome
Research 14(1), pp. 149-159, 2004
8.
P. Gajer, M. Schatz, S.L. Salzberg. Automated correction of genome sequence errors.
Nucleic Acids Research 32(2), pp. 562-569, 2004.
9.
M. Pop, S. L. Salzberg, M. Shumway. Genome Sequence Assembly: Algorithms and
Issues. IEEE Computer 35(7) 2002, pp. 47-54. Copyright 2002 IEEE. Reproduced with
permission from IEEE.
218
LESSON – 18
GENE PREDICTION PROGRAMS
18.1 Gene prediction programs
18.1.1 Introduction
18.1.2 GLIMMER (gene locator and interpolated markov modeler)
18.1.3 Interpolated markov models (imms)
18.1.4 The GLIMMER system
18.1.5 Fungal gene predictions by GLIMMER, GLIMMERM and Genscan
18.2 Let us Sum up
18.6 References
18.0 Aim and Objective:
This unit describes the Gene prediction programs, Introduction, Glimmer (gene locator and
interpolated markov modeler), Interpolated markov models (imms), The GLIMMER system,
Fungal gene predictions by GLIMMER, glimmerm, and Genscan.
18.1 Gene prediction programs
18.1.1 Introduction
A major goal of genome projects is to identify all genes in a given organism. Consequently, the
development of automated genefinding procedures has become one of the most active areas of
research in Bioinformatics . Protein-coding DNA sequences exhibit characteristics that
distinguish them from non-coding sequences. For prokaryotic organisms, the task of gene
219
identification is relatively easy as prokaryotic genomes are rather small and genes are not
interrupted by introns. Here, all open reading frames (ORFs) exceeding some threshold length
are likely to code proteins. The genefinding problem is much more complicated for eukaryotic
organisms where the density of genes in the genome is about two orders of magnitude lower than
in bacterial genomes and genes typically consist of multiple exons separated by introns of
varying length.
The commonly used approach for gene prediction is to train computer programs to recognize
sequences that are characteristic of known exons in genomic DNA sequences. The patterns used
to predict genes include intron-exon boundaries and upstream promoter sequences. However, in
eukaryotes, these signals are poorly defined, and therefore cannot be searched by a simple
pattern- matching technique as used with prokaryotes. During the past few years, various
prediction methods have been developed to identify genes in eukaryotic genome sequences.
Recent studies show, however, that the reliability of these methods is limited for large genomic
sequences as they cannot locate all possible exons encoded in the sequence . Moreover, many
gene-prediction programs have originally been tested on genomic sequences of only a few kilo
bases (kb) in length where each sequence contained only a single gene. The performance of
standard gene-prediction methods drops significantly when tested under more realistic conditions
usually containing multiple genes . Practically all existing gene-prediction programs rely on
information derived from known genes. Major differences between existing methods are in how
they assess if a stretch of genomic DNA looks as known genes. Two approaches are used. Abinitio or intrinsic methods use content statistics such as ORF length or codon usage together with
sequence signals like splice junctions to distinguish coding from non-coding regions.
GLIMMER, GRAIL, GENEID, GenScan , and GeneMark are among the most popular ab-initio
programs. By contrast, extrinsic methods work by comparing genomic sequence to known ESTs
or proteins in databases and check if a piece of the genomic sequence is similar to any known
genes or proteins.
This idea has been implemented in GENEWISE and PROCRUSTES . Neither ab-initio nor
extrinsic methods can elucidate perfectly the complex and variable genomic structure of higher
eukaryotic organisms. Their genes contain a large number of small exons separated by long
220
intervening sequences (introns). Furthermore, in the actual genomes, some non-coding sequences
could exhibit features of typical coding sequences (e.g., pseudogenes) and vice versa. Moreover,
a large fraction of higher eukaryotic coding exons are very short, which cannot be effectively
detected by commonly used gene-prediction programs. The following sections describe three
geneprediction methods: GLIMMER, GLIMMERM, and GenScan, used in this study for
detecting mainly small exons in the Neurospora crassa genome. They were chosen among
several others based on their possible ability to detect small coding regions, sensitivity and
specificity values when used on other genomes.
18.1.2 GLIMMER (Gene Locator and Interpolated Markov Modeler)
GLIMMER is a computational gene finder that was initially developed to predict genes in
prokaryotic genomes. Gene finders for prokaryotes have an advantage in that genomes tend to be
gene-rich, containing 90% coding sequences. One major problem is to correctly identify the
genes when two or more open reading frames (ORFs) overlap. GLIMMER uses a technique
called interpolated Markov model (IMM), a generalization of Markov chain methods, to identify
coding regions in microbial sequences. GLIMMER 1.0 has been used as the gene finder for some
bacterial genomes (Borrelia burgdorferi, Treponema pallidum, Chlamydia trachomatis, and
Thermotoga maritima). GLIMMER 2.0 has several technical improvements to the GLIMMER
1.0 algorithm and works better in resolving overlapping ORFs. GLIMMER uses an approach
based on frequency of occurrence of nucleotides in a DNA to determine the relative weights of
oligomers that have different lengths from 1 to 9 bp. First IMMs are created for the six open
reading frames (three frames for each of the two strands: forward and reverse), and then used to
score the entire ORFs. When there is an overlap between two high scoring ORFs, the overlapped
regions are scored separately to determine the more likely gene.
18.1.3 Interpolated Markov models (IMMs)
A Markov chain contains a sequence of random variables Xi (i is the position in the sequence),
where the probability distribution for each variable depends only on the preceding k variables Xi-
221
1, … , Xi-k for some constant k. In the case of DNA sequences, the random variables Xi takes
the value from the set of four nucleotide bases (a, c, g, and t). Depending on the order of the
Markov chain used, the constant k takes values from 0 to 8. For example, a fixed first-order
Markov chain is specified completely by a matrix of 16 conditional probabilities: p(a|a), p(a|c),
p(a|g), … , p(t|t), where each of the terms represents the probability of the current base given the
previous base. A second-order Markov chain predicts a base by looking at the two previous
bases. In general, for a kthorder Markov model, the number of probabilities we need to look into
is 4k+1 for each 33 reading frame. In a 0th-order model, the matrix contains only the individual
probabilities of the four nucleotides (a, c, g, and t). In the case of a first order, the 16 dinucleotide
(aa, ac, ag, at, …, tg, tt) probabilities are calculated by looking at the previous base. A secondorder model gives the probability of 64 trinucleotides (aaa, aag, aac, aat,…, ttg, ttt). In principle,
using longer oligomers is always preferable to using shorter ones, but only if sufficient data is
available to produce probability estimates. Currently most of the gene finders use a 5th-order
fixed Markov chains (it uses hexamer nucleotide or di-codon frequencies) as they have proven to
be effective for gene predictions . IMMs are generalization of fixed order Markov chains. The
main difference between IMMs and fixed Markov models is that IMMs use varying number of
bases for each prediction rather than making decision in advance regarding the number of bases
to consider. This allows IMMs to be sensitive depending on the frequencies of particular
oligomers in a genome. For example, if some 5- mers (oligomers having five bases) occur too
infrequently, their probabilities cannot be estimated reliably, and they will not be used in the
model. On the other hand, if some 8-mers occur sufficiently frequently, IMM use this longer
context to make better predictions. Thus it has all the additional information for prediction. From
the training data sets, GLIMMER computes the probability for each nucleotide base (a, c, g, or t)
following all k-mers (0
k
8). For each k-mer, weights are computed for use in different
models. These weights and Markov models are interpolated to produce a score for each base in
any potential coding region. The logs of scores are summed to score each coding region. The
probability that the model M generates the sequence S, P(S|M), is computed as P(S|M) = Σn x=1
IMM8 (Sx) Where Sx is the oligomer ending at the position x, and n is the length of the
sequence. IMM8 (Sx) is the 8th-order interpolated Markov model score computed as IMMk(Sx)
= λk(Sx - 1) * Pk(Sx) + [1- λk(Sx - 1)] * IMMk -1(Sx) where λk(Sx - 1) is the numeric weight
associated with the k-mer ending at the position x-1 in the sequence S, and Pk(Sx) is the estimate
222
obtained from the training data of the probability of the base located at x in the kth-order model.
Pk(Sx) = P(sx|Sx, i ) = f(Sx,i)/(Σ bε{a,c,g,t} f(Sx,i, b)) where f(S) denotes the number of
occurrences of the string S = s1s2…sn. GLIMMER uses two criteria to determine λk (Sx). The
first criterion is simply the frequency of occurrence. The current default threshold value is 400.
The default threshold value gives 95% confidence that the sample probabilities are within 5% of
the true probabilities from which the sample was taken. When there are insufficient sample
occurrences of a context string (oligomer), additional criteria are employed to assign a λ value.
For a given context string Sx,i of length i, observed frequencies of the base f(Sx,i,a), f(Sx,i,c),
f(Sx,i,g), and f(Sx,i,t) are compared with previously calculated IMM probabilities using the next
shorter context, IMMi-1(Sx,i – 1, a), IMMi-1(Sx,i – 1, c), IMMi-1(Sx,i – 1, g), and IMMi-1(Sx,i –
1, t). Using a χ2 test the two values are compared. If the values differ significantly, then the
observed values are used. If they are consistent with IMM values, a lower value is given as they
offer less predictive value. The value of λk(Sx) that we associate with Pk(Sx) can be regarded as
a measure of our confidence in estimating the true probability. The number of parameters we
need to estimate grows exponentially with the level of the order and higher the order, the
parameter estimates can be less reliable.
18.1.4 The GLIMMER alignment
The GLIMMER system consists of two programs: build-imm and glimmer (or glimmer2). The
program build-imm takes an input set of sequences. The set can be complete genes or parital
ORFs. It builds and outputs the interpolated Markov model. The second program glimmer uses
this IMM to identify genes in a genomic sequence. GLIMMER does not use sliding windows to
score the coding regions. Instead it identifies all ORFs that are longer than the threshold value
and scores them in six possible reading frames. The ORF is assumed to have only one stop codon
after the start codon in the sequence. It selects the frame that scores the highest for further
examination of overlaps. If there is an overlap between reading frames, it selects the overlapped
regions and scores them separately. Overlapping ORFs are resolved based on the length and a
separate score computed for their overlapped regions. Suppose that A and B are two ORFs that
overlap. If the overlap scores higher in A’s reading frame and A is longer than B, we reject B. If
the overlap scores higher in B’s reading frame and B is longer than A, we reject A. Otherwise,
223
both A and B are marked as “suspect”. GLIMMER 2.0 has resolved some of the prediction
problems of GLIMMER 1.0. GLIMMER 1.0 occasionally discarded a gene due to the placement
of its start codon in the 5’ direction resulting in an overlap with another gene. GLIMMER 2.0
resolves overlapping problems by incorporating extra rules. The scoring is similar to that of
GLIMMER 1.0 for potential overlapping genes, but the system attempts to move the 36 locations
of the start codons much more aggressively. In the case of ORFs A and B overlap, there are four
different orientations to be considered. The process of evaluating overlaps is performed in an
iterative fashion to prevent unnecessary rejection of genes. The current version also helps to find
genes that were missed earlier due to the high probability threshold score.
GLIMMER is the primary microbial gene finder at The Institute of Genomic Research (TIGR),
and has been used to annotate the complete genomes of Borrelia burgdorferi, Treponema
pallidum, Thermotoga maritima, Deinococcus radiodurans, Mycobacterium tuberculosis, and
non-TIGR projects including Chlamydia trachomatis and Chlamydophila pneumoniae. The
accuracy rates of gene prediction for bacterial and archaebacterial genomes (including
Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Bacillus subtilis, Mycoplasma
genitalium) were close to 98% . It has not been used for eukaryotic genomes. Even though
GLIMMER is developed for prokaryotic genomes, it could be still useful for finding eukaryotic
genes. Many small eukaryotes, for example, have relatively high gene density and contain short
genes without interrupted by introns, and these genes cannot be detected by commonly used
prediction programs. The GLIMMER system including the source codes was downloaded from
The Institute of Genomic Research website (http://www.tigr.org). The example output of the
program is listed in Appendix D. 37
The GLIMMERM system
The GLIMMERM algorithm uses the same IMM scoring method used in GLIMMER 2.0 and
was developed specifically for eukaryotes having a gene density of less than 20%. The splice site
predictor algorithm in GLIMMERM captures dependencies among neighboring bases in a small
window around each splice junction (16 and 29 bp around the 5'-donor and 3'-acceptor sites,
respectively) . The algorithm takes advantage of the fact that the coding and non-coding
224
sequences switch at the splice junction and detects this switch with two second-order Markov
chains, one models coding sequence and another non-coding sequence. The length of each of
these coding or noncoding context windows is currently fixed at 80 bp. Potential coding regions
are evaluated by a scoring function based on decision trees that estimate the probability that a
DNA subsequence is coding. Subsequences are evaluated according to their putative type: intron,
initial exon, internal exon, final exon, and single-exon gene. Each such subsequence is run
through ten different decision trees built with the OC1 induction system that can take multiple
numeric feature values. The probabilities obtained with the decision trees are averaged to
produce a smoothed estimate of the probability that the given subsequence is of a certain type. A
putative gene model is then accepted only if the IMM score for the coding sequence in the
correct reading frame exceeds a fixed threshold.
The main assumptions of this program are:
• The coding region of every gene begins with a start codon ATG,
• The gene has no inframe stop codons except the very last codon, and 38
• Each exon is in a consistent reading frame with the previous exon.
These constraints significantly enhance the efficiency of computing the optimal gene models by
restricting the search space of the dynamic programming algorithm. The dynamic programming
algorithm processes sequences from left to right searching for stop codons. At each stop codon, it
searches back in the 5' direction (right to left) finding all possible genes using this stop codon,
and chooses the highest scoring gene. The only positions that are considered as possible intron
donor and acceptor sites are those that score above the threshold determined by the Markov
chains. The algorithm is run separately on direct and complementary strands of the input.
GLIMMERM rejects overlapping genes by going through the list of putative genes. Overlap
occurs when two models share a common stop codon and have different exon locations. If the
genes overlap by less than 30 bp (the default value), then the overlap is ignored and both are
considered possible genes. If the overlap is more than 30 bp, then they are rescored using the
IMM and a gene with the best score is retained. GLIMMERM was used for a malaria parasite
(Plasmodium falciparum) genome and showed the rates of sensitivity and specificity for
nucleotide level recognition above 94% and 97%, respectively. GLIMMERM's accuracy of 93%
225
on a plant genome, Arabidopsis thaliana , was comparable to the accuracy of 95% and 94% for
GeneMark.hmm and GenScan, respectively .
GenScan
GenScan is a general-purpose gene identification program used to analyze genomic DNA
sequences from a variety of organisms including human, other vertebrates, 39 invertebrates, and
plants . For each genomic sequence, the program determines the most likely gene structure under
a probabilistic model of the gene structural and compositional properties of the genomic DNA
for the given organism. The probabilistic model used by GenScan accounts for many of the
essential gene structural properties of genomic sequences: e.g., typical gene density, typical
number of exons per gene, distribution of exon sizes for different types of exons; and also many
of the important compositional properties of genes: e.g., the reading frame-specific hexamer
composition of coding regions versus the reading frame- independent hexamer composition of
introns and intergenic regions, and the position-specific composition of the translation initiation,
termination signals, TATA box, cap site, and poly-adenylation signals.
Importantly, novel models of the donor and acceptor splice sites are used, which capture
potentially important dependencies between positions in these signals. For human and other
vertebrate sequences, separate sets of model parameters are used, which account for the many
differences in gene density and structure observed in genomic regions that exhibit distinct
nucleotide composition (G+C%). GenScan has an additional feature that draws a representation
of the resultant prediction showing all putative exons in their respective positions on both strands
and whether they are leading, internal, or terminal, and a simplified scoring scheme. GenScan
uses a homogeneous 5th-order Markov model of non-coding regions and three periodic 5th-order
Markov models of coding regions. The parameters are typically estimated using the maximum
likelihood method, that is, by using the observed conditional frequencies obtained from an
appropriate training set of known genes to estimate the corresponding conditional probabilities.
Nucleotides are generated 40 according to the probabilistic rules derived from an underlying
hidden Markov process. It is parameterized for G+C content.
226
The training set containing exons and introns are divided into four categories depending on the
G+C content of the sequence. The categories are: I (< 43% G+C), II (43-51% G+C), III (51-57%
G+C), and IV (> 57% G+C). For each of these categories, separate initial state probabilities are
computed by estimating the relative frequencies of various functional units in these categories.
GenScan uses double-stranded models to allow for occurrences of multiple genes on either or
both DNA strands unlike other programs, which analyze one strand at a time assuming the input
sequence contains a single complete gene. The essential idea is that a precise probabilistic model
a gene/genomic sequence looks like is specified in advance, and then, given a sequence, one
determines which of the vast number of possible gene structures (involving any valid
combination of states/lengths) has the highest likelihood given the sequence. It cannot handle
overlapping transcription units and does not address alternative splicing.
GenScan program was designed primarily to predict genes in human/vertebrate genomic
sequences; its accuracy level may be lower for other organisms. However, the vertebrate version
of the program performed fairly well on an invertebrate (Drosophila melanogaster) sequences
with accuracy per exon value of 68%. The maize and Arabidopsis versions (both are plants) also
performed fairly well on their respective organisms with per exon accuracy of 78% and 67%,
respectively. It differs from the majority of existing gene finding algorithms in that it allows
partial genes as well as complete genes and the occurrence of multiple genes in a single
sequence, on either or 41 both DNA strands. For prokaryotic or yeast sequences, the programs
GLIMMER and/or GeneMark are better in comparison to GenScan .
1. Mention the 2 programs available for GLIMMER system.
Notes:
ii) Write your answer in the space given below.
jj) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
227
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
18.1.5 Fungal gene predictions by GLIMMER, GLIMMERM, and GenScan
The gene prediction methods were chosen based on their ability to identify small genes.
GLIMMER has been widely used for prokaryotes and is highly accurate for their gene detection.
The program allows building models on any datasets. GLIMMERM, a modified version of
GLIMMER, was used as it was developed mainly for eukaryotes with small genome sizes and
can identify short genes from high density genomes.
GenScan was chosen for a comparison purpose since it is a widely used prediction program.
• GLIMMER was trained using cDNA datasets of N. crassa, S. cerevisiae, a n d S. pombe
obtained from NCBI. The trained model was then used to extract putative genes from the N.
crassa genomic sequences. Program getputative, extract, build-icm, and glimmer2 are all part of
the GLIMMER package. USAGE build- icm < tmp.train > tmp.model It builds the model using
the training datasets in fasta format (tmp.train) and stores it in tmp.model glimmer2 Sequence
tmp.model –g n | get-putative > g2.coord Using the trained model (tmp.model) and genomic
sequence (Sequence), glimmer2 predicts all possible gene locations. The most likely gene
coordinates are extracted by 42 the program get-putative included in the GLIMMER package
and stored in g2.coord. The –g option denotes the minimum gene length. 30 bp is used in this
study. extract Sequence g2.coord > Nucleotide_Output Using the stored coordinates (g2.coord),
the ORFs are extracted from the genomic sequence (Sequence).
• GLIMMERM was run using pre-trained models of one filamentous fungus species (Aspergillus
fumigatus) and two plant species (Arabidopsis thaliana and Oryza sativa) available from the
GLIMMERM software package. USAGE
glimmerm Sequence -d directory of trained model >
Output
• GenScan was run using pre-trained models of human and two plant species (A. thaliana and
maize) available from the downloaded GenScan software package. USAGE- genscan
228
Parameter_file_of_organism Sequence -cds > Output The program takes in a parameter file of
the trained model and a genomic sequence file (Sequence), and outputs the predicted ORFs.
The –cds option prints the predicted nucleotide sequences.
18.2 Let us sum up
Gene finding typically refers to the area of computational biology that is concerned with
algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically
functional. This especially includes protein-coding genes, but may also include other functional
elements such as RNA genes and regulatory regions. Gene finding is one of the first and most
important steps in understanding the genome of a species once it has been sequenced.
Step 1 Getting the DNA sequence
For the practical we have choosen a human genomic sequence that may perhaps encode an
adenylate kinase...
·
Open a new navigator window and load this page into it.
·
Click here to get the ~12 kb sequence (press on the middle mouse button to open a new
browser window).
Step 2 Looking for a Promoter candidate
Promoter prediction is heavily dependent on finding good matches to the TATAAAA motif.
Further clues may be provided by CCAAT-Box, CpG islands and other transcription factor
binding sites.
·
·
Load the LBL Promoter query submission page.
·
It is worth familiarising yourself with the layout and note the online help.
·
Cut and paste the query sequence into the sequence box and submit the job.
·
When the result arrives look at the set of predicted promoters
o
Can you see matches to the TATA box consensus tATAA/T AA/T
o
Which promoter has the silliest TATA-box?
229
·
·
Load the TSSG query submission page.
·
Toggle on the TSSG button.
·
Don't use the sequence pastebox. (it can't strip out numbers from the sequence).
·
Type (or paste) /home/seqanal/public_html/courses/spring99/seq.fasta in the load box and
then perform search.
·
When the result arrives look at the set of predicted promoters
o
How many TATA boxes were found?
o
Are the listed transcription factor binding sites informative?
o
How well do the two searches agree on candidate promoters?
o
How many candidates do they both find?
o
Is there a single best candidate from combining the searches?
Step 3. Poly(A) site prediction
In mammalian genes, polyadenylation sites are usually preceded by AATAAA or ATTAAA ~20
bases before the cleavage site and followed by a more weakly conserved GT-based motif. While
these motifs are trivial to find, they only function in the right context - which is harder to define
and includes regulation by upstream splicing factors. An important rule to remember is that there
must not be an inframe stop codon in an internal exon ie the true translation termination will be
in the last exon. (Violations to this rule suppress mRNA production, to the cost of many
experimentalists, and is occasionally used for differential mRNA regulation e.g. for certain Ig
splice variants.)
·
(As needed, open a new navigator window and load this page into it.)
·
Load the POLYAH query submission page.
·
Toggle on the POLYAH button.
·
Look at the POLYAH help and note the quoted prediction accuracy.
·
Don't use the sequence pastebox. (it can't strip out numbers from the sequence).
·
Type (or paste) /home/seqanal/public_html/courses/spring99/seq.fasta in the load box and
then perform search.
·
When the result arrives, look at the predicted poly(A) sites.
o
How many candidate sites were found?
230
o
If one or more of these sites are false is the prediction accuracy as good as
claimed?
o
How might overprediction of poly(A) sites be avoided?
Step 4. Predicting Splice Sites and Coding Exons
There are a number of servers that separately predict splice sites and coding sequence bias but
this information needs to be analysed together. We found that the CBS site in Denmark could
provide all the information, though from two different servers. The NetGene2 server provides a
graphical postscript output that we can print out and mark our predictions on. From the same
group, the HMMgene server (using different algorithms) provides list output including potential
Start and Stop codons. Both servers overpredict splice site candidates. In case you need
reminding, classical splice sites look something like:
o
D o n o r C o n s e n s u s :
o
Acceptor Consensus:
c
/aAG^GTA/ gAGt
(T>C)n N(C>T)AG^gt
·
(As needed, open a new navigator window and load this page into it.)
·
Load the NetGene2 query submission page.
·
Paste in the sequence and submit the job, which takes a few minutes to run.
·
The output provides a list of candidate splice sites (on both strands) and a graphical
coding/splicing prediction.
o
·
However it is not clear which translation frame is supposed to be coding.
It is worth printing this figure out and using it to summarise our prediction attempts!
o
Click on Direct strand and save the compressed postscript output (has a .Z suffix).
o
Open a UNIX X-window (terminal from the desktop)
o
Uncompress the file by typing UNIX command
§
o
gunzip filename.ps.Z
Print the file to the printer outside room V111 by typing
§
lpr -Plw-v111 filename.ps
·
Now load the HMMgene query submission page.
·
Paste /home/seqanal/public_html/courses/spring99/seq.fasta in the local file box.
·
select 5 best predictions and toggle on predict signals.
·
Submit the job.
231
·
Click on the Explanation link to understand the output format.
·
We can now begin to assemble a complete gene prediction.
Step 5. Combining the server outputs into an overall prediction
We now have predictions for all the components needed to assemble the gene, rather
inconveniently spread over many separate web outputs. We have to manually assemble all this
into one prediction. This can be done on the Netgene2 and DNA sequence outputs using biro and
fluorescent marker. The following guidelines may help.
·
Start from a strong point such as:
o
A well-predicted internal coding exon with good splice borders.
·
Work forwards and backwards toward the promoter and poly(A) boundary signals.
·
Reported splice site quality is not a completely robust guide to usage.
o
Context dependence is also important.
o
Splice sites tend to be overpredicted.
o
Some (true) splice sites might be better predicted by the HMMgene algorithm
than by NetGene2...
·
The terminal exon should be partially coding, including the stop codon and the poly(A)
signature.
·
The initiation codon should obey the Kozak rules:
§
It is normally the first methionine from the 5' end of the mRNA.
§
At least one of the underlined residues should be present in the consensus
APuXXAUGG.
·
Once the prediction is completed, we can check it in the next exercise.
·
Good luck!
Step 6. Gene prediction by homology using GeneWise
Usually nowadays, related sequences are already present in the databases. When available these
may be the fastest way to get a good gene prediction. Often this prediction will be more reliable
than the coding bias predictions though one should be aware of the possibility of sequence error,
differential splicing etc. and of course finding the coding exons is not a complete gene
prediction. The GeneWise program has an exhaustive (slow) algorithm to align a protein to a
DNA sequence, allowing for splice site recognition. (In a real situation, BLAST programs would
be useful for first picking up the matches in a DB search.)
232
·
Open an X-window (or terminal on Tau's desktop).
·
Type prepare wise2 in the window.
·
We've prepared files with the human DNA and a homologous chicken protein to
compare.
·
Now you can type or cut and paste the following command into the UNIX window:
o
g e n e w i s e
/ h o m e /seqanal/public_html/courses/spring99/kad1_chick
/home/seqanal/public_html/courses/spring99/hsak1.dna
·
The program will run with default parameters and after a couple of minutes will print out
the matched exons.
·
Now compare the results to the predictions so far
o
How many Exons are found?
o
Are the splice sites between or within codons?
o
Did you find all these coding regions earlier?
o
Have we now found all the coding exons (the chicken homologue has 194 AA)?
Your answer must include these points:
·
build-imm and
·
glimmer (or glimmer2)
·
build-imm takes an input set of sequences
·
glimmer uses this IMM to identify genes in a genomic sequence
233
1. Do a comparative study on the various gene prediction program.
18.6 References:
1. Benfey, P.; Protopapas, A.D. (2004). Essentials of Genomics. Prentice Hall.
2. Brown, Terence A. (2002). Genomes 2. Oxford: Bios Scientific Publishers. ISBN 9781859960295.
3. Gibson, Greg; Muse, Spencer V. (2004). A Primer of Genome Science, Second Edition,
Sunderland, Mass: Sinauer Assoc. ISBN 0-87893-234-8.
4. Gregory, T. Ryan (ed) (2005). The Evolution of the Genome. Elsevier. ISBN 0-12301463-8.
5. Reece, Richard J. (2004). Analysis of Genes and Genomes. Chichester: John Wiley &
Sons. ISBN 0-470-84379-9.
6. Saccone, Cecilia; Pesole, Graziano (2003). Handbook of Comparative Genomics.
Chichester: John Wiley & Sons. ISBN 0-471-39128-X.
7. Werner, E. (2003). "In silico multicellular systems biology and minimal genomes". Drug
Discov Today 8 (24): 1121-1127. PMID 14678738.
8. Witzany, G. (2006). "Natural Genome Editing Competences of Viruses". Acta
Biotheoretica 54 (4): 235-253. PMID 17347785.
234
UNIT V
LESSON – 19
SECONDARY STRUCTURE PREDICTION
19.1 Secondary structure prediction
19.1.1Prediction of protein secondary structure from the amino acid sequence
19.1.2 Accuracy of secondary structure prediction
19.1.3 Methods for secondary structure prediction
19.2 Let us Sum up
19.6 References
This unit discuss Secondary structure prediction, Prediction of protein secondary structure from
the amino, acid sequence, Accuracy of secondary structure prediction, Methods for secondary
structure prediction.
19.1 Secondary Structure Prediction
Secondary structure prediction is a set of techniques in Bioinformatics that aim to predict the
local secondary structures of proteins and RNA sequences based only on knowledge of their
primary structure - amino acid or nucleotide sequence, respectively. For proteins, a prediction
consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands
(often noted as "extended" conformations), or turns. The success of a prediction is determined
by comparing it to the results of the DSSP algorithm applied to the crystal structure of the
protein; for nucleic acids, it may be determined from the hydrogen bonding pattern. Specialized
algorithms have been developed for the detection of specific well-defined patterns such as
235
transmembrane helices a n d coiled coils in proteins, or canonical microRNA structures in
RNA. http://en.wikipedia.org/wiki/Secondary_structure_prediction - _note-Mount#_note-Mount
The best modern methods of secondary structure prediction in proteins reach about 80%
accuracy; this high accuracy allows the use of the predictions in fold recognition and ab initio
protein structure prediction, classification of structural motifs, and refinement of sequence
alignments. The accuracy of current protein secondary structure prediction methods is assessed
in weekly benchmarks such as LiveBench and EVA.
The problems of predicting RNA secondary structure are broadly related but dependent mainly
on base pairing a n d base stacking interactions; many RNA molecules have several possible
three-dimensional structures, so predicting these structures remains out of reach unless obvious
sequence and functional similarity to a known class of RNA molecules, such as transfer RNA or
microRNA, is observed. Many RNA secondary structure prediction methods rely on variations
of dynamic programming and therefore are unable to efficiently identify pseudoknots.
19.1.1 Prediction of Protein Secondary Structure from the Amino Acid Sequence
Accurate prediction as to where alpha helices, beta strands, and other secondary structures will
form along the amino acid chain of proteins is one of the greatest challenges in sequence
analysis. At present, it is not possible to predict these events with very high reliability. As
methods have improved, prediction has reached an average accuracy of 64–75% with a higher
accuracy for aplha helices, depending on the method used. These predictive methods can be
made especially useful when combined with other types of analyses discussed in this. For
example, a search of a sequence database or a protein motif database for matches to a candidate
sequence may discover a family or superfamily relationship with a protein of known structure. If
significant matches are found in regions of known secondary or three-dimensional structure, the
candidate protein may share the three-dimensional structural features of the matched protein.
Several Web sites provide such an enhanced analysis of secondary structure. These sites and
others that provide secondary structure analysis of a query protein are given. The main methods
of analyses used at these sites are described below.
Methods of structure prediction from amino acid sequence begin with an analysis of a database
of known structures. These databases are examined for possible relationships between sequence
236
and structure. When secondary structure predictions were first being made in the 1970s and
1980s, only a few dozen structures were available. This situation has now changed with present
databases including approximately 500 independent structural folds. The combination of more
structural and sequence information presents a new challenge to investigators who wish to
develop more powerful predictive methods. The ability to predict secondary structure also
depends on identifying types of secondary structural elements in known structures and
determining the location and extent of these elements. The main types of secondary structures
that are examined for sequence variation are alpha helices and beta htrands. Early efforts focused
on more types of structures, including other types of helices, turns, and coils. To simplify
secondary structure prediction, these additional structures that are not an aplha helix or beta
htrand were subsequently classified as coils. Assignment of secondary structure to particular
amino acids is sometimes included in the PDB file by the investigator who has solved the threedimensional structure. In other cases, secondary structure must be assigned to amino acids by
examination of the structural coordinates of the atoms in the PDB file.
Methods for comparing three-dimensional structures, described above, frequently assign these
features automatically, but not always can be made especially useful when combined with other
types of analyses discussed in this. For example, a search of a sequence database or a protein
motif database for matches to a candidate sequence may discover a family or superfamily
relationship with a protein of known structure. If significant matches are found in regions of
known secondary or three-dimensional structure, the candidate protein may share the threedimensional structural features of the matched protein. Several Web sites provide such an
enhanced analysis of secondary structure. These sites and others that provide secondary structure
analysis of a query protein are given. The main methods of analyses used at these sites are
described below.
Methods of structure prediction from amino acid sequence begin with an analysis of a database
of known structures. These databases are examined for possible relationships between sequence
and structure. When secondary structure predictions were first being made in the 1970s and
1980s, only a few dozen structures were available. This situation has now changed with present
databases including approximately 500 independent structural folds. The combination of more
237
structural and sequence information presents a new challenge to investigators who wish to
develop more powerful predictive methods. The ability to predict secondary structure also
depends on identifying types of secondary structural elements in known structures and
determining the location and extent of these elements. The main types of secondary structures
that are examined for sequence variation are alpha helices and beta strands. Early efforts focused
on more types of structures, including other types of helices, turns, and coils. To simplify
secondary structure prediction, these additional structures that are not an alpha helix or beta
strand were subsequently classified as coils.
Assignment of secondary structure to particular amino acids is sometimes included in the PDB
file by the investigator who has solved the three-dimensional structure. In other cases, secondary
structure must be assigned to amino acids by examination of the structural coordinates of the
atoms in the PDB file. Methods for comparing three-dimensional structures, described above,
frequently assign these features automatically, but not always in the same manner. Hence, some
variation is possible, and deciding which is the best method can be difficult.
The DSSP database of secondary structures and solvent accessibilities is a useful and widely
used resource for this purpose (Kabsch and Sander 1983; http://www.sander.ebi.ac.uk/dssp/).
This database, which is based on recognition of hydrogen-bonding patterns in known structures,
distinguishes eight secondary structural classes that can be grouped into alpha helices, beta
strands, and coils (Rost and Sander 1993). A more recently described automatic method makes
predictions in accord with published assignments (Frishman and Argos 1995). The assumption
on which all the secondary structure prediction methods are based is that there should be a
correlation between amino acid sequence and secondary structure. The usual assumption is that a
given short stretch of sequence may be more likely to form one kind of secondary structure than
another. Thus, many methods examine a sequence window of 13–17 residues and assume that
the central amino acid in the window will adopt a conformation that is determined by the side
groups of all the amino acids in the window. This window size is within the range of lengths of
alpha helices (5–40 residues) and beta strands (5–10 residues).
238
There is evidence that more distant interactions within the primary amino acid chain may
influence local secondary structure. The same amino acid sequence up to 5 (Kabsch and Sander
1984) and 8 (Sudarsanam 1998) residues in length can be found in different secondary structures.
An 11-residue- long amino acid “chameleon” sequence has been found to form an alpha helix
when inserted into one part of a primary protein sequence and a beta sheet when inserted into
another part of the sequence (Minor and Kim 1996). More distant interactions may account for
the observation that beta strands are predicted more poorly by analysis of local regions (Garnier
et al. 1996). However, the methods that have been used to predict the secondary structure of an
amino acid residue all perform less well when amino acids more distant than in the small
window of sequence are used.
The number of possible amino acid combinations in a sequence window of 17 amino acids is
very large (1720 _ 14 _ 1024). If many combinations influence one type of secondary structure,
examination of a large number of protein structures is required to discover the significant
patterns and correlations within this window. Earlier methods for predicting secondary structure
assumed that each amino acid within the sequence window of 13–17 residues influences the local
secondary structure independently of other nearby amino acids; i.e., there is no interaction
between amino acids in influencing local secondary structure. Later methods assumed that
interactions between amino acids within the window could play a role. Neural network models
described below have the ability to detect interactions between amino acids in a sequence
window, including conditional interactions.
A hypothetical example of the interactions that might be discovered illustrates the possibilities. If
the central amino acid in the sequence window is Leu and if the second upstream amino acid
toward the amino terminus is Asn, the Leu is in an alpha helix; however, if the neighboring
amino acid is not Asn, the Leu is in a beta strand. In another method of secondary structure
prediction, the nearest-neighbor method, sequence windows in known structures that are most
like the query sequence are identified. This method bypasses the need to discover complex
amino acid patterns associated with secondary structure. Protein secondary structure has also
been modeled by hidden Markov models, also described as discrete state-space models, which
are described below (Stultz et al. 1993; White et al. 1994).
239
19.1.2 Accuracy of Secondary Structure Prediction
One method of assessing accuracy of secondary structure prediction is to give the percentage of
correctly predicted residues in sequences of known structure, called Q3. This measure, however,
is not very effective by itself, because even a random assignment of structure can achieve a high
score by this test (Holley and Karplus 1991). Another measure is to report the fraction of each
type of predicted structure that is correct. A third method is to calculate a correlation coefficent
for each type of predicted secondary structure (Mathews 1975). The coefficient indicating
success of predicting residues in the _-helical configuration, C_, is given by where p_ is the
number of correct positive predictions, n_ is the number of correct negative predictions, o_ is the
number of overpredicted positive predictions (false positives), and u_ is the number of
underpredicted residues (misses). The closer this coefficient is to a value of 1, the more
successful the method for predicting a helical residue. An overall level of prediction accuracy
does not provide information on the accuracy of the number of predicted secondary structures,
and their lengths and location in the sequence.
One simple index of success is to compare the average of the predicted lengths with the known
average (Rost and Sander 1993). Another factor to consider in prediction accuracy is that some
protein structures are more readily predictable than others, such that the spectrum of test proteins
chosen will influence the frequency of success. A representative set of proteins that have limited
similarity will provide the most objective test. Rost and Sander (1993) have chosen a set of 126
globular and 4 membrane proteins that have less than 25% pair-wise similarity and have used
this set for training and testing neural network models. A newer set of 540 structurally distinct
fold types in the FSSP database provides an even larger set of training and test structures of
unique structure and sequence (Holm and Sander 1998). In the often-used jackknife test, one
protein in a set of known structure is left out of a calibration or training step of the program
being tested. The rest of the proteins are used to predict the structure of the left-out one, and the
procedure is cycled through all of the sequences.
240
The overall frequency of success of predicting the secondary structural features of the left-out
sequence is used as an indicator of success. An even more comprehensive approach to the
problem of accuracy is to examine the predictions for different structural classes of proteins.
Because some classes are much more difficult to predict, the overall success rate with respect to
protein class is an important index of success. Prediction accuracy is discussed further below. A
valuable addition to secondary structure prediction is giving the degree of reliability of the
prediction at each position. Some prediction methods produce a score for each of the three types
of structures (helix, strand, coil or loop) at each residue position. If one of these scores is much
higher than the other two, the score is considered to be more reliable, and a high reliability index
may be assigned that reflects high confidence in the prediction. If the scores are more similar, the
index is lower. By examining predictions for known structures, as in a jackknife experiment, the
accuracy of these reliability indices may be determined. What has been found is that a prediction
with a high index score is much more accurate (Yi and Lander 1993; and see PHD server below),
thus increasing confidence in the prediction of these residues.
19.1.3 Methods for Secondary Structure Prediction
Three widely used methods of protein secondary structure prediction, (1) the Chou- Fasman and
GOR methods, (2) neural network models, and (3) nearest- neighbor methods. An additional
method that models structural families by Hidden Markov models is then described. These
methods can be further enhanced by examining the distribution of hydrophobic, charged, and
polar amino acids in protein sequences.
1. List the widely used secondary structure prediction methods.
Notes:
kk) Write your answer in the space given below.
ll) Check your answer with the one given at the end of this lesson.
241
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
19.2 Let us Sum up
Secondary structure prediction is a set of techniques in Bioinformatics that aim to predict the
local secondary structures of proteins and RNA sequences based only on knowledge of their
primary structure - amino acid or nucleotide sequence. Accurate prediction as to where alpha
helices, beta strands, and other secondary structures will form along the amino acid chain of
proteins is one of the greatest challenges in sequence analysis.
(i) Download any protein sequence from PDB
(ii) Do secondary structure prediction using all the methods.
1) the Chou- Fasman and GOR methods,
(2) neural network models, and
(3) nearest-neighbor methods
“Secondary structure prediction of proteins requires a thorough knowledge of primary
structure” – Substantiate.
242
19.6 References
1.
C Branden and J Tooze (1999). Introduction to Protein Structure 2nd ed. Garland
Publishing: New York, NY.
2.
W. Kabsch and C. Sander. Dictionary of Protein Secondary Structure: Pattern
Recognition of Hydrogen Bonded and Geometrical Features. Biopolymers 22: 2577-2637
(1983). [1]
3.
M. Zuker "Computer prediction of RNA structure", Methods in Enzymology, 180:262-88
(1989). (The classic paper on dynamic programming algorithms to predict RNA
secondary structure.)
4.
L. Pauling and R.B Corey. Configurations of polypeptide chains with favored
orientations of the polypeptide around single bonds: Two pleated sheets. Proc. Natl.
Acad. Sci. Wash., 37:729-740 (1951). (The original beta-sheet conformation article.)
5.
L. Pauling, R.B. Corey and H.R. Branson. Two hydrogen-bonded helical configurations
of the polypeptide chain. Proc. Natl. Acad. Sci. Wash., 37:205-211 (1951). (alpha- and
pi- helix conformations, since they predicted that 310 helices would not be possible.)
6.
(1983) "Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features". Biopolymers 22: 2577-2637. PMID 6667333.
243
LESSON – 20
METHODS FOR SECONDARY STRUCTURE PREDICTION
20.1 Chou-Fasman/ GOR method
20.2 Patterns of hydrophobic amino acids can aid structure prediction
20.3 Secondary structure prediction by neural network models
20.4 Let us Sum up
20.8 References
This unit discuss about the secondary structure prediction methods namely Chou-Fasman/ Gor
method, Patterns of hydrophobic amino acids can aid structure prediction and Secondary
structure prediction by neural network models.
20.1 Chou-Fasman/GOR Method
The Chou-Fasman method (Chou and Fasman 1978) was based on analyzing the frequency of
each of the 20 amino acids in alpha helices, beta sheets, and turns of the then-known relatively
small number of protein structures. It was found, for example, that amino acids Ala (A), Glu (E),
Leu (L), and Met (M) are strong predictors of alpha helices, but that Pro (P) and Gly (G) are
predictors of a break in a helix.
244
A table of predictive values for each type of secondary structure was made for each of the alpha
helices, beta strands, and turns. To produce these values, the frequency of amino acid i in
structure s is divided by the frequency of all residues in structure s. The resulting three structural
parameters (P_, P_, and Pt) vary roughly from 0.5 to 1.5 for the 20 amino acids. To predict a
secondary structure, the following set of rules is used. The sequence is first scanned to find a
short sequence of amino acids that has a high probability for starting a nucleation event that
could form one type of structure. For alpha helices, a prediction is made when four of six amino
acids have a high probability _1.03 of being in an alpha helix. For beta strands, the presence in a
sequence of three of five amino acids with a probability of _1.00 of being in a beta strand
predicts a nucleation event for a beta strand. These nucleated regions are extended along the
sequence in each direction until the prediction values for four amino acids drops below 1. If both
_-helical and _-strand regions are predicted, the higher probability prediction is used. Turns are
predicted somewhat differently. Turns are modeled as a tetrapeptide, and two probabilities are
calculated.
First, the average of the probabilities for each of the four amino acids being in a turn is
calculated as for alpha helix and beta strand predictions. Second, the probabilities of amino acid
combinations being present at each position in the turn tetrapeptide (i.e., the probability that a
particular amino acid such as Pro is at position 1, 2, 3, or 4 in the tetrapeptide) are determined.
These probabilities for the four amino acids in the candidate sequence are multiplied to calculate
the probability that the particular tetrapeptide is a turn. A turn is predicted when the first
probability value is greater than the probabilities for an alpha helix and a beta strand in the
region and when the second probability value is greater than 7.5 _ 10–5.
In practice, the Chou-Fasman method is only about 50–60% accurate in predicting secondary
structural domains. Garnier et al. (1978) developed a somewhat more involved method for
protein secondary structure prediction that is based on a more sophisticated analysis. The method
is called the GOR (Garnier, Osguthorpe, and Robson) method. Whereas the Chou-Fasman
method is based on the assumption that each amino acid individually influences secondary
structure within a window of sequence, the GOR method is based on the assumption that amino
245
acids flanking the central amino acid residue influence the secondary structure that the central
residue is likely to adopt.
In addition, the GOR method uses principles of information theory to derive predictions (Garnier
et al. 1996). As in the Chou-Fasman method, known secondary structures are scanned for the
occurrence of amino acids in each type of structure. However, the frequency of each type of
amino acid at the next 8 amino-terminal and carboxy-terminal positions is also determined,
making the total number of positions examined equal to 17, including the central one. In the
original GOR method, three scoring matrices, containing in each column the probability of
finding each amino acid at one of the 17 positions, are prepared. One matrix corresponds to the
central (eighth) amino acid being found in an alpha helix, the second for the amino acid being in
a beta strand, the third a coil, and the fourth, a turn.
Later versions omitted the turn calculation because these were the most variable features and
were consequently the most difficult to predict. A candidate sequence is analyzed by each of the
three to four matrices by a sliding window of 17 residues. Each matrix is positioned along a
candidate sequence and the matrix giving the highest score predicts the structural state of the
central amino acid. At least 4 residues in a row have to be predicted as an alpha helix and 2 in a
row for a beta strand for a prediction to be validated. Matrix values are calculated in somewhat
the same manner as amino acid substitution matrices, in that matrix values are calculated as log
odds units representing units of information. The information available as to the joint occurrence
of secondary structural conformation S and amino acid a is given by (Garnier et al. 1996) where
P(S _ a) is the conditional probability of conformation S given residue a, and P(S) is the
probability of conformation S.
By Bayes’ rule, the probability of conformation S given amino acid a, P(S _ a) is given by where
P(S, a) is the joint probability of S and a and P(a) is the probability of a. These probabilities can
be estimated from the frequency of each amino acid found in each structure and the frequency of
each amino acid in the structural database. Given these frequencies where fS,a is the frequency
of amino acid a in conformation S and fS is the frequency of all amino acid residues found to be
in conformation S. The GOR method maximizes the information available in the values of fS,a
246
and avoids data size and sampling variations by calculating the information difference between
the competing hypotheses that residue a is in structure S, I (S;a), or that a is in a different
conformation (not S), I (not S;a). This difference I (_S;a) is calculated from it with simple
substitutions by which is derived from the observed amino acid data as where the frequency of
finding amino acid a not in conformation S is 1 fS,a and of not finding any amino acid in
conformation S i s 1 fS. It is used to calculate the information difference for a series of x
consecutive positions flanking sequence position m, from which the following ratio of the joint
probability of conformation Sm given a1,..aX to the joint probability of any other conformation
may be calculated Searching for all possible patterns in the structural database would require an
enormous number of proteins.
Hence, three simplifying approaches have been taken. First, it was assumed in earlier versions of
GOR that there is no correlation between amino acids in any of the 17 positions (both the
flanking 8 positions and the central amino acid position), or that each amino acid position had a
separate and independent influence on the structural conformation of the central amino acid. The
steps are then: (1) Values for I (_S; a) in it are calculated for each of the 17 positions; (2) these
values are summed to approximate the value of I (_Sm; a1,..aX) in it; (3) the probability ratios in
it are calculated. The second assumption used in later versions of GOR was that certain pair-wise
combinations of an amino acid in the flanking region and central amino acid influence the
conformation of the central amino acid. This model requires a determination of the frequency of
amino acid pairs between each of the 16 flanking positions and the central one, both for when the
central residue is in conformation S and when the central residue is not in conformation S.
Finally, in the most recent version of GOR, the assumption is made that certain pair-wise
combinations of amino acids in the flanking region, or of a flanking amino acid and the central
one, influence the conformation of the central one. Thus, there are 17 _ 16/2 _ 136 possible pairs
to use for frequency measurements and to examine for correlation with the conformation of the
central residue. With the advent of a large number of protein structures, it has become possible to
assess the frequencies of amino acid combinations and to use this information for secondary
structural predictions.
247
The GOR method predicts 64% of the residue conformations in known structures and quite
drastically (36.5%) underpredicts the number of residues in beta strands. Use of the ChouFasman and GOR methods for predicting the secondary structure of the beta subunit of
Salmonella typhimurium tryptophan synthase is illustrated. In this particular case, the positions
of the secondary structures predicted by either of these methods are very similar to those in the
solved crystal structure (Branden and Tooze 1991). However, tests of the accuracy of these
methods using sequences of other proteins whose structures are known have shown that the
Chou-Fasman method is only about 50–60% accurate in predicting the structural domains. The
methods are most useful in the hands of a knowledgeable structural biologist, and have been used
most successfully in polypeptide design and in analysis of motifs for organelle transport
(Branden and Tooze 1991). A useful approach is to analyze each of a series of aligned amino
acid sequences and then to derive a consensus structural prediction.
20.2 Patterns of Hydrophobic Amino Acids Can Aid Structure Prediction
Prediction of secondary structure can be aided by examining the periodicity of amino acids with
hydrophobic side chains in the protein chain. This type of analysis was discussed above in the
prediction of transmembrane _-helical domains in proteins. Hydrophobicity tables that give
hydrophobicity values for each amino acid are used to locate the most hydrophobic regions of
the protein. As for secondary structure prediction, a sliding window is moved across the
sequence and the average hydrophobicity value of amino acids within the window is plotted.
These methods use the chemical properties of amino acid side chains to predict the location of
these amino acids on the surface or buried within the core structure. The location of hydrophobic
amino acids within a predicted secondary structure can also be used to predict the location of the
structure. One type of display of this distribution is the helical wheel or spiral display of the
amino acids in an alpha helix,. The use of this display was described above as a way to visualize
the location of leucine residues on one face of the helix in a leucine zipper structure. There is
also a tendency of hydrophobic residues located in alpha helices on the surface of protein
structures to face the core of the protein and for polar and charged amino acids to face the
aqueous environment on the outside of the alpha helix. This arrangement is also revealed by the
248
helical wheel display. The contours in this plot show positions in the amino acid sequence where
hydrophobic amino acids tend to segregate to opposite sides of a structure plotted against various
angles of rotation from one residue to the next along the protein chain. For alpha helices, the
angle of rotation is 100 degrees and for beta strands, 160 degrees. The analysis predicts, for
example, an alpha helix at approximate sequence position 165 that has segregated hydrophobic
amino acids on one helix face. Helix _5 runs from positions 160 to168 in the crystal structure of
this protein.
1. Mention the main criteria for Chou Fasman method.
Notes:
mm) Write your answer in the space given below.
nn) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
20.3 Secondary Structure Prediction by Neural Network Models
The most sophisticated methods that have been devised to make secondary structural predictions
for proteins use artificial intelligence, or so-called neural net algorithms. An earlier method of
this type examined patterns that represent secondary structural features like the Chou-Fasman
method. However, this method went farther and tried to locate these patterns in a particular order
that coincides with a known domain structure. Patterns typical of proteins (Cohen et al. 1983),
turns in globular proteins (Cohen et al. 1986), or helices in helical proteins (Presnell et al. 1992)
may be located and used to predict secondary structure with increased confidence. The program
249
MACMATCH, which combines these methods with a neural network approach to predict the
secondary structure of globular proteins on a Macintosh computer, has been described (Presnell
et al. 1993). In the neural network approach, computer programs are trained to be able to
recognize amino acid patterns that are located in known secondary structures and to distinguish
these patterns from other patterns not located in these structures. There are many examples of the
use of this method to predict protein structures, which have been reviewed (Holley and Karplus
1991; Hirst and Sternberg 1992). The early methods are reported to be up to 63–64% accurate.
These methods have been improved to a level of over 70% for globular proteins by the use of
information from multiple sequence alignments (Rost and Sander 1993, 1994).
Two Web sites that perform a neural network analysis for protein secondary structure prediction
are PHD (Rost and Sander 1993; Rost 1996; http://www.embl heidelberg.de/predictprotein/
predictprotein.html) and NNPREDICT (Kneller et al. 1990; http://www.cmpharm.ucsf.edu/
_nomi/nnpredict.html). These neural network models are theoretically able to extract more
information from sequences than the information theory method described above (Qian and
Sejnowski 1988). Neural networks have also been used to model translational initiation sites and
promoter sites in E. coli, splice junctions, and specific structural features in proteins, such as _helical transmembrane domains. These applications are discussed elsewhere in this and in 8.
Neural network models are meant to simulate the operation of the brain. The complex patterns of
synaptic connections among a large number of neurons are presumed to underlie the functions of
the brain. Some groups of neurons are involved in collecting data as environmental signals,
others in processing data, and yet others in providing a response to the signals. Neural networks
are an attempt to build a similar kind of learning machine where the input is a 13–17-amino-acid
length of sequence and the output is the predicted secondary structure of the central amino acid
residue. The object is to train the neural network to respond correctly to a set of such flanking
sequence fragments when the secondary structural features of the centrally located amino acid
are known. The training is designed to achieve recognition of amino acid patterns associated
with secondary structure. If the neural network has sufficient capacity for learning, these patterns
may potentially include complex interactions among the flanking amino acids in determining
secondary structures.
250
However, two studies with neural networks described below have so far not found evidence for
such interactions. A sliding window of 13–17 amino acid residues is moved along a sequence.
The sequence within each window is read and used as input to a neural network model
previously trained to recognize the secondary structure most likely to be associated with that
pattern. The model then predicts the secondary structural configuration of the central amino acid
as alpha helix, beta strand, or other. Rules or another trained network are then applied that make
the prediction of a series of residues reasonable.
For example, at least 4 amino acids in a row should be predicted as being in an alpha helix if the
prediction is to make structural sense. The model comprises three layers of processing units—the
input layer, the output layer, and the so-called hidden layer between these layers. Signals are sent
from the input layer to the hidden layer and from the hidden layer to the output layer through
junctions between the units. This configuration is referred to as a feed- forward multilayer
network. The input layer of units reads the sequence, one unit per amino acid residue, and
transmits information on the amino acid at that location. A small window of sequence is read at a
time and information is sent as signals through junctions to a number of sequential units in the
hidden layer by all of the input units within the window, as shown by the lines joining units.
These signals are each individually modified by a weighting factor and then added together to
give a total input signal into each hidden unit. Sometimes a bias is added to this sum to influence
the response of the unit. The resulting signal is then transformed by the hidden unit into a
number that is very close either to a 1 or to a zero (or sometimes to a 1).
A mathematical function known as a sigmoid trigger function, simulating the firing or nonfiring
states of a neuron, is used for this transformation. Signals from the hidden units are then sent to
three individual output units, each output unit representing one type of secondary structure
(helix, strand, or other). Each signal is again weighted, the input signals are summed, and each of
the three output units then converts the combined signal into a number that is approximately a 1
or a 0. An output signal that is close to 1 represents a prediction of the secondary structural
feature represented by that output unit and a signal near to the value 0 means that the structure is
not predicted. When hidden layers are included, a neural network model is capable of detecting
higher levels of interaction among amino acids that influence secondary structure. For example,
251
particular combinations of amino acids may produce a particular type of secondary structure. To
resolve these patterns, a sufficient number of hidden units is needed (Holley and Karplus 1991);
the number varies from 2 to a range of 10–40. An interesting side effect of adding more hidden
units is that the neural network memorizes the training set but at the same time is less accurate
with test sequences. This effect is revealed by using the trained network to predict the same
structures used for training. The number correct increases by over 20% as the number of hidden
units increases from 0 to 10. In contrast, accuracy of prediction of test sequences not used for
training decreased 3% (Holley and Karplus 1991). Without hidden layers, the neural network
model is known as a perceptron, and has a more limited capacity to detect such combinations. In
two studies, networks with no hidden units were as successful in predicting secondary structure
as those with hidden units. In addition, the number of hidden units was increased to as many as
60 in one study (Qian and Sejnowski 1988) and 20 in another (Holley and Karplus 1991) without
significantly changing the level of success. These observations imply that the influence of local
sequence on secondary structure is the additive influence of individual residues and that there is
no higher level of interaction among these residues. To detect such interactions, however,
requires a large enough training set to provide a significant number of examples, and these
conditions may not have been met.
These same studies examined the effect of input window size and found that a maximum
information for secondary structure prediction seems to be located within a window of 13–17
amino acids, as larger windows do not increase accuracy. However, small windows were less
effective, suggesting that they have insufficient information, and below a window size of 5,
success at predicting beta strands was decreased. Training the neural network model is the
process of adjusting the values of the weights used to modify the signals from the input layer to
the hidden layer and from the hidden layer to the output layer. The object is to have these
weights balance the input signals so that the model output correctly identifies the known
secondary structure of the central amino acid in a sequence window of a protein of known
structure. Because there may be thousands of connections between the various units in the
network, a systematic method is needed to adjust these values. Initially, the weights are assigned
a constant or random value (typical range 0.1 to 0.1). The sliding window is then positioned
along one of the training sequences. The predicted output for a given sequence window is then
252
compared to the known structure of the central amino acid residue. The model is adjusted to
increase the chance of predicting the correct residue. The adjustment involves changing the
weighting of propagated signals by a method called the back-propagation algorithm. This
procedure is repeated for all windows in all of the training sequences. The better the model, the
more predicted structures that will be correct. Conversely, the worse the model, the more
predictions that will be incorrect. The object then becomes to minimize this incorrect number.
The error E is expressed as the square of the total number of incorrect predictions by the output
units.
When the back-propagation algorithm is applied, the weights are adjusted by a small amount to
decrease errors. A window of a training sequence is used as input to the network, and the
predicted and expected (known) structures of the central residue are compared. A set of small
corrections is then made to the weights to improve an incorrect prediction, or the weights are left
relatively unchanged for a correct prediction. This procedure is repeated using another training
sequence until the number of errors cannot be reduced further. A large number of training cycles
representing a slow training rate is an important factor for training the network to produce the
smallest number of incorrect predictions. Not all of the training sequences may be used—a
random input of training patterns may be used and sometimes these may be chosen from subsets
of sequences that represent one type of secondary structure to balance the training for each type
of structure. The back-propagation algorithm examines the contribution of each connection in the
network on the subsequent levels and adjusts the weight of this connection, if needed to improve
the predictions. The following example illustrates the operation of the algorithm.
The PHDsec program in the PHD system described above in the section on prediction of
transmembrane-spanning proteins is an example of a neural network program for protein
secondary structure prediction (Rost and Sander 1993; Rost 1996). The Web address of this
resource is http://www.emblheidelberg.de/predictprotein/predictprotein.html. PhDsec uses a
procedure similar to that used by PHD. A BLAST search of the input sequence is conducted to
identify similar but not closely identical sequences, and a multiple alignment of the sequences is
transformed into a sequence profile. This profile is then used as input to a neural network trained
to recognize correlations between a window of 13 amino acids and the secondary structure of the
253
central amino acid in the window. Program output includes a reliability index of each estimate on
a scale of 1 (low reliability) to 9 (high reliability). These reliabilities (not shown) are obtained as
normalized scores derived from the output values of the three units in the output layer of the
network. The highest output value is compared to the next lowest value and the difference is
normalized to give the reliability index. These indices are a useful way to examine the
predictions in closer detail.
20.4 Let us Sum up
The Chou-Fasman method (Chou and Fasman 1978) was based on analyzing the frequency of
each of the 20 amino acids in alpha helices, beta sheets, and turns of the then-known relatively
small number of protein structures. This unit discuss about the secondary structure prediction
methods namely Chou-Fasman/ Gor method, Patterns of hydrophobic amino acids can aid
structure prediction and Secondary structure prediction by neural network models.
(i) What are the other methods used?
(ii) Find out the major differences between these methods.
Based on analyzing the frequency of each of the 20 amino acids in alpha helices, beta sheets, and
turns of the then-known relatively small number of protein structures.
Make a sensitivity analysis on Chou-Fasman method of secondary structure prediction.
254
20.8 References
1. Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition. Proceedings of the IEEE, 77 (2), p. 257–286, February 1989.
2. Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison. Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University
Press, 1999. ISBN 0-521-62971-3.
3. Lior Pachter and Bernd Sturmfels. "Algebraic Statistics for Computational Biology".
Cambridge University Press, 2005. ISBN 0-521-85700-7.
4. Olivier Cappé, Eric Moulines, Tobias Rydén. Inference in Hidden Markov Models,
Springer, 2005. ISBN 0-387-40264-0.
5. Kristie Seymore, Andrew McCallum, and Roni Rosenfeld. Learning Hidden Markov
Model Structure for Information Extraction. AAAI 99 Workshop on Machine Learning
for Information Extraction, 1999 (also at CiteSeer: [1]).
255
LESSON – 21
NEAREST-NEIGHBOR METHOD
21.1 Nearest-neighbor Methods of Secondary Structure Prediction
21.2 Hidden Markov model
21.2.1 Architecture of a hidden Markov model
21.2.2 Probability of an observed sequence
21.2.3 Using hidden Markov models
21.2.4 History
21.3 Probabilistic Hidden Markov Model (Discrete-Space Model)
21.4 Prediction of Three-dimensional Protein Structure
21.5 Let us Sum up
21.9 References
This unit discuss about Nearest- neighbor methods of secondary structure prediction, Hidden
Markov model, Architecture of a hidden Markov model, Probability of an observed sequence,
Using hidden Markov models, History, Probabilistic Hidden markov model (discrete-space
model), Prediction of three-dimensional protein structure.
21.1 Nearest-neighbor Methods of Secondary Structure Prediction
Like neural networks, nearest-neighbor methods are also a type of machine learning method.
They predict the secondary structural conformation of an amino acid in the query sequence by
256
identifying sequences of known structures that are similar to the query sequence (Levin et al.
1986; Salzberg and Cost 1992; Zhang et al. 1992; Yi and Lander 1993; Salamov and Solovyev
1995, 1997; Frishman and Argos 1996). A large list of short sequence fragments is made by
sliding a window of length n (e.g., n _16) along a set of approximately 100–400 training
sequences of known structure but minimal sequence similarity to each other, and the secondary
structure of the central amino acid in each window is recorded. A window of the same size is
then selected from the query sequence and compared to each of the above sequence fragments,
and the 50 best- matching fragments are identified. The frequencies of the known secondary
structure of the middle amino acid in each of these matching fragments (f_, f_, and fcoils) are
then used to predict the secondary structure of the middle amino acid in the query window. As
with other secondary structure prediction programs, the predicted secondary structure of a series
of residues in the query sequence is subjected to a set of rules or used as input to a neural
network to make a final prediction for each amino acid position. Although not implemented in
the most available programs, a true estimate of probability of the above set of frequencies may
be obtained by identifying sets of training sequences that give the same value of (f_ f_
fcoils)1/2. The frequencies of the secondary structures predicted by this group then give true
estimates for p_, p_, and pcoils for the targeted amino acid in the query sequence (Yi and Lander
1993). Predictions based on the highest probabilities have been shown to be the most accurate,
with the top 28% of the predictions being 86% accurate and the top 43% being 81% accurate. In
addition, this method of calculating probability possesses more information than single-state
predictions. Using this method, therefore, a substantial proportion of protein secondary structures
can be predicted with high accuracy (Yi and Lander 1993, 1996).
The several nearest-neighbor programs that have been developed for secondary structure
prediction differ largely in the method used to identify related sequences in the training set.
Originally, an amino acid scoring matrix such as a BLOSUM scoring matrix was used (Zhang et
al. 1992). Distances between sequences based on a statistical analysis of the training sequences
have also been proposed (Salzberg and Cost 1992). Use of a scoring matrix (Bowie et al. 1991,
1996) based on a categorization of amino acids into local structural environments, discussed
below, in conjunction with a standard amino acid scoring matrix increased the success of the
predictions (Yi and Lander 1993; Salamov and Solovyev 1995, 1997). Yet further increases in
257
success have been achieved by aligning the query sequence with the training sequences to obtain
a set of nonintersecting alignments with windows of the query sequence (as described in 3, p.
75), and of using a multiple sequence alignment as input with amino-terminal and carboxyterminal positions of alpha helices and beta strands and _ turns treated as distinctive types of
secondary structure (Salamov and Solovyev 1997). The program PREDATOR is based on an
analysis of amino acid patterns in structures that form H-bond interactions between adjacent beta
strands (_ bridges) and between amino acid n and n 4 on alpha helices (Frishman and Argos
1995, 1996). The Hbond pattern between parallel and antiparallel beta strands is different and
two types of antiparallel patterns have been recognized. By utilizing such information combined
with substitutions found in sequence alignments, the prediction success of PREDATOR has been
increased to 75% (Frishman and Argos 1997). Examples of the NNSSP (Salamov and Solovyev
1997) and PREDATOR (Frishman and Argos 1997)
21.2 Hidden Markov model
A Hidden Markov Model (HMM) is a statistical model in which the system being modeled is
assumed to be a Markov process with unknown parameters, and the challenge is to determine
the hidden parameters from the observable parameters. The extracted model parameters can then
be used to perform further analysis, for example for pattern recognition applications. A HMM
can be considered as the simplest dynamic Bayesian network.
In a regular Markov model, the state is directly visible to the observer, and therefore the state
transition probabilities are the only parameters. In a hidden Markov model, the state is not
directly visible, but variables influenced by the state are visible. Each state has a probability
distribution over the possible output tokens. Therefore the sequence of tokens generated by an
HMM gives some information about the sequence of states.
Hidden Markov models are especially known for their application in temporal pattern
recognition such as speech, handwriting, gesture recognition, musical score following, partial
discharges and Bioinformatics .
21.2.1 Architecture of a hidden Markov model
The diagram below shows the general architecture of an instantiated HMM. Each oval shape
represents a random variable that can adopt a number of values. The random variable x(t) is the
258
hidden state at time t (with the model from the above diagram, ). The random variable y(t) is the
observation at time t. The arrows in the diagram (often called a trellis diagram) denote
conditional dependencies.
From the diagram, it is clear that the value of the hidden variable x(t) (at time t) only depends on
the value of the hidden variable x(t - 1) (at time t - 1). This is called the Markov property.
Similarly, the value of the observed variable y(t) only depends on the value of the hidden
variable x(t) (both at time t).
21.2.2 Probability of an observed sequence
The probability of observing a sequence
of length L is given by
where the sum runs over all possible hidden node sequences
Brute force calculation of P(Y) is intractable for most real- life problems, as the number of
possible hidden node sequences is typically extremely high. The calculation can however be sped
up enormously using an algorithm called the forward algorithm.
21.2.3 Using hidden Markov models
There are three canonical problems associated with HMM:
Given the parameters of the model, compute the probability of a particular output sequence, and
the probabilities of the hidden state values given that output sequence. This problem is solved by
the forward-backward algorithm.
Given the parameters of the model, find the most likely sequence of hidden states that could have
generated a given output sequence. This problem is solved by the Viterbi algorithm.
259
Given an output sequence or a set of such sequences, find the most likely set of state transition
and output probabilities. In other words, discover the parameters of the HMM given a dataset of
sequences. This problem is solved by the Baum-Welch algorithm.
A concrete example
Assume you have a friend who lives far away and to whom you talk daily over the telephone
about what he did that day. Your friend is only interested in three activities: walking in the park,
shopping, and cleaning his apartment. The choice of what to do is determined exclusively by the
weather on a given day. You have no definite information about the weather where your friend
lives, but you know general trends. Based on what he tells you he did each day, you try to guess
what the weather must have been like.
You believe that the weather operates as a discrete Markov chain. There are two states, "Rainy"
and "Sunny", but you cannot observe them directly, that is, they are hidden from you. On each
day, there is a certain chance that your friend will perform one of the following activities,
depending on the weather: "walk", "shop", or "clean". Since your friend tells you about his
activities, those are the observations. The entire system is that of a hidden Markov model
(HMM).
You know the general weather trends in the area, and what your friend likes to do on average. In
other words, the parameters of the HMM are known. You can write them down in the Python
programming language:
states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}
emission_probability = {
260
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}
In this piece of code, start_probability represents your belief about which state the HMM is in
when your friend first calls you (all you know is that it tends to be rainy on average). The
particular probability distribution used here is not the equilibrium one, which is (given the
transition probabilities)
actually
approximately {'Rainy': 0.571, 'Sunny': 0.429}. T h e
transition_probability represents the change of the weather in the underlying Markov chain. In
this example, there is only a 30% chance that tomorrow will be sunny if today is rainy. The
emission_probability represents how likely your friend is to perform a certain activity on each
day. If it is rainy, there is a 50% chance that he is cleaning his apartment; if it is sunny, there is a
60% chance that he is outside for a walk.
21.2.4 History
Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum
and other authors in the second half of the 1960s. One of the first applications of HMMs was
speech recognition, starting in the mid- 1970s.
In the second half of the 1980s, HMMs began to be applied to the analysis of biological
sequences, in particular DNA. Since then, they have become ubiquitous in the field of
Bioinformatics . http://en.wikipedia.org/wiki/Hidden_Markov_model - _note-3#_note-3
21.3 Probabilistic Hidden Markov Model (Discrete-Space Model)
HMMs have been used to model alignments of three-dimensional structure in proteins (Stultz et
al. 1993; Hubbard and Park 1995; Di Francesco et al. 1997, 1999; FORREST Web server at
http://absalpha.dcrt.nih.gov:8008/). In one example of this approach, the models are trained on
patterns of alpha helices, beta strands, tight turns, and loops in specific structural classes (Stultz
et al. 1993, 1997; White et al. 1994), which then may be used to provide the most probable
secondary structure and structural class of a protein. The manner by which protein threedimensional domains can be modeled.
261
21.4 Prediction of Three-dimensional Protein Structure
Because the number of ways that proteins can fold appears to be limited, there is considerable
optimism that ways will be found to predict the fold of any protein, just given its amino acid
sequence. Structural alignment studies have revealed that there are more than 500 common
structural folds found in the domains of the more than 12,500 three-dimensional structures that
are in the Brookhaven Protein Data Bank. These studies have also revealed that many different
sequences will adopt the same fold. Thus, there are many combinations of amino acids that can
fit together into the same three-dimensional conformation, filling the available space and making
suitable contacts with neighboring amino acids to adopt a common three-dimensional structure.
There is also a reasonable probability that a new sequence will possess an already identified fold.
The object of fold recognition is to discover which fold is best matched. Considerable headway
toward this goal has been made.
Sequence alignment can be used to identify a family of homologous proteins that have the same
sequence, and presumably a similar three-dimensional structure. As discussed above, there are
many databases that link sequence families to the known three-dimensional structure of a family
member. The structure of even a remote family or superfamily member can be predicted through
such sequence alignment methods. When the sequence of a protein of unknown structure has no
detectable similarity to other proteins, other methods of three-dimensional structure prediction
may be employed. One such method is sequence threading. In threading, the amino acid
sequence of a query protein is examined for compatibility with the structural core of a known
protein structure. Recall that the protein core is made up of alpha helices, beta strands, and other
structural elements folded into a compact structure. The environment of the core is strongly
hydrophobic with little room for water molecules, extra amino acids, or amino acid side chains
that are not able to fit into the available space. Side chains must also make contact with
neighboring amino acid side chains in the structure, and these contacts are needed for folding and
stability. Threading methods examine the sequence of a protein for compatibility of the side
groups with a known protein core. The sequence is “threaded” into a database of protein cores to
look for matches. If a reasonable degree of compatibility is found with a given structural core,
the protein is predicted to fold into a similar three-dimensional configuration. Threading methods
262
are undergoing a considerable degree of evolution at the present time. An excellent description
of algorithms for threading is found in Lathrop et al. (1998). Presently available methods require
considerable expertise with protein structure and with programming. However, there are some
sites where the analysis may be performed on a Web server.
1. Mention the 3 common probabilities.
Notes:
oo) Write your answer in the space given below.
pp) Check your answer with the one given at the end of this lesson.
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
……………………………………………………………………………………………………
…………………………………………………………
HMM
It turns out that sequence profiles are a special case of a more general mathematical approach,
called hidden Markov models (HMMs). These methods were originally used in speech
recognition before the were applied to biological sequence analysis. A well-defined formalism
exists, which helps with the theoretical understanding of what can be expected when applying it
to sequence analysis. This is an important advantage of using HMMs instead of sequence
profiles; the underlying theoretical basis is much more solid. Also, Bayesian statistics is used in
several aspects of the method.
A Markov process is a physical process of a special, but common kind. The basic idea is that we
have a physical system that stepwise goes through some kind of change. For example, it may be
die (svenska: "tärning") that we throw time and again; the change is the transition from the new
263
value to the next. An essential characteristic of a Markov process is that the change is dependent
only on the current state. The history of the system does not matter. The states that the system
has been in before are not relevant, only the current state determines what will happen next. The
system has no memory.
One may view a protein (or DNA) sequence as the record of such a process. There is some
hidden process that generates a sequence of amino-acid residues, where chance (based on
specific probabilities) play an essential role in determining the exact sequence being produced.
This is one (very crude) way of describing an HMM. This approach can be applied in sequence
motif searches. Given a multiple sequence alignment of a particular domain family, one uses
statistical methods to build a specific HMM for that domain family. The probabilities that are
required are estimated from the frequencies in the alignment, together with other data. This
HMM can then be used to test other sequences whether they match this domain family or not.
HMMs can be set up so that insertions, deletions and substitutions can be handled in sensible
ways, and their probabilities estimated properly.
The plan (or topology) of an HMM determines which probabilities need to be estimated, and
what kind of matches are allowed. For instance, it is perfectly possible to design an HMM plan
that strictly forbids insertions and deletions. This means that it is very important for the HMM
designer (i.e. the software programmer, usually not the user) to decide on which type of topology
that should be implemented. This will determine which kinds of sequence profiles that can be
matched by the HMM.
In an HMM plan designed for matching a sequence, each state corresponds to a residue in the
sequence. The transitions between the states are assigned probabilities that are determined from
the multiple sequence alignment that is used as training set. In order to test whether a new
sequence contains a segment that matches the HMM profile, an algorithm that works essentially
like a dynamical programming algorithm is used to find the best match between the HMM
profile and the sequence. The best match is the one that maximizes the transition probabilites
given those particular residues.
264
Here is an example of what an HMM plan may look like. This is the plan used in the popular
HMMER software, and the image was taken from its documentation.
Fig 21.1: A typical HMM model
The abbreviations for the states are as follows:
·
[M x ] Match state x. Has K emission probabilities.
·
[Dx ] Delete state x. Non-emitter.
·
[Ix ] Insert state x. Has K emission probabilities.
·
[S] Start state. Non-emitter.
·
[N] N -terminal unaligned sequence state. Emits on transition w i t h K emission
probabilities.
·
[B] Begin state (for entering main model). Non-emitter.
·
[E] End state (for exiting main model). Non-emitter.
·
[C] C -terminal unaligned sequence state. Emits on transition w i t h K emission
probabilities.
·
[J] Joining segment unaligned sequence state. Emits on transition with K emission
probabilities.
Compared with the HMM plan shown in the course book. This is slightly more complicated. The
reason is that the creator of HMMER (Sean Eddy) wanted to obtain a method that could locate a
domain in a sequence where the true domain is flanked by possibly very large regions of nonmatching sequence. Therefore the states N and C were added, which will be used to match such
completely irrelevant parts of a sequence.
265
21.5 Let us sum up
Nearest-neighbor methods are also a type of machine learning method. They predict the
secondary structural conformation of an amino acid in the query sequence by identifying
sequences of known structures that are similar to the query sequence. This unit discuss about
Nearest-neighbor methods of secondary structure prediction, Hidden Markov model,
Architecture of a hidden Markov model, Probability of an observed sequence, Using hidden
Markov models, History, Probabilistic Hidden markov model (discrete-space model), Prediction
of three-dimensional protein structure.
(i) Find out how nearest neighbor method differs from other methods.
(ii) Which one do you consider best method among all the above discussed ones?Why?
(i) Start probability
(ii) Transition probability
(iii) Emission probability
1. Comment on the applications of HMM.
21.10 References
1. Tutorial from University of Leeds[2].
2. J. Li, A. Najmi, R. M. Gray, Image classification by a two dimensional hidden Markov
model, IEEE Transactions on Signal Processing, 48(2):517-33, February 2000.
266
3. Y. Ephraim and N. Merhav, Hidden Markov processes, IEEE Trans. Inform. Theory, vol.
48, pp. 1518-1569, June 2002.
4. B. Pardo and W. Birmingham. Modeling Form for On- line Following of Musical
Performances. AAAI-05 Proc., July 2005.
5. http://citeseer.ist.psu.edu/starner95visual.html
6. L.Satish and B.I.Gururaj.Use of hidden Markov models for partial discharge pattern
classification.IEEE Transactions on Dielectrics and Electrical Insulation, Apr 1993.

Computational Methods

Transcription

Similar documents

EnJuvenate Pina Colada

read more

GEOCONCEPT E

File

Watermark One

Shaft100 - Fixturlaser

AN AUTOMATIC EYE WINK INTERPRETATION SYSTEM FOR THE

SDT a virus species classification tool

View InSportRecovery Magazine Advertisement