Current Protocols in H - Medical Life Sciences

Transcription

Current Protocols in H - Medical Life Sciences
Copyright protected
PolyPhred Analysis Software for
Mutation Detection from
Fluorescence-Based Sequence Data
UNIT 7.16
Kate T. Montgomery,1 Oleg Iartchouck,1 Li Li,2 Stephanie Loomis,1
Vanessa Obourn,1 and Raju Kucherlapati1
1
Harvard Medical School - Partners Healthcare Center for Genetics and Genomics, Boston,
Massachusetts
2
Albany Medical College, Albany, New York
ABSTRACT
The ability to search for genetic variants that may be related to human disease is one
of the most exciting consequences of the availability of the sequence of the human
genome. Large cohorts of individuals exhibiting certain phenotypes can be studied and
candidate genes resequenced. However, the challenge of analyzing sequence data from
many individuals with accuracy, speed, and economy is great. This unit describes one
set of software tools: Phred, Phrap, PolyPhred, and Consed. Coverage includes the
advantages and disadvantages of these analysis tools, details for obtaining and using the
software, and the results one may expect. The software is being continually updated to
permit further automation of mutation analysis. Currently, however, at least some manual
review is required if one wishes to identify 100% of the variants in a sample set. Curr.
C 2008 by John Wiley & Sons, Inc.
Protoc. Hum. Genet. 59:7.16.1-7.16.21. Keywords: DNA sequencing r mutation identification r SNPs r indels r
sequence traces r Consed r Phred r Phrap r PolyPhred
INTRODUCTION
The purpose of this unit is to describe, in detail, the use of one software application,
PolyPhred, (Nickerson et al., 1997; Stephens et al., 2006) to identify mutations or variations among individuals as exhibited in fluorescence-based DNA sequence data. UNIT 7.9
describes in detail the protocols for obtaining DNA sequences representing particular
exons or regions of the genome via PCR amplification and sequencing of the desired
regions. The success of a mutation detection project of any size is critically dependent
upon the ability of the investigators to analyze the data obtained in an accurate and
timely fashion. Even a modest project—e.g., resequencing one gene with 20 exons in 10
individuals—will require a minimum of 400 sequence traces that have to be examined
and compared to the normal sequence. Chromatograms generated for a large project
may number in the thousands. It is impractical to review this quantity of data manually,
examining and comparing individual chromatograms. Such an approach would be both
time-consuming and subject to human error. Similarly, aligning and comparing the simple text files of the sequences is inappropriate, because one must know the quality of the
sequence data before relying upon the called bases.
In response to the enormous increase in the use of sequencing to identify significant
pathogenic mutations in both the research and clinical environments, several academic
groups and companies have developed software applications to facilitate data review.
The challenge is to create a program into which one can simply import sequence data,
and the software will identify, list, and export all variants from a standard reference
sequence. The program must be able to identify the dual peaks of heterozygotes and
Current Protocols in Human Genetics 7.16.1-7.16.21, October 2008
Published online October 2008 in Wiley Interscience (www.interscience.wiley.com).
DOI: 10.1002/0471142905.hg0716s59
C 2008 John Wiley & Sons, Inc.
Copyright Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.1
Supplement 59
Copyright protected
differentiate them from background noise, and also recognize homozygous variants as
they differ from a reference sequence. It must also be able to display and enable a
reviewer to characterize more complicated variants, including insertions and deletions
(indels). Happily, significant progress has been made in the development of automated
sequence analysis software in recent years. This unit describes, in considerable detail,
one of the available options, a suite of programs including Phred (Ewing et al., 1998;
Ewing and Green, 1998), Phrap, PolyPhred (Nickerson et al., 1997; Stephens et al., 2006),
and Consed (Gordon et al., 1998; Gordon, 2004). The unit describes how to obtain the
programs and how to use them for resequencing projects on a UNIX server.
STRATEGIC PLANNING
Selection of Software
The selection of software to assist in sequence analysis is one of the critical decisions
to be made during the strategic planning of a project whose goal is mutation detection
via fluorescence-based sequencing. The factors to consider are (1) performance of the
program, (2) cost and availability of the program, (3) difficulty of program setup and use,
and (4) suitability to the size and complexity of the expected work. Brief descriptions of
a few of the available programs are found in the Commentary section, and the use of one
program (Mutation Surveyor) is described in UNIT 10.8. Further evaluations and comments
are available online. Several programs fulfill parts of the challenge of automated analysis
as outlined above quite effectively, but none are yet able to fully analyze chromatograms
with complete accuracy and report findings without some level of manual intervention.
However, the need for manual review is greatly reduced by the applications described,
and by others available at the present time.
The purpose of this unit is to describe the use of Phred, Phrap, PolyPhred, and Consed
at Harvard Medical School – Partners Center for Genetics and Genomics (HPCGG;
http://www.hpcgg.org/) for the analysis of all fluorescence-based sequencing projects
in our research laboratory. This suite of programs was originally developed during
the Human Genome Project by Phil Green, Brent Ewing, David Gordon, LaDeanna
Hillier, Deborah Nickerson, and others. Total automation of very large survey projects
in the research environment using the newest version of PolyPhred (v.5.04) has been
demonstrated to be highly efficient (Stephens et al., 2006), if some low level of error is
acceptable. For clinical applications, where no level of error is acceptable, at least one
round of manual review is still necessary.
The performance of Phred, Phrap, PolyPhred, and Consed is exceptionally reliable. The
programs run quickly, require no user interactions while running, offer many options
for customization, and can handle virtually unlimited quantities of data. The graphical
user interface, Consed, provides many ways to review the data and export information.
In addition, the background information extracted by the programs from the data files or
chromatograms is all available in text files, and information can be obtained by queries
addressed to these files by external programs in the UNIX environment.
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
The documentation and learning tools are very good, though they may seem complex and
time-consuming in the beginning. The programs are available without cost to academic
laboratories, from the groups that developed them. The primary disadvantages to selecting
this software option are that (1) the programs require a UNIX or Linux server, Mac
OS X.X, or one of the other platforms described in the documentation, and (2) the
installation and original setup require more sophisticated computer knowledge than most
other candidate programs.
Notwithstanding the technical challenges of setting up and becoming familiar with Phred,
Phrap, PolyPhred, and Consed, they offer many advantages, especially for large projects.
7.16.2
Supplement 59
Current Protocols in Human Genetics
Copyright protected
It takes only a few minutes to set up a project, dump data into the correct folder, run the
program, and open the assembly in Consed. One can then use a navigation tool to jump
from one variant to the next, or quickly scan the sequence data representing one amplicon
of up to 100 individuals (or more), and determine whether there are any variants. If many
genes with many amplicons are to be reviewed, all data still go into one folder, where
the user can move through each of the amplicons one by one.
Overview of PolyPhred
PolyPhred is designed to compare fluorescence-based sequence chromatograms (UNIT 7.9)
from different individuals and to identify differences among sequences (Nickerson et al.,
1997; Stephens et al., 2006). The algorithm upon which it is based has recently been
updated (v.5.04) to facilitate the automation of single-nucleotide polymorphism (SNP)
identification, and under certain ideal conditions it is highly accurate. The ideal conditions include clean sequence with reads available in both directions from four or more
individuals. The program determines a consensus (or takes a reference sequence as the
consensus), applies a tag at positions where polymorphisms appear, and applies a quality score to each tag. These tags can be exported, easily reviewed, and modified. The
genotype of each individual, at each polymorphic site is recorded in a text file, the
polyphred.out file.
The discussion below provides details for using PolyPhred version 5.04 in a UNIX
environment to review data quickly and accurately. In our experience, v.5.04 identifies
almost all of the SNPs found by manual review, and exhibits a very low level of false
positives. When v.5.04 was tested by the Nickerson group, 93% of all SNPs were found
in 47 to 90 patients, and 100% of the high-frequency SNPs were found in the same group.
There were no false positives, and the overall accuracy was 99.9% (Stephens et al., 2006).
Once familiar with the application, there are certain parameters that an advanced user
might wish to modify to fit the particular scope and demands of a project, and to take
advantage of the automation to save time. The Nickerson group continues to develop
this program, with goals of making it even more reliable in the automated format and
expanding it to include the automated identification of indels, a function now available
in a Beta version, PolyPhred v.6.11 (Bhangale et al., 2006).
PolyPhred is not a stand-alone program, but is integrated with the sequence analysis
programs mentioned above, originally developed by Phil Green’s group for the Human
Genome Project. These programs are: Phred (Ewing and Green, 1998; Ewing et al., 1998),
Phrap, Cross-match, Swat, and Consed (Gordon et al., 1998; Gordon, 2004). A full introduction and description of Phred, Phrap, and Consed is available at http://www.phrap.
org/phredphrap/general.html, while a description of PolyPhred v.5.04 is available at
http://droog.mbt.washington.edu/PolyPhred.html. Special instructions for other platforms are in the documentation and a complete Tour of Consed, the graphical user
interface, is provided.
Each of the elements of the integrated set of programs has a particular function. Phred
provides base calls, peak information, and quality scores for sequence data from a
fluorescence-based sequencing machine. The quality scores are a representation of the
likelihood of error for any given base. A Phred score of 20 means that the base will
be incorrect 1 of 100 times (99% accuracy), while a Phred score of 40 means that the
base will be incorrect 1 of 10,000 times (99.99% accuracy). Phred can also be used
independently for base calling and to generate quality scores for sequence data. In this
capacity only, it can be run in the Microsoft Windows environment.
Phrap, Cross-match, and Swat are used to screen out vector sequences and to provide
sequence alignments of similar sequences. PolyPhred uses the information from Phred
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.3
Current Protocols in Human Genetics
Supplement 59
Copyright protected
and Phrap to identify and tag putative heterozygous and homozygous variants from the
consensus.
Consed, a graphical user interface, is then used to view and edit the resulting assembly.
Consed is described at http://www.genome.washington.edu/consed/consed.html (Gordon
et al., 1998; Gordon, 2004). It is strongly recommended that new users take the “Quick
Tour” provided with Consed, which will introduce all of its features. At a minimum,
the documentation should be carefully reviewed to identify the tools that will be useful
for the intended project. Much of Consed’s complexity is related to shotgun sequencing,
whole-genome assemblies, and autofinish. The tools for resequencing and mutation
identification projects are less difficult to master. Once the project is set up and Consed
is running, there is a help guide that can be accessed from the program.
One of the major benefits of using PolyPhred for the detection of sequence variation,
especially when many samples are to be examined, is the ease with which data files
can be managed in the UNIX environment, either manually or using perl scripts. Once
chromatograms are deposited in the appropriate directory, the full suite of programs can
be triggered to run on the server completely in the background. The user merely awaits
completion of the assembly so the results can be reviewed. It is possible (though not
necessary) to view the progress of the program while it is running; when it is complete,
it will tell the user that the data can now be viewed in Consed.
Another benefit is that the text output of the programs is readily available, and can
be queried by external scripts to extract any information not available directly through
Consed. Thus, it is possible to customize the information derived from these programs
when projects demand it. For example, when new data are entered into a project at
HPCGG, we extract the Phred quality scores for each base in the region of interest, for
each read. These scores are used to determine if a read “passes” or “fails” our criteria
for quality, and this pass/fail information is incorporated into several e-mails sent to
the members of the sequencing group. The e-mails provide a summary report of the
success rate of the entire run, as well as whether each read passes. They provide real-time
assessment of the quality of the output from any computer with e-mail access, so that
problems can be identified in a timely fashion.
Finally, data from 100 or more individuals can be reviewed by scrolling through a single
computer screen. This allows the user to recognize the presence of a variant very quickly.
Alternatively, Consed will list the variants in the single contig or in the entire assembly
if multiple exons or amplicons have been sequenced. This list can be used to navigate
through the assembly to view all variant positions. With the highly accurate SNP detection
algorithm now in place, nearly full automation is possible.
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
At HPCGG, an automatic data processing pipeline is initiated when data collection is
completed on the ABI 3730xl. A series of perl scripts is triggered that direct necessary
operations, such as transfer of chromatograms to the UNIX system, name trimming
(.ab1 files have long unwieldy names), and subsequent sorting and transfer to the
correct project folder where analysis by Phred, Phrap and PolyPhred is initiated. E-mail
reports of the Phred quality scores are sent to the sequencing group, and their arrival
indicates that analysis is complete and results can be reviewed using Consed. This kind of
pipeline is ideal for high-volume sequencing projects and when bioinformatics support is
available. However, since every laboratory has a slightly different computer environment
and not everyone has adequate support for automating these processes, the explanation
below assumes manual project setup, file transfer, and program initiation.
7.16.4
Supplement 59
Current Protocols in Human Genetics
Copyright protected
Obtaining and Installing the Programs
PolyPhred is available for a number of different operating platforms, such as UNIX,
Linux, MacOSX, and others listed in the documentation. It is not available for Microsoft
Windows, and therefore, it is important to have IT support for one of these other platforms.
However, the system is very stable once it is downloaded and installed in a UNIX or
Linux environment, so the need for support is not extensive.
Phred and Phrap and their associated components can be obtained via e-mail by
following instructions at http://www.phrap.org/phredphrapconsed.html. An academic
user agreement from David Gordon ([email protected]) must be signed
in order to receive permission to download Consed from http://bozeman.genome.
washington.edu/consed/consed.html. PolyPhred is obtained by following instructions
at http://droog.mbt.washington.edu/PolyPhred.html. The programs are free to academic
laboratories, while commercial users must obtain a license ([email protected])
prior to obtaining them. Usually a System Administrator will download and perform the
actual installation, following the online instructions. The various elements will be placed
in particular directories in the UNIX environment as the instructions specify.
PC users must also install x-windows (http://www.starnet.com/products/xwin32) or similar software to access the UNIX/Linux environment (see Basic Protocol, Materials).
Alternatively, the user can access the programs directly through a UNIX terminal.
A perl script called phredPhrap, which runs phred-swat-crossmatch-phrap in the
proper sequence, is provided with the program suite. This script must be run from the
edit dir, and it must be able to find the other directories as described above. Certain modifications to this script are required to direct it to include PolyPhred, and others changes
may be implemented to customize certain details for Consed. These modifications are
described in the program support files. Most significantly, one line of the phredPhrap
script must be changed from $bUsingPolyPhred = 0 to $bUsingPolyPhred = 1. The
modified perl script can be saved as phredPhrap.poly. In order to analyze data, the
user must go to the edit dir of the project, and type this command in the command window. All chromatogram data that have been placed in the chromat dir will be analyzed.
Phd files (text files that contain information extracted from the chromatograms), will be
created for each read and will subsequently appear in the phd dir. If chromatograms have
been added since the last phredPhrap.poly command, they will not have corresponding
phd files or poly files; the program will recognize this, analyze the new data, create phd
files and poly files for the new data, and update the assembly. If no new data has been
added, there is no need to run phredPhrap.poly, and the command consed will display
the existing assembly.
When the programs are installed, a sample set of data is available for a “Quick Tour” of
Consed. This tour provides a very detailed exploration of the capabilities of the program,
and it is recommended that users view it. However, it includes a lot of information
that will not be needed for simple resequencing projects, so a basic guide for mutation
identification is provided here.
USING PolyPhred FOR SEQUENCE ANALYSIS AND MUTATION
DETECTION
BASIC
PROTOCOL
When performing the sequence detection in your own laboratory, transfer data files from
the data collection computer to a Windows desktop computer when the run is completed
for temporary stroage and review. The files may be ABI chromatogram data (.abi)
files or the equivalent from other sequence analyzers. Working with data directly on
the collection machines is not recommended, and data should be removed from them
frequently for storage elsewhere. The array pictures and raw data files should be reviewed
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.5
Current Protocols in Human Genetics
Supplement 59
Copyright protected
briefly to check for any obvious problems. The length of read and quality of the array
and sequencing controls should be checked to be sure they meet standards.
When sequencing is performed by an outside core facility, the facility will provide
chromatograms so you can import them into whatever sequence analysis program you
have selected. They may also provide text files of the sequence, but be aware that those
files are not reliable for mutation detection! The user must review the actual data in order
to determine its quality and whether there are artifacts or miscalled bases.
PolyPhred runs under UNIX. A basic understanding of UNIX commands that allows you
to navigate through directories, make directories and move, remove, and copy files. A
short list of commands is below. A novice should spend a little time learning navigation
in the UNIX environment.
ls = list contents of current directory
pwd = requests information on present working directory
cd = change directory (must give pathway or subdirectory of pwd, or it will take you
back to your home directory)
mkdir <dir name> = make a new directory in this directory
cp = copy (cp <file> to a new place <full pathway to desired location>)
mv = (mv <file> to a new place <full pathway to desired location>)
Materials
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
High-speed access to the Internet for Web-based tools and information, especially
access to genome databases
A UNIX server for running the mutation identification software, or other type as
described below. The programs can be run on Mac OSX.X or LINUX, but they
do not run under Microsoft Windows.
A directory on the UNIX computer where data analysis will be performed, referred
to here as ANALYSIS DIRECTORY. The user will have full privileges to read,
write, and execute here, and the sequence analysis programs will be available.
Within this directory, all projects will have their own folders. A Project may
consist of one gene with multiple exons, or many genes. The program will align
like sequences into “contigs.”
A Windows PC terminal with an X-terminal emulator installed and running, to
interface with the UNIX System. One option is X-Win32, which may be
downloaded from http://www.starnet.com/products/xwin32/. There is a free trial
version, and various types of licenses may be purchased at reasonable cost.
Other options are Exceed, Reflection X, and OpenNT.X.
A three-button mouse or a mouse capable of emulating a three-button mouse. A
two-button mouse with a scroll button is easily adapted to Consed.
Current versions of Phred, Phrap (and its associated programs), Consed, and
PolyPhred, installed in the user’s pathway of the UNIX machine. A sample set
of data is provided by the software developers, and if this is downloaded, there is
a tutorial that is very useful in developing a complete understanding of how to
use the programs.
Experimental sequence data equivalent to .ab1 files or scf files generated by
ABI 3730xl or other fluorescence-based automated sequencer (see also
Beckman Instruments, LI-COR Life Sciences, or Amersham Biosciences
analyzers). Such data are generally known as “chromatograms.”
NOTE: In the following directions, all commands to be typed in the UNIX command
window are in italics, while directories are bold. All UNIX commands are case sensitive,
7.16.6
Supplement 59
Current Protocols in Human Genetics
Copyright protected
and spaces are not allowed within file or directory names. Underscore is frequently used
in place of a space.
Set up the project
A basic file structure is required for data analysis, and this is created by the user before
depositing the data or running the programs. First, a project directory is created in a user
directory that has access to the programs. Four directories are then required to be created
within the project directory. The project directory name can be anything, but should be
unique and specific to the particular data to be analyzed. The four directories within
PROJECT A must be named edit dir, phd dir, chromat dir, and poly dir.
1. Login to the UNIX server and go to (cd) the ANALYSIS DIRECTORY.
2. Set up a project folder or directory called PROJECT A, as follows:
a. mkdir PROJECT A (Type command mkdir, followed by a space and the name
of the directory. A space always follows a command in UNIX).
b. cd PROJECT A and make four directories required for all Phred-PhrapPolyPhred-Consed analysis, using the mkdir command: mkdir chromat dir
phd dir edit dir poly dir (space between directory names).
The same thing would be accomplished by typing mkdir followed by the desired new
directory name four times.
c. Transfer your dataset (.ab1 or equivalent chromatogram files) into the
chromat dir. Use File Transfer Protocol (ftp) from the collection computer or
other Windows-based data storage location. Or, cp or mv files to desired location
from a UNIX environment.
Warning! If you use the mv command incorrectly, forgetting to give the full path of the
destination, the files will be moved to nowhere, and may be lost forever. cp is safer.
3. Set up the project-specific control chromatogram files. Name these files as
gene exonX.ref or gene exonX.pseudo so you will recognize them.
a. You may use high-quality chromatograms representing the normal sequence of
the PCR product of each amplicon to be sequenced, forward and reverse.
b. Alternatively, you can follow the directions in the Consed documentation for creating a pseudo-chromatogram representing the template sequence used to design
your primers.
The command is “sudophred,” and it is run from the edit directory. The input is a text file
of the template sequences, in fasta format with names following the < symbol on the line
above the sequence. The files created will go automatically to the appropriate directories
(phd dir and chromat dir). Example: sudophred genename.fasta.
c. If desired, use external programs to create user-defined tags in the control files
when setting up a project.
At HPCGG, we add coding sequence tags, PCR primer tags, and SNP tags from dbSNP
(http://www.ncbi.nlm.nih.gov/SNP/). The consed documentation shows how to do this in
ADDING TAGS FROM OTHER PROGRAMS. Tags are added to the phd files based upon
their numerical position in the template sequence. This will require some perl script facility.
Tags can also be added by the user, while working in consed, to individual reads, reference
or pseudo sequences, or the consensus. The application of these tags is described below
(Working with Consed, 7). This requires no programming expertise but is somewhat timeconsuming.
d. Your project is set up. The data files you have added to the chromat dir are
ready to be analyzed. See UNIT 7.9 for protocols to generate data files, including
recommended naming conventions for chromatograms:
Current Protocols in Human Genetics
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.7
Supplement 59
Copyright protected
PROJECT gene-exon#.sample.direction
Remember that a new project must have four directories, but all may be empty except for
the chromat dir.
Working with PolyPhred and Consed: data analysis and review
The full documentation for Consed and the “Quick Tour” of Consed are available at
http://bozeman.mbt.washington.edu/consed/distributions/README.15.0.txt. Once you
have installed the programs on your system, the Consed documentation is available
from the Help menu.
4. Go to (cd) the edit dir of the project you wish to analyze. Type the command,
phredPhrap.poly into the UNIX command window.
This perl script is included with the program package, slightly modified as described above
for PolyPhred. It directs the sequential running of the component programs. It must always
be typed from within the edit dir of the project to be analyzed.
5. Wait while the full suite of programs, including phred-phrap-crossmatch, swat, and
polyphred, run. They will generate a number of text files in the various directories
created (edit dir, phd dir, and poly dir). These files contain information extracted
from your chromatograms. The steps are chronicled on the screen. When complete,
a message in the command window will tell you that you may now run Consed on a
particular .ace file.
a. The phredPhrap.poly command results in full analysis of all the data files you have
put in the chromat dir. Each of the components of the suite of programs performs
its function as described above—base calling and base quality score assignment,
sequence comparison and alignment of similar sequences, and assembly, yielding
multiple contigs, contig consensus determination, and polymorphism tagging and
scoring.
b. The programs all run independently in the background. There is no interaction
with the user. At HPCGG, our automated pipeline is triggered as soon as the
analyzing machines complete their runs. The data are copied to appropriate UNIX
directories (according to project name) and processed. Quality control scripts are
run and the results are distributed by email to the sequencing group.
c. Since the programs are run on a UNIX server, data for many projects can be
processed at the same time, each from a different edit dir within a different
Project folder.
d. The output is simple text files which will appear in the edit dir, phd dir, and
poly dir. Consed will use these files to provide a graphical interface to the
user to review the data. Some of the files are very long, but they can be used
by external scripts to obtain information about the results. For example, the
polyphred.out file in the edit dir, contains the specific positions where polymorphisms are seen in each contig, and the genotypes of all individuals at that
position. The same information is displayed graphically in Consed.
6. View the results. From the edit dir, type the command consed. A warning may appear
in the command line window: <no ./.consedrc file so no project-specific resources–
that’s ok>. Ignore this. Two windows will appear on your X-terminal screen. One
will be called Ace Files and the other Consed Main Window.
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
7. Select an Ace file by double-clicking on it. The newest file is on top. An Ace file in
the edit dir is absolutely required for Consed. The empty boxes in the Consed Main
Window will then be populated with the selected Ace file name, a list of Contigs
7.16.8
Supplement 59
Current Protocols in Human Genetics
Copyright protected
Figure 7.16.1 Consed Main Window. Drop-down menus applying to the whole project are on
top, then a box listing the .ace file being reviewed. Below the .ace file are several action buttons
and then the Contig List and Read List. Two boxes are available for performing searches based
on read name, and two action buttons that will either Show a contig or read highlighted in one of
the lists or Close All Windows. The Quit Consed button (on the upper right) will exit the program
after prompting for saving your changes. The Help button (on the top right) will give you full
documentation.
(overlapping reads), and a Read List. A large warning box may also appear referring
to templates—ignore it. The Main Consed Window is shown in Figure 7.16.1.
a. The contig list is arranged by number of reads in the contig, lowest to highest.
b. The read list is arranged alpha-numerically.
c. Clicking on either a contig or a read will bring up an Aligned Reads (AR) window
display (Figure 7.16.2, a sample AR window).
8. Basic consed tools and navigation:
a. The Consed Main Window. Figure 7.16.1 shows the critical elements and actions
available in the Consed Main Window. This is the primary window from which
you select the parts of the assembly you want to view.
i. Drop-down menus applying to the whole project are on top: File, Navigate,
Info, and Options. Left-click on the topic to see the choices. Choices are fully
described in the Consed Documentation, available through the Help option, to
the right. See below for use of the Options menu.
ii. The name of the assembly .ace file being displayed, is in the box below.
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.9
Current Protocols in Human Genetics
Supplement 59
Copyright protected
Figure 7.16.2 Aligned Reads Window, PolyPhred v.5.04. Read names are on the left, and arrows indicate the direction
of each read. Here, forward is on the top and reverse on the bottom. The quality of bases is indicated by the shading,
from dark gray to white. The colored red bar on the pseudo sequence is a coding sequence tag added by the user, while
the column of blue tags is applied by PolyPhred to indicate a position where a polymorphism is identified. The red tag on
the consensus means this polymorphism is a polyPhredRank1 tag, with a quality score of 99, and the pink tags indicate
heterozygous bases. For color version of this figure see http://www.currentprotocols.com.
iii. Several action buttons appear next, notably for this discussion, the String Search
button for finding particular sequences within the assembly and the Quit Consed
button. The others are explained fully in the Help documentation.
iv. A list of the contigs in the assembly appears in the large Contig List box, in
ascending order by number of reads in each contig.
v. All the reads in the assembly are shown in the Read List. Failed sequences do
not appear.
vi. The Find Reads box allows you to search for all reads with a particular string
in their name—i.e., an exon or a gene. The result will be a new box with a list
of all reads that contain the string in their name.
vii. To Quit Consed, use the button on the upper right. You will be prompted to save
changes if you wish. Unless you made an editing mistake, save the assembly.
Each save will generate a new .ace file, with a new number. When Consed is
run again, you will usually select the newest .ace file and the others may be
removed as they are very big.
b. View a contig in an Aligned Reads (AR) window (Figure 7.16.2):
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
i. Read names on left side indicate Project gene exon.samplename.primer (.F or
.R).
ii. Arrows indicate direction of read—reads are complemented and aligned for
viewing data in Consed. The default view has forward and reverse grouped,
above or below the line. (See below, we usually change this option from the
Options menu).
7.16.10
Supplement 59
Current Protocols in Human Genetics
Copyright protected
iii. Shades of gray to white indicate quality (Phred score) of bases—whitest is best.
The numerical phred score of a base is shown in bottom box if you click with
the left mouse button on the base of interest. The Phred scores are as follows: 0
to 39, shades of gray; 40 to 97 are bright white. 98=edited, unsure; 99=edited,
sure.
iv. Upper-case bases are high quality, lower-case lower quality (cruder scale than
the numerical Phred scores or the color shades).
v. Consensus sequence is top line, determined by program from data.
vi. Red bases differ from Consensus.
vii. If user-defined tags were entered, they are seen on the pseudo sequence. Example: red bar is a coding sequence tag.
viii. When PolyPhred recognizes a polymorphism, a column of tags is applied (see
position 558), blue for homozygotes, pink for heterozygotes. Mouse over a pink
tag to see what the alleles are, in the box at the bottom of the window.
ix. PolyPhred also tags the consensus with a color-coded rank tag: red = rank1,
score of 99, highest quality. Range is Rank1 to Rank6.
c. Revise the view format options. Go to the Consed Main Window, left-click on
Options at the top. Select General Preferences from the menu. A General
Preferences Box appears, and you can modify default settings. Change “Display
reads alphabetically or by strand . . .”option from “strand” to “alpha,” then
left-click the box Apply and Dismiss (bottom left). This is the only change we
routinely make in the default view. The effect is shown in Figure 7.16.3 where
forward and reverse of each individual are next to each other.
This makes it easier to identify artifacts not present in both directions. Changes made from
the options menu are temporary, and will not be applied next time Consed is run.
d. Show protein translation. Click on the Misc box at the top of AR window.
Multiple options appear—select Show Top Strand Protein Translation. All three
reading frames for the entire sequence are shown, with potential start and stop
codons highlighted.
Use this display format. It facilitates the characterization of variants. The user may tag
actual start and stop codons and coding positions.
e. View different parts of the contig. Scroll through the contig in the AR window
by moving the scroll buttons on the bottom or side of the window. Other
methods are described in the Quick Tour or Consed documentation.
f. Viewing Traces. Using the mouse middle button, click on a base of one of your
sequences, a Trace Window will appear. You can scroll through the sequence,
using the scroll button on the bottom. You can change the magnitude or breadth
of the peaks using the two-slide buttons on the left. The top line of sequence is
the consensus, the second line is the edit line, where you can edit the calls for
this chromatogram, or tag a base or region using the middle button. The third
line of sequence is the Phred base call. You can view four chromatograms at one
time, to compare data. In the options window you can modify this number.
g. Edit a base. If you are sure Phred has called an incorrect base, you can change it.
Middle-click on the read in the AR window. The trace will appear. Left-click on
the base you wish to change. Type the correct base. If you are sure, you can
make it high-quality by making it upper case. The Phred score will reflect your
selection. When you save the assembly, (see below) this change will be recorded
in a new version of the phd file for the read (phd.2). However, the previous phd
file (phd.1) will also be retained, and the chromatogram will never be altered.
h. Find a sequence. String Search allows you to find a sequence: left-click on a
base in the AR window and slide the cursor over 10 to 15 bases while holding
the mouse button down. The bases will turn yellow. Release. Left-click the
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.11
Current Protocols in Human Genetics
Supplement 59
Copyright protected
A
B
Figure 7.16.3 (A) Aligned reads window after changing view from “strand” to “alpha” in the
Options list. This view facilitates review by placing forward and reverse reads next to one another. The red tag at the top indicates the polymorphism is Rank1, with a quality score of 99.
Two individuals are shown with heterozygous tags (pink) and two with homozygous tags (blue).
(B) The chromatograms showing heterozygous and homozygous individuals in both directions.
See fuschia arrows in panel A for these data (021.F at the top, then 021.R, 019.F, and 019.R). For
color version of this figure see http://www.currentprotocols.com.
Search for String box, upper left. A Search for String window appears. Use the
middle mouse button to paste the highlighted string into the Query string field. A
box will appear that shows every place in the assembly where the string appears.
i. Complement a contig. It is convenient to review the sequence with the forward
reads going from left to right, or in a sense direction. If the assembly does not do
this, you can click on the Compl Cont button above the sequence in the AR
window.
j. Join contigs. If two contigs overlap but are not joined, highlight and string search
part of the overlapping sequence. A new box appears, listing two contigs.
Double click on each and the AR window for each appears, with the cursor over
the first base in the string. In each window, click on the Compare Cont button
above the reads. The sequences will appear in a Compare Contigs window. Click
on Align, and scroll to review the alignment. If it looks valid, click on Join
Contigs. Your assembly will be modified to place these two contigs into one.
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
Phrap assemblies are sometimes incorrect, and may fail to put all overlapping reads in
one contig. You will need to review each amplicon quickly to be sure no errors have
occurred. Example: Two individuals have a 3 bp deletion relative to the others and to the
normal sequence (pseudo)—they may be in a different contig, but you can combine them.
Or, two exon templates may overlap but the overlap may be very small. You can join them
to facilitate review of the intronic sequence.
7.16.12
Supplement 59
Current Protocols in Human Genetics
Copyright protected
k. Tear Contigs. Right-click on a base near the position where you want to tear a
contig into two separate contigs. Choose Tear contig at this consensus position.
A box will appear where you select the contig you want each read to fall into,
after the separation. Select Do Tear at the bottom of box.
Occasionally, contigs are misassembled, sometimes for no apparent reason. This tool
allows you to fix them.
l. Remove reads. Reads may be removed from a contig. The most common reason
to do this is because poor quality interferes with the analysis. Right-click on a
base in the offending read, and choose Put read xxxxxx.xx.F into its own contig.
This read will now appear as a single-read contig on the Consed Main Window.
See the Quick Tour documentation for more complicated read removal and
addition.
m. Save Assembly. You must periodically save the changes you have made, if you
want to have them available next time you open the assembly. If the program
crashes without saving, you may lose your changes. The assembly can be saved
from the file menu in either the Consed Main Window or the Aligned Reads
window, upper-left in both cases. Left-click on the menu and choose save
assembly.
This creates a new .ace file, which you will be able to access next time you open the
Project. Or, you can access an older .ace file to see the data before edits. Ace files are big,
and you should not accumulate too many in a project. Most old .ace files are not useful,
so use the UNIX rm <filename> to delete them from the edit dir. You will be probably be
prompted to confirm removal.
9. Add tags manually to control or pseudo files. Type the designation you have given
to identify your control or pseudo sequences, in the Find reads box of the Consed
Main Window (Example: pseudo). Press enter. A new window will appear with a list
of all the pseudo sequences in the assembly, in alpha-numeric order. Double-click
on the first. An Aligned Reads window will appear. Apply coding sequence tags to
your pseudo or control sequences manually, if you have not used an external script
to do so.
a. Coding sequence tags may be added to indicate most important regions for review.
i. Locate the first base of the coding sequence using string search (refer to the
UCSC template files used in primer design, UNIT 7.9). Middle-click on this base
in the pseudo sequence. A Trace Window will appear.
ii. Middle-click on the first base of the coding sequence in the “edt” line, and
holding the button, slide it along for a few bases. They will turn yellow. Release,
and a window appears, “What to Do with Selection.”
iii. Choose add tag. A new window appears, Select Tag Type. Note the variety of
options available. Choose Coding Sequence, and select “ok” at the bottom of
the window. A new red tag covering the highlighted bases will be seen on the
pseudo sequence.
iv. To locate the last base of the coding sequence, string search for the last 15 bases
of the exon as seen in the UCSC reference. Note the position of the last base in
the box that appears. Then right-click on the tag you created for the beginning
of the coding sequence (see iii), and choose “Tag: coding sequence show more
info?” in the drop-down menu. The coding sequence box will appear, where
you can change the End Unpadded Consensus Position to the number noted
for the last coding base. You may also type a note in the comment box (e.g.,
Exon 3). Then choose Save Changes. Your tag will now cover the entire coding
sequence for this exon. If you added a note, it will appear in the box at the
bottom of the AR window when your mouse is over the tag.
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.13
Current Protocols in Human Genetics
Supplement 59
Copyright protected
v. Tag primers in the pseudo sequence. String search for a forward primer sequence. Middle-click on the relevant pseudo sequence to bring up the Trace
window and highlight the forward primer sequence holding the middle button
down. Choose add a tag, and select Forward Primer from the list that appears.
Click OK on the bottom of the tag list box. Many types of tags are described in
the list, among them forward and reverse primer. Different tags are displayed
in different colors.
b. Manual addition of tags is time-consuming, but it is very useful to mark the
coding sequence, and the forward and reverse primers.
Tags are saved in the phd files, when you save the assembly, and will be permanently
available to you. You can also add other relevant tags, for start codons, known mutations,
stop codons, etc. These phd files with tags can be copied with the same templates to other
projects where you are sequencing the same genes.
10. Recognizing mutations: common patterns and problems.
a. Clean data: Figure 7.16.2 shows good clean data as represented in an AR window
in Consed. Most of the bases are bright white, and one can be confident that there
are no hidden variants in these traces, except those marked.
b. SNP: If a SNP is present in your data, it will appear as the one at position 558 and
colored gray in Figure 7.16.2. Note that this appears in the forward and reverse
sequence of the same sample. In one case, phred has called a C, in the other a T.
Both are lower case (low quality) due to the secondary peak. The pink tag means
heterozygote, and if you put your mouse over this tag (no click), it will tell you it
is heterozygote TC tag data 99 (highest quality).
c. Indel: Heterozygous small insertions and deletions have a very clear pattern in the
AR window in PolyPhred and in the corresponding trace windows. The sequence
in each direction is clean until the deletion or insertion, then it is double, as both
alleles are read (Fig. 7.16.4). The traces for a normal individual (forward only,
top) and for an individual with a heterozygous deletion (forward and reverse) are
shown in Figure 7.16.5. To determine the exact nature of the indel:
i. Write the wild-type sequence, from 10 bases to the left of the position where the
sequence begins to fail to ∼20 bases after this point.
CCTTGCCACGCTAGCTTTCTGACATC.....
ii. From the forward read, write the wild-type and then the variant peak after the
mixed sequence begins (the base that appears in addition to the wild-type base)
immediately below the wild-type sequence.
CCTTGCCACGCTAGCTTTCTGACATC.....
CCTTGCCACGCAGCTTTCTGACATCC.....
From this you can see that the secondary sequence is like the wild-type with one
T deleted.
iii. To confirm, read the sequence in the reverse direction, right to left, and line it
up with the others
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
CCTTGCCACGCTAGCTTTCTGACATC.....
CCTTGCCACGCAGCTTTCTGACATCC.....
...CCTTGCCACGCAGCTTTCTGACATC
7.16.14
Supplement 59
Current Protocols in Human Genetics
Copyright protected
Figure 7.16.4 Aligned Reads window showing four individuals with heterozygous single-base
deletions. The characteristic pattern for indels is clear sequence to the point of the deletion or
insertion, then an immediate loss of good sequence. The arrows indicate the direction of the reads.
PolyPhred v.5.04 does not tag the indels, so manual review is required to identify them. For color
version of this figure see http://www.currentprotocols.com.
d. Longer deletions that fall within the amplicon: These are harder to see,
especially if they are very long so both ends may not be in one window. They
may be characterized exactly as shown above. To avoid missing them, always
check the beginning and end of sequence that appears to be double when the
poor sequence extends for a long distance in both directions. The hallmark is
that each direction will start as a clean sequence, then become double. If, for
example, the deletion is 20 bp long, the double sequence in both reads will
overlap for ∼20 bases. But if the indel is 150 bases, the double sequence may
appear to be a failed sequence, and you may miss a critical variant.
e. Contaminated sequence: In contrast to the situation described above for indels,
double sequence can also appear when two or more PCR products or two
primers are present in a sequencing reaction. In either case, the traces will be
double or messy in the beginning, often in both directions, but may clear up
somewhere before the end. This may be explained by
i. Contaminated universal sequencing primers that anneal to PCR product and
each extend different products that incorporate dye terminators.
ii. Two or more PCR products that are both extended by a single primer and
incorporate dye terminators.
iii. Clean sequence appears later on because one of the templates is shorter than
the other and the longer product is clean in the end.
11. Review your data. Open consed and review all contigs, record variants and failures.
a. Each contig should contain the control or pseudo sequence from one exon and all
the reads you have deposited in the chromat dir for that exon.
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.15
Current Protocols in Human Genetics
Supplement 59
Copyright protected
Figure 7.16.5 Traces showing a single-base deletion. The top panel is the forward trace of a normal
individual, the middle panel shows the forward sequence of an individual with a deletion, and the
bottom panel shows the reverse sequence of the same individual. For color version of this figure see
http://www.currentprotocols.com.
b. Manual Review.
i. Review the contigs for faulty assembly; tear apart those that are incorrect, pin
those that belong together.
If two exons are close to each other and their templates overlap, they may be in the
same contig.
For manual review, we review each gene, exon by exon, beginning with the first, characterize each mutation, and record the results in an Excel sheet.
ii. Failed reads may not appear, so check to see that all individuals are present for
each exon.
iii. Check to be sure every amplicon is present. If all reads fail for one exon, the
pseudo sequence may not appear, and it is easy to miss the fact that the exon is
not present.
c. Automated review:
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
i. Using the Navigation menu from the Consed Main Window, Navigate tags in
all contigs, for the polyPhredRank1 to 6 tags.
ii. Click on one of the polyPhredRank tags. A box will appear with a list you can
click through, jumping right to the tag, in whatever contig it may be.
You must do one Rank at a time.
7.16.16
Supplement 59
Current Protocols in Human Genetics
Copyright protected
iii. Click through all Ranks to determine if they are real or not. Record.
iv. This gives a quick view of all the mutations found by PolyPhred, and depending
upon your dataset and the size of your project, may be very informative. However, you may also need to scan each contig for variants missed by PolyPhred,
and also for indels, which currently are not tagged in v.5.04.
12. Editing bases in Consed: If you are sure Phred has called an incorrect base, you can
change it (see step 8g above). Middle-click on the base to be changed, which will
bring up a trace window. Edits are made in the trace window. Left-click on the base
in the “edt” line of the sequence and type in the changed base.
Changes are saved in a new version of the phd file that contains the information extracted
from each chromatogram. The changes are written to a new version of the Ace file when
the project is saved, and will be seen if the new Ace file is opened at a later date. The
chromatograms themselves are never modified.
If your purpose is primarily to identify variants, it is not necessary to edit most bases—
this is time-consuming, and unless changes impact a variant, don’t bother.
13. Save Assembly before closing to save any changes.
COMMENTARY
Background Information
Progress in sequencing technology and in
knowledge of the genome has made it technically and economically feasible to screen
large numbers of samples for mutations in
many different genes. UNIT 7.9 presents detailed protocols for generating sequence data
by automated fluorescence-based sequencing.
This unit describes sequence data review and
mutation identification using one of several
available programs, PolyPhred v.5.04. Our
group has had great success with using this
particular program. Development of the program continues and a new version, as yet
untested in our hands, is available for Indel
detection (PolyPhred v.6.11, http://droog.gs.
washington.edu/get poly6.html), described in
Bhangale et al. (2006).
Several additional computer programs that
aid in the analysis of electronic chromatogram
files (also known as electropherograms, .abi
or scf files) from an ABI (or other) automated sequencer are briefly noted below. Some
are commercially available, while others are
free to academic laboratories.
Trace analysis software from other sources
For most mutation identification projects,
it is clearly advantageous to use some form of
sequence analysis software for the task of reviewing the quality of the data and identifying
sequence variations, and there are many options available. Below is a discussion of four
of these options from the perspective of the
end user; details of programming algorithms
are not discussed. The first three (Sequencher,
DNASTAR, and Mutation Surveyor) are com-
mercially available and run on either a Macintosh or PC/Windows platform. These programs seem to be well-supported and widely
used for sequence analysis and mutation identification. They each have graphical reporting
capacities that are attractive. The Staden package for sequence analysis, described in Staden
(1994), includes Gap4 and (like PolyPhred)
has recently become available without cost
(http://staden.sourceforge.net/). The new version can be run on UNIX or PC platforms,
but (like PolyPhred) it seems complex if you
have no prior experience with the package.
The choice of software for mutation detection
will depend on personal preference, local experience, and the availability of computer resources, but it is a choice that should be made
early in the planning stages because it is a critical element for success.
Sequencher 4.7
Sequencher is a sequence assembly
software package available from Gene
Codes (http://www.genecodes.com/; info@
genecodes.com). It is easy to install and runs
on either Macintosh or PC/Windows computers. We have no current experience with Sequencher, but it is one of the programs most
commonly mentioned as being used by the laboratories for whom we perform sequencing. A
review of its features online and through the
Demo program, downloaded from Gene Codes
for testing purposes, shows that it performs
many of the functions required for mutation
analysis.
Sequence data are imported in the form of
chromatograms from automated sequencers,
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.17
Current Protocols in Human Genetics
Supplement 59
Copyright protected
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
such as ABI, LI-COR, or Pharmacia/ALF systems. A user-defined Reference Sequence can
also be imported and used to align experimental data. The data are assembled with the “assemble contigs” command and an overview
of the assembly is available. Reads can be
trimmed based on program-defined quality
scores and also to match the Reference Sequence.
The user can observe the sequence and
the aligned chromatograms and inspect them
for mutations. The simultaneous evaluation of
bidirectional traces of a control sample and
one with a suspected mutation will allow the
identification of most mutations. It is possible to navigate quickly through the ambiguous
bases or places where the data do not agree
with the Reference Sequence, so base-calling
errors can be edited, and heterozygous or homozygous mutations identified and annotated.
There is a translation function as well, so each
reading frame can be viewed.
Sequencher also has an automated feature for identifying heterozygote variations in
which “secondary peaks” are called. The user
can specify the minimum lower peak height
as a percentage of the upper, then quickly
scan and edit base pairs with secondary peaks
that are suggestive of a heterozygote variation.
However, the program notes say this function
calls some positions that are not true heterozygotes, and also fails to capture some, largely
due to background noise or sequence data artifacts. Once the contig is edited, a Variance
Table can be generated, indicating where differences exist between the consensus sequence
(representing your edited experimental data)
and the Reference Sequence. A graphical summary report can be produced of features you
have identified, showing base or amino acid
changes and their locations.
The major advantage of Sequencher is that
it is easy to install and use on Macintosh
or PC/Windows computers, after consulting
the user’s manual with its short tutorial. It
is clearly useful for sequence assembly, especially for relatively small projects, and is
superior to visual inspection of printed chromatograms for detecting heterozygote mutations. However, compared to PolyPhred, it is
relatively labor-intensive to import a set of
data, assemble it, and edit it. Another consideration is the current cost of Sequencher, approximately $3000 for a single academic user,
with support and upgrades for 1 year. In contrast, PolyPhred is free to academic laboratories that have the ability to implement it. A free
demo version of Sequencher can be obtained
upon request from Gene Codes for those interested in evaluating it. More information on
Sequencher can be found on the Gene Codes
Web site (http://www.genecodes.com/).
Lasergene from DNASTAR
Lasergene
(http://www.dnastar.com/
products/lasergene.php) is a Windows/Mac
program package including seven modules
for the analysis and management of DNA and
protein sequences. Individual modules for
particular functions may be purchased, most
notably SeqMan Pro, designed for sequence
assembly and SNP discovery. Site licenses
are also available for multiple users. The Web
site describes the various elements and the
details of SeqMan Pro. We have not used
this program, but the overview suggests that
it includes many of the same features of
Sequencher and Mutation Surveyor. Data can
be in multiple formats, including ABI files
and SCF files, and the program can handle
large numbers of reads, so it is not limited
(as Mutation Surveyor is currently) to 400.
In addition, it can manage phrap assemblies,
which might be useful if you were using
phredPhrap as a primary sequence analysis
tool but wanted to use some of the attractive
graphical reporting facility of SeqMan Pro.
Mutation Surveyor and Mutation Explorer
These programs for sequence analysis, available from SoftGenetics, are described in detail at http://www.softgenetics.
com/download/Mutation Surveyor Sheet.pdf.
Mutation Surveyor is recommended for
discovery, while Mutation Explorer is for
clinical applications, though the difference
between them is not obvious. Both run in
the PC/Windows environment. Mutation
Surveyor is available in different size capacities; the smallest can handle up to 48 lanes
of sequence, while the largest handles 400,
and there is a network version, as well as
stand-alone versions. A detailed description
of the implementation of Mutation Surveyor
is in UNIT 10.8. A comprehensive external
review of Mutation Surveyor by the National
Genetics Reference Laboratory, UK, is
available at http://www.softgenetics.com/
download/NGRL TechnologyAssessment.pdf.
This review describes the program’s characteristics and gives a very informative picture
of its capacity. Version 3.01 of Mutation
Surveyor was released in December 2006,
adding two new functions. It now has the
ability to mask vector sequences if cloned
fragments are being sequenced, and it has
an improved capacity to identify known
7.16.18
Supplement 59
Current Protocols in Human Genetics
Copyright protected
SNPs from the Genbank database or from the
laboratory. The user can activate a Negative
SNP function so these SNPs are shown in the
screen with experimental data, even if they
are not seen in the sample.
Mutation Surveyor is used in the Laboratory for Molecular Medicine at HPCGG. This
group finds its approach to mutation identification to be extremely effective, and its reporting capacity makes it very helpful. However,
as noted in the external review, it cannot yet
be used in a clinical setting without additional
manual review. This program offers quite a
few more features than Sequencher, but is correspondingly more costly. Like Sequencher, it
is rather labor-intensive to upload the appropriate files for analysis, and does not run well
in a truly automated fashion.
Critical Parameters and
Troubleshooting
Because heterozygous point mutations and
small insertions or deletions are the most difficult to identify, this unit illustrates how such
mutations can be located using PolyPhred
v.5.04. Homozygous mutations are also observed when a known reference sequence is
incorporated in your assembly. There are several caveats that must be recognized when
performing sequence analysis of the type described here, and there are certain hallmarks
that the reviewer may observe in sequence data
that indicate potential problems in data production that should be addressed. The most
important caveats and hallmarks are discussed
below.
Sequencing alone is not useful for identifying heterozygous individuals carrying large
deletions or rearrangements in one chromosome, because in such cases only the normal
or existing allele will be amplified. If one observes a homozygous variant, it is not possible
to exclude a deletion unless the individual also
exhibits a heterozygous variant in the same
amplicon. It is useful to analyze the entire amplicon rather than just the coding sequence, because such confirmatory heterozygous SNPs
may occur outside the exon. This is standard
practice at HPCGG.
In a broader sense, sequencing never proves
the presence of two alleles unless heterozygous bases are seen. However, loss of heterozygosity may be especially worth investigating if the sequence data show multiple
exons without any evidence of heterozygosity.
In the case of tumors that may be contaminated with normal tissue, variants in tumor
cells may be masked by the amplification of
small quantities of DNA from normal cells.
UNIT 10.9 describes the sequencing of the EGF
receptor in tumor cell populations, where the
presence of normal cells may be a problem.
It is possible that variant peaks may be much
smaller than the normal heterozygous peaks
(ratio should be 1:1) when tumor cells are
mixed with normal cells, and so background
must be very low for proper interpretation.
As described in UNIT 7.9, the primers used to
amplify regions of the genome must not span
a SNP or indel, or amplification may fail or be
biased. While dbSNP continues to add newly
reported SNPs that can inform the placement
of primers, many novel SNPs occur in every
individual, so all data must be viewed with that
caveat. We have seen instances where a primer
fails to amplify one allele and thus a mutation
has been missed. We are only aware of this
when there are overlapping amplicons and the
SNP under the primer is seen in the product
from the flanking primer pair.
It is important to view the data with a critical eye, and to note any unusual characteristics.
For instance, if a heterozygous SNP occurs in
every sample, there is a possibility that DNAs
may have been cross-contaminated, very likely
with one or more that are actually homozygous for the variant. This is most likely when
many DNAs are stored in a plate, and they are
accessed many times. Such results should be
confirmed with clean samples.
Another problem that is often first noted
during sequence analysis is that primers
or whole amplicons are not unique in the
genome. The data may look very “dirty,”
with many background peaks and potential
heterozygous positions throughout in all
individuals. The color in Consed is not bright
white (see Basic Protocol, step 10a). We have
seen this to be true for several exons of a gene,
while all the rest of the amplicons/exons are
high quality, and have determined that fragments of some genes are duplicated in other
places. The amplicons in question, as well
as the primers, should be checked to confirm
that they are unique, using the BLAT tool at
UCSC (http://genome.cse.ucsc.edu/cgi-bin/
hgBlat?command=start&org=Human&db=
hg18&hgsid=105162860). BLAT is described
in UNIT 7.9.
Anticipated Results and Time
Considerations
The most critical factor for efficient and accurate mutation identification using PolyPhred
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.19
Current Protocols in Human Genetics
Supplement 59
Copyright protected
PolyPhred:
Mutation
Detection from
FluorescenceBased Sequence
Data
or any other software application is to obtain high-quality data. The protocols for doing
this are in UNIT 7.9. It is also important that
PolyPhred be set up on a platform with sufficient power and space to run fairly intensive
computational programs. Once this is accomplished, and a pipeline is established to move
the data files to the appropriate folders, one
can anticipate that data processing will move
smoothly and will be very rapid.
Users will require a few sessions with a
UNIX expert to learn the basics of the UNIX
command window, and then a few sessions
with the “Quick Tour” of Consed or with an
experienced user to become proficient. Fortunately, even a very inexperienced user cannot
create any unredeemable errors within Consed
and in a properly managed UNIX infrastructure. The user will rapidly learn the few tools
and commands that are repeated often, and
will understand the structure and appreciate
the scope of the software.
Data review and reporting may be more or
less time-consuming, depending upon several
factors including (1) data quality, (2) the number of amplicons in a project, (3) the number
of individuals to be sequenced for each amplicon, (4) the number of variants present, and
(5) the method of reporting the results.
If the data are all clean, the PolyPhred assembly is likely to be completely correct, and
each amplicon can be scanned quickly, individual samples accounted for, and variants
seen and characterized with very little difficulty. For poor-quality data, the time required
can increase dramatically. Thus, it is very important to optimize primer design, PCR reactions, sequencing reactions, and analyzer run
parameters ahead of time.
The HPCGG has had projects where >100
amplicons are sequenced in a small number
of samples (two to ten), as well as projects
where a much smaller number of amplicons
are sequenced in hundreds of individuals.
Projects with many amplicons are more timeconsuming than those with few, even though
the actual number of reads may be equal, because each amplicon will form one contig,
which must be individually opened and reviewed. If all the reads fall into one or a few
amplicons/contigs, the work of reviewing is
accomplished more quickly.
The actual number of variants that may appear in a project may also vary, depending
on the experimental design. In a candidate
gene search, most (or all) of the amplicons reviewed may have no significant or novel variants. Alternatively, in a project where one has
a selected group of patients with one disease
frequently caused by mutations in one gene,
every sample may have one or more significant variants. The most time-consuming part
of data analysis is the characterization of novel
mutations.
One of the most complicated issues is to
determine how to report results efficiently and
with complete clarity. PolyPhred does not have
a graphical or spreadsheet-based reporting tool
that can be readily used for clinical or even
research reporting. HPCGG uses an Excel format, and reporting across the top every variant observed, in the order in which they occur
within the gene. It is essential to provide the
exon or intron within which the variant occurs,
and a 9-base string for each SNP, or 4 bases
flanking each indel or duplication. This should
be unique, and allow the unequivocal identification of the variant. In addition, one may
wish to provide the genotype for each individual at this position; for coding sequence variants, it is useful to determine the nucleotide
and codon numbers and changes. We report
every variant within the amplicon, even those
that are far from the coding sequence. Finally,
for publication or presentation of results, the
Human Genome Variation Society’s recommendations (http://www.hgvs.org/mutnomen/)
should be consulted to determine the correct
nomenclature for the variant.
In summary, the time required for a project
is extremely variable depending upon all the
factors discussed, most notably the quality of
data, the number of amplicons, the number of
individuals, and the detail of the reports to be
generated.
Literature Cited
Bhangale, T.R., Stephens, M., and Nickerson, D.A.
2006. Automating resequencing-based detection of insertion-deletion polymorphisms. Nat.
Genet. 38:1457-1462.
Ewing, B. and Green, P. 1998. Base-calling of automated sequencer traces using phred. II. Error
probabilities. Genome Res. 8:186-194.
Ewing, B., Hillier, L., Wendl, M., Green, P. 1998.
Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res.
8:175-185.
Gordon, D. 2004. Viewing and editing assembled
sequences using consed. Curr. Protoc. Bioinformatics 11.2.1-11.2.43.
Gordon, D., Abajian, C., and Green, P. 1998.
Consed: A graphical tool for sequence finishing. Genome Res. 8:195-202.
Nickerson, D.A., Tobe, V.O., and Taylor, S.L. 1997.
PolyPhred: Automating the detection and genotyping of single nucleotide substitutions using
7.16.20
Supplement 59
Current Protocols in Human Genetics
Copyright protected
fluorescence-based resequencing. Nucleic Acids
Res. 25:2745-2751.
Staden, R. 1994. Staden: Comparing sequences.
Methods Mol. Biol. 25:155-170.
Stephens, M., Sloan, J.S., Robertson, P.D., Scheet,
P., and Nickerson, D.A. 2006. Automating
sequence-based detection and genotyping of
SNPs from diploid samples. Nat. Genet. 38:375381.
Searching
Candidate Genes
for Sequence
Variation:
Mutations and
Polymorphisms
7.16.21
Current Protocols in Human Genetics
Supplement 59