Primers4clades, A Web Server to Design Lineage

Transcription

Primers4clades, A Web Server to Design Lineage
de Bruijn Vol. I
Chapter
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 441
51
Primers4clades, A Web Server to
Design Lineage-Specific PCR Primers
for Gene-Targeted Metagenomics
Bernardo Sachman-Ruiz, Bruno Contreras-Moreira, Enrique
Zozaya, Cristina Martı́nez-Garza, and Pablo Vinuesa
51.1 INTRODUCTION
This chapter presents a practical overview and case
study of the usage and capabilities of the primers4clades
version 1.0 web server [Contreras-Moreira et al., 2009],
a software tool useful to design degenerate PCR primers
for gene-targeted analysis of metagenomic DNA (see
also Chapter 28, Vol. I) and multilocus sequence analysis.
Although direct shotgun sequencing of community DNA
is now feasible [Venter et al., 2004], polymerase chain
reaction (PCR) remains the most widely used technology
to gain molecular markers for studies in molecular
ecology and systematics. With the ongoing accumulation
of fully sequenced genomes in public sequence databases,
the focus of microbial ecology studies is increasingly
shifting to the analysis of protein coding genes and
sequences (CDSs) to understand ecological, metabolic,
and evolutionary processes in nature [Falkowski et al.,
2008; Frias-Lopez et al., 2008; Hunt et al., 2008].
This trend is reflected in the huge interest of studying
the diversity and expression patterns of “functional
genes” in the environment, such as those encoding for
antibiotic resistance and virulence [Castiglioni et al.,
2008; Manning et al., 2008], photosynthesis [Yutin and
Beja, 2005], or enzymes involved in the nitrogen cycle
[Smith et al., 2007; Zehr et al., 2003], to mention a
few. Furthermore, multilocus sequence analysis (MLSA)
and typing (MLST) of protein-coding genes are the
new standards in molecular systematics [Edwards, 2009;
Gevers et al., 2005; Vinuesa, 2010] and molecular epidemiology [Maiden, 2006; Smith et al., 2009]. However,
it still remains a major challenge to design optimal PCR
primers to specifically amplify CDSs from target lineages
directly from environmental DNA samples or from
novel organisms. Primers4clades is a publicly available
web server that uses phylogenetic trees to aid in the
design of lineage-specific degenerate PCR primers for the
above-mentioned purposes, which takes into account both
protein and the corresponding codon multiple sequence
alignments [Contreras-Moreira et al., 2009].
The major advantages of using a lineage-specific
gene-targeted approach to metagenomics lie in the high
sampling coverage and long sequence reads that can be
achieved by sequencing a relatively low number of clones
using classic Sanger sequencing technology. Furthermore,
amplification biases derived from sequence composition
heterogeneity [Polz and Cavanaugh, 1998] are reduced
due to the relatively homogeneous composition of the
sequences targeted with lineage-specific PCR primers.
Their lower degeneracy also contributes to a reduction in
amplification biases.
Here we present an empirical validation study
that demonstrates the highly specific amplification of
Mycobacterium spp. rpoB sequences from complex
metagenomic DNA samples gained from two tropical
Mexican forest soils: (i) the humid evergreen forest
reserve of Los Tuxtlas, Veracruz, and (ii) a conserved
patch of seasonally dry deciduous forest in the Biosphere
Handbook of Molecular Microbial Ecology, Volume I: Metagenomics and Complementary Approaches, First Edition. Edited by Frans J. de Bruijn.
© 2011 Wiley-Blackwell. Published 2011 by John Wiley & Sons, Inc.
441
de Bruijn Vol. I
Q1
442
V1 - 02/08/2011
5:16 P.M. Page 442
Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers
Reserve of Sierra de Huautla, Morelos. Rarefaction curves
for a random subsample of 50 clones of each library are
compared with rarefaction curves for a 16S rRNA-based
library of 98 clones obtained from Amazonian pasture
and forest soils using universal primers [Borneman
and Triplett, 1997], which illustrates the usefulness of
the lineage-specific metagenomics analysis approach to
improve sampling coverage and statistical power.
51.2
c51.tex
METHODS
51.2.1 Implementation, Input Data
Processing, and Overview of the
Two Run Modes Using the Server’s
Actinobacteriales Demonstration
Dataset
Primers4clades (primers for clades) is an easy-to-use
web server developed for researchers interested in the
design of PCR primers for cross-species amplification
of novel sequences from metagenomic DNA or from
uncharacterized organisms belonging to user-specified
phylogenetic clades or taxa. Degenerate PCR primers
are derived from conserved motifs in protein multiple
sequence alignments using the CODEHOP primer design
strategy described below, but considering also the
underlying codon alignments to obtain what we call
“corrected CODEHOP” primer formulations. This is one
of the diverse components of the extended CODEHOP
algorithm that the server implements [Contreras-Moreira
et al., 2009].
Primers4clades was mainly written in Perl and uses
several BioPerl modules [Stajich et al., 2002] along with
the open source software cited below to perform the different computations involved in the extended CODEHOP
algorithm. The input for the server is a set of homologous protein-coding genes in FASTA format and in frame
+1, which may be aligned or not, with or without introns
(Fig. 51.1).
We will demonstrate the usage of the software by
choosing the server’s Actinobacteriales dataset, one of the
three provided as tutorial material, which can be selected
by clicking on the “demo” button. Although no registration is required, we recommend that the user provide his
email address, which will allow access for one week to
the analyses stored on the server and also avoids potential
loss of results due to eventual browser timeout events due
to often long computation times (see FAQ section in the
online tutorial for the details). Note that the taxon names
(Latin binomials) should be placed within brackets, which
is how the header from FASTA-formatted sequences are
downloaded from GenBank, as shown in Figure 51.1.
This is required for the server to parse the taxon
names in order to select the corresponding codon usage
tables, as explained below. The server excises introns if
their coordinates are indicated in the FASTA header (see
the server’s documentation and the fungal alpha-tubulins
demo dataset), collapses redundant sequences to haplotypes, translates the CDSs with user-selected translation
tables, and aligns them using Muscle [Edgar, 2004]. The
alignment step is skipped, if the server detects that the
uploaded DNA sequences are aligned. The protein alignment is projected on the underlying DNA sequences to
compute the corresponding codon alignment, along with
maximum likelihood (ML) distance matrices from the protein (WAG+G) or the codon (HKY85+G) alignments,
using Tree-Puzzle [Schmidt et al., 2002]. If the server is
run in the “advanced” and interactive “cluster sequences”
mode, the above-mentioned distance matrices are used to
compute and display a neighbor-joining (NJ) tree with
“neighbor” from the PHYLIP package [Felsenstein, 2004],
based either on the protein or codon alignments. The
user can then select a clade from the displayed NJ tree
to target the primer design toward its sequence members
(Fig. 51.2). Note that the first sequence in the alignment
will be used to root the tree. In the default, noninteractive
“get primers” run mode, all the uploaded sequences will
be considered to compute the primer formulations.
51.2.2 The Standard and Extended
CODEHOP Primer Design Strategies
Primers4clades implements an extended and fully
automated CODEHOP (Consensus Degenerate Hybrid
Oligonucleotide Primer) design strategy, which is based
on both DNA and protein multiple sequence alignments
of protein-coding sequences [Contreras-Moreira et al.,
2009]. The standard CODEHOP design strategy was first
published by Rose et al. [1998, 2003]. It represents a
novel PCR primer design method devised to amplify
distantly related members of viral protein families. The
primers are derived from conserved amino acid BLOCKS
within a protein multiple sequence alignment that don’t
contain gaps. Each hybrid primer consists of (a) a short
(11–12 nucleotides long) 3 degenerate core region that
binds to the codons encoding 3–4 highly conserved
amino acids and (b) a longer 5 consensus clamp region
that contains the most abundant codon for each amino
acid as predicted on the basis of a user-selected codon
usage table (CUT). The basic structure of a CODEHOP
can be seen from the link # 1 provided in the Internet
Resources section. Amplification initiates by annealing
and extension of primers in the pool with the highest
similarity in the 3 core region, which confers the
specificity of the amplification, and their stabilization by
the consensus clamp, which only partially matches the
target template. Because all primers are identical at their
5 consensus clamp region, they will all anneal at high
de Bruijn Vol. I
51.2 Methods
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 443
443
Figure 51.1 Data input and analysis customizations
page of the primers4clades server. Three demo
datasets are provided for rapid testing of the server’s
interface and capabilities. The tutorial presented in this
chapter will focus on the actinobacterial dataset and
the advanced, interactive “cluster sequences” run
mode. The selected “cluster distance metric” option
indicates to the server that the reference NJ tree should
be computed from the protein sequence alignment
generated from the uploaded coding sequences using
NCBI’s translation table number 11. The user can also
change the preferred size range for the amplicons, Tm
of the consensus clamp, and whether the phylogenetic
information content of the resulting amplicon sets
should be evaluated at the DNA or protein levels
under a variety of substitution models.
(A)
Figure 51.2 Reference NJ tree displayed on the
(B)
first output page generated by primers4clades run in
the interactive “cluster sequences” mode using the
actinobacterial demonstration dataset. (A) Labeled NJ
tree computed from the protein alignment of the input
dataset. (B) Cluster selection panel based on the labels
displayed on the NJ tree. Hitting the re-cluster button
parses the alignment to use only the selected
sequences. Hitting the get primers button starts the
primer design and evaluation steps.
de Bruijn Vol. I
444
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 444
Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers
stringency during subsequent rounds of amplification.
This increases the efficiency of the PCR amplification,
making it in principle more efficient than PCR with
standard consensus or degenerate oligonucleotides and
allowing the amplification of divergent sequences of
protein families. If no oligonucleotide formulations are
returned by the server, or these are very long, it may be
necessary to decrease the Tm of the consensus clamp (set
to 55◦ C), as explained below. Primers4clades automatically uses the CUTs for all species identified in the input
dataset for which a CUT is available at the Codon Usage
Database [Nakamura et al., 2000]. It also computes an
alignment-specific CUT on the fly. Taking into account
several nonredundant CUTs notably increases the amount
and quality of the primers found for a given dataset,
as shown by our genome-scale benchmark analysis
results reported in the original primers4clades publication
[Contreras-Moreira et al., 2009]. This represents a notable
performance improvement of our extended CODEHOP
algorithm with respect to that implemented in the original
CODEHOP and recent iCODEHOP servers (links #2 and
#3 of Internet Resources).
51.2.3 Customization of the
Primers4clades Server Behavior
As mentioned above, the server can be run either in
the basic “get primers” or in the advanced “cluster
sequences” modes. However, further customization of
the server’s behavior may be required for optimal primer
design in either mode. For demonstration purposes, we
will use the advanced and interactive “cluster sequences”
run mode. Clicking on the “customize settings” button
opens a window with self-explanatory options (Fig. 51.1).
First of all, the input DNA protein-coding sequence can
be translated by any of the current NCBI translation
tables, with the “bacterial plastid” translation table 11
being the correct choice for our actinobacterial demonstration data. Next, the user can select whether to work
at the DNA or protein levels to estimate a first reference
NJ tree using the “cluster metric” option. As shown
in Figure 51.1, we selected the default protein choice.
Next, the user can choose a desired size range for the
amplicons, as well as the Tm for the consensus clamp.
Lastly, we choose the HKY85 + G DNA substitution
model for the evaluation of the phylogenetic information
content of the different amplicon sets, as explained
below. Please read the servers FAQ section of the online
documentation if you need some background information
on selecting DNA substitution models.
A unique feature of the primers4clades server is its
ability to compute a robust phylogenetic information content parameter for each theoretical amplicon set, which
is based on a recently developed Shimodaira–Hasegawa
(SH)-like test for the significance of branches in a
maximum likelihood tree, originally implemented in
PhyML v2.4.5 [Anisimova and Gascuel, 2006]. In
brief, the test assesses whether the branch being studied
provides a significant likelihood gain, in comparison
with the null hypothesis that involves collapsing that
branch, but leaving the rest of the tree topology identical.
The resulting SH-like branch support values therefore
indicate the probability that the corresponding split is
significant. The phylogenetic information content of
each amplicon set is calculated as the mean and median
SH-like branch support values for the corresponding
maximum likelihood tree inferred under the user-specified
substitution model or matrix. As a rule of thumb, the
phylogenetic information content of the targeted sequence
region increases as branch support values approach to 1.
This strategy of calculating the phylogenetic information
content of different sequence alignments was recently
developed by Vinuesa et al. [2008].
Once all customization choices have been properly
set, we are ready to hit the submit button to let the server
start its work. After a few seconds, we get a new page displaying the NJ phylogeny (see Fig. 51.2A) based on the
translation products of the actinobacterial rpoB sequences
with some summary information of the alignment and distance matrix written on top of it (not shown). Notice
that the two Corynebacterium sequences form an outgroup clade to the target Mycobacterium spp. ingroup
clade on which we will now instruct the server to focus the
search for PCR primers. To do so, we provide the cluster
boundary identification numbers of the target clade in the
corresponding box, as shown in Figure 51.2B. This can
be simply done by copying and pasting the numbers from
the displayed tree that demarcate the target clade, separating them with a coma without spaces. In our example,
this is 002_0,013_0. After hitting the re-cluster button a
new page will be presented, displaying the same tree but
only with the target sequences selected. We could still at
this point exclude further sequences or define a different
subcluster for primer design, but in our example we are
now ready to hit the “get-primers button.”
51.2.4 Results Returned to the
User by the Primers4clades Server
A very useful feature of primers4clades is that it returns
a nonredundant set of primer pair formulations, ranked
according to their thermodynamic properties (the theoretically best primers are displayed first). The server checks
that the resulting amplicon sets for the primer pairs do not
overlap more than 80%, ensuring a high coverage of the
target locus, but filtering excessive redundancy, as shown
on the amplicon distribution maps generated by the server
after some seconds or minutes (Fig. 51.3), depending on
de Bruijn Vol. I
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 445
445
51.2 Methods
0k
0.2k
0.1k
39 (quality = 60%)
0.3k
0.4k
0.5k
19 (quality = 90%)
33 (quality = 70%)
40 (quality = 50%)
0.6k
0.7k
0.8k
0.9k
12 (quality = 100%)
22 (quality = 90%)
4 (quality = 100%)
29 (quality = 80%)
41 (quality = 50%)
37 (quality = 70%)
26 (quality = 90%)
1 (quality = 100%)
38 (quality = 70%)
17 (quality = 100%)
25 (quality = 90%)
35 (quality = 70%)
8 (quality = 100%)
7 (quality = 100%)
20 (quality = 90%)
36 (quality = 70%)
5 (quality = 100%)
3 (quality = 100%)
1.1k
6 (quality = 100%)
21 (quality = 90%)
32 (quality = 80%)
1k
31 (quality = 80%)
14 (quality = 100%)
30 (quality = 80%)
11 (quality = 100%)
10 (quality = 100%)
18 (quality = 100%)
16 (quality = 100%)
13 (quality = 100%)
9 (quality = 100%)
15 (quality = 100%)
27 (quality = 90%)
23 (quality = 90%)
2 (quality = 100%)
28 (quality = 80%)
34 (quality = 70%)
24 (quality = 90%)
Figure 51.3 Amplicon distribution map for the actinobacterial demonstration dataset generated using the parameters shown in Figure 51.1 for
the selected strains as shown in Fig. 51.2. The positions of the different amplicons are mapped on the first sequence of the input alignment at
the protein level. A quality value is assigned to each amplicon, with 100% being best. This value is computed based on thermodynamic
properties of the primer pair (see text for the details). The amplicon distribution map and all primer pair formulations along with their
thermodynamic properties can be downloaded from the server.
the size of the alignment and number of primer pairs
found. During the process of primer pair calculation, the
server displays information about the redundancy among
the codon usage tables (CUTs) for each taxon and the
number of primer pairs found for each nonredundant CUT
(see the section on “Technical Information” of the server’s
online documentation page for a detailed explanation of
how it computes the nonredundant CUTs).
The amplicon distribution map can be downloaded
from the server as well as a TAB-delimited text file containing all the primer pair formulations, amplicon size, and
a comprehensive set of their thermodynamic properties
such as degeneracy, Tm , hairpin-formation potential, and
cross-hybridization potential computed with subroutines
borrowed from the Amplicon software [Jarman, 2004].
The quality % value shown on top of each bar representing an amplicon indicates how “good” the corresponding
primer pairs are, based on the above-mentioned thermodynamic properties. A 100% score indicates that none of
the property values is worse than a predetermined cutoff
value for each parameter that we have empirically chosen
based on the evaluation of dozens of well-working primer
pairs. Note also that the bars representing amplicons are
color-coded according to their corresponding primer pair
quality scores.
In addition to the standard CODEHOP, the
primers4clades server computes three additional primer
formulations: the corrected CODEHOP, relaxed corrected
CODEHOP, and fully degenerate oligonucleotides.
The former two are based on the original CODEHOP
formulation, but adjusting the degeneracy of the latter
by taking into account the sequences of the underlying
codon alignment, as shown in Figure 51.4. The relaxed
corrected CODEHOP extends the 3 core region of the
corrected CODEHOP toward the 5 end until it reaches
a degeneracy level of 24. If the corrected CODEHOP
already has a degeneracy equal or greater than 24, then
the latter two formulations will be identical. The fully
degenerated oligonucleotide formulation is computed
to reflect the full nucleotide sequence variation of the
targeted binding site. Table 1 of the server’s online
documentation page summarizes the recommended uses
of the four oligonucleotide formulations computed by
the server. We recommend the use of the corrected
CODEHOP formulation for most purposes.
After displaying the amplicon distribution map,
the server enters the second computing intensive phase.
Maximum likelihood phylogenies (in our case study using
the HKY85+G model) will be estimated from each of
the theoretical amplicon sets displayed on the amplicon
distribution map. It is from these phylogenies that the
phylogenetic information content of the amplicons is
calculated based on Shimodaira–Hasegawa-like branch
support values [Anisimova and Gascuel, 2006; Guindon
et al., 2009], as we have described recently [Vinuesa,
2010; Vinuesa et al., 2008]. The best scoring amplicons
(based on the quality parameter explained above) are
evaluated first.
de Bruijn Vol. I
446
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 446
Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers
Figure 51.4 Example primer formulation output panel (only the reverse formulation of the first primer pair returned by the server is shown)
for the actinobacterial example dataset. The server returns the standard CODEHOP (bold) and three codon-based primer formulations, aligned
with the underlying codons. An ‘!’ sign denotes positions corrected by the system based on the codon alignment. Also shown are the summary
outputs for key primer pair and corresponding amplicon characteristics, as well as the associated phylogenetic information content of the
theoretical amplicon sets (see the text for the details).
After the server is done with the ML search for the
first amplicon set, the corresponding primer pairs are
shown on screen, aligned with the underlying codons
(Fig. 51.4). A header line indicates the codon usage
table used to calculate the corresponding primer pair,
in this case a data-specific one computed from the
input alignment (not shown on Fig. 51.4). The first line
after the header shows the sequence of the CODEHOP
formulation, as calculated by the original CODEHOP
program, showing also the coordinates of the forward
(fw) and reverse (rev) oligos on the original (full) protein
alignment. The lowercase letters on the 3 end of the oligo
correspond to the degenerated portion of the CODEHOP,
while the uppercase residues toward the 5 end correspond
to the consensus clamp. The lines below the CODEHOP
formulation are the corresponding target sites at the DNA
level. This codon alignment is used to calculate the
corrected CODEHOP formulation. The ! and ? characters just above the corrected CODEHOP formulation
line indicate the discrepancies between the CODEHOP
formulation and the codon alignment that are considered
to formulate the corrected CODEHOP sequence, the !
indicating nondegenerate sites, and the ? highlighting
degenerate sites (Fig. 51.4). The . symbol indicates
matches between the original CODEHOP formulation and
the codon alignment. The relaxed corrected CODEHOP
formulation and the fully degenerated oligonucleotide
formulations are also displayed (Fig. 51.4).
A primer pair quality summary report is displayed
on screen after each primer pair, based on computations
performed on the corrected CODEHOP formulations. The
first line provides a summary score of the “primer quality” in thermodynamic terms for each primer pair (range
100% to 0%, best to worst). The next three lines indicate
the expected amplicon length in bps and the computed Tm
ranges for the fw- and rev-corrected codehop formulations
(Fig. 51.4). If the quality score is <100%, this means that
at least one oligo has a thermodynamic parameter value
worse than the cutoff values used by the server (these
values are presented in Table 2 of the server’s online documentation page).
Lastly, we get a summary of the phylogenetic evaluation of the corresponding amplicon (Fig. 51.4), which
indicates the substitution matrix/model used for the evaluation, and the value of the corresponding shape parameter
(alpha). The next line is very useful for users interested in
de Bruijn Vol. I
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 447
447
51.3 Results
using the new molecular markers for phylogenetics, since
it provides the mean and median Shimodaira–Hasegawalike bipartition support values computed by the PhyML
maximum likelihood algorithm. The closer to 1, the higher
the resolution level of the tree inferred from the theoretical amplicon set and therefore the higher its phylogenetic
information content. A FASTA-formatted file containing
the aligned amplicon sets can be downloaded (in this case
at the protein level). The ML trees for each amplicon can
be either directly visualized or downloaded.
51.2.5 Purification of Soil DNA,
rpoB Amplification, Clone Library
Construction, and Analysis
Ten soil samples of approximately 100 g were randomly
sampled within a 10-m × 10-m area in both the tropical
humid forest Reserve of Los Tuxtlas, Estación Biológica
de la UNAM, Veracruz, Mexico (18◦ 34 23.24 N,
95◦ 9 40.10 O), and a conserved patch of seasonally
dry deciduous tropical forest near “El Limón” in the
Biosphere Reserve of Sierra de Huautla, Morelos,
Mexico (18◦ 32 33.20 N, 98◦ 56 11.19 O). The 10 soil
samples were from each site were pooled, mixed, and
sieved (2 mm) to obtain two composite samples. A 2-g
aliquot from each of the composite samples was used
for DNA extraction using the MoBio PowerMax Soil
DNA Extraction Kit. DNA integrity was checked on an
agarose gel, and the lack of strong PCR inhibitors was
confirmed by amplification of 16S rRNA genes using the
universal fD1 and rD1 primers [Weisburg et al., 1991].
The rpoB amplicons were generated using a hot-start
protocol and Taq-platinum (Invitrogen), with annealing
at 60◦ C, and cloned using the PCR Topo cloning system
2.1 (Invitrogen). Plasmids were purified using a 96-well
plate format and commercially sequenced at LANGEBIO,
Mexico, using the Sanger method at 2 × coverage with
the M13f and M13rev primers that bind at vector positions flanking its cloning site. The two sequence reads for
each plasmid were automatically assembled and validated
by using a custom Perl script that runs Phred/Phrap
[Ewing and Green, 1998; Ewing et al., 1998] followed
by a BlastX [Altschul et al., 1997] search of each contig
against a locally formatted rpoB sequence database of
Actinobacteria assembled from fully sequenced genomes
of these bacteria. Only sequences aligning globally with
entries from the database were further processed by
subjecting them to a sequence-to-profile alignment using
Clustalw 2.1 [Larkin et al., 2007]. The reference profile
consists of a codon alignment of 25 nonredundant rpoB
sequences of reference Mycobacterium strains, trimmed
to the segment corresponding to the expected amplicon.
The resulting multiple sequence alignment was further
corrected manually.
Phylogenetic analyses were performed with PhyML
v. 3.0 [Guindon et al., 2009], as previously described
[Vinuesa, 2010; Vinuesa et al., 2008]. All community
ecology analyses of the environmental rpoB -amplicon
libraries were performed with Mothur version 1.7
[Schloss et al., 2009].
51.3 RESULTS
51.3.1 Design and Evaluation of
the Specificity of Mycobacterium
spp. PCR Primers Targeting the RNA
Polymerase Beta Subunit (rpoB )
Gene Using Soil Metagenomic DNA
Using the primers4clades server and selected reference
sequences (a few more than those provided at the server’s
actinobacterial demonstration sequence set), we designed
the primer pair shown in Table 51.1. These primers were
used to generate PCR amplicons from metagenomic DNA
purified from the two contrasting tropical forest soils in
Mexico (humid evergreen forest at Los Tuxtlas, Veracruz;
deciduous seasonally dry forest from Sierra de Huautla,
Morelos), as described in Materials and Methods.
Specific amplification products could be obtained
for both soils, as shown in Figure 10 of the server’s
online documentation page. We sequenced 50 clones
from each of the two amplicon libraries to evaluate
Table 51.1 Basic Thermodynamic Parameters and Sequence Formulation of the Corrected Codehop Primers Used in
This Study to Amplify Mycobacterium-Specific Sequences from Two Contrasting Soil Samples
Name
La
Sequence
Tm
Corrected Degeneracyb
Full Degeneracyc
h_potd
c_pote
rpoB2_N326
rpoB2_C757
26
25
GAGAACCTGTTCTTCaaggagaagcg
CGTCCTCGTAGTTGtgrccytccca
63.3
66.6
0
4
0
4
0.52
0.00
0.50
0.50
a
Oligonucleotide length.
Degeneracy of the corrected codehop oligonucleotide (see technical details in the server’s tutorial page).
c
Full degeneracy of the target alignment site (see technical details in the server’s tutorial page).
d
Hairpin formation potential.
e
Cross-hybridization potential.
b
de Bruijn Vol. I
448
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 448
Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers
Figure 51.5 Maximum likelihood phylogeny under the GTR+G+I substitution model for 32 reference sequences [Corynebacterium (7),
Mycobacterium (17), Nocardia (1), Rhodococcus (3), Saccharopolyspora (1), Salinispora (2), and Tsukamurella (1)] and 100 soil clones (50
from the Reserve of Los Tuxtlas, labeled as TUX; 50 from the Sierra de Huautla Reserve, labeled as REB). For the sake of clarity, the
non-Mycobacterium sequence clades are collapsed. Eighty-nine percent of the soil clones clustered within the Mycobacterium spp. clade,
demonstrating the high selectivity of the rpoB primers shown in Table 51.1 to target this group of organisms in complex metagenomic DNA
samples.
the specificity of the oligonucleotides in a phylogenetic
context. Figure 51.5 shows a maximum likelihood phylogeny (GTR+G) of the 100 environmental sequences
along with 15 reference outgroup strains in the genera
Corynebacterium (7), Nocardia (1), Rhodococcus (3),
Saccharopolyspora (1), Salinispora (2), and Tsukamurella
(1), which are shown as collapsed clades in Figure 51.5,
and 17 Mycobacterium spp. ingroup reference strains.
From this phylogeny we could unambiguously define the
perfectly supported Mycobacterium spp. clade, which
clustered 89 of the environmental clones. The remaining
11 environmental sequences clustered with the closely
related actinobacterial genera used as outgroups.
In summary, 89% of the environmental rpoB amplicons could be unambiguously assigned to the genus
Mycobacterium, indicating that the primers have a high
specificity for the target genus. Only ∼5% of the environmental Mycobacterium sequences had suspiciously
long branches, which might reflect sequence artifacts
such as chimeras, heteroduplexes, or Taq errors.
de Bruijn Vol. I
51.4 Discussion
51.3.2 Statistical Evaluation of the
Sampling Effort Achieved by the
Lineage-Targeted Approach
A recent review on the use of rpoB sequences in
Mycobacterium molecular systematics suggests that
97.7–98.2% similarity cutoff values for rpoB sequences
could be used to delimit species within the genus
[Adekambi et al., 2009]. However, in this study we use a
slightly more conservative cutoff value (3% divergence)
to define operational taxonomic units (OTUs; approximately equivalent to species) for our Mycobacterium spp.
diversity study.
Figures 51.6A and 51.6B show the rarefaction
curves [Hughes et al., 2001] generated for the 50
Mycobacterium spp. and closely related actinobacterial
sequences obtained from both rpoB soil clone libraries
at different cutoff values. Notice that for both soils
the rarefaction curves are close to reach an asymptote
at the 3% divergence OTU definition, indicating that
the sequencing effort has been reasonably good. As
a reference point, Figure 51.6C shows the rarefaction
curves at the same cutoff values for a classical 16S rRNA
gene clone library made from a Brazilian tropical soil
using universal primers [Borneman and Triplett, 1997].
These curves indicate that the 98 clones sequenced in that
study are far from being representative of the diversity
at the species level, which is classically defined by a
cutoff level of 97–98% of sequence similarity (2–3%
divergence). Figures 51.6D and 51.6E show collector
curves for the two libraries using the nonparametric
Chao1 richness estimator [Hughes et al., 2001], which
indicate that the Mycobacterium community at seasonally
dry deciduous forest of El Limón is more diverse (19
predicted species, with a 95% credibility interval 14–46)
than that found in the humid tropical forest of Los Tuxtlas
(12 predicted species, with a 95% credibility interval
12–32). Interestingly, not a single OTU is shared among
the communities at this cutoff level, indicating a strongly
differentiated species composition of both communities,
which together bring about 31 OTUs. The Chao1 collector
curves shown in Figures 51.6D and 51.6E indicate that
the sampling effort has not been sufficient to evaluate
the diversity at the two lower cutoff values since they
have not stabilized [Hughes et al., 2001], but those for
the 0.03 and 0.05 cutoff values seem to have stabilized.
Therefore, inferences at these cutoff levels are robust.
51.4 DISCUSSION
Primers4clades is currently the only publicly available
server that integrates alternative primer-design strategies
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 449
449
with phylogenetic trees to (a) interactively target the
search for oligonucleotide formulations to specific
sequence clusters and (b) evaluate the phylogenetic
information content of the new molecular markers.
Together, these features are very useful to make an
informed choice among alternative, nonredundant primer
pair formulations, taking into account both the thermodynamic properties of the primers and the phylogenetic
information content of the expected amplicon sets. These
attributes make of primers4clades a novel and powerful
tool for the targeted design and informed selection of
PCR primers for lineage-targeted metagenomic studies
of microbial communities, as demonstrated herein for
environmental mycobacteria.
The genus Mycobacterium currently has 147 validly
described species (see link #4 in Internet resources),
most of which are poorly characterized ubiquitous
environmental mycobacteria (EM) [Bland et al., 2005;
Jacobs et al., 2009]. These are known as “nontuberculous” or “atypical” mycobacteria among clinicians [van
Ingen et al., 2009]. The most “famous” mycobacterial
species are M. tuberculosis and M. leprae, obligate
parasites and the causative agents of tuberculosis (TB)
and leprosy, respectively. M. avium, M. abscessus, M.
chelonae, M. fortuitum, M. intracellulare, M. kansasii,
M. mucogenicum, M. phocaicum, M. scrophulaceum, M.
ulcerans, and M. vanbalenii are among the most common nontuberculous or atypical mycobacterial species
causing opportunistic infections in immunocompromised
humans and have a broad environmental distribution
[Falkinham, 2009; Primm et al., 2004]. Understanding
the diversity, abundance, and distribution of EM is of
prime interest from a biomedical standpoint because
immune responses following exposure to these bacteria
may influence susceptibility to tuberculosis and leprosy
and may severely interfere with the proliferation and
protective effect of the attenuated BCG vaccine used
against TB [Black et al., 2001; Weir et al., 2006]. Their
functional role in ecosystems remains largely unknown
despite their environmental ubiquity.
The results presented in this study demonstrate that
a good sampling effort can be achieved for the genus
Mycobacterium with the primers described in Table 51.1
by sequencing as few as 50 clones per site using the
lineage-targeted approach to gene-based metagenomic
analyses of microbial communities. Further sequencing
and statistical analysis of these and additional libraries
(>200 clones each) generated with the mentioned primers
from soil [Sachman-Ruiz and Vinuesa, 2011] and water
samples [Sachman-Ruiz et al., 2009] demonstrated that
sequencing around 200 clones provides very strong
statistical power, allowing for the first time to make
Q2
de Bruijn Vol. I
450
V1 - 02/08/2011
5:16 P.M. Page 450
Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers
(A)
rarefaction curves for the
Sierra de Huautla dataset (n = 50)
at the indicated clustering cutoffs
rarefaction curves for the
(B) Reserva de los Tuxtlas dataset (n = 50)
(C)
at the indicated clustering cutoffs
unique
0.01
0.03
0.05
40
unique
0.01
0.03
0.05
25
rarefaction curves for the
Amazonian Soils dataset (n = 98)
at the indicated clustering cutoffs
unique
0.01
0.03
0.05
80
OTUs
20
OTUs
20
30
OTUs
c51.tex
15
60
40
10
10
20
5
0
0
0
0
10
20
30
40
50
Number of sequences sampled
(D)
300
0
30
40
50
10
20
Number of sequences sampled
Chao’s richness estimator plot
Sierra de Huautla dataset (n = 50)
at the indicated clustering cutoffs
(E)
unique
0.01
0.03
0.05
250
0
20
40
60
80
100
Number of sequences sampled
Chao’s richness estimator plot
Reserva de los Tuxtlas (n = 50)
at the indicated clustering cutoffs
unique
0.01
0.03
0.05
30
25
OTUs
OTUs
200
150
20
15
100
10
50
5
0
0
0
10
20
30
40
Number of sequence sampled
50
0
10
20
30
40
Number of sequence sampled
50
Figure 51.6 Rarefaction (A–C) and collector curves for the nonparametric Chao1 richness estimator (C, D) at four clustering levels
(0.0–0.05 P distance) for OTU definition. Panel C shows for comparative purposes the rarefaction curves for 98 SSU rRNA Amazonian soil
clones generated with universal primers from the classical study by Borneman and Triplett [1997].
fine-grained inferences about the richness and structure
of mycobacteria communities in diverse natural environments. Furthermore, the latter studies have revealed the
magnitude of the bias in diversity estimation introduced
by classical Mycobacterium isolation procedures and
have also revealed that we have a strongly distorted
notion of the diversity of this species-rich bacterial genus
of great environmental and clinical relevance.
These results demonstrate the utility and strong statistical power of the lineage-targeted approach to microbial community ecology and show that primers4clades
[Contreras-Moreira et al., 2009] (link #5 in the Internet Resources) is a useful tool to develop primers for
such gene-targeted metagenomic analyses of microbial
diversity. The development of the tool is now coupled
to its recent implementation in a phylogenomics analysis pipeline to construct an interactive primer database
for phylogenetic clades at different taxonomic and phylogenetic depths. The graphical interface, analysis options,
and parameter evaluation procedures will be improved,
extended, and refined in future versions, which will also
include faster ML algorithms.
INTERNET RESOURCES
Link 1: CODEHOP structure (http://www.ncbi.nlm.nih.
gov/pmc/articles/PMC168931/figure/gkg524f1/)
Link 2: CODEHOP server (http://blocks.fhcrc.org/
codehop.html)
Link 3: iCODEHOP server (https://icodehop.cphi.
washington.edu/i-codehop-context/iCODEHOP/view/
PrimerAnalysis)
Link 4: Mycobacterium taxonomy (http://www.
bacterio.cict.fr/m/mycobacterium.html)
Link 5: Primers4clades web server (http://maya.ccg.
unam.mx/primers4clades)
de Bruijn Vol. I
References
Acknowledgments
Romualdo Zayas-Laguna and Vı́ctor del Moral from the
Information Technology Administration UNIT at CCGUNAM are acknowledged for their technical support. This
work was supported by DGAPA/PAPIIT-UNAM [grant
number IN201806-2], CONACyT-Mexico [grant number
P1-60071], and by Consejo Superior de Investigaciones
Cientı́ficas [grant number 200720I038]. UNAM’s Ph.D.
Program in Biomedical Sciences and CONACyT-Mexico
are acknowledged for the financial support offered to BSR
for his Ph.D. studies.
REFERENCES
Adekambi T, Drancourt M, Raoult D. 2009. The rpoB gene as a
tool for clinical microbiologists. Trends Microbiol . 17(1):37– 45.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al.
1997. Gapped BLAST and PSI-BLAST: A new generation of protein
database search programs. Nucleic Acids Res. 25:3389– 3402.
Anisimova M, Gascuel O. 2006. Approximate likelihood-ratio test
for branches: A fast, accurate, and powerful alternative. Syst. Biol .
55(4):539– 352.
Black GF, Dockrell HM, Crampin AC, Floyd S, Weir RE,
et al. 2001. Patterns and implications of naturally acquired immune
responses to environmental and tuberculous mycobacterial antigens
in northern Malawi. J. Infect. Dis. 184(3):322– 329.
Bland CS, Ireland JM, Lozano E, Alvarez ME, Primm TP. 2005.
Mycobacterial ecology of the Rio Grande. Appl. Environ. Microbiol .
71(10):5719– 5727.
Borneman J, Triplett EW. 1997. Molecular microbial diversity in
soils from eastern Amazonia: Evidence for unusual microorganisms
and microbial population shifts associated with deforestation. Appl.
Environ. Microbiol . 63(7):2647– 2653.
Castiglioni S, Pomati F, Miller K, Burns BP, Zuccato E,
et al. 2008. Novel homologs of the multiple resistance regulator marA in antibiotic-contaminated environments. Water Res.
42(16):4271– 4280.
Contreras-Moreira B, Sachman-Ruiz B, Figueroa-Palacios I,
Vinuesa P. 2009. Primers4clades: A web server that uses phylogenetic trees to design lineage-specific PCR primers for metagenomic and diversity studies. Nucleic Acids Res. 37(Web Server
issue):W95– W100.
Edgar RC. 2004. MUSCLE: Multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Res. 32(5):1792– 1797.
Edwards SV. 2009. Is a new and general theory of molecular systematics emerging? Evolution 63(1):1–19.
Ewing B, Green P. 1998. Base-calling of automated sequencer traces
using phred. II. Error probabilities. Genome Res. 8(3):186– 194.
Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome
Res. 8(3):175– 185.
Falkinham JO, 3rd. 2009. Surrounded by mycobacteria: nontuberculous mycobacteria in the human environment. J. Appl. Microbiol .
107(2):356– 367.
Falkowski PG, Fenchel T, Delong EF. 2008. The microbial engines that drive Earth’s biogeochemical cycles. Science
320(5879):1034– 1039.
Felsenstein J. 2004. PHYLIP (Phylogeny Inference Package), 3.6 ed.
Distributed by the author. Seattle: Department of Genetics, University
of Washington.
c51.tex
V1 - 02/08/2011
5:16 P.M. Page 451
451
Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, et al.
2008. Microbial community gene expression in ocean surface waters.
Proc. Natl. Acad. Sci. USA 105(10):3805– 3810.
Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, et al.
2005. Opinion: Reevaluating prokaryotic species. Nat. Rev. Microbiol . 3(9):733– 739.
Guindon S, Delsuc F, Dufayard JF, Gascuel O. 2009. Estimating
maximum likelihood phylogenies with PhyML. Methods Mol. Biol .
537:113– 137.
Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJ. 2001. Counting the uncountable: Statistical approaches to estimating microbial
diversity. Appl. Environ. Microbiol . 67(10):4399– 4406.
Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, et al. 2008.
Resource partitioning and sympatric differentiation among closely
related bacterioplankton. Science 320(5879):1081– 1085.
Jacobs J, Rhodes M, Sturgis B, Wood B. 2009. Influence of environmental gradients on the abundance and distribution of Mycobacterium spp. in a coastal lagoon estuary. Appl. Environ. Microbiol .
75(23):7378– 7384.
Jarman SN. 2004. Amplicon: Software for designing PCR primers on
aligned DNA sequences. Bioinformatics 20(10):1644– 1645.
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan
PA, et al. 2007. Clustal W and Clustal X version 2.0. Bioinformatics
23(21):2947– 2948.
Maiden MC. 2006. Multilocus sequence typing of bacteria. Annu. Rev.
Microbiol . 60:561– 588.
Manning SD, Motiwala AS, Springman AC, Qi W, Lacher DW,
et al. 2008. Variation in virulence among clades of Escherichia coli
O157:H7 associated with disease outbreaks. Proc. Natl. Acad. Sci.
USA 105(12):4868– 4873.
Nakamura Y, Gojobori T, Ikemura T. 2000. Codon usage tabulated
from international DNA sequence databases: Status for the year 2000.
Nucleic Acids Res. 28(1):292.
Polz MF, Cavanaugh CM. 1998. Bias in template-to-product ratios
in multitemplate PCR. Appl. Environ. Microbiol . 64(10):3724– 3730.
Primm TP, Lucero CA, Falkinham JO, 3rd. 2004. Health impacts of
environmental mycobacteria. Clin. Microbiol. Rev . 17(1):98– 106.
Rose TM, Schultz ER, Henikoff JG, Pietrokovski S, McCallum CM, et al. 1998. Consensus-degenerate hybrid oligonucleotide
primers for amplification of distantly related sequences. Nucleic Acids
Res. 26(7):1628– 1635.
Rose TM, Henikoff JG, Henikoff S. 2003. CODEHOP (COnsensusDEgenerate Hybrid Oligonucleotide Primer) PCR primer design.
Nucleic Acids Res. 31(13):3763– 3766.
Sachman-Ruiz B, Vinuesa P. 2011. In preparation.
Sachman-Ruiz B, Castillo-Rodal AI, López-Vidal Y, Martı́nezRomero E, Vinuesa P. 2009. Diversity of environmental mycobacteria in Mexican rivers assessed by cultivation and metagenomics
approaches. Poster abstract U-007. 109th General Meeting of the
American Society for Microbiology. May 17–21 2009. Philadelphia.
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M,
et al. 2009. Introducing mothur: open-source, platform-independent,
community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol . 75(23):7537– 7541.
Schmidt HA, Strimmer K, Vingron M, von Haeseler A. 2002.
TREE-PUZZLE: maximum likelihood phylogenetic analysis using
quartets and parallel computing. Bioinformatics 18(3):502– 504.
Smith CJ, Nedwell DB, Dong LF, Osborn AM. 2007. Diversity
and abundance of nitrate reductase genes (narG and napA), nitrite
reductase genes (nirS and nrfA), and their transcripts in estuarine
sediments. Appl. Environ. Microbiol . 73(11):3612– 3622.
Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al.
2009. Origins and evolutionary genomics of the 2009 swine-origin
H1N1 influenza A epidemic. Nature 459(7250):1122– 1125.
Q3
de Bruijn Vol. I
452
Q4
c51.tex
V1 - 02/08/2011
Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, et al.
2002. The Bioperl toolkit: Perl modules for the life sciences. Genome
Res. 12(10):1611– 1618.
van Ingen J, Boeree MJ, Dekhuijzen PN, van Soolingen D. 2009.
Environmental sources of rapid growing nontuberculous mycobacteria
causing disease in humans. Clin. Microbiol. Infect. 15(10):888– 893.
Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D,
et al. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304(5667):66– 74.
Vinuesa P. 2010. Multilocus sequence analysis and bacterial species
phylogeny estimation. In Oren A, Papke RT, eds. Molecular Phylogeny of Microorganisms. Norwich, UK: Caister Academic Press,
pp. 000– 000.
Vinuesa P, Rojas-Jimenez K, Contreras-Moreira B, Mahna SK,
Prasad BN, et al. 2008. Multilocus sequence analysis for assessment
of the biogeography and evolutionary genetics of four Bradyrhizobium
species that nodulate soybeans on the Asiatic continent. Appl. Environ.
Microbiol . 74(22):6987– 6996.
Weir RE, Black GF, Nazareth B, Floyd S, Stenson S, et al. 2006.
The influence of previous exposure to environmental mycobacteria on
the interferon-gamma response to bacille Calmette–Guerin vaccination in southern England and northern Malawi. Clin. Exp. Immunol .
146(3):390– 399.
Weisburg WG, Barns SM, Pelletie DA, Lane DJ. 1991. 16S
ribosomal DNA amplification for phylogenetic study. J. Bacteriol.
173:697– 703.
Yutin N, Beja O. 2005. Putative novel photosynthetic reaction centre
organizations in marine aerobic anoxygenic photosynthetic bacteria:
Insights from metagenomics and environmental genomics. Environ.
Microbiol . 7(12):2027– 2033.
Zehr JP, Jenkins BD, Short SM, Steward GF. 2003. Nitrogenase
gene diversity and microbial community structure: A cross-system
comparison. Environ. Microbiol . 5(7):539– 554.
5:16 P.M.
de Bruijn Vol. I
Q1.
Q2.
Q3.
Q4.
Queries in Chapter 51
We have shortened the running head, since it exceeds the textwidth. Please confirm
Update in ref list
Update
Supply page range
c51.tex
V1 - 02/08/2011
5:16 P.M.