Primers4clades, A Web Server to Design Lineage
Transcription
Primers4clades, A Web Server to Design Lineage
de Bruijn Vol. I Chapter c51.tex V1 - 02/08/2011 5:16 P.M. Page 441 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers for Gene-Targeted Metagenomics Bernardo Sachman-Ruiz, Bruno Contreras-Moreira, Enrique Zozaya, Cristina Martı́nez-Garza, and Pablo Vinuesa 51.1 INTRODUCTION This chapter presents a practical overview and case study of the usage and capabilities of the primers4clades version 1.0 web server [Contreras-Moreira et al., 2009], a software tool useful to design degenerate PCR primers for gene-targeted analysis of metagenomic DNA (see also Chapter 28, Vol. I) and multilocus sequence analysis. Although direct shotgun sequencing of community DNA is now feasible [Venter et al., 2004], polymerase chain reaction (PCR) remains the most widely used technology to gain molecular markers for studies in molecular ecology and systematics. With the ongoing accumulation of fully sequenced genomes in public sequence databases, the focus of microbial ecology studies is increasingly shifting to the analysis of protein coding genes and sequences (CDSs) to understand ecological, metabolic, and evolutionary processes in nature [Falkowski et al., 2008; Frias-Lopez et al., 2008; Hunt et al., 2008]. This trend is reflected in the huge interest of studying the diversity and expression patterns of “functional genes” in the environment, such as those encoding for antibiotic resistance and virulence [Castiglioni et al., 2008; Manning et al., 2008], photosynthesis [Yutin and Beja, 2005], or enzymes involved in the nitrogen cycle [Smith et al., 2007; Zehr et al., 2003], to mention a few. Furthermore, multilocus sequence analysis (MLSA) and typing (MLST) of protein-coding genes are the new standards in molecular systematics [Edwards, 2009; Gevers et al., 2005; Vinuesa, 2010] and molecular epidemiology [Maiden, 2006; Smith et al., 2009]. However, it still remains a major challenge to design optimal PCR primers to specifically amplify CDSs from target lineages directly from environmental DNA samples or from novel organisms. Primers4clades is a publicly available web server that uses phylogenetic trees to aid in the design of lineage-specific degenerate PCR primers for the above-mentioned purposes, which takes into account both protein and the corresponding codon multiple sequence alignments [Contreras-Moreira et al., 2009]. The major advantages of using a lineage-specific gene-targeted approach to metagenomics lie in the high sampling coverage and long sequence reads that can be achieved by sequencing a relatively low number of clones using classic Sanger sequencing technology. Furthermore, amplification biases derived from sequence composition heterogeneity [Polz and Cavanaugh, 1998] are reduced due to the relatively homogeneous composition of the sequences targeted with lineage-specific PCR primers. Their lower degeneracy also contributes to a reduction in amplification biases. Here we present an empirical validation study that demonstrates the highly specific amplification of Mycobacterium spp. rpoB sequences from complex metagenomic DNA samples gained from two tropical Mexican forest soils: (i) the humid evergreen forest reserve of Los Tuxtlas, Veracruz, and (ii) a conserved patch of seasonally dry deciduous forest in the Biosphere Handbook of Molecular Microbial Ecology, Volume I: Metagenomics and Complementary Approaches, First Edition. Edited by Frans J. de Bruijn. © 2011 Wiley-Blackwell. Published 2011 by John Wiley & Sons, Inc. 441 de Bruijn Vol. I Q1 442 V1 - 02/08/2011 5:16 P.M. Page 442 Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers Reserve of Sierra de Huautla, Morelos. Rarefaction curves for a random subsample of 50 clones of each library are compared with rarefaction curves for a 16S rRNA-based library of 98 clones obtained from Amazonian pasture and forest soils using universal primers [Borneman and Triplett, 1997], which illustrates the usefulness of the lineage-specific metagenomics analysis approach to improve sampling coverage and statistical power. 51.2 c51.tex METHODS 51.2.1 Implementation, Input Data Processing, and Overview of the Two Run Modes Using the Server’s Actinobacteriales Demonstration Dataset Primers4clades (primers for clades) is an easy-to-use web server developed for researchers interested in the design of PCR primers for cross-species amplification of novel sequences from metagenomic DNA or from uncharacterized organisms belonging to user-specified phylogenetic clades or taxa. Degenerate PCR primers are derived from conserved motifs in protein multiple sequence alignments using the CODEHOP primer design strategy described below, but considering also the underlying codon alignments to obtain what we call “corrected CODEHOP” primer formulations. This is one of the diverse components of the extended CODEHOP algorithm that the server implements [Contreras-Moreira et al., 2009]. Primers4clades was mainly written in Perl and uses several BioPerl modules [Stajich et al., 2002] along with the open source software cited below to perform the different computations involved in the extended CODEHOP algorithm. The input for the server is a set of homologous protein-coding genes in FASTA format and in frame +1, which may be aligned or not, with or without introns (Fig. 51.1). We will demonstrate the usage of the software by choosing the server’s Actinobacteriales dataset, one of the three provided as tutorial material, which can be selected by clicking on the “demo” button. Although no registration is required, we recommend that the user provide his email address, which will allow access for one week to the analyses stored on the server and also avoids potential loss of results due to eventual browser timeout events due to often long computation times (see FAQ section in the online tutorial for the details). Note that the taxon names (Latin binomials) should be placed within brackets, which is how the header from FASTA-formatted sequences are downloaded from GenBank, as shown in Figure 51.1. This is required for the server to parse the taxon names in order to select the corresponding codon usage tables, as explained below. The server excises introns if their coordinates are indicated in the FASTA header (see the server’s documentation and the fungal alpha-tubulins demo dataset), collapses redundant sequences to haplotypes, translates the CDSs with user-selected translation tables, and aligns them using Muscle [Edgar, 2004]. The alignment step is skipped, if the server detects that the uploaded DNA sequences are aligned. The protein alignment is projected on the underlying DNA sequences to compute the corresponding codon alignment, along with maximum likelihood (ML) distance matrices from the protein (WAG+G) or the codon (HKY85+G) alignments, using Tree-Puzzle [Schmidt et al., 2002]. If the server is run in the “advanced” and interactive “cluster sequences” mode, the above-mentioned distance matrices are used to compute and display a neighbor-joining (NJ) tree with “neighbor” from the PHYLIP package [Felsenstein, 2004], based either on the protein or codon alignments. The user can then select a clade from the displayed NJ tree to target the primer design toward its sequence members (Fig. 51.2). Note that the first sequence in the alignment will be used to root the tree. In the default, noninteractive “get primers” run mode, all the uploaded sequences will be considered to compute the primer formulations. 51.2.2 The Standard and Extended CODEHOP Primer Design Strategies Primers4clades implements an extended and fully automated CODEHOP (Consensus Degenerate Hybrid Oligonucleotide Primer) design strategy, which is based on both DNA and protein multiple sequence alignments of protein-coding sequences [Contreras-Moreira et al., 2009]. The standard CODEHOP design strategy was first published by Rose et al. [1998, 2003]. It represents a novel PCR primer design method devised to amplify distantly related members of viral protein families. The primers are derived from conserved amino acid BLOCKS within a protein multiple sequence alignment that don’t contain gaps. Each hybrid primer consists of (a) a short (11–12 nucleotides long) 3 degenerate core region that binds to the codons encoding 3–4 highly conserved amino acids and (b) a longer 5 consensus clamp region that contains the most abundant codon for each amino acid as predicted on the basis of a user-selected codon usage table (CUT). The basic structure of a CODEHOP can be seen from the link # 1 provided in the Internet Resources section. Amplification initiates by annealing and extension of primers in the pool with the highest similarity in the 3 core region, which confers the specificity of the amplification, and their stabilization by the consensus clamp, which only partially matches the target template. Because all primers are identical at their 5 consensus clamp region, they will all anneal at high de Bruijn Vol. I 51.2 Methods c51.tex V1 - 02/08/2011 5:16 P.M. Page 443 443 Figure 51.1 Data input and analysis customizations page of the primers4clades server. Three demo datasets are provided for rapid testing of the server’s interface and capabilities. The tutorial presented in this chapter will focus on the actinobacterial dataset and the advanced, interactive “cluster sequences” run mode. The selected “cluster distance metric” option indicates to the server that the reference NJ tree should be computed from the protein sequence alignment generated from the uploaded coding sequences using NCBI’s translation table number 11. The user can also change the preferred size range for the amplicons, Tm of the consensus clamp, and whether the phylogenetic information content of the resulting amplicon sets should be evaluated at the DNA or protein levels under a variety of substitution models. (A) Figure 51.2 Reference NJ tree displayed on the (B) first output page generated by primers4clades run in the interactive “cluster sequences” mode using the actinobacterial demonstration dataset. (A) Labeled NJ tree computed from the protein alignment of the input dataset. (B) Cluster selection panel based on the labels displayed on the NJ tree. Hitting the re-cluster button parses the alignment to use only the selected sequences. Hitting the get primers button starts the primer design and evaluation steps. de Bruijn Vol. I 444 c51.tex V1 - 02/08/2011 5:16 P.M. Page 444 Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers stringency during subsequent rounds of amplification. This increases the efficiency of the PCR amplification, making it in principle more efficient than PCR with standard consensus or degenerate oligonucleotides and allowing the amplification of divergent sequences of protein families. If no oligonucleotide formulations are returned by the server, or these are very long, it may be necessary to decrease the Tm of the consensus clamp (set to 55◦ C), as explained below. Primers4clades automatically uses the CUTs for all species identified in the input dataset for which a CUT is available at the Codon Usage Database [Nakamura et al., 2000]. It also computes an alignment-specific CUT on the fly. Taking into account several nonredundant CUTs notably increases the amount and quality of the primers found for a given dataset, as shown by our genome-scale benchmark analysis results reported in the original primers4clades publication [Contreras-Moreira et al., 2009]. This represents a notable performance improvement of our extended CODEHOP algorithm with respect to that implemented in the original CODEHOP and recent iCODEHOP servers (links #2 and #3 of Internet Resources). 51.2.3 Customization of the Primers4clades Server Behavior As mentioned above, the server can be run either in the basic “get primers” or in the advanced “cluster sequences” modes. However, further customization of the server’s behavior may be required for optimal primer design in either mode. For demonstration purposes, we will use the advanced and interactive “cluster sequences” run mode. Clicking on the “customize settings” button opens a window with self-explanatory options (Fig. 51.1). First of all, the input DNA protein-coding sequence can be translated by any of the current NCBI translation tables, with the “bacterial plastid” translation table 11 being the correct choice for our actinobacterial demonstration data. Next, the user can select whether to work at the DNA or protein levels to estimate a first reference NJ tree using the “cluster metric” option. As shown in Figure 51.1, we selected the default protein choice. Next, the user can choose a desired size range for the amplicons, as well as the Tm for the consensus clamp. Lastly, we choose the HKY85 + G DNA substitution model for the evaluation of the phylogenetic information content of the different amplicon sets, as explained below. Please read the servers FAQ section of the online documentation if you need some background information on selecting DNA substitution models. A unique feature of the primers4clades server is its ability to compute a robust phylogenetic information content parameter for each theoretical amplicon set, which is based on a recently developed Shimodaira–Hasegawa (SH)-like test for the significance of branches in a maximum likelihood tree, originally implemented in PhyML v2.4.5 [Anisimova and Gascuel, 2006]. In brief, the test assesses whether the branch being studied provides a significant likelihood gain, in comparison with the null hypothesis that involves collapsing that branch, but leaving the rest of the tree topology identical. The resulting SH-like branch support values therefore indicate the probability that the corresponding split is significant. The phylogenetic information content of each amplicon set is calculated as the mean and median SH-like branch support values for the corresponding maximum likelihood tree inferred under the user-specified substitution model or matrix. As a rule of thumb, the phylogenetic information content of the targeted sequence region increases as branch support values approach to 1. This strategy of calculating the phylogenetic information content of different sequence alignments was recently developed by Vinuesa et al. [2008]. Once all customization choices have been properly set, we are ready to hit the submit button to let the server start its work. After a few seconds, we get a new page displaying the NJ phylogeny (see Fig. 51.2A) based on the translation products of the actinobacterial rpoB sequences with some summary information of the alignment and distance matrix written on top of it (not shown). Notice that the two Corynebacterium sequences form an outgroup clade to the target Mycobacterium spp. ingroup clade on which we will now instruct the server to focus the search for PCR primers. To do so, we provide the cluster boundary identification numbers of the target clade in the corresponding box, as shown in Figure 51.2B. This can be simply done by copying and pasting the numbers from the displayed tree that demarcate the target clade, separating them with a coma without spaces. In our example, this is 002_0,013_0. After hitting the re-cluster button a new page will be presented, displaying the same tree but only with the target sequences selected. We could still at this point exclude further sequences or define a different subcluster for primer design, but in our example we are now ready to hit the “get-primers button.” 51.2.4 Results Returned to the User by the Primers4clades Server A very useful feature of primers4clades is that it returns a nonredundant set of primer pair formulations, ranked according to their thermodynamic properties (the theoretically best primers are displayed first). The server checks that the resulting amplicon sets for the primer pairs do not overlap more than 80%, ensuring a high coverage of the target locus, but filtering excessive redundancy, as shown on the amplicon distribution maps generated by the server after some seconds or minutes (Fig. 51.3), depending on de Bruijn Vol. I c51.tex V1 - 02/08/2011 5:16 P.M. Page 445 445 51.2 Methods 0k 0.2k 0.1k 39 (quality = 60%) 0.3k 0.4k 0.5k 19 (quality = 90%) 33 (quality = 70%) 40 (quality = 50%) 0.6k 0.7k 0.8k 0.9k 12 (quality = 100%) 22 (quality = 90%) 4 (quality = 100%) 29 (quality = 80%) 41 (quality = 50%) 37 (quality = 70%) 26 (quality = 90%) 1 (quality = 100%) 38 (quality = 70%) 17 (quality = 100%) 25 (quality = 90%) 35 (quality = 70%) 8 (quality = 100%) 7 (quality = 100%) 20 (quality = 90%) 36 (quality = 70%) 5 (quality = 100%) 3 (quality = 100%) 1.1k 6 (quality = 100%) 21 (quality = 90%) 32 (quality = 80%) 1k 31 (quality = 80%) 14 (quality = 100%) 30 (quality = 80%) 11 (quality = 100%) 10 (quality = 100%) 18 (quality = 100%) 16 (quality = 100%) 13 (quality = 100%) 9 (quality = 100%) 15 (quality = 100%) 27 (quality = 90%) 23 (quality = 90%) 2 (quality = 100%) 28 (quality = 80%) 34 (quality = 70%) 24 (quality = 90%) Figure 51.3 Amplicon distribution map for the actinobacterial demonstration dataset generated using the parameters shown in Figure 51.1 for the selected strains as shown in Fig. 51.2. The positions of the different amplicons are mapped on the first sequence of the input alignment at the protein level. A quality value is assigned to each amplicon, with 100% being best. This value is computed based on thermodynamic properties of the primer pair (see text for the details). The amplicon distribution map and all primer pair formulations along with their thermodynamic properties can be downloaded from the server. the size of the alignment and number of primer pairs found. During the process of primer pair calculation, the server displays information about the redundancy among the codon usage tables (CUTs) for each taxon and the number of primer pairs found for each nonredundant CUT (see the section on “Technical Information” of the server’s online documentation page for a detailed explanation of how it computes the nonredundant CUTs). The amplicon distribution map can be downloaded from the server as well as a TAB-delimited text file containing all the primer pair formulations, amplicon size, and a comprehensive set of their thermodynamic properties such as degeneracy, Tm , hairpin-formation potential, and cross-hybridization potential computed with subroutines borrowed from the Amplicon software [Jarman, 2004]. The quality % value shown on top of each bar representing an amplicon indicates how “good” the corresponding primer pairs are, based on the above-mentioned thermodynamic properties. A 100% score indicates that none of the property values is worse than a predetermined cutoff value for each parameter that we have empirically chosen based on the evaluation of dozens of well-working primer pairs. Note also that the bars representing amplicons are color-coded according to their corresponding primer pair quality scores. In addition to the standard CODEHOP, the primers4clades server computes three additional primer formulations: the corrected CODEHOP, relaxed corrected CODEHOP, and fully degenerate oligonucleotides. The former two are based on the original CODEHOP formulation, but adjusting the degeneracy of the latter by taking into account the sequences of the underlying codon alignment, as shown in Figure 51.4. The relaxed corrected CODEHOP extends the 3 core region of the corrected CODEHOP toward the 5 end until it reaches a degeneracy level of 24. If the corrected CODEHOP already has a degeneracy equal or greater than 24, then the latter two formulations will be identical. The fully degenerated oligonucleotide formulation is computed to reflect the full nucleotide sequence variation of the targeted binding site. Table 1 of the server’s online documentation page summarizes the recommended uses of the four oligonucleotide formulations computed by the server. We recommend the use of the corrected CODEHOP formulation for most purposes. After displaying the amplicon distribution map, the server enters the second computing intensive phase. Maximum likelihood phylogenies (in our case study using the HKY85+G model) will be estimated from each of the theoretical amplicon sets displayed on the amplicon distribution map. It is from these phylogenies that the phylogenetic information content of the amplicons is calculated based on Shimodaira–Hasegawa-like branch support values [Anisimova and Gascuel, 2006; Guindon et al., 2009], as we have described recently [Vinuesa, 2010; Vinuesa et al., 2008]. The best scoring amplicons (based on the quality parameter explained above) are evaluated first. de Bruijn Vol. I 446 c51.tex V1 - 02/08/2011 5:16 P.M. Page 446 Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers Figure 51.4 Example primer formulation output panel (only the reverse formulation of the first primer pair returned by the server is shown) for the actinobacterial example dataset. The server returns the standard CODEHOP (bold) and three codon-based primer formulations, aligned with the underlying codons. An ‘!’ sign denotes positions corrected by the system based on the codon alignment. Also shown are the summary outputs for key primer pair and corresponding amplicon characteristics, as well as the associated phylogenetic information content of the theoretical amplicon sets (see the text for the details). After the server is done with the ML search for the first amplicon set, the corresponding primer pairs are shown on screen, aligned with the underlying codons (Fig. 51.4). A header line indicates the codon usage table used to calculate the corresponding primer pair, in this case a data-specific one computed from the input alignment (not shown on Fig. 51.4). The first line after the header shows the sequence of the CODEHOP formulation, as calculated by the original CODEHOP program, showing also the coordinates of the forward (fw) and reverse (rev) oligos on the original (full) protein alignment. The lowercase letters on the 3 end of the oligo correspond to the degenerated portion of the CODEHOP, while the uppercase residues toward the 5 end correspond to the consensus clamp. The lines below the CODEHOP formulation are the corresponding target sites at the DNA level. This codon alignment is used to calculate the corrected CODEHOP formulation. The ! and ? characters just above the corrected CODEHOP formulation line indicate the discrepancies between the CODEHOP formulation and the codon alignment that are considered to formulate the corrected CODEHOP sequence, the ! indicating nondegenerate sites, and the ? highlighting degenerate sites (Fig. 51.4). The . symbol indicates matches between the original CODEHOP formulation and the codon alignment. The relaxed corrected CODEHOP formulation and the fully degenerated oligonucleotide formulations are also displayed (Fig. 51.4). A primer pair quality summary report is displayed on screen after each primer pair, based on computations performed on the corrected CODEHOP formulations. The first line provides a summary score of the “primer quality” in thermodynamic terms for each primer pair (range 100% to 0%, best to worst). The next three lines indicate the expected amplicon length in bps and the computed Tm ranges for the fw- and rev-corrected codehop formulations (Fig. 51.4). If the quality score is <100%, this means that at least one oligo has a thermodynamic parameter value worse than the cutoff values used by the server (these values are presented in Table 2 of the server’s online documentation page). Lastly, we get a summary of the phylogenetic evaluation of the corresponding amplicon (Fig. 51.4), which indicates the substitution matrix/model used for the evaluation, and the value of the corresponding shape parameter (alpha). The next line is very useful for users interested in de Bruijn Vol. I c51.tex V1 - 02/08/2011 5:16 P.M. Page 447 447 51.3 Results using the new molecular markers for phylogenetics, since it provides the mean and median Shimodaira–Hasegawalike bipartition support values computed by the PhyML maximum likelihood algorithm. The closer to 1, the higher the resolution level of the tree inferred from the theoretical amplicon set and therefore the higher its phylogenetic information content. A FASTA-formatted file containing the aligned amplicon sets can be downloaded (in this case at the protein level). The ML trees for each amplicon can be either directly visualized or downloaded. 51.2.5 Purification of Soil DNA, rpoB Amplification, Clone Library Construction, and Analysis Ten soil samples of approximately 100 g were randomly sampled within a 10-m × 10-m area in both the tropical humid forest Reserve of Los Tuxtlas, Estación Biológica de la UNAM, Veracruz, Mexico (18◦ 34 23.24 N, 95◦ 9 40.10 O), and a conserved patch of seasonally dry deciduous tropical forest near “El Limón” in the Biosphere Reserve of Sierra de Huautla, Morelos, Mexico (18◦ 32 33.20 N, 98◦ 56 11.19 O). The 10 soil samples were from each site were pooled, mixed, and sieved (2 mm) to obtain two composite samples. A 2-g aliquot from each of the composite samples was used for DNA extraction using the MoBio PowerMax Soil DNA Extraction Kit. DNA integrity was checked on an agarose gel, and the lack of strong PCR inhibitors was confirmed by amplification of 16S rRNA genes using the universal fD1 and rD1 primers [Weisburg et al., 1991]. The rpoB amplicons were generated using a hot-start protocol and Taq-platinum (Invitrogen), with annealing at 60◦ C, and cloned using the PCR Topo cloning system 2.1 (Invitrogen). Plasmids were purified using a 96-well plate format and commercially sequenced at LANGEBIO, Mexico, using the Sanger method at 2 × coverage with the M13f and M13rev primers that bind at vector positions flanking its cloning site. The two sequence reads for each plasmid were automatically assembled and validated by using a custom Perl script that runs Phred/Phrap [Ewing and Green, 1998; Ewing et al., 1998] followed by a BlastX [Altschul et al., 1997] search of each contig against a locally formatted rpoB sequence database of Actinobacteria assembled from fully sequenced genomes of these bacteria. Only sequences aligning globally with entries from the database were further processed by subjecting them to a sequence-to-profile alignment using Clustalw 2.1 [Larkin et al., 2007]. The reference profile consists of a codon alignment of 25 nonredundant rpoB sequences of reference Mycobacterium strains, trimmed to the segment corresponding to the expected amplicon. The resulting multiple sequence alignment was further corrected manually. Phylogenetic analyses were performed with PhyML v. 3.0 [Guindon et al., 2009], as previously described [Vinuesa, 2010; Vinuesa et al., 2008]. All community ecology analyses of the environmental rpoB -amplicon libraries were performed with Mothur version 1.7 [Schloss et al., 2009]. 51.3 RESULTS 51.3.1 Design and Evaluation of the Specificity of Mycobacterium spp. PCR Primers Targeting the RNA Polymerase Beta Subunit (rpoB ) Gene Using Soil Metagenomic DNA Using the primers4clades server and selected reference sequences (a few more than those provided at the server’s actinobacterial demonstration sequence set), we designed the primer pair shown in Table 51.1. These primers were used to generate PCR amplicons from metagenomic DNA purified from the two contrasting tropical forest soils in Mexico (humid evergreen forest at Los Tuxtlas, Veracruz; deciduous seasonally dry forest from Sierra de Huautla, Morelos), as described in Materials and Methods. Specific amplification products could be obtained for both soils, as shown in Figure 10 of the server’s online documentation page. We sequenced 50 clones from each of the two amplicon libraries to evaluate Table 51.1 Basic Thermodynamic Parameters and Sequence Formulation of the Corrected Codehop Primers Used in This Study to Amplify Mycobacterium-Specific Sequences from Two Contrasting Soil Samples Name La Sequence Tm Corrected Degeneracyb Full Degeneracyc h_potd c_pote rpoB2_N326 rpoB2_C757 26 25 GAGAACCTGTTCTTCaaggagaagcg CGTCCTCGTAGTTGtgrccytccca 63.3 66.6 0 4 0 4 0.52 0.00 0.50 0.50 a Oligonucleotide length. Degeneracy of the corrected codehop oligonucleotide (see technical details in the server’s tutorial page). c Full degeneracy of the target alignment site (see technical details in the server’s tutorial page). d Hairpin formation potential. e Cross-hybridization potential. b de Bruijn Vol. I 448 c51.tex V1 - 02/08/2011 5:16 P.M. Page 448 Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers Figure 51.5 Maximum likelihood phylogeny under the GTR+G+I substitution model for 32 reference sequences [Corynebacterium (7), Mycobacterium (17), Nocardia (1), Rhodococcus (3), Saccharopolyspora (1), Salinispora (2), and Tsukamurella (1)] and 100 soil clones (50 from the Reserve of Los Tuxtlas, labeled as TUX; 50 from the Sierra de Huautla Reserve, labeled as REB). For the sake of clarity, the non-Mycobacterium sequence clades are collapsed. Eighty-nine percent of the soil clones clustered within the Mycobacterium spp. clade, demonstrating the high selectivity of the rpoB primers shown in Table 51.1 to target this group of organisms in complex metagenomic DNA samples. the specificity of the oligonucleotides in a phylogenetic context. Figure 51.5 shows a maximum likelihood phylogeny (GTR+G) of the 100 environmental sequences along with 15 reference outgroup strains in the genera Corynebacterium (7), Nocardia (1), Rhodococcus (3), Saccharopolyspora (1), Salinispora (2), and Tsukamurella (1), which are shown as collapsed clades in Figure 51.5, and 17 Mycobacterium spp. ingroup reference strains. From this phylogeny we could unambiguously define the perfectly supported Mycobacterium spp. clade, which clustered 89 of the environmental clones. The remaining 11 environmental sequences clustered with the closely related actinobacterial genera used as outgroups. In summary, 89% of the environmental rpoB amplicons could be unambiguously assigned to the genus Mycobacterium, indicating that the primers have a high specificity for the target genus. Only ∼5% of the environmental Mycobacterium sequences had suspiciously long branches, which might reflect sequence artifacts such as chimeras, heteroduplexes, or Taq errors. de Bruijn Vol. I 51.4 Discussion 51.3.2 Statistical Evaluation of the Sampling Effort Achieved by the Lineage-Targeted Approach A recent review on the use of rpoB sequences in Mycobacterium molecular systematics suggests that 97.7–98.2% similarity cutoff values for rpoB sequences could be used to delimit species within the genus [Adekambi et al., 2009]. However, in this study we use a slightly more conservative cutoff value (3% divergence) to define operational taxonomic units (OTUs; approximately equivalent to species) for our Mycobacterium spp. diversity study. Figures 51.6A and 51.6B show the rarefaction curves [Hughes et al., 2001] generated for the 50 Mycobacterium spp. and closely related actinobacterial sequences obtained from both rpoB soil clone libraries at different cutoff values. Notice that for both soils the rarefaction curves are close to reach an asymptote at the 3% divergence OTU definition, indicating that the sequencing effort has been reasonably good. As a reference point, Figure 51.6C shows the rarefaction curves at the same cutoff values for a classical 16S rRNA gene clone library made from a Brazilian tropical soil using universal primers [Borneman and Triplett, 1997]. These curves indicate that the 98 clones sequenced in that study are far from being representative of the diversity at the species level, which is classically defined by a cutoff level of 97–98% of sequence similarity (2–3% divergence). Figures 51.6D and 51.6E show collector curves for the two libraries using the nonparametric Chao1 richness estimator [Hughes et al., 2001], which indicate that the Mycobacterium community at seasonally dry deciduous forest of El Limón is more diverse (19 predicted species, with a 95% credibility interval 14–46) than that found in the humid tropical forest of Los Tuxtlas (12 predicted species, with a 95% credibility interval 12–32). Interestingly, not a single OTU is shared among the communities at this cutoff level, indicating a strongly differentiated species composition of both communities, which together bring about 31 OTUs. The Chao1 collector curves shown in Figures 51.6D and 51.6E indicate that the sampling effort has not been sufficient to evaluate the diversity at the two lower cutoff values since they have not stabilized [Hughes et al., 2001], but those for the 0.03 and 0.05 cutoff values seem to have stabilized. Therefore, inferences at these cutoff levels are robust. 51.4 DISCUSSION Primers4clades is currently the only publicly available server that integrates alternative primer-design strategies c51.tex V1 - 02/08/2011 5:16 P.M. Page 449 449 with phylogenetic trees to (a) interactively target the search for oligonucleotide formulations to specific sequence clusters and (b) evaluate the phylogenetic information content of the new molecular markers. Together, these features are very useful to make an informed choice among alternative, nonredundant primer pair formulations, taking into account both the thermodynamic properties of the primers and the phylogenetic information content of the expected amplicon sets. These attributes make of primers4clades a novel and powerful tool for the targeted design and informed selection of PCR primers for lineage-targeted metagenomic studies of microbial communities, as demonstrated herein for environmental mycobacteria. The genus Mycobacterium currently has 147 validly described species (see link #4 in Internet resources), most of which are poorly characterized ubiquitous environmental mycobacteria (EM) [Bland et al., 2005; Jacobs et al., 2009]. These are known as “nontuberculous” or “atypical” mycobacteria among clinicians [van Ingen et al., 2009]. The most “famous” mycobacterial species are M. tuberculosis and M. leprae, obligate parasites and the causative agents of tuberculosis (TB) and leprosy, respectively. M. avium, M. abscessus, M. chelonae, M. fortuitum, M. intracellulare, M. kansasii, M. mucogenicum, M. phocaicum, M. scrophulaceum, M. ulcerans, and M. vanbalenii are among the most common nontuberculous or atypical mycobacterial species causing opportunistic infections in immunocompromised humans and have a broad environmental distribution [Falkinham, 2009; Primm et al., 2004]. Understanding the diversity, abundance, and distribution of EM is of prime interest from a biomedical standpoint because immune responses following exposure to these bacteria may influence susceptibility to tuberculosis and leprosy and may severely interfere with the proliferation and protective effect of the attenuated BCG vaccine used against TB [Black et al., 2001; Weir et al., 2006]. Their functional role in ecosystems remains largely unknown despite their environmental ubiquity. The results presented in this study demonstrate that a good sampling effort can be achieved for the genus Mycobacterium with the primers described in Table 51.1 by sequencing as few as 50 clones per site using the lineage-targeted approach to gene-based metagenomic analyses of microbial communities. Further sequencing and statistical analysis of these and additional libraries (>200 clones each) generated with the mentioned primers from soil [Sachman-Ruiz and Vinuesa, 2011] and water samples [Sachman-Ruiz et al., 2009] demonstrated that sequencing around 200 clones provides very strong statistical power, allowing for the first time to make Q2 de Bruijn Vol. I 450 V1 - 02/08/2011 5:16 P.M. Page 450 Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers (A) rarefaction curves for the Sierra de Huautla dataset (n = 50) at the indicated clustering cutoffs rarefaction curves for the (B) Reserva de los Tuxtlas dataset (n = 50) (C) at the indicated clustering cutoffs unique 0.01 0.03 0.05 40 unique 0.01 0.03 0.05 25 rarefaction curves for the Amazonian Soils dataset (n = 98) at the indicated clustering cutoffs unique 0.01 0.03 0.05 80 OTUs 20 OTUs 20 30 OTUs c51.tex 15 60 40 10 10 20 5 0 0 0 0 10 20 30 40 50 Number of sequences sampled (D) 300 0 30 40 50 10 20 Number of sequences sampled Chao’s richness estimator plot Sierra de Huautla dataset (n = 50) at the indicated clustering cutoffs (E) unique 0.01 0.03 0.05 250 0 20 40 60 80 100 Number of sequences sampled Chao’s richness estimator plot Reserva de los Tuxtlas (n = 50) at the indicated clustering cutoffs unique 0.01 0.03 0.05 30 25 OTUs OTUs 200 150 20 15 100 10 50 5 0 0 0 10 20 30 40 Number of sequence sampled 50 0 10 20 30 40 Number of sequence sampled 50 Figure 51.6 Rarefaction (A–C) and collector curves for the nonparametric Chao1 richness estimator (C, D) at four clustering levels (0.0–0.05 P distance) for OTU definition. Panel C shows for comparative purposes the rarefaction curves for 98 SSU rRNA Amazonian soil clones generated with universal primers from the classical study by Borneman and Triplett [1997]. fine-grained inferences about the richness and structure of mycobacteria communities in diverse natural environments. Furthermore, the latter studies have revealed the magnitude of the bias in diversity estimation introduced by classical Mycobacterium isolation procedures and have also revealed that we have a strongly distorted notion of the diversity of this species-rich bacterial genus of great environmental and clinical relevance. These results demonstrate the utility and strong statistical power of the lineage-targeted approach to microbial community ecology and show that primers4clades [Contreras-Moreira et al., 2009] (link #5 in the Internet Resources) is a useful tool to develop primers for such gene-targeted metagenomic analyses of microbial diversity. The development of the tool is now coupled to its recent implementation in a phylogenomics analysis pipeline to construct an interactive primer database for phylogenetic clades at different taxonomic and phylogenetic depths. The graphical interface, analysis options, and parameter evaluation procedures will be improved, extended, and refined in future versions, which will also include faster ML algorithms. INTERNET RESOURCES Link 1: CODEHOP structure (http://www.ncbi.nlm.nih. gov/pmc/articles/PMC168931/figure/gkg524f1/) Link 2: CODEHOP server (http://blocks.fhcrc.org/ codehop.html) Link 3: iCODEHOP server (https://icodehop.cphi. washington.edu/i-codehop-context/iCODEHOP/view/ PrimerAnalysis) Link 4: Mycobacterium taxonomy (http://www. bacterio.cict.fr/m/mycobacterium.html) Link 5: Primers4clades web server (http://maya.ccg. unam.mx/primers4clades) de Bruijn Vol. I References Acknowledgments Romualdo Zayas-Laguna and Vı́ctor del Moral from the Information Technology Administration UNIT at CCGUNAM are acknowledged for their technical support. This work was supported by DGAPA/PAPIIT-UNAM [grant number IN201806-2], CONACyT-Mexico [grant number P1-60071], and by Consejo Superior de Investigaciones Cientı́ficas [grant number 200720I038]. UNAM’s Ph.D. Program in Biomedical Sciences and CONACyT-Mexico are acknowledged for the financial support offered to BSR for his Ph.D. studies. REFERENCES Adekambi T, Drancourt M, Raoult D. 2009. The rpoB gene as a tool for clinical microbiologists. Trends Microbiol . 17(1):37– 45. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:3389– 3402. Anisimova M, Gascuel O. 2006. Approximate likelihood-ratio test for branches: A fast, accurate, and powerful alternative. Syst. Biol . 55(4):539– 352. Black GF, Dockrell HM, Crampin AC, Floyd S, Weir RE, et al. 2001. Patterns and implications of naturally acquired immune responses to environmental and tuberculous mycobacterial antigens in northern Malawi. J. Infect. Dis. 184(3):322– 329. Bland CS, Ireland JM, Lozano E, Alvarez ME, Primm TP. 2005. Mycobacterial ecology of the Rio Grande. Appl. Environ. Microbiol . 71(10):5719– 5727. Borneman J, Triplett EW. 1997. Molecular microbial diversity in soils from eastern Amazonia: Evidence for unusual microorganisms and microbial population shifts associated with deforestation. Appl. Environ. Microbiol . 63(7):2647– 2653. Castiglioni S, Pomati F, Miller K, Burns BP, Zuccato E, et al. 2008. Novel homologs of the multiple resistance regulator marA in antibiotic-contaminated environments. Water Res. 42(16):4271– 4280. Contreras-Moreira B, Sachman-Ruiz B, Figueroa-Palacios I, Vinuesa P. 2009. Primers4clades: A web server that uses phylogenetic trees to design lineage-specific PCR primers for metagenomic and diversity studies. Nucleic Acids Res. 37(Web Server issue):W95– W100. Edgar RC. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792– 1797. Edwards SV. 2009. Is a new and general theory of molecular systematics emerging? Evolution 63(1):1–19. Ewing B, Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8(3):186– 194. Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8(3):175– 185. Falkinham JO, 3rd. 2009. Surrounded by mycobacteria: nontuberculous mycobacteria in the human environment. J. Appl. Microbiol . 107(2):356– 367. Falkowski PG, Fenchel T, Delong EF. 2008. The microbial engines that drive Earth’s biogeochemical cycles. Science 320(5879):1034– 1039. Felsenstein J. 2004. PHYLIP (Phylogeny Inference Package), 3.6 ed. Distributed by the author. Seattle: Department of Genetics, University of Washington. c51.tex V1 - 02/08/2011 5:16 P.M. Page 451 451 Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, et al. 2008. Microbial community gene expression in ocean surface waters. Proc. Natl. Acad. Sci. USA 105(10):3805– 3810. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, et al. 2005. Opinion: Reevaluating prokaryotic species. Nat. Rev. Microbiol . 3(9):733– 739. Guindon S, Delsuc F, Dufayard JF, Gascuel O. 2009. Estimating maximum likelihood phylogenies with PhyML. Methods Mol. Biol . 537:113– 137. Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJ. 2001. Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol . 67(10):4399– 4406. Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, et al. 2008. Resource partitioning and sympatric differentiation among closely related bacterioplankton. Science 320(5879):1081– 1085. Jacobs J, Rhodes M, Sturgis B, Wood B. 2009. Influence of environmental gradients on the abundance and distribution of Mycobacterium spp. in a coastal lagoon estuary. Appl. Environ. Microbiol . 75(23):7378– 7384. Jarman SN. 2004. Amplicon: Software for designing PCR primers on aligned DNA sequences. Bioinformatics 20(10):1644– 1645. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, et al. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947– 2948. Maiden MC. 2006. Multilocus sequence typing of bacteria. Annu. Rev. Microbiol . 60:561– 588. Manning SD, Motiwala AS, Springman AC, Qi W, Lacher DW, et al. 2008. Variation in virulence among clades of Escherichia coli O157:H7 associated with disease outbreaks. Proc. Natl. Acad. Sci. USA 105(12):4868– 4873. Nakamura Y, Gojobori T, Ikemura T. 2000. Codon usage tabulated from international DNA sequence databases: Status for the year 2000. Nucleic Acids Res. 28(1):292. Polz MF, Cavanaugh CM. 1998. Bias in template-to-product ratios in multitemplate PCR. Appl. Environ. Microbiol . 64(10):3724– 3730. Primm TP, Lucero CA, Falkinham JO, 3rd. 2004. Health impacts of environmental mycobacteria. Clin. Microbiol. Rev . 17(1):98– 106. Rose TM, Schultz ER, Henikoff JG, Pietrokovski S, McCallum CM, et al. 1998. Consensus-degenerate hybrid oligonucleotide primers for amplification of distantly related sequences. Nucleic Acids Res. 26(7):1628– 1635. Rose TM, Henikoff JG, Henikoff S. 2003. CODEHOP (COnsensusDEgenerate Hybrid Oligonucleotide Primer) PCR primer design. Nucleic Acids Res. 31(13):3763– 3766. Sachman-Ruiz B, Vinuesa P. 2011. In preparation. Sachman-Ruiz B, Castillo-Rodal AI, López-Vidal Y, Martı́nezRomero E, Vinuesa P. 2009. Diversity of environmental mycobacteria in Mexican rivers assessed by cultivation and metagenomics approaches. Poster abstract U-007. 109th General Meeting of the American Society for Microbiology. May 17–21 2009. Philadelphia. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, et al. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol . 75(23):7537– 7541. Schmidt HA, Strimmer K, Vingron M, von Haeseler A. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18(3):502– 504. Smith CJ, Nedwell DB, Dong LF, Osborn AM. 2007. Diversity and abundance of nitrate reductase genes (narG and napA), nitrite reductase genes (nirS and nrfA), and their transcripts in estuarine sediments. Appl. Environ. Microbiol . 73(11):3612– 3622. Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. 2009. Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature 459(7250):1122– 1125. Q3 de Bruijn Vol. I 452 Q4 c51.tex V1 - 02/08/2011 Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, et al. 2002. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12(10):1611– 1618. van Ingen J, Boeree MJ, Dekhuijzen PN, van Soolingen D. 2009. Environmental sources of rapid growing nontuberculous mycobacteria causing disease in humans. Clin. Microbiol. Infect. 15(10):888– 893. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304(5667):66– 74. Vinuesa P. 2010. Multilocus sequence analysis and bacterial species phylogeny estimation. In Oren A, Papke RT, eds. Molecular Phylogeny of Microorganisms. Norwich, UK: Caister Academic Press, pp. 000– 000. Vinuesa P, Rojas-Jimenez K, Contreras-Moreira B, Mahna SK, Prasad BN, et al. 2008. Multilocus sequence analysis for assessment of the biogeography and evolutionary genetics of four Bradyrhizobium species that nodulate soybeans on the Asiatic continent. Appl. Environ. Microbiol . 74(22):6987– 6996. Weir RE, Black GF, Nazareth B, Floyd S, Stenson S, et al. 2006. The influence of previous exposure to environmental mycobacteria on the interferon-gamma response to bacille Calmette–Guerin vaccination in southern England and northern Malawi. Clin. Exp. Immunol . 146(3):390– 399. Weisburg WG, Barns SM, Pelletie DA, Lane DJ. 1991. 16S ribosomal DNA amplification for phylogenetic study. J. Bacteriol. 173:697– 703. Yutin N, Beja O. 2005. Putative novel photosynthetic reaction centre organizations in marine aerobic anoxygenic photosynthetic bacteria: Insights from metagenomics and environmental genomics. Environ. Microbiol . 7(12):2027– 2033. Zehr JP, Jenkins BD, Short SM, Steward GF. 2003. Nitrogenase gene diversity and microbial community structure: A cross-system comparison. Environ. Microbiol . 5(7):539– 554. 5:16 P.M. de Bruijn Vol. I Q1. Q2. Q3. Q4. Queries in Chapter 51 We have shortened the running head, since it exceeds the textwidth. Please confirm Update in ref list Update Supply page range c51.tex V1 - 02/08/2011 5:16 P.M.