A Cross-Sample Statistical Model for SNP Detection in Short-Read Sequencing Data
Transcription
A Cross-Sample Statistical Model for SNP Detection in Short-Read Sequencing Data
A Cross-Sample Statistical Model for SNP Detection in Short-Read Sequencing Data Supplementary Material Omkar Muralidharan, Georges Natsoulis, John Bell, Daniel Newburger, Hua Xu, Itai Kela, Hanlee Ji, and Nancy Zhang August 6, 2011 1 1.1 Example Dataset Details Capture Assay Design Capture oligonucleotides specific for 53 genes involved in B-cell regulation were selected from an exome capture resource, oligoexome.stanford.edu. All oligonucleotides were synthesized at the Stanford Genome Technology Center and pooled based on those oligonucleotides specific to each restriction enzyme (MseI, BfaI, Sau3AI). This resulted in three separate pools for each capture assay. In the case of capture oligonucleotides specific for Sau3AI, DpnII, a Sau3AI isoschizomer that recognizes the same palindromic sequence, was the actual enzyme used for restriction digests. 1.2 Sample Preparation Briefly, 250 ng human genomic DNA was digested to completion with 3 to 5 units of MseI, BfaI, DpnII restriction enzyme (New England BioLabs). Subsequently, one third of each digestion was combined with 2.5 unit each of Ampligase (Epicentre Biotechnologies) and Taq polymerase plus 50 pM each of the capture oligonucleotide pool and the vector oligonucleotide at equimolar concentration with the capture oligonucleotide pool. The reactions were first denatured at 95OC for 5 minutes and then subjected to 10-15 cycles at 95OC for 1 minute, 60OC for 45 minutes, and 72OC for 15 minutes. Under these conditions, the captured genomic regions formed partially double-stranded circles via oligonucleotide-mediated nick ligation. Uracil excision enzymes (Epicentre Biotechnologies), at 1 unit per reaction, were used to linearize the circles and degrade excess targeting and vector oligonucleotides. After a brief purification using the Spin-20 columns (Princeton Separations), the captured DNA pool was 1 amplified by PCR (98OC for 30 seconds followed by 36-37 cycles at 98OC for 10 seconds, 65OC for 30 seconds, and 73OC for 30 seconds) using Phusion Hot Start High-Fidelity DNA polymerase (New England BioLabs) and non-target specific common primers that are homologous to the vector oligonucleotide. After purification using the Fermentas PCR Purification kit, 0.5-1 µg PCR products per sequencing library were ligated to each other using T4 DNA ligase (New England BioLabs). The concatenated amplicon DNA was fragmented using the Bioruptor (Diagenode), a probe-free sonication device. Subsequent sequencing library preparation was essentially as described in [4] with minor modifications. For the “A” tailing step prior to ligation to the adapters, we used Taq polymerase for improved efficiency and shorter reaction time [1]. Size selection of the sequencing libraries in the range of 200-300 bp was accomplished by using the 2% SizeSelect E-Gel (Invitrogen). 1.3 Sequencing Libraries were sequenced on a GAIIx instrument. Sequencing was performed following the standard paired-end protocol for 2x40 bases. Sequence reads from the forward and reverse reads were combined for analysis and the mate-pair information was not used because the samples have been submitted to concatenation followed by fragmentation. Under these conditions the reads of most mate pairs are expected to derive from different amplicons. 1.4 Alignment The samples were sequenced according to the manufacturer’s specifications. Images were collected and after the run, image analysis, base calling and error estimation were performed using Illumina sequencing software (version 2.6.26). The samples were sequenced in 42 paired-end cycles, analyzed using Illumina RTA 1.6.32 and Eland v2, pipeline version 1.6. A PhiX control lane was use for all image analysis. Alignments used default parameters. We used hg18 as a reference. 1.5 GATK We followed the Broad’s best practices guidelines for SNP calling with GATK by performing indel realignment and base quality score recalibration for each sample[3]. We skipped the mark duplicates step because of the very high depth of our single read data, and we adjusted the maxReadsForRealignment parameter in the IndelRealigner to accommodate this depth. Single sample and multisample SNP calling were performed with the Unified Genotyper using standard hard filters [2]. 2 1.6 Quality Score Filtered Depth Charts A perl script was used for filtering. It reads each aligned probe sequence and basewise quality score listing for each line in an Illumina export file. Every base whose corresponding score is less than 20 (here, "S" or lower) is converted to N . No other modifications are made. Thus, the number of mismatches determined by the Eland aligner is not modified, regardless of whether low-quality bases contributed to that number. These samples were run on the v.1.5 pipeline and later, so they are using real phred scores, rather than the quasi-phred scores used by earlier Illumina pipelines. The depth matrices used for variant calling include sequences with no more than two mismatches. Thus, the exact same number of sequences were input to the quality-filtered depth matrices. The difference is that no bases with individual quality scores less than 20 contribute to the total A, C, G, or T bases. 1.7 Coverage Reproducibility Figure 1 shows that the coverage was reproducible across samples in our data. 2 Model Fitting Algorithm We fit the mixture model using a regularized ECM algorithm. ECM algorithms replace the M-step with a series of partial maximizations. They are often computationally simpler than the usual EM algorithm, yet enjoy many of the same convergence properties [?]. Our algorithm is quite fast on current data and easily parallelizable, so it will be able to handle larger datasets to come. We first define mixture component indicators g ξi,null = I(pi drawn f rom component g) g ξi,alt = I(q i drawn f rom component g). We treat the mixture component indicators g g ξ = {ξi,null , ξi,alt : i = 1, . . . , N ; g = 1, . . . , G} and the heterozygosity indicators δ = {δij : i = 1, . . . , N ; j = 1, . . . , M } as missing data. 2.0.1 E-Step In the E-step, we compute E (ξ|X) and E (δ|X), at the current values of the other parameters p, q, π, θ, α. The Xij s are conditionally independent given p and q, so Q4 X πi l=1 pil ijl . E (δij |X, p, q, π, α) = Q4 Q4 X X πi l=1 pil ijl + (1 − πi ) l=1 qil ijl 3 Figure 1: Coverage rates for sample 1 (x axis) and 2 (y axis), plotted on a log-log scale. Points with zero coverage are not shown. 4 Similarly, E (ξ|X) are ratios of Dirichlet densities fα : θnull,g fαg (pi ) g E ξi,null |X, p, q, π, α = P g θnull,g fαg (pi ) θalt,g fαg (q i ) g E ξi,alt |X, p, q, π, α = P . g θalt,g fαg (q i ) 2.0.2 CM-Step In the CM-step, we estimate p, q, π, θ, α using the expected values of the indicators. For simplicty, we will write ξ for E (ξ|X), and so on. We sequentially optimize over π, θ, α, p, q. P To estimate π, we do not use the MLE π ˆi = M −1 j δij . This estimator behaves poorly. If πi = 0, as it does for the majority of our data, the MLE is trivially biased upward, since δij ∈ (0, 1). This creates a feedback cycle, since a higher π increases δ estimates, increasing the next estimate of π yet further. The bias of the MLE is worst for low-depth positions, but it falls exponentially as the depth increases, since δij → 0 exponentially fast. Instead, we use a weighted shrinkage estimator for πi that downweights lowdepth positions and shrinks all the πi toward the overall mean of the δs. Our estimator is P ¯ j wij δij + aδ , π ˆi = P j wij + a where the weights wij are wij = min((Nij − 3)+ , 20). We give samples with depth less than 3 no weight since these positions are particularly noisy in our data. The weights increase until Nij = 23, and then remain constant. We bound the weights because the bias of the MLE is negligible when the depth is high (the specific choice of 23 does not make much difference). We took a = 10 for our data, but the exact choice did not seem to make much difference. To estimate θ, we simply use the MLE. For example, P g i ξi,alt ˆ θnull,g = . N P4 That leaves p, q and α. Let A be the common precision of the α’s, so A = ˜ = α/A. We first estimate α ˜ unbiasedly. For each null group l=1 αlg , and let α g, our estimate is P g i,j Xij (1 − δij ) ξi,null α ˜g = P . g i,j Nij (1 − δij ) ξi,null For the non-null groups, we set α ˜ to be the average of the appropriate null groups. Next, using the fitted α ˜ ’s, we likelihood, marginalizPestimate A by maximum g ing over p and q. In our model, j Xij (1 − δij ) ξi,null has a Dirichlet-multinomial distribution for each i, g: X g P g Xij (1 − δij ) ξi,null ∼ fAα (Xij ) ˜ g , j Nij (1−δij )ξi,null j 5 where fα,N is P the Dirichlet-multinomial distribution with parameters α and g N . Similarly, j Xij δij ξi,alt has a Dirichlet-multinomial distribution for each i, g. We use the estimated δ, ξ, α ˜ and estimate A by maximum likelihood (using Newton’s method). Finally, given α ˜ and A, we estimate p and q by their posterior means. ! P X g αg + j (1 − δij ) Xij P pi = ξi,null A + j (1 − δij ) Nij g ! P X g αg + j δij Xij P qi = ξi,alt . A + j δij Nij g 2.0.3 Starting Points We use the starting points to regularize this procedure and incorporate our intuition about what parameter values are reasonable. By starting out with parameters of the form we desire and allowing the EM iterations to fine-tune them, we hope to get reasonable final parameter estimates. P ˜ = We start with πi = 10−5 . For α, we initialize A = g αg and α ˜ to α/A separately. We start with A = 20. We initialize the clean null αs (0.95, 0.0033, 0.0033, 0.0033) (changing the location of the maximum as appropriate). For noisy nulls, we use (0.85, 0.05, 0.05, 0.05). This difference in starting points is the only explicit difference between clean and noisy mixture components; afterward, the algorithm ignores the clean/noisy distinction. The alter˜ are initialized to the averages of the corresponding null αs. ˜ For θ, native αs we initialize θnull to put probability 0.245 on each clean null, and 0.005 on each 1 on each clean and noisy alternative. noisy null, and θalt to put probability 12 Finally, we initialize p and q as follows. Let Z ∈ RN ×4 be the matrix obtained by summing the counts X over all samples for each position. For each i, let li be the index of the reference base (1 to 4), and let ki be the index of the highest-frequency nonreference base in Zi . We initialize pi to be roughly a null group at base li , with some error in the base ki direction, depending on how much error Zi shows. More precisely, we set pi = 1, pli = max (Zi ), pki = Zki , and then scale so that p sums to 1 and puts probability between 0.85 and 0.99 on the reference base. To initialize q i , we simply put probability 0.4 on bases li and ki and 0.1 on the other two bases. 2.1 Model-Based Simulation We generated a synthetic dataset from our model, with parameters based on the fit to real data. Given the true priors and parameters, calculating E (δ|X) is still intractable, so we approximated E (δ|X) as in our EM fitting procedure, and used these to calculate the true weighted f dr X X f dr = exp 1 − wi log (1 − E (δ|X)) / wi , 6 Figure 2: Estimated f dr vs true weighted f dr, plotted on the logit scale. Points plotted at y = 50 have estimated f dr numerically equal to 1 where wi = max (N − 3)+ , 20 . We then refit our model on the synthetic dataset and compared the fitted f ˆdr to the true weighted f drs. Figure 2 shows that the estimated f dr tracks the true f dr well for positions with low true f dr, with a slight upward bias. For positions with high true f drs, our estimator is quite upwardly biased, but unlikely to mislead, since the exact f dr for high-f dr positions does not usually matter. Table 1 shows that if we bin the estimated and true f drs into extremely low, low, moderate, large and very large, our estimated f dr nearly always agrees with the true f dr. These simulations show that our method can conservatively estimate the true f dr when the data is generated from the model we are fitting. This test validates our fitting method. Estimating the f dr is nontrivial, even when fitting the true model, since the likelihood is nonconvex with many local optima. Our method’s 7 f dr: True\\Estimated 0, 10−5 (10−5 , 0.001] (0.001, 0.1] (0.1, 0.5] (0.5, 1] Total 0, 10−5 2134 6 0 0 2 2142 (10−5 , 0.001] 8 8 4 0 0 20 (0.001, 0.1] 1 6 19 5 4 35 (0.1, 0.5] 0 0 0 7 7 21 (0.5, 1] 5 1 13 22 278397 278438 Table 1: Binned true f dr and estimated f dr. good f dr estimation performance on this parametric simulation suggests that our estimated f drs are likely to be at least roughly accurate on real data. 3 Spike-In Experiment Positions Tables 2 and 3 give the clean and noisy positions we used, respectively. References [1] James M. Clark. Novel non-templated nucleotide addition reactions catalyzed by procaryotic and eucaryotic dna polymerases. Nucleic Acids Research, 16(20):9677–9686, 1988. [2] M. DePristo, E. Banks, R. Poplin, K. Garimella, J. Maguire, C. Hartl., A. Philippakis, G. del Angel, M.A Rivas, M. Hanna, A. McKenna, T. Fennell, A. Kernytsky, A Sivachenko, K. Cibulskis, S. Gabriel, D Altshuler, , and M Daly. A framework for variation discovery and genotyping using nextgeneration dna sequencing data. Nature Genetics, 0(0), 2011. [3] Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A. DePristo. The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome Research, 20(9):1297–1303, 2010. [4] Michael A. Quail, Iwanka Kozarewa, Frances Smith, Aylwyn Scally, Philip J. Stephens, Richard Durbin, Harold Swerdlow, and Daniel J. Turner. A large genome center’s improvements to the Illumina sequencing system. Nature Methods, 5(12):1005–1010, November 2008. 8 Total 2148 21 43 34 278410 280656 Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 A 0 0 0 0 0 0 0 13 3 0 0 2 4 0 0 0 1 0 1 0 0 0 0 2 0 3 1 0 1 0 C 0 1 2 0 0 0 5 7 2 1 0 0 3 1 0 1 1 1 0 1 0 0 0 1 2 3 3 2 0 0 G 400 500 594 471 615 11 4991 3907 2036 602 583 663 1009 1126 984 1053 1123 2103 279 227 424 154 411 582 1148 2346 1503 2639 1899 255 T 0 1 2 4 0 0 6 2 4 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 2 2 3 1 2 1 Table 2: Clean null position, reference base G, spiked alternative base A. 9 Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 A 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 1 3 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 G 1 0 0 0 0 0 2 3 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 T 7 8 9 1 7 0 92 102 32 3 4 37 11 17 8 20 19 24 4 12 0 11 13 0 13 5 6 2 15 6 Table 3: Noisy null position, reference base T , spiked alternative base G. 10