A Cross-Sample Statistical Model for SNP Detection in Short-Read Sequencing Data

Transcription

A Cross-Sample Statistical Model for SNP Detection in Short-Read Sequencing Data
A Cross-Sample Statistical Model for SNP
Detection in Short-Read Sequencing Data
Supplementary Material
Omkar Muralidharan, Georges Natsoulis, John Bell, Daniel Newburger,
Hua Xu, Itai Kela, Hanlee Ji, and Nancy Zhang
August 6, 2011
1
1.1
Example Dataset Details
Capture Assay Design
Capture oligonucleotides specific for 53 genes involved in B-cell regulation were
selected from an exome capture resource, oligoexome.stanford.edu. All oligonucleotides were synthesized at the Stanford Genome Technology Center and
pooled based on those oligonucleotides specific to each restriction enzyme (MseI,
BfaI, Sau3AI). This resulted in three separate pools for each capture assay.
In the case of capture oligonucleotides specific for Sau3AI, DpnII, a Sau3AI
isoschizomer that recognizes the same palindromic sequence, was the actual
enzyme used for restriction digests.
1.2
Sample Preparation
Briefly, 250 ng human genomic DNA was digested to completion with 3 to 5
units of MseI, BfaI, DpnII restriction enzyme (New England BioLabs). Subsequently, one third of each digestion was combined with 2.5 unit each of Ampligase (Epicentre Biotechnologies) and Taq polymerase plus 50 pM each of the
capture oligonucleotide pool and the vector oligonucleotide at equimolar concentration with the capture oligonucleotide pool. The reactions were first denatured at 95OC for 5 minutes and then subjected to 10-15 cycles at 95OC for
1 minute, 60OC for 45 minutes, and 72OC for 15 minutes. Under these conditions, the captured genomic regions formed partially double-stranded circles
via oligonucleotide-mediated nick ligation. Uracil excision enzymes (Epicentre
Biotechnologies), at 1 unit per reaction, were used to linearize the circles and
degrade excess targeting and vector oligonucleotides. After a brief purification
using the Spin-20 columns (Princeton Separations), the captured DNA pool was
1
amplified by PCR (98OC for 30 seconds followed by 36-37 cycles at 98OC for 10
seconds, 65OC for 30 seconds, and 73OC for 30 seconds) using Phusion Hot Start
High-Fidelity DNA polymerase (New England BioLabs) and non-target specific
common primers that are homologous to the vector oligonucleotide. After purification using the Fermentas PCR Purification kit, 0.5-1 µg PCR products
per sequencing library were ligated to each other using T4 DNA ligase (New
England BioLabs). The concatenated amplicon DNA was fragmented using the
Bioruptor (Diagenode), a probe-free sonication device. Subsequent sequencing
library preparation was essentially as described in [4] with minor modifications.
For the “A” tailing step prior to ligation to the adapters, we used Taq polymerase for improved efficiency and shorter reaction time [1]. Size selection of
the sequencing libraries in the range of 200-300 bp was accomplished by using
the 2% SizeSelect E-Gel (Invitrogen).
1.3
Sequencing
Libraries were sequenced on a GAIIx instrument. Sequencing was performed
following the standard paired-end protocol for 2x40 bases. Sequence reads from
the forward and reverse reads were combined for analysis and the mate-pair
information was not used because the samples have been submitted to concatenation followed by fragmentation. Under these conditions the reads of most
mate pairs are expected to derive from different amplicons.
1.4
Alignment
The samples were sequenced according to the manufacturer’s specifications. Images were collected and after the run, image analysis, base calling and error
estimation were performed using Illumina sequencing software (version 2.6.26).
The samples were sequenced in 42 paired-end cycles, analyzed using Illumina
RTA 1.6.32 and Eland v2, pipeline version 1.6. A PhiX control lane was use
for all image analysis. Alignments used default parameters. We used hg18 as a
reference.
1.5
GATK
We followed the Broad’s best practices guidelines for SNP calling with GATK
by performing indel realignment and base quality score recalibration for each
sample[3]. We skipped the mark duplicates step because of the very high depth
of our single read data, and we adjusted the maxReadsForRealignment parameter in the IndelRealigner to accommodate this depth. Single sample and multisample SNP calling were performed with the Unified Genotyper using standard
hard filters [2].
2
1.6
Quality Score Filtered Depth Charts
A perl script was used for filtering. It reads each aligned probe sequence and
basewise quality score listing for each line in an Illumina export file. Every base
whose corresponding score is less than 20 (here, "S" or lower) is converted to N .
No other modifications are made. Thus, the number of mismatches determined
by the Eland aligner is not modified, regardless of whether low-quality bases
contributed to that number. These samples were run on the v.1.5 pipeline and
later, so they are using real phred scores, rather than the quasi-phred scores
used by earlier Illumina pipelines.
The depth matrices used for variant calling include sequences with no more
than two mismatches. Thus, the exact same number of sequences were input
to the quality-filtered depth matrices. The difference is that no bases with
individual quality scores less than 20 contribute to the total A, C, G, or T
bases.
1.7
Coverage Reproducibility
Figure 1 shows that the coverage was reproducible across samples in our data.
2
Model Fitting Algorithm
We fit the mixture model using a regularized ECM algorithm. ECM algorithms
replace the M-step with a series of partial maximizations. They are often computationally simpler than the usual EM algorithm, yet enjoy many of the same
convergence properties [?]. Our algorithm is quite fast on current data and
easily parallelizable, so it will be able to handle larger datasets to come.
We first define mixture component indicators
g
ξi,null
=
I(pi drawn f rom component g)
g
ξi,alt
=
I(q i drawn f rom component g).
We treat the mixture component indicators
g
g
ξ = {ξi,null
, ξi,alt
: i = 1, . . . , N ; g = 1, . . . , G}
and the heterozygosity indicators δ = {δij : i = 1, . . . , N ; j = 1, . . . , M } as
missing data.
2.0.1
E-Step
In the E-step, we compute E (ξ|X) and E (δ|X), at the current values of the
other parameters p, q, π, θ, α. The Xij s are conditionally independent given p
and q, so
Q4
X
πi l=1 pil ijl
.
E (δij |X, p, q, π, α) = Q4
Q4
X
X
πi l=1 pil ijl + (1 − πi ) l=1 qil ijl
3
Figure 1: Coverage rates for sample 1 (x axis) and 2 (y axis), plotted on a
log-log scale. Points with zero coverage are not shown.
4
Similarly, E (ξ|X) are ratios of Dirichlet densities fα :
θnull,g fαg (pi )
g
E ξi,null
|X, p, q, π, α
= P
g θnull,g fαg (pi )
θalt,g fαg (q i )
g
E ξi,alt
|X, p, q, π, α
= P
.
g θalt,g fαg (q i )
2.0.2
CM-Step
In the CM-step, we estimate p, q, π, θ, α using the expected values of the indicators. For simplicty, we will write ξ for E (ξ|X), and so on. We sequentially
optimize over π, θ, α, p, q.
P
To estimate π, we do not use the MLE π
ˆi = M −1 j δij . This estimator
behaves poorly. If πi = 0, as it does for the majority of our data, the MLE is
trivially biased upward, since δij ∈ (0, 1). This creates a feedback cycle, since
a higher π increases δ estimates, increasing the next estimate of π yet further.
The bias of the MLE is worst for low-depth positions, but it falls exponentially
as the depth increases, since δij → 0 exponentially fast.
Instead, we use a weighted shrinkage estimator for πi that downweights lowdepth positions and shrinks all the πi toward the overall mean of the δs. Our
estimator is
P
¯
j wij δij + aδ
,
π
ˆi = P
j wij + a
where the weights wij are wij = min((Nij − 3)+ , 20). We give samples with
depth less than 3 no weight since these positions are particularly noisy in our
data. The weights increase until Nij = 23, and then remain constant. We
bound the weights because the bias of the MLE is negligible when the depth is
high (the specific choice of 23 does not make much difference). We took a = 10
for our data, but the exact choice did not seem to make much difference.
To estimate θ, we simply use the MLE. For example,
P g
i ξi,alt
ˆ
θnull,g =
.
N
P4 That leaves p, q and α. Let A be the common precision of the α’s, so A =
˜ = α/A. We first estimate α
˜ unbiasedly. For each null group
l=1 αlg , and let α
g, our estimate is
P
g
i,j Xij (1 − δij ) ξi,null
α
˜g = P
.
g
i,j Nij (1 − δij ) ξi,null
For the non-null groups, we set α
˜ to be the average of the appropriate null
groups.
Next, using the fitted α
˜ ’s, we
likelihood, marginalizPestimate A by maximum
g
ing over p and q. In our model, j Xij (1 − δij ) ξi,null
has a Dirichlet-multinomial
distribution for each i, g:
X
g
P
g
Xij (1 − δij ) ξi,null
∼ fAα
(Xij )
˜ g , j Nij (1−δij )ξi,null
j
5
where fα,N is P
the Dirichlet-multinomial distribution with parameters α and
g
N . Similarly, j Xij δij ξi,alt
has a Dirichlet-multinomial distribution for each
i, g. We use the estimated δ, ξ, α
˜ and estimate A by maximum likelihood (using
Newton’s method).
Finally, given α
˜ and A, we estimate p and q by their posterior means.
!
P
X g
αg + j (1 − δij ) Xij
P
pi =
ξi,null
A + j (1 − δij ) Nij
g
!
P
X g
αg + j δij Xij
P
qi =
ξi,alt
.
A + j δij Nij
g
2.0.3
Starting Points
We use the starting points to regularize this procedure and incorporate our
intuition about what parameter values are reasonable. By starting out with
parameters of the form we desire and allowing the EM iterations to fine-tune
them, we hope to get reasonable final parameter estimates. P
˜ =
We start with πi = 10−5 . For α, we initialize A =
g αg and α
˜ to
α/A separately. We start with A = 20. We initialize the clean null αs
(0.95, 0.0033, 0.0033, 0.0033) (changing the location of the maximum as appropriate). For noisy nulls, we use (0.85, 0.05, 0.05, 0.05). This difference in starting
points is the only explicit difference between clean and noisy mixture components; afterward, the algorithm ignores the clean/noisy distinction. The alter˜ are initialized to the averages of the corresponding null αs.
˜ For θ,
native αs
we initialize θnull to put probability 0.245 on each clean null, and 0.005 on each
1
on each clean and noisy alternative.
noisy null, and θalt to put probability 12
Finally, we initialize p and q as follows. Let Z ∈ RN ×4 be the matrix
obtained by summing the counts X over all samples for each position. For each
i, let li be the index of the reference base (1 to 4), and let ki be the index of the
highest-frequency nonreference base in Zi . We initialize pi to be roughly a null
group at base li , with some error in the base ki direction, depending on how
much error Zi shows. More precisely, we set pi = 1, pli = max (Zi ), pki = Zki ,
and then scale so that p sums to 1 and puts probability between 0.85 and 0.99
on the reference base. To initialize q i , we simply put probability 0.4 on bases li
and ki and 0.1 on the other two bases.
2.1
Model-Based Simulation
We generated a synthetic dataset from our model, with parameters based on
the fit to real data. Given the true priors and parameters, calculating E (δ|X)
is still intractable, so we approximated E (δ|X) as in our EM fitting procedure,
and used these to calculate the true weighted f dr
X
X f dr = exp 1 −
wi log (1 − E (δ|X)) /
wi ,
6
Figure 2: Estimated f dr vs true weighted f dr, plotted on the logit scale. Points
plotted at y = 50 have estimated f dr numerically equal to 1
where wi = max (N − 3)+ , 20 . We then refit our model on the synthetic
dataset and compared the fitted f ˆdr to the true weighted f drs.
Figure 2 shows that the estimated f dr tracks the true f dr well for positions
with low true f dr, with a slight upward bias. For positions with high true f drs,
our estimator is quite upwardly biased, but unlikely to mislead, since the exact
f dr for high-f dr positions does not usually matter. Table 1 shows that if we
bin the estimated and true f drs into extremely low, low, moderate, large and
very large, our estimated f dr nearly always agrees with the true f dr.
These simulations show that our method can conservatively estimate the true
f dr when the data is generated from the model we are fitting. This test validates
our fitting method. Estimating the f dr is nontrivial, even when fitting the true
model, since the likelihood is nonconvex with many local optima. Our method’s
7
f dr: True\\Estimated
0, 10−5
(10−5 , 0.001]
(0.001, 0.1]
(0.1, 0.5]
(0.5, 1]
Total
0, 10−5
2134
6
0
0
2
2142
(10−5 , 0.001]
8
8
4
0
0
20
(0.001, 0.1]
1
6
19
5
4
35
(0.1, 0.5]
0
0
0
7
7
21
(0.5, 1]
5
1
13
22
278397
278438
Table 1: Binned true f dr and estimated f dr.
good f dr estimation performance on this parametric simulation suggests that
our estimated f drs are likely to be at least roughly accurate on real data.
3
Spike-In Experiment Positions
Tables 2 and 3 give the clean and noisy positions we used, respectively.
References
[1] James M. Clark. Novel non-templated nucleotide addition reactions catalyzed by procaryotic and eucaryotic dna polymerases. Nucleic Acids Research, 16(20):9677–9686, 1988.
[2] M. DePristo, E. Banks, R. Poplin, K. Garimella, J. Maguire, C. Hartl.,
A. Philippakis, G. del Angel, M.A Rivas, M. Hanna, A. McKenna, T. Fennell,
A. Kernytsky, A Sivachenko, K. Cibulskis, S. Gabriel, D Altshuler, , and
M Daly. A framework for variation discovery and genotyping using nextgeneration dna sequencing data. Nature Genetics, 0(0), 2011.
[3] Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey
Gabriel, Mark Daly, and Mark A. DePristo. The genome analysis toolkit:
A mapreduce framework for analyzing next-generation dna sequencing data.
Genome Research, 20(9):1297–1303, 2010.
[4] Michael A. Quail, Iwanka Kozarewa, Frances Smith, Aylwyn Scally, Philip J.
Stephens, Richard Durbin, Harold Swerdlow, and Daniel J. Turner. A large
genome center’s improvements to the Illumina sequencing system. Nature
Methods, 5(12):1005–1010, November 2008.
8
Total
2148
21
43
34
278410
280656
Sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
A
0
0
0
0
0
0
0
13
3
0
0
2
4
0
0
0
1
0
1
0
0
0
0
2
0
3
1
0
1
0
C
0
1
2
0
0
0
5
7
2
1
0
0
3
1
0
1
1
1
0
1
0
0
0
1
2
3
3
2
0
0
G
400
500
594
471
615
11
4991
3907
2036
602
583
663
1009
1126
984
1053
1123
2103
279
227
424
154
411
582
1148
2346
1503
2639
1899
255
T
0
1
2
4
0
0
6
2
4
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
2
2
3
1
2
1
Table 2: Clean null position, reference base G, spiked alternative base A.
9
Sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
A
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
C
0
0
0
0
0
0
1
3
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
G
1
0
0
0
0
0
2
3
0
0
0
1
0
1
0
1
1
0
0
0
0
0
0
0
1
0
0
0
0
0
T
7
8
9
1
7
0
92
102
32
3
4
37
11
17
8
20
19
24
4
12
0
11
13
0
13
5
6
2
15
6
Table 3: Noisy null position, reference base T , spiked alternative base G.
10