MPG NGS workshop - Broad Institute

Transcription

MPG NGS workshop - Broad Institute
MPG NGS workshop:
SNP calling and error modeling
February 2011
Ryan Poplin
Genome Sequencing and Analysis
Medical and Population Genetics
The paradigm today
Phase 1: NGS data processing
Typically by lane
Input
Raw reads
Phase 2: Variant discovery and genotyping
Phase 3: Integrative analysis
Typically multiple samples simultaneously but can be single sample alone
Sample 1
reads
Sample N
reads
Raw
indels
Raw
SNPs
Raw
SVs
External data
Mapping
SNPs
Local
realignment
Pedigrees
Known
variation
Population
structure
Known
genotypes
Indels
Duplicate
marking
Base quality
recalibration
Output
Analysis-ready
reads
Variant quality
recalibration
Structural
variation (SV)
Raw variants
Genotype
refinement
Analysis-ready
variants
Step 2: SNP discovery
Analysisready
BAMs
Allele
Genotype
Frequenc
Likelihoo
y
ds
Calculatio
Calculatio
n
n
Unified Genotyper
Variant
Quality
Recalibratio
n
Beagle
•  We note that we no longer use any hard-filters (proximity
to indel calls, clustered SNPs, etc.) at any point in the
process.
•  Unified Genotyper math and command lines discussed in
previous meetings.
(see Appendix for full details)
Step 3: SNP discovery
Analysisready
BAMs
Allele
Genotype
Frequenc
Likelihoo
y
ds
Calculatio
Calculatio
n
n
Unified Genotyper
Variant
Quality
Recalibratio
n
Beagle
•  The variant quality recalibration process has gone through a
major overhaul recently. Most notably, we have removed any
dependency on Ti/Tv in the calculation. This and further
changes are highlighted in the following slides.
•  Outline:
•  Quick Variant Recalibration overview
•  Contrastive clustering walkthrough
•  Ti/Tv-free quality thresholding or commitment-free probabilistic callsets
Variant annotations provide signal
with which to remove artifacts!
VCF record for an A/G SNP at 22:49582364
5
.
INFO field
22  49582364
AB=0.67;
AC=3;
AF=0.50;
AN=6;
DP=87;
Dels=0.00;
HRun=1;
MQ=71.31;
MQ0=22;
QD=2.29;
SB=-31.76
GT:DP:GQ
A
G
198.96
. AC
No. chromosomes
carrying alt allele
AB
Allele balance of ref/alt in
hets
AN
Total no. of chromosomes
HRu
n
Length of longest
contiguous homopolymer
AF
Allele frequency
MQ
RMS MAPQ of all reads
DP
Depth of coverage
MQ0
No. of MAPQ 0 reads at
locus
QD
QUAL score over depth
SB
Estimated SB score
0/1:12:99.00
0/1:11:89.43
0/1:28:37.78
Variant Quality Score Recalibration Model
Gaussian Mixture Model trained on annotated variants, find MAP
using VBEM:
K


 

p(c) = ∑ p(z)p(c | z) = ∑ π k p(π k )N(c | µ k , Σ k )p( µ k , Σ k )
k =1
z
Dirichlet distribution
Prior
expectation is
sparse set
Normal – inverse Wishart distribution
Prior expectation is the empirical
mean and empirical covariance of the
data.
Bias away from singularities.
Variant Quality Score Recalibration: training on highly confident
known sites to determine the probability that other sites are true
A
More
Bias
HiSeq: training on HapMap
C
Likely dbSNP
errors
Heterozygous
variants
Homozygous
variants
NR sensitivity (%): HM3 1KG Trio
Ti/Tv FDR
HiSeq
2.07 0.1
2.05
1
2.04
2
Cumulative TPs
Tranch−specific TPs
Tranch−specific FPs
Cumulative FPs
1.92 10
0
Less
Bias
B
Gaussian mixture
model fits
D
Evaluating novel variants
HiSeq: evaluating novel variants
Ti/Tv FDR
Exome
3.01 0.1
99.5
98.4
99.5
98.5
99.5
98.6
200
96.0
88.0
96.0
88.0
2.96
2
96.8
89.7
98.3
93.8
2.79 10
Analysis tranche
0.0
0.2
300
0.4
82.3
65.1
2.06
1
82.4
65.3
1.99
5
82.8
66.2
83.0
66.9
NGS only
96.7
86.5
With imputation
1.91 10
Analysis tranche
0
500
1000
400
0.6
HM3 HiSeq
Low-pass
2.07 0.1
Analysis
tranche
HM3 HiSeq
1
Ti/Tv FDR
Less
Bias
98.2
2.98
More
Bias
E
100
99.5
1500
Number of Novel Variants (1000s)
0.8
2000
Step 3: SNP discovery
Analysisready
BAMs
Allele
Genotype
Frequenc
Likelihoo
y
ds
Calculatio
Calculatio
n
n
Unified Genotyper
Variant
Quality
Recalibratio
n
Beagle
•  The variant quality recalibration process has gone through a
major overhaul recently. Most notably, we have removed any
dependency on Ti/Tv in the calculation. This and further
changes are highlighted in the following slides.
•  Outline:
•  Quick Variant Recalibration overview
•  Contrastive clustering walkthrough
•  Ti/Tv-free quality thresholding or commitment-free probabilistic callsets
Running the
Variant Quality Score Recalibrator
•  Wiki page has full list of command lines
broken out by the various steps in the
process
•  Wiki page also has links to all the data sets
we recommend using as training data
•  In a few weeks this whole process will be
condensed into two much easier to use steps
9
See http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration
Contrastive VQSR Clustering Walkthrough
First partition the data
into a training set by
looking at sites which
overlap with HapMap3.3
and
the Omni chip.
Contrastive VQSR Clustering Walkthrough
Using Variational Bayes
EM algorithm learn
probability distribution
over the training set.
Contrastive VQSR Clustering Walkthrough
Assign a probability to
each variant based on
how well it clusters
with the training set.
Unfortunately a
sizeable number of
seemingly good
variants fall outside of
the main clusters.
Furthermore, all
clusters are
essentially two-sided
tests but most
annotations are really
only one-sided.
Contrastive VQSR Clustering Walkthrough
Solution: Train a
second set of clusters
based on the bottom
10% of variants which
had the worst LOD.
This model for the bad
variants allows for
contrastive evaluation.
New LOD score
becomes difference
between the good
model and
the bad model.
Contrastive VQSR Clustering Walkthrough
4000
3000
2000
1000
0
Number of SNPs
5000
6000
Contrastive VQSR
clustering allows us to
rescue the variants
which fall outside of
the main clusters but
which also don’t fit the
model for bad
variants.
0
5
10
15
AC
20
25
Step 3: SNP discovery
Analysisready
BAMs
Allele
Genotype
Frequenc
Likelihoo
y
ds
Calculatio
Calculatio
n
n
Unified Genotyper
Variant
Quality
Recalibratio
n
Beagle
•  The variant quality recalibration process has gone through a
major overhaul recently. Most notably, we have removed any
dependency on Ti/Tv in the calculation. This and further
changes are highlighted in the following slides.
•  Outline:
•  Quick Variant Recalibration overview
•  Contrastive clustering walkthrough
•  Ti/Tv-free quality thresholding or commitment-free probabilistic
callsets
Sensitivity vs. specificity plots with
the new Ti/Tv-less approach look
good
NA12878 HiSeq WGS
!
2.2
2.4
!
2.3
!
2.1
!
!
2.0
2.5
!
!
1.9
Specificity (Novel Ti/Tv ratio)
2.7
2.6
!
!
2.2
Specificity (Novel Ti/Tv ratio)
2.8
1000G low-pass August N=629
!
!
!
0.5
0.6
0.7
0.8
Tranche truth sensitivity
0.9
1.0
0.5
0.6
0.7
0.8
Tranche truth sensitivity
0.9
1.0
61-sample CEU from 1000G!
;G>
;G:
:G@
:G<
;88
:G8
8
9G:
9G>
12*+3'&43(*5(67/3
M'%,'&#(H$4,3I*D$%J(H'#$
:88
:<R<=>(&*D$"(D'%,'&#3(,&('SS%$S'#$(C,#2(9G@;(1,Q1D(%'#,*
988
8G:
8G<
8G>
8G@
8G98N(OPH(#%'&I2$
9G88N(OPH(#%'&I2$
=G88N(OPH(#%'&I2$
98G88N(OPH(#%'&I2$
B&*C&(1,Q1D(H'#,*
7*D$"(1,Q1D(H'#,*
9:8
8
98
:8
;8
<8
=8
>8
?8
@8
A8
!"#$%&'#$(!""$"$()*+&#(,&()-.(/*0+"'#,*&
5%
988
998
9:8
1%'&3,#,*&(#*(1%'&3D$%3,*&(H'#,*
<88
9G8
3%
8G8
$
The low confidence tranches are comprised of the low frequency
events (most likely FPs)
Broad discovered the most variants at very high
quality levels in 1000G chr20 bake-off exercise
#
Center
sample
s
Total #
variant
s
dbSNP #
Know #
Nove Includes
%
known n ti/tv novels l ti/tv genotype
(129)
s
refinement?
1004
Broad
765,36
5
24.82
190,00
0
2.36
575,36
5
2.37
No
1004
BC
733,15
5
25.34
185,78
7
2.37
547,36
8
2.32
No
1004
Sanger
728,37
4
25.31
184,34
1
2.36
544,03
3
2.36
No
1004
UMich
721,25
0
26.46
190,87
1
2.33
530,37
9
2.35
Yes
1004
Oxford
660,02
4
27.44
181,09
5
2.38
478,92
9
2.38
Yes
1004
BCM
605,27
4
29.98
181,44
4
2.33
423,83
0
2.29
Yes
1004
NCBI
601,90
7
29.26
176,15
0
2.39
425,75
7
2.57
No
Final Thoughts
•  Our data processing pipeline produces really good SNP calls. The
same pipeline is used for whole exome and WGS, both deep and
low-pass sequencing. Short indel calls too!
•  Anything can be used as truth data. Validation assays, several
1000G callsets, or auto-generate your own by subsetting to the
highest quality SNPs
•  There is no reason to decide between high sensitivity or high
specificity. Just use a probabilistic callset.
•  The tools are available to all:
http://www.broadinstitute.org/gsa/wiki/index.php/
The_Genome_Analysis_Toolkit
Appendix
Step 2a: SNP discovery
Analysisready
BAMs
Allele
Genotype
Frequenc
Likelihoo
y
ds
Calculatio
Calculatio
n
n
Unified Genotyper
Variant
Quality
Recalibratio
n
Beagle
•  The genotype likelihoods calculation now takes overlapping
read pairs (where bases are not independent observations) into
account, which we term “fragment-based calling”.
GATK single sample genotype likelihoods
Bayesia
n model
Likelihood of the
Likelihood for Prior for the data given the
Independent base model"
the genotype" genotype" genotype"
L(G | D) = P(G) P(D | G) =
∏
b∈{good _ bases}
P(b | G)
•  Priors applied during multi-sample calculation; P(G) = 1
•  Likelihood of data computed using pileup of bases and
associated quality scores at given locus
•  Only “good bases” are included: those satisfying
minimum base quality, mapping read quality, pair
mapping quality, NQS
•  P(b | G) uses calibrated base quality score
•  L(G|D) computed for all 10 genotypes
22
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information
Step 2b: SNP discovery
Analysisready
BAMs
Allele
Genotype
Frequenc
Likelihoo
y
ds
Calculatio
Calculatio
n
n
Unified Genotyper
Variant
Quality
Recalibratio
n
Beagle
•  We now use Heng Li’s Exact model to calculate P(AF>0)
instead of our previous heuristic grid search model.
We apply a generalization of the single
sample SNP caller for multi sample
data
Sample-associated reads"
Genotype likelihoods"
Allele frequency"
Individual 1"
Individual 2"
Joint estimate
across samples"
SNPs"
Individual N"
Genotype frequencies"
•  This approach allows us to combine weak
single sample calls to discover variation
among several samples with high
confidence
24
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information
Running the Unified Genotyper
java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b37_both.fasta -T UnifiedGenotyper
-B:dbsnp,VCF dbsnp_132_b37.vcf
Minimum phred-scaled
confidence required to emit a
-o NA19240.raw.vcf SNP
-stand_call_conf 30 --heterozygosity 1.000000e-03 -I NA19240.SLX.bam 1 het per 1000 reference bases
on average for a Yoruban
BAM file containing NA19240 SLX reads
Long string of variant annotations
(more info in a few slides)
Raw VCF calls (NA19240.raw.vcf)
#CHROM
1
POS
36496
ID
.
1
45162
rs10399749
1
48677
.
25
REF
T
ALT
A
QUAL
53.13
FILTER
.
INFO
<ATTRIBUTES>
FORMAT
GT:DP:GQ
NA19240 1/0:6:84.70
C
T
331.37
.
<ATTRIBUTES>
GT:DP:GQ
0/1:27:99.00
G
A
399.86
.
<ATTRIBUTES>
GT:DP:GQ
1/0:25:99.00
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information
Variants with bad Haplotype Scores often exhibit good Ti/Tv
ratios and are included in other centers’ callsets, but are likely
FPs
Bad sites being called by other centers but correctly filtered by the
Broad.
Higher
score
means
more
evidenc
e for
error.
These sites are potentially bad in other SNP annotation dimensions.

Similar documents