MPG NGS workshop - Broad Institute
Transcription
MPG NGS workshop - Broad Institute
MPG NGS workshop: SNP calling and error modeling February 2011 Ryan Poplin Genome Sequencing and Analysis Medical and Population Genetics The paradigm today Phase 1: NGS data processing Typically by lane Input Raw reads Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis Typically multiple samples simultaneously but can be single sample alone Sample 1 reads Sample N reads Raw indels Raw SNPs Raw SVs External data Mapping SNPs Local realignment Pedigrees Known variation Population structure Known genotypes Indels Duplicate marking Base quality recalibration Output Analysis-ready reads Variant quality recalibration Structural variation (SV) Raw variants Genotype refinement Analysis-ready variants Step 2: SNP discovery Analysisready BAMs Allele Genotype Frequenc Likelihoo y ds Calculatio Calculatio n n Unified Genotyper Variant Quality Recalibratio n Beagle • We note that we no longer use any hard-filters (proximity to indel calls, clustered SNPs, etc.) at any point in the process. • Unified Genotyper math and command lines discussed in previous meetings. (see Appendix for full details) Step 3: SNP discovery Analysisready BAMs Allele Genotype Frequenc Likelihoo y ds Calculatio Calculatio n n Unified Genotyper Variant Quality Recalibratio n Beagle • The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides. • Outline: • Quick Variant Recalibration overview • Contrastive clustering walkthrough • Ti/Tv-free quality thresholding or commitment-free probabilistic callsets Variant annotations provide signal with which to remove artifacts! VCF record for an A/G SNP at 22:49582364 5 . INFO field 22 49582364 AB=0.67; AC=3; AF=0.50; AN=6; DP=87; Dels=0.00; HRun=1; MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ A G 198.96 . AC No. chromosomes carrying alt allele AB Allele balance of ref/alt in hets AN Total no. of chromosomes HRu n Length of longest contiguous homopolymer AF Allele frequency MQ RMS MAPQ of all reads DP Depth of coverage MQ0 No. of MAPQ 0 reads at locus QD QUAL score over depth SB Estimated SB score 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78 Variant Quality Score Recalibration Model Gaussian Mixture Model trained on annotated variants, find MAP using VBEM: K p(c) = ∑ p(z)p(c | z) = ∑ π k p(π k )N(c | µ k , Σ k )p( µ k , Σ k ) k =1 z Dirichlet distribution Prior expectation is sparse set Normal – inverse Wishart distribution Prior expectation is the empirical mean and empirical covariance of the data. Bias away from singularities. Variant Quality Score Recalibration: training on highly confident known sites to determine the probability that other sites are true A More Bias HiSeq: training on HapMap C Likely dbSNP errors Heterozygous variants Homozygous variants NR sensitivity (%): HM3 1KG Trio Ti/Tv FDR HiSeq 2.07 0.1 2.05 1 2.04 2 Cumulative TPs Tranch−specific TPs Tranch−specific FPs Cumulative FPs 1.92 10 0 Less Bias B Gaussian mixture model fits D Evaluating novel variants HiSeq: evaluating novel variants Ti/Tv FDR Exome 3.01 0.1 99.5 98.4 99.5 98.5 99.5 98.6 200 96.0 88.0 96.0 88.0 2.96 2 96.8 89.7 98.3 93.8 2.79 10 Analysis tranche 0.0 0.2 300 0.4 82.3 65.1 2.06 1 82.4 65.3 1.99 5 82.8 66.2 83.0 66.9 NGS only 96.7 86.5 With imputation 1.91 10 Analysis tranche 0 500 1000 400 0.6 HM3 HiSeq Low-pass 2.07 0.1 Analysis tranche HM3 HiSeq 1 Ti/Tv FDR Less Bias 98.2 2.98 More Bias E 100 99.5 1500 Number of Novel Variants (1000s) 0.8 2000 Step 3: SNP discovery Analysisready BAMs Allele Genotype Frequenc Likelihoo y ds Calculatio Calculatio n n Unified Genotyper Variant Quality Recalibratio n Beagle • The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides. • Outline: • Quick Variant Recalibration overview • Contrastive clustering walkthrough • Ti/Tv-free quality thresholding or commitment-free probabilistic callsets Running the Variant Quality Score Recalibrator • Wiki page has full list of command lines broken out by the various steps in the process • Wiki page also has links to all the data sets we recommend using as training data • In a few weeks this whole process will be condensed into two much easier to use steps 9 See http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration Contrastive VQSR Clustering Walkthrough First partition the data into a training set by looking at sites which overlap with HapMap3.3 and the Omni chip. Contrastive VQSR Clustering Walkthrough Using Variational Bayes EM algorithm learn probability distribution over the training set. Contrastive VQSR Clustering Walkthrough Assign a probability to each variant based on how well it clusters with the training set. Unfortunately a sizeable number of seemingly good variants fall outside of the main clusters. Furthermore, all clusters are essentially two-sided tests but most annotations are really only one-sided. Contrastive VQSR Clustering Walkthrough Solution: Train a second set of clusters based on the bottom 10% of variants which had the worst LOD. This model for the bad variants allows for contrastive evaluation. New LOD score becomes difference between the good model and the bad model. Contrastive VQSR Clustering Walkthrough 4000 3000 2000 1000 0 Number of SNPs 5000 6000 Contrastive VQSR clustering allows us to rescue the variants which fall outside of the main clusters but which also don’t fit the model for bad variants. 0 5 10 15 AC 20 25 Step 3: SNP discovery Analysisready BAMs Allele Genotype Frequenc Likelihoo y ds Calculatio Calculatio n n Unified Genotyper Variant Quality Recalibratio n Beagle • The variant quality recalibration process has gone through a major overhaul recently. Most notably, we have removed any dependency on Ti/Tv in the calculation. This and further changes are highlighted in the following slides. • Outline: • Quick Variant Recalibration overview • Contrastive clustering walkthrough • Ti/Tv-free quality thresholding or commitment-free probabilistic callsets Sensitivity vs. specificity plots with the new Ti/Tv-less approach look good NA12878 HiSeq WGS ! 2.2 2.4 ! 2.3 ! 2.1 ! ! 2.0 2.5 ! ! 1.9 Specificity (Novel Ti/Tv ratio) 2.7 2.6 ! ! 2.2 Specificity (Novel Ti/Tv ratio) 2.8 1000G low-pass August N=629 ! ! ! 0.5 0.6 0.7 0.8 Tranche truth sensitivity 0.9 1.0 0.5 0.6 0.7 0.8 Tranche truth sensitivity 0.9 1.0 61-sample CEU from 1000G! ;G> ;G: :G@ :G< ;88 :G8 8 9G: 9G> 12*+3'&43(*5(67/3 M'%,'&#(H$4,3I*D$%J(H'#$ :88 :<R<=>(&*D$"(D'%,'(,&('SS%$S'#$(C,#2(9G@;(1,Q1D(%'#,* 988 8G: 8G< 8G> 8G@ 8G98N(OPH(#%'&I2$ 9G88N(OPH(#%'&I2$ =G88N(OPH(#%'&I2$ 98G88N(OPH(#%'&I2$ B&*C&(1,Q1D(H'#,* 7*D$"(1,Q1D(H'#,* 9:8 8 98 :8 ;8 <8 =8 >8 ?8 @8 A8 !"#$%&'#$(!""$"$()*+&#(,&()-.(/*0+"'#,*& 5% 988 998 9:8 1%'&3,#,*&(#*(1%'&3D$%3,*&(H'#,* <88 9G8 3% 8G8 $ The low confidence tranches are comprised of the low frequency events (most likely FPs) Broad discovered the most variants at very high quality levels in 1000G chr20 bake-off exercise # Center sample s Total # variant s dbSNP # Know # Nove Includes % known n ti/tv novels l ti/tv genotype (129) s refinement? 1004 Broad 765,36 5 24.82 190,00 0 2.36 575,36 5 2.37 No 1004 BC 733,15 5 25.34 185,78 7 2.37 547,36 8 2.32 No 1004 Sanger 728,37 4 25.31 184,34 1 2.36 544,03 3 2.36 No 1004 UMich 721,25 0 26.46 190,87 1 2.33 530,37 9 2.35 Yes 1004 Oxford 660,02 4 27.44 181,09 5 2.38 478,92 9 2.38 Yes 1004 BCM 605,27 4 29.98 181,44 4 2.33 423,83 0 2.29 Yes 1004 NCBI 601,90 7 29.26 176,15 0 2.39 425,75 7 2.57 No Final Thoughts • Our data processing pipeline produces really good SNP calls. The same pipeline is used for whole exome and WGS, both deep and low-pass sequencing. Short indel calls too! • Anything can be used as truth data. Validation assays, several 1000G callsets, or auto-generate your own by subsetting to the highest quality SNPs • There is no reason to decide between high sensitivity or high specificity. Just use a probabilistic callset. • The tools are available to all: http://www.broadinstitute.org/gsa/wiki/index.php/ The_Genome_Analysis_Toolkit Appendix Step 2a: SNP discovery Analysisready BAMs Allele Genotype Frequenc Likelihoo y ds Calculatio Calculatio n n Unified Genotyper Variant Quality Recalibratio n Beagle • The genotype likelihoods calculation now takes overlapping read pairs (where bases are not independent observations) into account, which we term “fragment-based calling”. GATK single sample genotype likelihoods Bayesia n model Likelihood of the Likelihood for Prior for the data given the Independent base model" the genotype" genotype" genotype" L(G | D) = P(G) P(D | G) = ∏ b∈{good _ bases} P(b | G) • Priors applied during multi-sample calculation; P(G) = 1 • Likelihood of data computed using pileup of bases and associated quality scores at given locus • Only “good bases” are included: those satisfying minimum base quality, mapping read quality, pair mapping quality, NQS • P(b | G) uses calibrated base quality score • L(G|D) computed for all 10 genotypes 22 See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information Step 2b: SNP discovery Analysisready BAMs Allele Genotype Frequenc Likelihoo y ds Calculatio Calculatio n n Unified Genotyper Variant Quality Recalibratio n Beagle • We now use Heng Li’s Exact model to calculate P(AF>0) instead of our previous heuristic grid search model. We apply a generalization of the single sample SNP caller for multi sample data Sample-associated reads" Genotype likelihoods" Allele frequency" Individual 1" Individual 2" Joint estimate across samples" SNPs" Individual N" Genotype frequencies" • This approach allows us to combine weak single sample calls to discover variation among several samples with high confidence 24 See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information Running the Unified Genotyper java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b37_both.fasta -T UnifiedGenotyper -B:dbsnp,VCF dbsnp_132_b37.vcf Minimum phred-scaled confidence required to emit a -o NA19240.raw.vcf SNP -stand_call_conf 30 --heterozygosity 1.000000e-03 -I NA19240.SLX.bam 1 het per 1000 reference bases on average for a Yoruban BAM file containing NA19240 SLX reads Long string of variant annotations (more info in a few slides) Raw VCF calls (NA19240.raw.vcf) #CHROM 1 POS 36496 ID . 1 45162 rs10399749 1 48677 . 25 REF T ALT A QUAL 53.13 FILTER . INFO <ATTRIBUTES> FORMAT GT:DP:GQ NA19240 1/0:6:84.70 C T 331.37 . <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00 G A 399.86 . <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00 See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information Variants with bad Haplotype Scores often exhibit good Ti/Tv ratios and are included in other centers’ callsets, but are likely FPs Bad sites being called by other centers but correctly filtered by the Broad. Higher score means more evidenc e for error. These sites are potentially bad in other SNP annotation dimensions.