Alignment - UC Davis Bioinformatics Core
Transcription
Alignment - UC Davis Bioinformatics Core
Alignment UCD Genome Center Bioinformatics Core Wednesday 17 June 2015 From reads to molecules UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG CAATAAAAGTGCCGTATCATGCTGGTGTTACAATCGCCGCA CGTATCATGCTGGTGTTACAATCGCCGCATGACATGATCAATGG TGTCTGCTCAATAAAAGTGCCGTATCATGCTGGTGTTACAATC ATCGTCGGGTGTCTGCTCAATAAAAGTGCCGTATCATG--GGTGTTATAA CTCAATAAGAGTGCCGTATCATG--GGTGTTATAATCGCCGCA GTTATAATCGCCGCATGACATGATCAATGG To measure variation. UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Why align? UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Why align? UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Short Read Aligners (DNA, not RNA) Fall '12 - Apr '13: ... now 150-180 Gbp / day!* * http://www.illumina.com/systems/hiseq_2500_1500/performance_specifications.ilmn UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Burrows-Wheeler Aligners Burrows-Wheeler Transform used in bzip2 file compression tool; FM-index (Ferragina & Manzini) allow efficient finding of substring matches within compressed text – algorithm is sub-linear with respect to time and storage space required for a certain set of input data (reference 'ome, essentially). Reduced memory footprint, faster execution. UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 BWA BWA is fast, and can do gapped alignments. When run without seeding, it will find all hits within a given edit distance. Long read aligner is also fast, and can perform well for 454, Ion Torrent, Sanger, and PacBio reads. BWA is actively developed and has a strong user / developer community. bio-bwa.sourceforge.net Short reads – under 200 bp Li H. and Durbin R. (2009) “Fast and accurate short read alignment with Burrows-Wheeler Transform.” Bioinformatics, 25:1754-60. [PMID: 19451168] Long reads – over 200 bp … chimeric alignments built-in Li H. and Durbin R. (2010) “Fast and accurate long read alignment with Burrows-Wheeler Transform.” Bioinformatics, 26:589-95. [PMID: 20080505] … don't forget to join the mailing groups! UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Bowtie Bowtie (now Bowtie 2) is probably faster than BWA for some types of alignment, but it may not find the best alignments (see discussions on sensitivity, accuracy on SeqAnswers.com). Bowtie is part of a suite of tools (Bowtie, Tophat, Cufflinks, CummeRbund) that address RNAseq experiments. http://bowtie-bio.sourceforge.net Langmead B., Trapnell C., Pop M., and Salzberg S.L. (2009) “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome” Genome Biology 10:R25 [PMID: 19261174] … don't forget to join the mailing groups! UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Alignment concepts / parameters Paired-End reads UC Davis Genome Center | Bioinformatics Core | J Fass Mate-Paired reads Alignment 2015-06-17 Alignment concepts / parameters 454 "Paired-End" reads Single End Construct UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Alignment concepts / parameters UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Alignment concepts / parameters UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Alignment concepts / parameters UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Alignment concepts / parameters UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 File Format: SAM / BAM / CRAM ! NEW http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and up Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943] ● ● ● ● SAM specification (currently v1, renumbered … 1 is after old v1.4) samtools man page example workflow(s) mailing list! UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 File Format: SAM UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 File Format: SAM SAM Format Specification v1.4-r985 7,8 - formerly MRNM, MPOS (mate reference name, mate position) 9 - formerly ISIZE ("insert" size) UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 File Format: SAM google "Heng Li slides" - Challenges and Solutions in the Analysis of Next Generation Sequencing Data (2010) UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 File Format: BAM BAMs are compressed SAMs (so, binary, not human-readable text … don't look directly at them!). They can be indexed to allow rapid extraction of information, so alignment viewers do not need to uncompress the whole BAM file in order to look at information for a particular read or coordinate range, somewhere in the file. Indexing your BAM file, myCoolBamFile.bam, will create an index file, myCoolBamFile.bam.bai, which is needed (in addition to the BAM file) by viewers and other downstream tools. An occasional downstream tool will require an index called myCoolBamFile.bai (notice that the “.bai” replaces the “.bam”, instead of being appended after it). UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 File Format: CRAM Available as of SAMtools 1.0, and is a binary format like BAM. Uses data-specific compression tools (i.e. compressing letters is different than compressing numbers), specifically reference-based compression (e.g. for aligned reads, only mis-matching bases need to be stored). Also can employ lossy compression of base qualities, which appears to have a negligible effect on, say, variant calling (see Illumina white paper). Indexing your CRAM file, myCoolBamFile.cram, will create an index file, myCoolBamFile.cram.crai, which is needed (in addition to the CRAM file) by viewers and other downstream tools. This is a very recent development, so it may be a while before tools are CRAM-capable. UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Alignment Viewers ● ● ● IGV (Integrated Genomics Viewer) ○ www.broadinstitute.org/igv/ BAMview, tview (in SAMtools), IGB, GenomeView, SAMscope ... UCSC Genome Browser, GBrowse UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 IGV red box indicates region of reference in view below coverage track: read coverage depth plot read alignments: (various view styles - “squished” shown here) … read positions, orientations, pairing, sequence that disagrees with reference highlighted, improper pairs highlighted, etc. annotation tracks (GTF, BED, etc.) UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 IGV colored bases where they disagree with reference (substitution, indel, etc.) improper pairs (mate aligns far away, in wrong orientation, or on another chromosome) reference sequence, reading frames, etc. UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Calling - VCF format One main application of read alignment. A.k.a. "resequencing", SNP / indel discovery. VCF (variant call format) is now the standard format for variant reporting. http://vcftools.sourceforge.net/specs.html ... VCF poster UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format ##fileformat=VCFv4.1 ##fileDate=20130825 ##source=freeBayes v9.9.2-9-gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f ../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1 ../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26 . TGTTACGCG GCTTTTGC,TGTTTCTAC 27.2619 . AO=1,2;RO=0;TYPE=complex, complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38 . TCA ACG,TA,AGA 0.0495692 . AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28 8_PB1 42 . G A 3.94171e-14 . AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA: GL UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format ##fileformat=VCFv4.1 ##fileDate=20130825 ##source=freeBayes v9.9.2-9-gfbf46fc-dirty ##reference=../results/8/8.fa ##phasing=none ##commandline="../tools/freebayes/bin/freebayes -f ../results/8/8.fa --min-alternate-fraction 0.03 --minmapping-quality 20 --min-base-quality 20 --ploidy 1 --pooled-continuous --use-best-n-alleles 4 --usemapping-quality --min-alternate-fraction 0.04 --min-alternate-count 1 ../results/8/8.bam" ##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB1 26 . TGTTACGCG GCTTTTGC,TGTTTCTAC 27.2619 . AO=1,2;RO=0;TYPE=complex, complex GT:DP:RO:QR:AO:QA:GL 2:3:0:0:1,2:31,70:-4.46,-1.65,0 8_PB1 38 . TCA ACG,TA,AGA 0.0495692 . AO=1,1,1;RO=3;TYPE=complex,del,mnp GT:DP:RO:QR:AO:QA:GL 2:6:3:101:1,1,1:31,37,34:0,-4.556,-4.004,-4.28 8_PB1 42 . G A 3.94171e-14 . AO=8;RO=128;TYPE=snp GT:DP:RO:QR:AO:QA: GL UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format #CHROM POS ID REF 8_PB2 407 . A 170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID = . REF = A ALT = G QUAL = 3935.83 ALT G QUAL FILTER 3935.83 . INFO FORMAT 8 AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1: FILTER = . INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format #CHROM POS ID REF 8_PB2 407 . A 170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID = . REF = A ALT = G QUAL = 3935.83 ALT G QUAL FILTER 3935.83 . INFO FORMAT 8 AO=149;RO=21;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1: ##FORMAT=<ID=DP,Number=1,Type=Integer, Description="Read Depth"> FILTER = . INFO = AO=149;RO=21;TYPE=snp FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format ##INFO=<ID=RO,Number=1,Type=Integer,Description=" Reference allele observation count, with partial observations recorded fractionally"> ##INFO=<ID=AO,Number=A,Type=Integer,Description=" Alternate allele observations, with partial observations recorded fractionally"> ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format ##FORMAT=<ID=GT,Number=1,Type=String,Description=" Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Float,Description=" Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype"> ##FORMAT=<ID=GL,Number=G,Type=Float,Description=" Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description=" Read Depth"> UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Call Format ##FORMAT=<ID=RO,Number=1,Type=Integer,Description=" Reference allele observation count"> ##FORMAT=<ID=QR,Number=1,Type=Integer,Description=" Sum of quality of the reference observations"> ##FORMAT=<ID=AO,Number=A,Type=Integer,Description=" Alternate allele observation count"> ##FORMAT=<ID=QA,Number=A,Type=Integer,Description=" Sum of quality of the alternate observations"> UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 Variant Effect Prediction ● snpEff ● Variant Effect Predictor (EMBL) ● SIFT UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 VCF after Effect Prediction #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 8 8_PB2 407 . A G 3935.83 . AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING (LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1) GT:DP:RO:QR:AO:QA:GL 1:170:21:788:149:5579:-5,0 CHROM = 8_PB2 POS = 407 ID = . REF = A ALT = G QUAL = 3935.83 FILTER = . INFO = AO=149;RO=21;TYPE=snp;EFF=SYNONYMOUS_CODING (LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1) FORMAT = GT:DP:RO:QR:AO:QA:GL 8 = 1:170:21:788:149:5579:-5,0 UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17 VCF after Effect Prediction ##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex."> ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change | Amino_Acid_length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon | GenotypeNum [ | ERRORS | WARNINGS ] )' "> INFO = AO=149;RO=21;TYPE=snp; EFF=SYNONYMOUS_CODING (LOW|SILENT|gaA/gaG|E123|759|PB2||CODING|Tr_PB2|1|1) UC Davis Genome Center | Bioinformatics Core | J Fass Alignment 2015-06-17