Comparative analysis of programs for mapping bisulfite
Transcription
Comparative analysis of programs for mapping bisulfite
Comparative analysis of programs for mapping Bisulfite-seq reads Govindarajan Kunde Ramamoorthy M.Sc., Robert A Waterland Ph.D., Outline • • • • • • • Introduction of Bisulfite-seq techniques Bisulfite-seq mapping algorithms/tools Benchmark data (Lister et al Nature 2009) Analysis workflow Quality/mapping Statistics Genomic methylation profiling Results & Conclusions Bisulfite-Seq techniques A. methylC-seq : Lister et al B. BS-seq: Cokus et al C. RRBS: Meissner et al Lister and Ecker, Genome Research 2010 Cokus & Lister protocol (summary) Chen et al. BMC Bioinformatics 2010, 11:203 Methylation Data Analysis Software Software Features BISMARK Supports both single end and pair-end reads. Uses bowtie aligner. PASH 3.0 Methylation & SNP’s. Uses low memory & High speed alignment BSMAP Maps both single/pair-end reads. Uses SOAP aligner. Methylcoder Maps both single/pair-end reads. Handles also color space reads (SOLiD). BS-Seq Uses Gaussian Mixture model (GMM) to identify the probability of A vs G vs C vs T. GMM available only to Arabidopsis genome BRAT Maps both single/pair-end reads. Trims low quality bases. Improves unique mapping for pair-end reads. Kismeth Web-based tool. Designed for plant methylation data. BISMARK algorithm • Bismark uses Bowtie mapper for alignment. • Post-processing scripts to parse aligned reads to identify methylated and unmethylated C’s. • Handles both single and pair-end libraries. • Handles data generated from both Cokus and Lister protocols. BISMARK algorithm Bismark output Felix Krueger & Simon R. Andrews Bioinformatics 2011 PASH algorithm • Uses multi-positional hashing data structure for alignment. • Performs gapped alignment at k-mers level. • Explores all possible read k-mers. • Handles only single end library reads data. PASH algorithm • • • Creates k-mer multi-positional hash. Performs gapped alignment at kmer-level Scores k-mer at a given genomic window. Coarfa et al. BMC Bioinformatics 2010 BSMAP algorithm • Reference Genome: Create seed table with both original and bisulfite variants as keys and values. • Map reads to reference variants. • Mask reference with 01=>C and 11=> A,T,G. • Comparison using bitwise AND and XOR operation on both reference and masked reads. Uses SOAP aligner. Yuanxin Xi and Wei Li BMC Bioinformatics 2009 Lister Dataset (Benchmark) • Whole genome Bisulfite-Seq data of H1 (human embryonic Stem cell) cell line. • 205 lanes of Illumina sequencing data. • 1.97 billion reads (76 bp length) • Sequenced ~164 billion bases in total. • Human genome - hg18 assembly was used. • Mapping results and % methylation were compared with other mapping algorithms. Analysis Work flow NCBI SRA DB SRA -> FASTQ format Trim bad quality reads BISMARK BSMAP Results summary (% coverage & % methylation) PASH Bisulfite-Seq mapping algorithms UCSC hg19 Read conversion & Trimming SRA to FASTQ: • Fastq-dump utility from NCBI used. Read trimming & filtering: • Adaptor sequences removed. • Base quality < 14 have been removed. • Read length < 25 bp have been removed. Read Quality Statistics Average good reads ~ 67 % 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146 151 156 161 166 171 176 181 186 191 196 % uniquely mapped reads BISMARK – mapping efficiency 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Lane Number Average mapping: 75 % BSMAP – Mapping efficiency BSMAP - Mapping efficiency (Unique reads) 90 70 60 50 40 30 20 10 Lane number Average mapping efficiency: 78.4 % 196 191 186 181 176 171 166 161 156 151 146 141 136 131 126 121 116 111 106 101 96 91 86 81 76 71 66 61 56 51 46 41 36 31 26 21 16 6 11 0 1 % uniquely mapped reads 80 Benchmark results • We used ~ 7 million reads (1 lane) of length 40 – 70 bp data from H1 cells to compute the cpu time and memory usage. • Server configuration: X5690 @ 3.47GHz /2 cpu /12 core / 96 GB RAM Aligner CPU time Memory usage BSMAP 27 mins 8.1 GB BISMARK 2 hrs 48 mins 12 GB PASH 14 hrs 12 GB Genomics Regions of interest (ROI) • We profiled two classes of genomic regions for % methylation and % coverage of CpG sites. – Transcription start site (TSS) – CpG island (CGI) CpG statistics at TSS • ~ 27,500 known RefSeq genes TSS flanking 2kb sequences were downloaded from UCSC hg19. • 2.89 million CpG sites TSS flanking 2 kb region. • Sequences are divided into 20 bins. • CpG sites that have a read depth of at least 4 reads are included in the analysis. %CpG coverage at TSS (2kb up/down) % CpG coverage at TSS (2kb up/down stream) 80 70 % CpG coverage 60 50 % coverage (PASH) 40 % coverage (BISMARK) % coverage (Lister analyzed) 30 % coverage (BSMAP) 20 10 0 -2kb -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 -2kb <---- TSS ----> 2kb 0.6 0.8 1 1.2 1.4 1.6 1.8 2kb %CpG methylation at TSS (2kb up/down) % CpG methylation at TSS (2kb up/down) 100 90 80 % mCG/CG 70 60 % methylation (PASH) 50 % methylation (BISMARK) 40 % methylation (Lister analyzed) 30 % methylation (BSMAP) 20 10 0 -2kb -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 -2kb <---- TSS ----> 2kb 1 1.2 1.4 1.6 1.8 2kb CpG statistics at CGI • 28691 CpG island sequences with 2kb flanking regions were downloaded from UCSC – hg19 build. • 2kb up/down stream of mid point of CGI sequences were extracted. • 4.4 million CpG sites within 2kb up/down stream of centre of CGI. • Sequences are divided into 20 bins. • CpG sites that have a read depth of at least 4 reads are included in the analysis. % CpG coverage at CGI 80 70 % CpG Coverage 60 50 % coverage (BISMARK) 40 % coverage (PASH) % coverage (LISTER ANALYZED) 30 % Coverage (BSMAP) 20 10 0 -2kb -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 -2kb <----CGI---->2kb 1 1.2 1.4 1.6 1.8 2 % CpG methylation at CGI 100 90 80 % CpG Coverage 70 60 % methylation (BISMARK) 50 % methylation (PASH) % methylation (LISTER ANALYZED) 40 % methylation (BSMAP) 30 20 10 0 -2kb -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 -2kb <----CGI---->2kb 1 1.2 1.4 1.6 1.8 2 Validation on In-house data • Bisulfite-seq data of human samples. • ~50 million reads with 100 bp pair-end reads. BISMARK mapping efficiency 90 80 % of reads 70 60 50 40 Uniquely mapped 30 20 Unmapped 10 Multiple hits 0 Samples % CpG coverage at TSS (In-house data) BISMARK algorithm 30 % CpG Coverage at TSS 25 20 15 10 5 0 -2kb -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 -2kb <---- TSS ----> 2kb 0.6 0.8 1 1.2 1.4 1.6 1.8 2kb Conclusions • PASH appears to provide greater mapping coverage at both TSS and CGI. However, with longer reads (100 bp) BISMARK coverage is also comparable. • % methylation patterns are similar at both TSS and CGI. • BSMAP alignment speed is much faster than BISMARK and PASH. • Validation studies by bisulfite pyrosequencing are underway to determine the accuracy of methylation estimates obtained in genomic regions covered by PASH but not by other methods. Acknowledgements • Cristian Coarfa • R. Alan Harris • Aleksandar Milosavljevic
Similar documents
2.1.2. Memory Module Compatibility 2.1.3. Hard Drive Compatibility Performance Benchmarking Promise Knowledge Base Entries Promise FastTrak SX4000 DVT Errata
More information