Comparative analysis of programs for mapping bisulfite

Transcription

Comparative analysis of programs
for mapping Bisulfite-seq reads
Govindarajan Kunde Ramamoorthy M.Sc.,
Robert A Waterland Ph.D.,
Outline
•
•
•
•
•
•
•
Introduction of Bisulfite-seq techniques
Bisulfite-seq mapping algorithms/tools
Benchmark data (Lister et al Nature 2009)
Analysis workflow
Quality/mapping Statistics
Genomic methylation profiling
Results & Conclusions
Bisulfite-Seq techniques
A. methylC-seq : Lister et al
B. BS-seq: Cokus et al
C. RRBS: Meissner et al
Lister and Ecker, Genome Research 2010
Cokus & Lister protocol (summary)
Chen et al. BMC Bioinformatics 2010, 11:203
Methylation Data Analysis Software
Software
Features
BISMARK
Supports both single end and pair-end reads.
Uses bowtie aligner.
PASH 3.0
Methylation & SNP’s.
Uses low memory & High speed alignment
BSMAP
Maps both single/pair-end reads.
Uses SOAP aligner.
Methylcoder
Handles also color space reads (SOLiD).
BS-Seq
Uses Gaussian Mixture model (GMM) to identify the
probability of A vs G vs C vs T.
GMM available only to Arabidopsis genome
BRAT
Trims low quality bases.
Improves unique mapping for pair-end reads.
Kismeth
Web-based tool.
Designed for plant methylation data.
BISMARK algorithm
• Bismark uses Bowtie mapper for alignment.
• Post-processing scripts to parse aligned reads
to identify methylated and unmethylated C’s.
• Handles both single and pair-end libraries.
• Handles data generated from both Cokus and
Lister protocols.
BISMARK algorithm
Bismark output
Felix Krueger & Simon R. Andrews Bioinformatics 2011
PASH algorithm
• Uses multi-positional hashing data structure
for alignment.
• Performs gapped alignment at k-mers level.
• Explores all possible read k-mers.
• Handles only single end library reads data.
PASH algorithm
•
•
•
Creates k-mer multi-positional
hash.
Performs gapped alignment at kmer-level
Scores k-mer at a given genomic
window.
Coarfa et al. BMC Bioinformatics 2010
BSMAP algorithm
• Reference Genome:
Create seed table with
both original and
bisulfite variants as keys
and values.
• Map reads to reference
variants.
• Mask reference with
01=>C and 11=> A,T,G.
• Comparison using
bitwise AND and XOR
operation on both
reference and masked
reads.
Uses SOAP aligner.
Yuanxin Xi and Wei Li BMC Bioinformatics 2009
Lister Dataset (Benchmark)
• Whole genome Bisulfite-Seq data of H1
(human embryonic Stem cell) cell line.
• 205 lanes of Illumina sequencing data.
• 1.97 billion reads (76 bp length)
• Sequenced ~164 billion bases in total.
• Human genome - hg18 assembly was used.
• Mapping results and % methylation were
compared with other mapping algorithms.
Analysis Work flow
NCBI
SRA DB
SRA -> FASTQ format
Trim bad quality reads
BISMARK
BSMAP
Results summary
(% coverage &
% methylation)
PASH
Bisulfite-Seq mapping
algorithms
UCSC
hg19
Read conversion & Trimming
SRA to FASTQ:
• Fastq-dump utility from NCBI used.
Read trimming & filtering:
• Adaptor sequences removed.
• Base quality < 14 have been removed.
• Read length < 25 bp have been removed.
Read Quality Statistics
Average good reads ~ 67 %
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
176
181
186
191
196
% uniquely mapped reads
BISMARK – mapping efficiency
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Lane Number
Average mapping: 75 %
BSMAP – Mapping efficiency
BSMAP - Mapping efficiency (Unique reads)
90
70
60
50
40
30
20
10
Lane number
Average mapping efficiency: 78.4 %
196
191
186
181
176
171
166
161
156
151
146
141
136
131
126
121
116
111
106
101
96
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
6
11
0
1
% uniquely mapped reads
80
Benchmark results
• We used ~ 7 million reads (1 lane) of length 40 – 70 bp
data from H1 cells to compute the cpu time and memory
usage.
• Server configuration: X5690 @ 3.47GHz /2 cpu /12 core /
96 GB RAM
Aligner
CPU time
Memory usage
BSMAP
27 mins
8.1 GB
BISMARK
2 hrs 48 mins
12 GB
PASH
14 hrs
12 GB
Genomics Regions of interest (ROI)
• We profiled two classes of genomic regions for
% methylation and % coverage of CpG sites.
– Transcription start site (TSS)
– CpG island (CGI)
CpG statistics at TSS
• ~ 27,500 known RefSeq genes TSS flanking 2kb
sequences were downloaded from UCSC hg19.
• 2.89 million CpG sites TSS flanking 2 kb region.
• Sequences are divided into 20 bins.
• CpG sites that have a read depth of at least 4
reads are included in the analysis.
%CpG coverage at TSS (2kb up/down)
% CpG coverage at TSS (2kb up/down stream)
80
70
% CpG coverage
60
50
% coverage (PASH)
40
% coverage (BISMARK)
% coverage (Lister analyzed)
30
% coverage (BSMAP)
20
10
0
-2kb -1.8 -1.6 -1.4 -1.2
-1
-0.8 -0.6 -0.4 -0.2 0.2
0.4
-2kb <---- TSS ----> 2kb
0.6
0.8
1
1.2
1.4
1.6
1.8 2kb
%CpG methylation at TSS (2kb up/down)
% CpG methylation at TSS (2kb up/down)
100
90
80
% mCG/CG
70
60
% methylation (PASH)
50
% methylation (BISMARK)
40
% methylation (Lister analyzed)
30
% methylation (BSMAP)
20
10
0
-2kb -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8
-2kb <---- TSS ----> 2kb
1
1.2 1.4 1.6 1.8 2kb
CpG statistics at CGI
• 28691 CpG island sequences with 2kb flanking
regions were downloaded from UCSC – hg19
build.
• 2kb up/down stream of mid point of CGI
sequences were extracted.
• 4.4 million CpG sites within 2kb up/down stream
of centre of CGI.
• Sequences are divided into 20 bins.
• CpG sites that have a read depth of at least 4
reads are included in the analysis.
% CpG coverage at CGI
80
70
% CpG Coverage
60
50
% coverage (BISMARK)
40
% coverage (PASH)
% coverage (LISTER ANALYZED)
30
% Coverage (BSMAP)
20
10
0
-2kb -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8
-2kb <----CGI---->2kb
1
1.2 1.4 1.6 1.8
2
% CpG methylation at CGI
100
90
80
% CpG Coverage
70
60
% methylation (BISMARK)
50
% methylation (PASH)
% methylation (LISTER ANALYZED)
40
% methylation (BSMAP)
30
20
10
0
-2kb -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8
-2kb <----CGI---->2kb
1
1.2 1.4 1.6 1.8
2
Validation on In-house data
• Bisulfite-seq data of human samples.
• ~50 million reads with 100 bp pair-end reads.
BISMARK mapping efficiency
90
80
% of reads
70
60
50
40
Uniquely mapped
30
20
Unmapped
10
Multiple hits
0
Samples
% CpG coverage at TSS (In-house data)
BISMARK algorithm
30
% CpG Coverage at TSS
25
20
15
10
5
0
-2kb
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0.2
0.4
-2kb <---- TSS ----> 2kb
0.6
0.8
1
1.2
1.4
1.6
1.8
2kb
Conclusions
• PASH appears to provide greater mapping coverage at
both TSS and CGI. However, with longer reads (100 bp)
BISMARK coverage is also comparable.
• % methylation patterns are similar at both TSS and CGI.
• BSMAP alignment speed is much faster than BISMARK
and PASH.
• Validation studies by bisulfite pyrosequencing are
underway to determine the accuracy of methylation
estimates obtained in genomic regions covered by
PASH but not by other methods.
Acknowledgements
• Cristian Coarfa
• R. Alan Harris
• Aleksandar Milosavljevic

Comparative analysis of programs for mapping bisulfite

Transcription

Similar documents

PDF