Slides - San Diego Supercomputer Center

Transcription

Slides - San Diego Supercomputer Center
Compute- and Data-Intensive Analyses
in Bioinformatics"
Wayne Pfeiffer!
SDSC/UCSD!
August 8, 2012!
!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Questions for today"
•  How big is the flood of data from high-throughput DNA
sequencers?!
•  What bioinformatics codes are installed at SDSC?!
•  What are typical compute- and data-intensive analyses of
in bioinformatics?!
•  What are their computational requirements?!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Size matters: how much data are we talking about?"
•  3.1 GB for human genome!
•  Fits on flash drive; assumes FASTA format (1 B per base)"
•  >100 GB/day from a single Illumina HiSeq 2000!
•  50 Gbases/day of reads in FASTQ format (2.5 B per base)"
•  300 GB to >1 TB of reads needed as input for analysis of
whole human genome, depending upon coverage!
•  300 GB for 40x coverage"
•  1 TB for 130x coverage"
•  Multiple TB needed for subsequent analysis!
•  45 TB on disk at SDSC for W115 project! (~10,000x single genome)"
•  Multiple genomes per person!"
•  May only be looking for kB or MB in the end"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Market-leading DNA sequencers come from
Illumina & Life Technologies (both SD County companies)"
•  Illumina HiSeq 2000!
• 
• 
• 
• 
Big; $690,000 list price"
High throughput"
Low error rate"
100-bp paired-end reads"
read!
!
read!
•  Life Technologies Ion PGM!
• 
• 
• 
• 
Small; $50,000 list price"
Low throughput"
Modest error rate"
≤250-bp reads"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Cost of DNA sequencing is dropping much faster (1/10 in 2 y)
than cost of computing (1/2 in 2 y);
this is producing the flood of data"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
What does this mean?"
•  Growth of read data is roughly inversely proportional to
drop in sequencing cost!
•  >100 GB/day of reads from a single Illumina HiSeq 2000 now"
•  1 TB/day of reads from a sequencer likely by 2014"
•  Analysis & quality control will dominate the cost!
•  <$10,000 for sequencing human genome now"
•  $1,000 for sequencing human genome in 2013 or 2014"
•  ≥$10,000 for analysis & quality control of human genome sequence
now and decreasing relatively slowly"
•  Analysis improvements are needed to take advantage of
new sequencing technology!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Many widely-used bioinformatics codes
are installed on Triton, Trestles, & Gordon"
•  Pairwise sequence alignment!
•  ATAC, BFAST, BLAST, BLAT, Bowtie, BWA"
•  Multiple sequence alignment (via CIPRES gateway)!
•  ClustalW, MAFFT"
•  RNA-Seq analysis!
•  TopHat, Cufflinks"
•  De novo assembly!
•  ABySS, SOAPdenovo, Velvet!
•  Phylogenetic tree inference (via CIPRES gateway)!
•  BEAST, GARLI, MrBayes, RAxML!
•  Tool kits!
•  BEDTools, GATK, SAMtools"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Computational requirements for some codes & data sets
can be substantial"
!Input !Output !Memory
!(GB)
!(GB)
!(GB)
Code & data set
!
BFAST 0.6.4c
!26
52M 100-bp reads!
SOAPdenovo 1.05 !424
1.7B 100-bp reads
!!
Velvet 1.1.06
!35
562M ≤50-bp reads!
MrBayes 3.2.1
!<1
DNA data,!
40 taxa, 16k patterns !
RAxML 7.2.7
!<1
amino acid data,!
1.6k taxa, 8.8k patterns!
!Time !Cores /!
!(h) !computer!
!19
!17
!8 !8 / Dash!
!77
!387
!26 !16 / Triton P+C! !
!617
!539
!9 !16 / Triton PDAF!
!27
!12
!155 !8 / Gordon!
!<1
!47
!106 !160 / Trestles!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Benchmark tests were run on various computers,
some with large shared memory"
•  Gordon from Appro at SDSC!
•  16-core nodes with 2.6-GHz Intel Sandy Bridge processors"
•  64 GB of memory per node + vSMP"
•  Trestles from Appro at SDSC!
•  32-core nodes with 2.4-GHz AMD Magny-Cours processors"
•  64 GB of memory per node"
•  Triton CC & Dash from Appro at SDSC!
•  8-core nodes with 2.4-GHz Intel Nehalem processors"
•  24 & 48 GB of memory per node + vSMP on Dash"
•  Triton PDAF from Sun at SDSC!
•  32-core nodes with 2.5-GHz AMD Shanghai processors"
•  256 & 512 GB of memory per node "
•  Blacklight from SGI at PSC!
•  2,048-core NUMA nodes with 2.27-GHz Intel Nehalem processors "
•  16 TB of memory per NUMA node"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Typical projects involve multiple codes,
some with multiple steps, combined in workflows"
•  HuTS: Human Tumor Study!
• 
• 
• 
• 
Search for genome variants between blood and tumor tissue"
Start from Ilumina 100-bp paired-end reads"
Use BWA & GATK on Triton to find SNPs & short indels"
Use SOAPdenovo, ATAC, & custom scripts on Triton to find long
indels"
•  W115: Study of 115-year-old woman’s genomes (!)!
•  Search for genome variants between
"
blood and brain tissue"
•  Start from SOLiD 50-bp reads"
•  Use BioScope, SAMtools, & GATK
"
elsewhere to find SNVs & short indels"
•  Use SAMtools, ABySS, Velvet, ATAC, BFAST,
& custom scripts on Triton to find long indels"
SAN DIEGO SUPERCOMPUTER CENTER
"
"
"
"
"
"
Hendrikje van!
Andel-Schipper
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
"
Computational workflows for
common bioinformatics analyses"
DNA reads in
FASTQ format!
Read mapping, i.e.,
pairwise alignment:
BFAST, BWA, …!
Alignment info
in BAM format!
De novo assembly:
SOAPdenovo,
Velvet, …!
Contigs &
scaffolds in
FASTA format!
Reference
genome in
FASTA format!
Pairwise alignment:
ATAC, BLAST, …!
Multiple sequence
alignment: ClustalW,
MAFFT, …!
Variant calling:
GATK, …!
Alignment info
in various
formats!
Aligned
sequences in
various formats!
Variants: SNPs,
indels, others!
Tree in various
formats!
Phylogenetic tree
inference: MrBayes,
RAxML, …!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Computational workflow for
read mapping & variant calling"
DNA reads in
FASTQ format!
Goal: identify simple variants, e.g.,!
• single nucleotide polymorphisms (SNPs)
or single nucleotide variants (SNVs)!
Read mapping, i.e.,
pairwise alignment:
BFAST, BWA, …!
Reference
genome in
FASTA format!
!
!
CACCGGCGCAGTCATTCTCATAAT
!
||||||||||| ||||||||||||
!
CACCGGCGCAGACATTCTCATAAT
!
!
Alignment info
in BAM format!
Variant calling:
GATK, …!
• short insertions & deletions (indels)!
CACCGGCGCAGTCATTCTCATAAT!
!
||||||||||
|||||||||||!
CACCGGCGCA
ATTCTCATAAT!
!
!
Variants: SNPs,
indels, others!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Pileup diagram shows mapping of reads to reference;
example from HuTS shows a SNP in KRAS gene; this
means that cetuximab is not effective for chemotherapy"
BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
BFAST took about 8 hours & 17 GB of memory
to map a small set of reads; speedup was 3.7 on 8 cores"
•  Parallelization is typically done by!
•  Separate runs for each lane of reads"
• ! Threads within a run" !
!
!
!1-thread
!Step
!time (h)
!
Match
!8.4
Align
!19.3
Postprocess
!0.4
Total
!28.2
!8-thread!
!time(h)!
!
!
!Speedup
!8-thread!
!memory!
!(GB)!
!3.1
!4.2
!0.4
!7.7
!2.7
!4.6
!1.0
!3.7!
!16.9!
!2.0
!2.2!
!!
•  Tabulated results are for!
• 
• 
• 
• 
One lane of Illumina 100-bp paired-end reads: 52 million reads"
One index with k=22 on reference human genome (done previously)"
One 8-core node of Dash with 2.4-GHz Intel Nehalems & 48 GB of memory"
26 GB input, half for reads & half for index; 19 GB output!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Computational workflow for
de novo assembly & variant calling"
DNA reads in
FASTQ format!
De novo assembly:
SOAPdenovo,
Velvet, …!
Contigs &
scaffolds in
FASTA format!
Goal: identify more
complex variants, e.g.,!
• !large indels!
• !duplications!
Reference
genome in
FASTA format!
Variant calling:
GATK, …!
Pairwise alignment:
ATAC, BLAST, …!
• !inversions!
• !translocations!
Alignment info
in various
formats!
Variants: SNPs,
indels, others!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Key conceptual steps in de novo assembly"
1.  Find reads that overlap by a specified
number of bases (the k-mer size),
typically by building a graph in memory
2. Merge overlapping, “good” reads into
longer contigs, typically by simplifying
the graph
3. Link contigs to form scaffolds using
paired-end information
Diagrams from Serafim Batzoglou, Stanford!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
de Bruijn graph has
k-mers as nodes connected by reads;
assembly involves finding Eulerian path through graph"
AGAC
Diagram from Michael Schatz, Cold Spring Harbor"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SOAPdenovo & Velvet are two leading assemblers
that use de Bruijn graph algorithm"
•  SOAPdenovo is from BGI!
• 
• 
• 
• 
Code has four steps: pregraph, contig, map, & scaffold"
pregraph & map are parallelized with Pthreads, but not reproducibly"
pregraph uses the most time & memory!
http://soap.genomics.org.cn/soapdneovo.html"
•  Velvet is from EMBL-EBI!
•  Code has two steps: hash & graph"
•  Both are parallelized with OpenMP, but not reproducibly!
•  Either step can use more time or memory depending upon problem
& computer"
•  http://www.ebi.ac.uk/~zerbino/velvet"
•  k-mer size is adjustable parameter!
•  Typically it is adjusted to maximize N50 length of scaffolds or contigs"
•  N50 length is central measure of distribution weighted by lengths"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SOAPdenovo & Velvet each have their strengths"
•  Quality of assembly!
•  Both give similar assemblies "
•  Speed!
•  SOAPdenovo is faster"
•  Memory!
•  SOAPdenovo uses much less memory"
•  vSMP!
•  Velvet often runs well with vSMP, whereas SOAPdenovo does not"
•  Reads!
•  Both work with Illumina reads, but only Velvet works with SOLiD reads"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Graph step of Velvet works well on Gordon with vSMP;
Gordon, Blacklight, & Triton PDAF have similar speeds
when memory for hash step is small
"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hash step of Velvet runs much slower on Gordon with vSMP &
somewhat slower on Blacklight when memory for hash step is large;
graph step still works well on Gordon with vSMP
"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
What is going on?"
•  Memory access for graph step of Velvet is fairly regular!
•  This is efficient with vSMP"
•  Performance improved significantly last year through tuning of vSMP
by ScaleMP"
•  Memory access for hash step of Velvet is nearly random!
•  This is inefficient with vSMP "
•  Memory access for pregraph step of SOAPdenovo (not
shown) is also nearly random!
•  Since pregraph step uses most memory, large-memory SOAPdenovo
runs are slow with vSMP "
•  vSMP allows analyses otherwise possible on only a few
computers!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Computational workflow for
de novo assembly followed by phylogenetic analyses"
DNA reads in
FASTQ format!
De novo assembly:
SOAPdenovo,
Velvet, …!
Contigs &
scaffolds in
FASTA format!
Multiple sequence alignment is matrix of taxa vs characters!
Human
Chimpanzee
Gorilla
Orangutan
Gibbon
Multiple sequence
alignment: ClustalW,
MAFFT, …!
. . .
...... . .!
AAGCTTCACCGGCGCAGTCATTCTCATAAT...!
AAGCTTCACCGGCGCAATTATCCTCATAAT...!
AAGCTTCACCGGCGCAGTTGTTCTTATAAT...!
AAGCTTCACCGGCGCAACCACCCTCATGAT...!
AAGCTTTACAGGTGCAACCGTCCTCATAAT...
!
Aligned
sequences in
various formats!
Final output is phylogeny or tree with taxa at its tips!
/-------- Human!
|
|---------- Chimpanzee!
+
|
/---------- Gorilla!
|
|
\---+
/-------------------------------- Orangutan!
\-------------+
\----------------------------------------------- Gibbon!
!
!
!
!
Phylogenetic tree
inference: MrBayes,
RAxML, …!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Scalability of RAxML & MrBayes was improved
during past three years by Stamatakis, Goll, & Pfeiffer"
•  Hybrid MPI/Pthreads version of RAxML was developed!
•  MPI code was added to previous Pthreads-only code"
•  Parallelization is multi-grained as well as hybrid!
•  Change in algorithm often leads to better solution!
•  Hybrid MPI/OpenMP version of MrBayes was developed!
•  OpenMP code was added to previous MPI-only code"
•  Parallelization is multi-grained as well as hybrid!
•  Memory-efficient code called RAxML-Light was developed!
•  This allows very large trees to be analyzed together with RAxML"
•  Single-node runs are more efficient than before!
•  Multi-node runs with more cores are possible!
•  Scalability before was limited to about 8 cores for typical analyses"
•  Hybrid codes now scale well to 10s of cores for typical analyses"
•  Scripted version of RAxML-Light scales even further "
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
RAxML parallel efficiency is >0.5 up to 60 cores for >1,000 patterns*;
speedup is superlinear for comprehensive analysis at some core counts;
scalability improves with number of patterns"
* Number of patterns = number of unique columns in multiple sequence alignment!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
RAxML run time for a DNA analysis
went from >3 days on 1 core to ~1.3 hours on 60 cores;
large amino acid analysis was solved in 4.4 days on 160 cores"
!
!
! !Taxa
!
! !150
! !218
! !404
! !1,596
!Char!acters
!Pat!terns
!Boot!straps!
!1,269
!2,294
!13,158
!10,301
!1,130
!1,846
! 7,429
!8,807
!400!
!450, 500!
!450, 400!
!160!
!Data! !Time (h) !Time (h) !Speed-!
!type !& cores !& cores
!up!
!RNA
!DNA
!DNA
!AA
!2.1, 1 !0.06, 60
!8.7, 1 !0.20, 60
!74.8, 1 !1.27, 60
! !106, 160!
!33!
!43!
!59!
•  Tabulated results are for!
•  Comprehensive analysis with number of bootstrap searches
determined automatically followed by 10 or 20 thorough searches"
•  32-core nodes of Trestles with 2.4-GHz AMD Magny-Cours processors"
•  10 MPI processes & 6 threads/process using 60 cores (which gives better
performance than using 64 cores)"
•  20 MPI processes & 8 threads/process using 160 cores!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
MrBayes runs 1.6x to 3.3x faster on Gordon than Trestles
depending upon the size of the data set;
speedup is greater for larger data sets that are not partitioned"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
The CIPRES gateway lets biologists run parallel versions
of tree inference codes via a browser interface
on the Trestles & Gordon supercomputers at SDSC"
!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Questions & answers about analyzing DNA sequence data"
•  How big is the flood of data from high-throughput DNA
sequencers?!
•  >100 GB per day from a single Illumina sequencer now"
•  1 TB/day from a sequencer likely by 2014"
•  What are three compute- and data-intensive analyses of
DNA sequence data?!
•  Mapping of short reads against a reference genome"
•  De novo assembly of short reads"
•  Phylogenetic tree inference"
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
So how compute- and data-intensive
are the three bioinformatics analyses we considered?"
Here is a qualitative summary!
!Compute!intensive
Analysis
!
Read mapping
!
De novo assembly
!
Tree inference (usually)
!
Tree inference (sometimes) !
!x
!x
!x!
!x
!Memory!intensive*
! I/O-!
!intensive!
!
!
!
!x
!
!
!x!
!x
!
!x
!
!!
!
* I.e., large memory per node is needed for shared-memory implementations
!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
!!