Slides - San Diego Supercomputer Center

Transcription

Compute- and Data-Intensive Analyses in Bioinformatics"
Wayne Pfeiffer!
SDSC/UCSD!
August 8, 2012!
!
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Questions for today"
•  How big is the flood of data from high-throughput DNA
sequencers?!
•  What bioinformatics codes are installed at SDSC?!
•  What are typical compute- and data-intensive analyses of
in bioinformatics?!
•  What are their computational requirements?!
Size matters: how much data are we talking about?"
•  3.1 GB for human genome!
•  Fits on flash drive; assumes FASTA format (1 B per base)"
•  >100 GB/day from a single Illumina HiSeq 2000!
•  50 Gbases/day of reads in FASTQ format (2.5 B per base)"
•  300 GB to >1 TB of reads needed as input for analysis of
whole human genome, depending upon coverage!
•  300 GB for 40x coverage"
•  1 TB for 130x coverage"
•  Multiple TB needed for subsequent analysis!
•  45 TB on disk at SDSC for W115 project! (~10,000x single genome)"
•  Multiple genomes per person!"
•  May only be looking for kB or MB in the end"
Market-leading DNA sequencers come from Illumina & Life Technologies (both SD County companies)"
•  Illumina HiSeq 2000!
• 
• 
• 
• 
Big; $690,000 list price"
High throughput"
Low error rate"
100-bp paired-end reads"
read!
!
read!
•  Life Technologies Ion PGM!
• 
• 
• 
• 
Small; $50,000 list price"
Low throughput"
Modest error rate"
≤250-bp reads"
Cost of DNA sequencing is dropping much faster (1/10 in 2 y) than cost of computing (1/2 in 2 y); this is producing the flood of data"
What does this mean?"
•  Growth of read data is roughly inversely proportional to
drop in sequencing cost!
•  >100 GB/day of reads from a single Illumina HiSeq 2000 now"
•  1 TB/day of reads from a sequencer likely by 2014"
•  Analysis & quality control will dominate the cost!
•  <$10,000 for sequencing human genome now"
•  $1,000 for sequencing human genome in 2013 or 2014"
•  ≥$10,000 for analysis & quality control of human genome sequence
now and decreasing relatively slowly"
•  Analysis improvements are needed to take advantage of
new sequencing technology!
Many widely-used bioinformatics codes are installed on Triton, Trestles, & Gordon"
•  Pairwise sequence alignment!
•  ATAC, BFAST, BLAST, BLAT, Bowtie, BWA"
•  Multiple sequence alignment (via CIPRES gateway)!
•  ClustalW, MAFFT"
•  RNA-Seq analysis!
•  TopHat, Cufflinks"
•  De novo assembly!
•  ABySS, SOAPdenovo, Velvet!
•  Phylogenetic tree inference (via CIPRES gateway)!
•  BEAST, GARLI, MrBayes, RAxML!
•  Tool kits!
•  BEDTools, GATK, SAMtools"
Computational requirements for some codes & data sets can be substantial"
!Input !Output !Memory
!(GB)
!(GB)
!(GB)
Code & data set
!
BFAST 0.6.4c
!26
52M 100-bp reads!
SOAPdenovo 1.05 !424
1.7B 100-bp reads
!!
Velvet 1.1.06
!35
562M ≤50-bp reads!
MrBayes 3.2.1
!<1
DNA data,!
40 taxa, 16k patterns !
RAxML 7.2.7
!<1
amino acid data,!
1.6k taxa, 8.8k patterns!
!Time !Cores /!
!(h) !computer!
!19
!17
!8 !8 / Dash!
!77
!387
!26 !16 / Triton P+C! !
!617
!539
!9 !16 / Triton PDAF!
!27
!12
!155 !8 / Gordon!
!<1
!47
!106 !160 / Trestles!
Benchmark tests were run on various computers, some with large shared memory"
•  Gordon from Appro at SDSC!
•  16-core nodes with 2.6-GHz Intel Sandy Bridge processors"
•  64 GB of memory per node + vSMP"
•  Trestles from Appro at SDSC!
•  32-core nodes with 2.4-GHz AMD Magny-Cours processors"
•  64 GB of memory per node"
•  Triton CC & Dash from Appro at SDSC!
•  8-core nodes with 2.4-GHz Intel Nehalem processors"
•  24 & 48 GB of memory per node + vSMP on Dash"
•  Triton PDAF from Sun at SDSC!
•  32-core nodes with 2.5-GHz AMD Shanghai processors"
•  256 & 512 GB of memory per node "
•  Blacklight from SGI at PSC!
•  2,048-core NUMA nodes with 2.27-GHz Intel Nehalem processors "
•  16 TB of memory per NUMA node"
Typical projects involve multiple codes, some with multiple steps, combined in workflows"
•  HuTS: Human Tumor Study!
• 
• 
• 
• 
Search for genome variants between blood and tumor tissue"
Start from Ilumina 100-bp paired-end reads"
Use BWA & GATK on Triton to find SNPs & short indels"
Use SOAPdenovo, ATAC, & custom scripts on Triton to find long
indels"
•  W115: Study of 115-year-old woman’s genomes (!)!
•  Search for genome variants between
"
blood and brain tissue"
•  Start from SOLiD 50-bp reads"
•  Use BioScope, SAMtools, & GATK
"
elsewhere to find SNVs & short indels"
•  Use SAMtools, ABySS, Velvet, ATAC, BFAST,
& custom scripts on Triton to find long indels"
"
"
"
"
"
"
Hendrikje van!
Andel-Schipper
"
Computational workflows for common bioinformatics analyses"
DNA reads in
FASTQ format!
Read mapping, i.e.,
pairwise alignment:
BFAST, BWA, …!
Alignment info
in BAM format!
De novo assembly:
SOAPdenovo,
Velvet, …!
Contigs &
scaffolds in
FASTA format!
Reference
genome in
FASTA format!
Pairwise alignment:
ATAC, BLAST, …!
Multiple sequence
alignment: ClustalW,
MAFFT, …!
Variant calling:
GATK, …!
Alignment info
in various
formats!
Aligned
sequences in
various formats!
Variants: SNPs,
indels, others!
Tree in various
formats!
Phylogenetic tree
inference: MrBayes,
RAxML, …!
Computational workflow for read mapping & variant calling"
DNA reads in
FASTQ format!
Goal: identify simple variants, e.g.,!
• single nucleotide polymorphisms (SNPs)
or single nucleotide variants (SNVs)!
Read mapping, i.e.,
pairwise alignment:
BFAST, BWA, …!
Reference
genome in
FASTA format!
!
!
CACCGGCGCAGTCATTCTCATAAT
!
||||||||||| ||||||||||||
!
CACCGGCGCAGACATTCTCATAAT
!
!
Alignment info
in BAM format!
Variant calling:
GATK, …!
• short insertions & deletions (indels)!
CACCGGCGCAGTCATTCTCATAAT!
!
||||||||||
|||||||||||!
CACCGGCGCA
ATTCTCATAAT!
!
!
Variants: SNPs,
indels, others!
Pileup diagram shows mapping of reads to reference;
example from HuTS shows a SNP in KRAS gene; this
means that cetuximab is not effective for chemotherapy"
BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI!
BFAST took about 8 hours & 17 GB of memory to map a small set of reads; speedup was 3.7 on 8 cores"
•  Parallelization is typically done by!
•  Separate runs for each lane of reads"
• ! Threads within a run" !
!
!
!1-thread
!Step
!time (h)
!
Match
!8.4
Align
!19.3
Postprocess
!0.4
Total
!28.2
!8-thread!
!time(h)!
!
!
!Speedup
!8-thread!
!memory!
!(GB)!
!3.1
!4.2
!0.4
!7.7
!2.7
!4.6
!1.0
!3.7!
!16.9!
!2.0
!2.2!
!!
•  Tabulated results are for!
• 
• 
• 
• 
One lane of Illumina 100-bp paired-end reads: 52 million reads"
One index with k=22 on reference human genome (done previously)"
One 8-core node of Dash with 2.4-GHz Intel Nehalems & 48 GB of memory"
26 GB input, half for reads & half for index; 19 GB output!
Computational workflow for de novo assembly & variant calling"
DNA reads in
FASTQ format!
De novo assembly:
SOAPdenovo,
Velvet, …!
Contigs &
scaffolds in
FASTA format!
Goal: identify more
complex variants, e.g.,!
• !large indels!
• !duplications!
Reference
genome in
FASTA format!
Variant calling:
GATK, …!
Pairwise alignment:
ATAC, BLAST, …!
• !inversions!
• !translocations!
Alignment info
in various
formats!
Variants: SNPs,
indels, others!
Key conceptual steps in de novo assembly"
1.  Find reads that overlap by a specified
number of bases (the k-mer size),
typically by building a graph in memory
2. Merge overlapping, “good” reads into
longer contigs, typically by simplifying
the graph
3. Link contigs to form scaffolds using
paired-end information
Diagrams from Serafim Batzoglou, Stanford!
de Bruijn graph has k-mers as nodes connected by reads; assembly involves finding Eulerian path through graph"
AGAC
Diagram from Michael Schatz, Cold Spring Harbor"
SOAPdenovo & Velvet are two leading assemblers that use de Bruijn graph algorithm"
•  SOAPdenovo is from BGI!
• 
• 
• 
• 
Code has four steps: pregraph, contig, map, & scaffold"
pregraph & map are parallelized with Pthreads, but not reproducibly"
pregraph uses the most time & memory!
http://soap.genomics.org.cn/soapdneovo.html"
•  Velvet is from EMBL-EBI!
•  Code has two steps: hash & graph"
•  Both are parallelized with OpenMP, but not reproducibly!
•  Either step can use more time or memory depending upon problem
& computer"
•  http://www.ebi.ac.uk/~zerbino/velvet"
•  k-mer size is adjustable parameter!
•  Typically it is adjusted to maximize N50 length of scaffolds or contigs"
•  N50 length is central measure of distribution weighted by lengths"
SOAPdenovo & Velvet each have their strengths"
•  Quality of assembly!
•  Both give similar assemblies "
•  Speed!
•  SOAPdenovo is faster"
•  Memory!
•  SOAPdenovo uses much less memory"
•  vSMP!
•  Velvet often runs well with vSMP, whereas SOAPdenovo does not"
•  Reads!
•  Both work with Illumina reads, but only Velvet works with SOLiD reads"
Graph step of Velvet works well on Gordon with vSMP; Gordon, Blacklight, & Triton PDAF have similar speeds when memory for hash step is small
"
Hash step of Velvet runs much slower on Gordon with vSMP &
somewhat slower on Blacklight when memory for hash step is large; graph step still works well on Gordon with vSMP
"
What is going on?"
•  Memory access for graph step of Velvet is fairly regular!
•  This is efficient with vSMP"
•  Performance improved significantly last year through tuning of vSMP
by ScaleMP"
•  Memory access for hash step of Velvet is nearly random!
•  This is inefficient with vSMP "
•  Memory access for pregraph step of SOAPdenovo (not
shown) is also nearly random!
•  Since pregraph step uses most memory, large-memory SOAPdenovo
runs are slow with vSMP "
•  vSMP allows analyses otherwise possible on only a few
computers!
Computational workflow for de novo assembly followed by phylogenetic analyses"
DNA reads in
FASTQ format!
De novo assembly:
SOAPdenovo,
Velvet, …!
Contigs &
scaffolds in
FASTA format!
Multiple sequence alignment is matrix of taxa vs characters!
Human
Chimpanzee
Gorilla
Orangutan
Gibbon
Multiple sequence
alignment: ClustalW,
MAFFT, …!
. . .
...... . .!
AAGCTTCACCGGCGCAGTCATTCTCATAAT...!
AAGCTTCACCGGCGCAATTATCCTCATAAT...!
AAGCTTCACCGGCGCAGTTGTTCTTATAAT...!
AAGCTTCACCGGCGCAACCACCCTCATGAT...!
AAGCTTTACAGGTGCAACCGTCCTCATAAT...
!
Aligned
sequences in
various formats!
Final output is phylogeny or tree with taxa at its tips!
/-------- Human!
|
|---------- Chimpanzee!
+
|
/---------- Gorilla!
|
|
\---+
/-------------------------------- Orangutan!
\-------------+
\----------------------------------------------- Gibbon!
!
!
!
!
Phylogenetic tree
inference: MrBayes,
RAxML, …!
Scalability of RAxML & MrBayes was improved during past three years by Stamatakis, Goll, & Pfeiffer"
•  Hybrid MPI/Pthreads version of RAxML was developed!
•  MPI code was added to previous Pthreads-only code"
•  Parallelization is multi-grained as well as hybrid!
•  Change in algorithm often leads to better solution!
•  Hybrid MPI/OpenMP version of MrBayes was developed!
•  OpenMP code was added to previous MPI-only code"
•  Parallelization is multi-grained as well as hybrid!
•  Memory-efficient code called RAxML-Light was developed!
•  This allows very large trees to be analyzed together with RAxML"
•  Single-node runs are more efficient than before!
•  Multi-node runs with more cores are possible!
•  Scalability before was limited to about 8 cores for typical analyses"
•  Hybrid codes now scale well to 10s of cores for typical analyses"
•  Scripted version of RAxML-Light scales even further "
RAxML parallel efficiency is >0.5 up to 60 cores for >1,000 patterns*; speedup is superlinear for comprehensive analysis at some core counts; scalability improves with number of patterns"
* Number of patterns = number of unique columns in multiple sequence alignment!
RAxML run time for a DNA analysis went from >3 days on 1 core to ~1.3 hours on 60 cores; large amino acid analysis was solved in 4.4 days on 160 cores"
!
!
! !Taxa
!
! !150
! !218
! !404
! !1,596
!Char!acters
!Pat!terns
!Boot!straps!
!1,269
!2,294
!13,158
!10,301
!1,130
!1,846
! 7,429
!8,807
!400!
!450, 500!
!450, 400!
!160!
!Data! !Time (h) !Time (h) !Speed-!
!type !& cores !& cores
!up!
!RNA
!DNA
!DNA
!AA
!2.1, 1 !0.06, 60
!8.7, 1 !0.20, 60
!74.8, 1 !1.27, 60
! !106, 160!
!33!
!43!
!59!
•  Tabulated results are for!
•  Comprehensive analysis with number of bootstrap searches
determined automatically followed by 10 or 20 thorough searches"
•  32-core nodes of Trestles with 2.4-GHz AMD Magny-Cours processors"
•  10 MPI processes & 6 threads/process using 60 cores (which gives better
performance than using 64 cores)"
•  20 MPI processes & 8 threads/process using 160 cores!
MrBayes runs 1.6x to 3.3x faster on Gordon than Trestles depending upon the size of the data set; speedup is greater for larger data sets that are not partitioned"
The CIPRES gateway lets biologists run parallel versions of tree inference codes via a browser interface on the Trestles & Gordon supercomputers at SDSC"
!
Questions & answers about analyzing DNA sequence data"
•  How big is the flood of data from high-throughput DNA
sequencers?!
•  >100 GB per day from a single Illumina sequencer now"
•  1 TB/day from a sequencer likely by 2014"
•  What are three compute- and data-intensive analyses of
DNA sequence data?!
•  Mapping of short reads against a reference genome"
•  De novo assembly of short reads"
•  Phylogenetic tree inference"
So how compute- and data-intensive are the three bioinformatics analyses we considered?"
Here is a qualitative summary!
!Compute!intensive
Analysis
!
Read mapping
!
De novo assembly
!
Tree inference (usually)
!
Tree inference (sometimes) !
!x
!x
!x!
!x
!Memory!intensive*
! I/O-!
!intensive!
!
!
!
!x
!
!
!x!
!x
!
!x
!
!!
!
* I.e., large memory per node is needed for shared-memory implementations
!
!!

Slides - San Diego Supercomputer Center

Transcription

Similar documents

Capuchinbird

4S Commons Town Center

Allen`s Swamp Monkey - San Diego Zoo Global Library

San Diego Wild Animal Park Water Master Plan

we`ve accomplished we`ve accomplished

Sudberry Properties Celebrates 30th Anniversary

2016 Coronation flyer short.psf

Layout 1 (Page 1) - Rescue Task Force

Critter Kids - San Diego Humane Society

Thanks for Carnivale - Baja Animal Sanctuary