Slides - San Diego Supercomputer Center
Transcription
Slides - San Diego Supercomputer Center
Compute- and Data-Intensive Analyses in Bioinformatics" Wayne Pfeiffer! SDSC/UCSD! August 8, 2012! ! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Questions for today" • How big is the flood of data from high-throughput DNA sequencers?! • What bioinformatics codes are installed at SDSC?! • What are typical compute- and data-intensive analyses of in bioinformatics?! • What are their computational requirements?! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Size matters: how much data are we talking about?" • 3.1 GB for human genome! • Fits on flash drive; assumes FASTA format (1 B per base)" • >100 GB/day from a single Illumina HiSeq 2000! • 50 Gbases/day of reads in FASTQ format (2.5 B per base)" • 300 GB to >1 TB of reads needed as input for analysis of whole human genome, depending upon coverage! • 300 GB for 40x coverage" • 1 TB for 130x coverage" • Multiple TB needed for subsequent analysis! • 45 TB on disk at SDSC for W115 project! (~10,000x single genome)" • Multiple genomes per person!" • May only be looking for kB or MB in the end" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Market-leading DNA sequencers come from Illumina & Life Technologies (both SD County companies)" • Illumina HiSeq 2000! • • • • Big; $690,000 list price" High throughput" Low error rate" 100-bp paired-end reads" read! ! read! • Life Technologies Ion PGM! • • • • Small; $50,000 list price" Low throughput" Modest error rate" ≤250-bp reads" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Cost of DNA sequencing is dropping much faster (1/10 in 2 y) than cost of computing (1/2 in 2 y); this is producing the flood of data" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO What does this mean?" • Growth of read data is roughly inversely proportional to drop in sequencing cost! • >100 GB/day of reads from a single Illumina HiSeq 2000 now" • 1 TB/day of reads from a sequencer likely by 2014" • Analysis & quality control will dominate the cost! • <$10,000 for sequencing human genome now" • $1,000 for sequencing human genome in 2013 or 2014" • ≥$10,000 for analysis & quality control of human genome sequence now and decreasing relatively slowly" • Analysis improvements are needed to take advantage of new sequencing technology! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Many widely-used bioinformatics codes are installed on Triton, Trestles, & Gordon" • Pairwise sequence alignment! • ATAC, BFAST, BLAST, BLAT, Bowtie, BWA" • Multiple sequence alignment (via CIPRES gateway)! • ClustalW, MAFFT" • RNA-Seq analysis! • TopHat, Cufflinks" • De novo assembly! • ABySS, SOAPdenovo, Velvet! • Phylogenetic tree inference (via CIPRES gateway)! • BEAST, GARLI, MrBayes, RAxML! • Tool kits! • BEDTools, GATK, SAMtools" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational requirements for some codes & data sets can be substantial" !Input !Output !Memory !(GB) !(GB) !(GB) Code & data set ! BFAST 0.6.4c !26 52M 100-bp reads! SOAPdenovo 1.05 !424 1.7B 100-bp reads !! Velvet 1.1.06 !35 562M ≤50-bp reads! MrBayes 3.2.1 !<1 DNA data,! 40 taxa, 16k patterns ! RAxML 7.2.7 !<1 amino acid data,! 1.6k taxa, 8.8k patterns! !Time !Cores /! !(h) !computer! !19 !17 !8 !8 / Dash! !77 !387 !26 !16 / Triton P+C! ! !617 !539 !9 !16 / Triton PDAF! !27 !12 !155 !8 / Gordon! !<1 !47 !106 !160 / Trestles! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Benchmark tests were run on various computers, some with large shared memory" • Gordon from Appro at SDSC! • 16-core nodes with 2.6-GHz Intel Sandy Bridge processors" • 64 GB of memory per node + vSMP" • Trestles from Appro at SDSC! • 32-core nodes with 2.4-GHz AMD Magny-Cours processors" • 64 GB of memory per node" • Triton CC & Dash from Appro at SDSC! • 8-core nodes with 2.4-GHz Intel Nehalem processors" • 24 & 48 GB of memory per node + vSMP on Dash" • Triton PDAF from Sun at SDSC! • 32-core nodes with 2.5-GHz AMD Shanghai processors" • 256 & 512 GB of memory per node " • Blacklight from SGI at PSC! • 2,048-core NUMA nodes with 2.27-GHz Intel Nehalem processors " • 16 TB of memory per NUMA node" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Typical projects involve multiple codes, some with multiple steps, combined in workflows" • HuTS: Human Tumor Study! • • • • Search for genome variants between blood and tumor tissue" Start from Ilumina 100-bp paired-end reads" Use BWA & GATK on Triton to find SNPs & short indels" Use SOAPdenovo, ATAC, & custom scripts on Triton to find long indels" • W115: Study of 115-year-old woman’s genomes (!)! • Search for genome variants between " blood and brain tissue" • Start from SOLiD 50-bp reads" • Use BioScope, SAMtools, & GATK " elsewhere to find SNVs & short indels" • Use SAMtools, ABySS, Velvet, ATAC, BFAST, & custom scripts on Triton to find long indels" SAN DIEGO SUPERCOMPUTER CENTER " " " " " " Hendrikje van! Andel-Schipper at the UNIVERSITY OF CALIFORNIA, SAN DIEGO " Computational workflows for common bioinformatics analyses" DNA reads in FASTQ format! Read mapping, i.e., pairwise alignment: BFAST, BWA, …! Alignment info in BAM format! De novo assembly: SOAPdenovo, Velvet, …! Contigs & scaffolds in FASTA format! Reference genome in FASTA format! Pairwise alignment: ATAC, BLAST, …! Multiple sequence alignment: ClustalW, MAFFT, …! Variant calling: GATK, …! Alignment info in various formats! Aligned sequences in various formats! Variants: SNPs, indels, others! Tree in various formats! Phylogenetic tree inference: MrBayes, RAxML, …! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for read mapping & variant calling" DNA reads in FASTQ format! Goal: identify simple variants, e.g.,! • single nucleotide polymorphisms (SNPs) or single nucleotide variants (SNVs)! Read mapping, i.e., pairwise alignment: BFAST, BWA, …! Reference genome in FASTA format! ! ! CACCGGCGCAGTCATTCTCATAAT ! ||||||||||| |||||||||||| ! CACCGGCGCAGACATTCTCATAAT ! ! Alignment info in BAM format! Variant calling: GATK, …! • short insertions & deletions (indels)! CACCGGCGCAGTCATTCTCATAAT! ! |||||||||| |||||||||||! CACCGGCGCA ATTCTCATAAT! ! ! Variants: SNPs, indels, others! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this means that cetuximab is not effective for chemotherapy" BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO BFAST took about 8 hours & 17 GB of memory to map a small set of reads; speedup was 3.7 on 8 cores" • Parallelization is typically done by! • Separate runs for each lane of reads" • ! Threads within a run" ! ! ! !1-thread !Step !time (h) ! Match !8.4 Align !19.3 Postprocess !0.4 Total !28.2 !8-thread! !time(h)! ! ! !Speedup !8-thread! !memory! !(GB)! !3.1 !4.2 !0.4 !7.7 !2.7 !4.6 !1.0 !3.7! !16.9! !2.0 !2.2! !! • Tabulated results are for! • • • • One lane of Illumina 100-bp paired-end reads: 52 million reads" One index with k=22 on reference human genome (done previously)" One 8-core node of Dash with 2.4-GHz Intel Nehalems & 48 GB of memory" 26 GB input, half for reads & half for index; 19 GB output! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for de novo assembly & variant calling" DNA reads in FASTQ format! De novo assembly: SOAPdenovo, Velvet, …! Contigs & scaffolds in FASTA format! Goal: identify more complex variants, e.g.,! • !large indels! • !duplications! Reference genome in FASTA format! Variant calling: GATK, …! Pairwise alignment: ATAC, BLAST, …! • !inversions! • !translocations! Alignment info in various formats! Variants: SNPs, indels, others! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Key conceptual steps in de novo assembly" 1. Find reads that overlap by a specified number of bases (the k-mer size), typically by building a graph in memory 2. Merge overlapping, “good” reads into longer contigs, typically by simplifying the graph 3. Link contigs to form scaffolds using paired-end information Diagrams from Serafim Batzoglou, Stanford! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO de Bruijn graph has k-mers as nodes connected by reads; assembly involves finding Eulerian path through graph" AGAC Diagram from Michael Schatz, Cold Spring Harbor" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO SOAPdenovo & Velvet are two leading assemblers that use de Bruijn graph algorithm" • SOAPdenovo is from BGI! • • • • Code has four steps: pregraph, contig, map, & scaffold" pregraph & map are parallelized with Pthreads, but not reproducibly" pregraph uses the most time & memory! http://soap.genomics.org.cn/soapdneovo.html" • Velvet is from EMBL-EBI! • Code has two steps: hash & graph" • Both are parallelized with OpenMP, but not reproducibly! • Either step can use more time or memory depending upon problem & computer" • http://www.ebi.ac.uk/~zerbino/velvet" • k-mer size is adjustable parameter! • Typically it is adjusted to maximize N50 length of scaffolds or contigs" • N50 length is central measure of distribution weighted by lengths" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO SOAPdenovo & Velvet each have their strengths" • Quality of assembly! • Both give similar assemblies " • Speed! • SOAPdenovo is faster" • Memory! • SOAPdenovo uses much less memory" • vSMP! • Velvet often runs well with vSMP, whereas SOAPdenovo does not" • Reads! • Both work with Illumina reads, but only Velvet works with SOLiD reads" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Graph step of Velvet works well on Gordon with vSMP; Gordon, Blacklight, & Triton PDAF have similar speeds when memory for hash step is small " SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hash step of Velvet runs much slower on Gordon with vSMP & somewhat slower on Blacklight when memory for hash step is large; graph step still works well on Gordon with vSMP " SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO What is going on?" • Memory access for graph step of Velvet is fairly regular! • This is efficient with vSMP" • Performance improved significantly last year through tuning of vSMP by ScaleMP" • Memory access for hash step of Velvet is nearly random! • This is inefficient with vSMP " • Memory access for pregraph step of SOAPdenovo (not shown) is also nearly random! • Since pregraph step uses most memory, large-memory SOAPdenovo runs are slow with vSMP " • vSMP allows analyses otherwise possible on only a few computers! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational workflow for de novo assembly followed by phylogenetic analyses" DNA reads in FASTQ format! De novo assembly: SOAPdenovo, Velvet, …! Contigs & scaffolds in FASTA format! Multiple sequence alignment is matrix of taxa vs characters! Human Chimpanzee Gorilla Orangutan Gibbon Multiple sequence alignment: ClustalW, MAFFT, …! . . . ...... . .! AAGCTTCACCGGCGCAGTCATTCTCATAAT...! AAGCTTCACCGGCGCAATTATCCTCATAAT...! AAGCTTCACCGGCGCAGTTGTTCTTATAAT...! AAGCTTCACCGGCGCAACCACCCTCATGAT...! AAGCTTTACAGGTGCAACCGTCCTCATAAT... ! Aligned sequences in various formats! Final output is phylogeny or tree with taxa at its tips! /-------- Human! | |---------- Chimpanzee! + | /---------- Gorilla! | | \---+ /-------------------------------- Orangutan! \-------------+ \----------------------------------------------- Gibbon! ! ! ! ! Phylogenetic tree inference: MrBayes, RAxML, …! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Scalability of RAxML & MrBayes was improved during past three years by Stamatakis, Goll, & Pfeiffer" • Hybrid MPI/Pthreads version of RAxML was developed! • MPI code was added to previous Pthreads-only code" • Parallelization is multi-grained as well as hybrid! • Change in algorithm often leads to better solution! • Hybrid MPI/OpenMP version of MrBayes was developed! • OpenMP code was added to previous MPI-only code" • Parallelization is multi-grained as well as hybrid! • Memory-efficient code called RAxML-Light was developed! • This allows very large trees to be analyzed together with RAxML" • Single-node runs are more efficient than before! • Multi-node runs with more cores are possible! • Scalability before was limited to about 8 cores for typical analyses" • Hybrid codes now scale well to 10s of cores for typical analyses" • Scripted version of RAxML-Light scales even further " SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO RAxML parallel efficiency is >0.5 up to 60 cores for >1,000 patterns*; speedup is superlinear for comprehensive analysis at some core counts; scalability improves with number of patterns" * Number of patterns = number of unique columns in multiple sequence alignment! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO RAxML run time for a DNA analysis went from >3 days on 1 core to ~1.3 hours on 60 cores; large amino acid analysis was solved in 4.4 days on 160 cores" ! ! ! !Taxa ! ! !150 ! !218 ! !404 ! !1,596 !Char!acters !Pat!terns !Boot!straps! !1,269 !2,294 !13,158 !10,301 !1,130 !1,846 ! 7,429 !8,807 !400! !450, 500! !450, 400! !160! !Data! !Time (h) !Time (h) !Speed-! !type !& cores !& cores !up! !RNA !DNA !DNA !AA !2.1, 1 !0.06, 60 !8.7, 1 !0.20, 60 !74.8, 1 !1.27, 60 ! !106, 160! !33! !43! !59! • Tabulated results are for! • Comprehensive analysis with number of bootstrap searches determined automatically followed by 10 or 20 thorough searches" • 32-core nodes of Trestles with 2.4-GHz AMD Magny-Cours processors" • 10 MPI processes & 6 threads/process using 60 cores (which gives better performance than using 64 cores)" • 20 MPI processes & 8 threads/process using 160 cores! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO MrBayes runs 1.6x to 3.3x faster on Gordon than Trestles depending upon the size of the data set; speedup is greater for larger data sets that are not partitioned" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO The CIPRES gateway lets biologists run parallel versions of tree inference codes via a browser interface on the Trestles & Gordon supercomputers at SDSC" ! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Questions & answers about analyzing DNA sequence data" • How big is the flood of data from high-throughput DNA sequencers?! • >100 GB per day from a single Illumina sequencer now" • 1 TB/day from a sequencer likely by 2014" • What are three compute- and data-intensive analyses of DNA sequence data?! • Mapping of short reads against a reference genome" • De novo assembly of short reads" • Phylogenetic tree inference" SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO So how compute- and data-intensive are the three bioinformatics analyses we considered?" Here is a qualitative summary! !Compute!intensive Analysis ! Read mapping ! De novo assembly ! Tree inference (usually) ! Tree inference (sometimes) ! !x !x !x! !x !Memory!intensive* ! I/O-! !intensive! ! ! ! !x ! ! !x! !x ! !x ! !! ! * I.e., large memory per node is needed for shared-memory implementations ! SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO !!