PDF file - Edwards @ SDSU
Transcription
PDF file - Edwards @ SDSU
Assembler for SOLiD data: by Improving memory management of Velvet assembler Sajia Akhter, Robert A Edwards Computational Science Research Center, San Diego State University 2009 The SOLiD™ System is a highly accurate next-generation sequencing technology that supports a wide range of applications. This platform delivers very short length high volume sequences which are significantly different from the traditional sequencing data in both read length and volume. So it is challenging to adapt these data to many applications, including de novo assembly. There are several different assemblers like VELVET, SHARCGS, EULER-SR, which can assemble Solexa data. As both SOLiD data and Solexa data have the same read length but the volume of the data from SOLiD technology is much higher than form Solexa, these assemblers might not perform very well for SOLiD data. The algorithms for the velvet assembler work very fast but memory is a big issue for this assembler. Here we report the improvement of the velvet assembler so that it can assemble Solid data in moderate memory requirement. Velvet: Sequencing Modified Hashing Algorithm 2: • Convert Sequence to base 4 integer, that is ACGT –> (0123)4 = 27 • Advantages • Less memory - 40 length read requires 10 bytes instead of 40 bytes • Computation faster - Makes hashing efficient • Does not use splay tree for hashing • As time complexity for splay tree is nlog(n), and time complexity of any standard sorting algorithm is also nlog(n), we use quick sort for sorting all kmers Memory requirement: Sequencing Velvet assembler Steps of Velvet assembler struct TightString{ char *seq i32 length i32 arraylength } Total size = (8+10) +8 = 26 byte Modified Velvet assembler struct TightString{ char seq[10] i8 length i8 arraylength } Total size = 12 byte For each read save 54% For 337 mi reads Memory requirement for velvet (337 mi * 26) ~ 15 GB Modified Velvet (337 mi * 12) ~ 4 GB Algorithm 1. For all sequences, store sequence id and all possible start positions of kmer 2. Sort the list from step 1 according to kmer value 1. For each comparison of sorting, kmer value is dynamically calculated 3. Use binary search for identifying the repeated sequence. 1. Store the relative position for non repeated sequence Memory requirement: • #sequences * #kmer in each sequence * sizeof(node) • For 337 mi reads, length = 35 and kmer = 31 • (337 mi * 5) * 8 ~ 13.5 GB • if kmer in each sequence > 8 then more memory required struct node{ i32 seqID i8 position i8 relative_position[3] } Total size = 8 byte Time Complexity • Time complexity is same as original Velvet. However, as we calculate kmer value dynamically, this algorithm is 15 times slower than the original velvet. Velvet: Hashing • Each read converted to k length • That means, for 35 read length and kmer = 31, total 5 k length strings are possible for that read. • Finding repeated kmer • Keeps track for previous sequence if current sequence matches • Use splay tree • Height balancing binary search tree • Make faster insertion and searching Memory Requirement: Hashing Velvet Assembler struct splayNode{ i64 kmer i32 seqID i32 position i64 *lc i64 *rc } Total size = 32 byte Velvet: algorithms for de novo short read assembly using de Bruijn graphs. D.R. Zerbino and E. Birney. Genome Research 18:821-829. Modified Hashing Algorithm 1 struct splayNode{ i32 seqID i8 position i32 lc i32 rc } Total size = 13 byte • Memory requirement: • #sequences * #kmer in each sequence * sizeof( splay node) • For 337 mi reads, length = 35 and kmer = 31 • Worst case: if there is no repeats: (337 mi * 5) * 32 ~ 54 GB • For Modified Hashing Algorithm 1: • Save 19 byte per node • Worst case memory ~ 22 GB [(337 mi * 5) * 13] • Time complexity – Increased as kmer value is dynamically • computed • Not work if #node > 4 billion Modified Hashing Algorithm 3: • To improve time complexity, we can store the 1st kmer value and then calculate the next kmer values using the 1st kmer value. For example, length = n and kmer = k, we store kmer value for position 1 to k and then dynamically calculate kmer value for position 2 to k+1, 3 to k+2, and so on. Memory requirement: • #sequences * #kmer in each sequence * sizeof(node) + #sequence * 8 [to store kmer value] • For 337 mi reads, length = 35 and kmer = 31 • (337 mi * 5) * 8 ~ 13.5 GB + 2.7 GB Time Complexity • Time complexity is same as original Velvet. However, as we store 1st kmer value, this algorithm is faster than modification algorithm 2 but still about 6 times slower than the original velvet. • Time performance can be improved by implementing iterative quick sort. Real data Statistics (upto hashing step) Data Size Velvet Modification in Sequencing Modification Algorithm 3 526,996 kmer = 31, read length = 36 Time: 6 sec Memory: 105 MB Time: 6 sec Memory: 98 MB Time : 30 sec Memory: 36 MB 49,423,731 kmer = 31, read length = 36 Time: 12m 29 sec Memory: 8.7 GB Time: 12m 29 sec Memory: 7.8 GB Time: 85 min 10 sec Memory: 3.9 GB 337,075,448 kmer = 31, read length = 35 Could not run in memory limitation of 24GB Could not run in memory limitation of 24GB Time: 694 min 15 sec Memory: 21.5 GB