PDF file - Edwards @ SDSU

Transcription

PDF file - Edwards @ SDSU
Assembler for SOLiD data: by Improving memory management of
Velvet assembler
Sajia Akhter, Robert A Edwards
Computational Science Research Center, San Diego State University 2009 The SOLiD™ System is a highly accurate next-generation sequencing technology that supports a wide range of
applications. This platform delivers very short length high volume sequences which are significantly different
from the traditional sequencing data in both read length and volume. So it is challenging to adapt these data to
many applications, including de novo assembly. There are several different assemblers like VELVET,
SHARCGS, EULER-SR, which can assemble Solexa data. As both SOLiD data and Solexa data have the same
read length but the volume of the data from SOLiD technology is much higher than form Solexa, these
assemblers might not perform very well for SOLiD data. The algorithms for the velvet assembler work very fast
but memory is a big issue for this assembler. Here we report the improvement of the velvet assembler so that it
can assemble Solid data in moderate memory requirement.
Velvet: Sequencing
Modified Hashing Algorithm 2:
•  Convert Sequence to base 4 integer, that is ACGT –> (0123)4 = 27
•  Advantages
•  Less memory - 40 length read requires 10 bytes instead of 40 bytes
•  Computation faster - Makes hashing efficient
•  Does not use splay tree for hashing
•  As time complexity for splay tree is nlog(n), and time complexity of any standard sorting
algorithm is also nlog(n), we use quick sort for sorting all kmers
Memory requirement: Sequencing
Velvet assembler
Steps of Velvet assembler
struct TightString{
char *seq
i32 length
i32 arraylength
}
Total size = (8+10) +8 = 26 byte
 
 
 
 
Modified Velvet assembler
struct TightString{
char seq[10]
i8 length
i8 arraylength
}
Total size = 12 byte
For each read save 54%
For 337 mi reads
Memory requirement for velvet (337 mi * 26) ~ 15 GB
Modified Velvet (337 mi * 12) ~ 4 GB
Algorithm
1.  For all sequences, store sequence id and all possible start positions of kmer
2.  Sort the list from step 1 according to kmer value
1.  For each comparison of sorting, kmer value is dynamically calculated
3.  Use binary search for identifying the repeated sequence.
1.  Store the relative position for non repeated sequence
Memory requirement:
•  #sequences * #kmer in each sequence * sizeof(node)
•  For 337 mi reads, length = 35 and kmer = 31
•  (337 mi * 5) * 8 ~ 13.5 GB
•  if kmer in each sequence > 8 then more memory required
struct node{
i32 seqID
i8 position
i8 relative_position[3]
}
Total size = 8 byte
Time Complexity
•  Time complexity is same as original Velvet. However, as we calculate kmer value dynamically,
this algorithm is 15 times slower than the original velvet.
Velvet: Hashing
•  Each read converted to k length
•  That means, for 35 read length and kmer = 31, total 5 k length
strings are possible for that read.
•  Finding repeated kmer
•  Keeps track for previous sequence if current sequence matches
•  Use splay tree
•  Height balancing binary search tree
•  Make faster insertion and searching
Memory Requirement: Hashing
Velvet Assembler
struct splayNode{
i64 kmer
i32 seqID
i32 position
i64 *lc
i64 *rc
}
Total size = 32 byte
Velvet: algorithms for de novo short read assembly using de Bruijn graphs. D.R. Zerbino and E. Birney.
Genome Research 18:821-829.
Modified Hashing Algorithm 1
struct splayNode{
i32 seqID
i8 position
i32 lc
i32 rc
}
Total size = 13 byte
•  Memory requirement:
•  #sequences * #kmer in each sequence * sizeof( splay node)
•  For 337 mi reads, length = 35 and kmer = 31
•  Worst case: if there is no repeats: (337 mi * 5) * 32 ~ 54 GB
• For Modified Hashing Algorithm 1:
•  Save 19 byte per node
•  Worst case memory ~ 22 GB [(337 mi * 5) * 13]
• Time complexity – Increased as kmer value is dynamically
• computed
•  Not work if #node > 4 billion
Modified Hashing Algorithm 3:
•  To improve time complexity, we can store the 1st kmer value and then calculate the next kmer
values using the 1st kmer value. For example, length = n and kmer = k, we store kmer value for
position 1 to k and then dynamically calculate kmer value for position 2 to k+1, 3 to k+2, and so
on.
Memory requirement:
•  #sequences * #kmer in each sequence * sizeof(node) + #sequence * 8 [to store kmer value]
•  For 337 mi reads, length = 35 and kmer = 31
•  (337 mi * 5) * 8 ~ 13.5 GB + 2.7 GB
Time Complexity
•  Time complexity is same as original Velvet. However, as we store 1st kmer value, this
algorithm is faster than modification algorithm 2 but still about 6 times slower than the original
velvet.
•  Time performance can be improved by implementing iterative quick sort.
Real data Statistics (upto hashing step)
Data Size
Velvet
Modification in
Sequencing
Modification
Algorithm 3
526,996
kmer = 31, read length = 36
Time: 6 sec
Memory: 105 MB
Time: 6 sec
Memory: 98 MB
Time : 30 sec
Memory: 36 MB
49,423,731
kmer = 31, read length = 36
Time: 12m 29 sec
Memory: 8.7 GB
Time: 12m 29 sec
Memory: 7.8 GB
Time: 85 min 10 sec
Memory: 3.9 GB
337,075,448
kmer = 31, read length = 35
Could not run in memory
limitation of 24GB
Could not run in memory
limitation of 24GB
Time: 694 min 15 sec
Memory: 21.5 GB