De novo assembly

Transcription

De novo assembly

De novo assembly Sandra Smit WUR bioinforma4cs De novo assembly. Why? 454 Illumina SOLiD De novo assembly. Why? 454 Illumina SOLiD Read lengths are much shorter than even the smallest genome De novo assembly. Why? 454 Illumina SOLiD No reference genome available Not all reads map to the reference quickb!
Assemble this! ckbrow!
thelaz!
sove!
the!
helaz!
ydo!
azy!
ownfo!
umps!
ckbro!
heq!
azyd!
lazydo!
quic!
fox!
thel!
nfo!
uickb!
qui!
rthela!
ydog!
umpso!
thequ!
bro!
nfox!
kbr!
oxjum!
erthel!
ver!
Find overlap quickb!
ckbrow!
thelaz!
sove!
the!
helaz!
ydo!
azy!
ownfo!
umps!
ckbro!
heq!
azyd!
lazydo!
quic!
fox!
thel!
nfo!
uickb!
qui!
rthela!
ydog!
umpso!
thequ!
bro!
nfox!
kbr!
oxjum!
erthel!
ver!
Layout of the reads the!
oxjum!
thequ!
umps!
heq!
umpso!
qui!
sove!
quic!
ver!
quickb!
erthel!
uickb!
rthela!
ckbro!
thel!
ckbrow!
thelaz!
kbr!
helaz!
bro!
lazydo!
ownfo!
azy!
nfo!
azyd!
nfox!
ydo!
fox!
ydog Layout of the reads the!
oxjum!
thequ!
umps!
heq!
umpso!
qui!
sove!
quic!
ver!
quickb!
erthel!
uickb!
rthela!
ckbro!
thel!
ckbrow!
thelaz!
kbr!
helaz!
bro!
lazydo!
ownfo!
azy!
nfo!
azyd!
nfox!
ydo!
fox!
ydog thequickbrownfoxjumpsoverthelazydog!
Overlap-‐layout-‐consensus approach ACGTCGTGTAGTGTGTATATATAAAGGCGTATAACGGA TATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATTG ACGTCGTGTATCGTATATCGCGATCGGCATCGATCTT GATGCATCGATGTGTGTATATATAAAGGCCC AGCTCATCTATTCAGCGACGAGACTCGTTATATCATCATATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA Overlap-‐layout-‐consensus approach AGCTCATCTATTCAGCGACGAGACTCGTTATATCATCATATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA All-vs-all sequence similarity comparison
Alignment of sequences with high sequence identity
Overlap-‐layout-‐consensus approach AGCTCATCTATTCAGCGACGAGACTCGTTATATCATCATATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA Determination of consensus sequence
The de Bruijn graph approach •  Solving the assembly problem with De Bruijn graphs –  Nodes are k-‐mers –  Edges are overlaps between k-‐mers on k-‐1 posi4ons •  A k-‐mer is a (short) substring of a read –  A read of n bp consists of (n – k + 1) k-‐mers –  Example: a read of 75 bp consists of 45 (overlapping) 31-‐
mers TGACCAGTG!
TGAC!
GACC!
ACCA!
CCAG!
.... Genera4ng k-‐mers from reads A C A T T G C T A T G C T A C A A A T G A Genera4ng k-‐mers from reads A C A T T G C T A T G C T A C A A A T G A A C A T T G C T A T G C A T T G C T A T G C A T T G C T A T G C T Genera4ng k-‐mers from reads A C A T T G C T A T G C T A C A A A T G A A C A T T G C T A T G C A T T G C T A T G C A T T G C T A T G C T T T G C T A T G C T A T G C T A T G C T A C G C T A T G C T A C A C T A T G C T A C A A T A T G C T A C A A A A T G C T A C A A A T T G C T A C A A A T G G C T A C A A A T G A Graph-‐based assembly of k-‐mers k-‐mer table A C A T T G C T A T G A T T G C T A T G C T A T T G C T A T G C G C A T T G C T A T G C G A C A T T G C T A T Graph-‐based assembly of k-‐mers G A C A T T G C T A T ? G A C A T T G C T A T G ? G A C A T T G C T A T G C ? G A C A T T G C T A T G C T G A C A T T G C T A T G C G Find Eulerian path K-‐mer extrac4on and graph crea4on happen simultaneously Error correc4on is an important step Assembly approaches Overlap-‐layout-‐consensus •  Designed for Sanger and (adjusted for) 454 data •  Not suitable for short reads –  All-‐versus-‐all comparison becomes imprac4cal –  Many mutually inconsistent overlaps De Bruijn graph •  Designed for short reads •  Widely applied to short reads from Illumina and SOLiD plaZorms •  Avoid overlap step –  K-‐mer extrac4on and graph construc4on at the same 4me Assemblers Overlap-‐layout-‐consensus •  CABOG •  Newbler •  Shorty De Bruijn graph •  EULER-‐SR •  AllPaths(-‐LG) •  Velvet •  ABySS •  SOAP Just push the buôn? Assemble Complica4ng factor: DNA structure DNA is double stranded DNA contains palindromes etc. Complica4ng factor: repeats Tomato has a large, repe44ve, genome -‐  200-‐250 Mb gene-‐rich euchroma4n -‐ 600-‐700 Mb highly repe44ve heterochroma4n -‐ Mul4ple overlaps in overlap-‐layout-‐consensus approach -‐ Mul4ply connected nodes in graph-‐based approach Complica4ng factor: repeats Complica4ng factor: sequencing errors Technologies have different proper4es and error profiles Sequence-‐dependent coverage bias Nonuniform error rates Kircher & Kelso, 2010 Filter your data! Data filtering is essen4al to remove: -‐  Sequencing errors -‐  Homopolymer stretches -‐  Low-‐quality bases Use for example a k-‐mer table -‐ assump4on: k-‐mer table represents the complete genome A C A T T G C T A T G A T T G C T A T G C T A T T G C T A T G C G A T T G C T T T G C G C A T T G C T A T G C G A C A T T G C T A T Filter your data! Data filtering is essen4al to remove: -‐  Sequencing errors -‐  Homopolymer stretches -‐  Low-‐quality bases Use for example a k-‐mer table -‐ assump4on: k-‐mer table represents the complete genome n A C A T T G C T A T G 12 A T T G C T A T G C T 8 A T T G C T A T G C G 118 A T T G C T T T G C G 1 C A T T G C T A T G C 23 G A C A T T G C T A T 6 Filter your data! Data filtering is essen4al to remove: -‐  Sequencing errors -‐  Homopolymer stretches -‐  Low-‐quality bases Use for example a k-‐mer table -‐ assump4on: k-‐mer table represents the complete genome n A C A T T G C T A T G 12 A T T G C T A T G C T 8 A T T G C T A T G C G 118 C A T T G C T A T G C 23 G A C A T T G C T A T 6 Complica4ng factor: heterozygosity k-‐mers – homozygous potato (k = 31) 0 1 2 3 k-‐mers – heterozygous potato (k = 31) k = 63 heterozygous k-‐mers homozygous k-‐mers 0 1 2 3 4 Complica4ng factor: heterozygosity •  Heterozygous k-‐mers create “bubbles” in the graph –  Every SNP and in/del creates a bubble •  Solu4ons –  Pinching bubbles: loss of diversity –  Breaking graph: loss of con4guity Assembly is not so easy Assembly result DNA molecule ACGTCGTGTATCGTATATCGCGATCGGCATCGATCTATTCAGCGACGAGACTCGTATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA shotgun reads con4gs Assembly result DNA molecule ACGTCGTGTATCGTATATCGCGATCGGCATCGATCTATTCAGCGACGAGACTCGTATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA shotgun reads paired-‐end reads ACGTCGTGTATCGTATATCGCGATCGGCA -‐-‐-‐-‐-‐-‐TCTATTCAGCGACGAGACTCGTATCGATCGATGCATCGATGTGTGTATATATAAAGGCGTATAACGGCTAGAGCTACCCCACACATA scaffolds Assembly quality •  Number and length of con4gs/scaffolds –  N50 • 
• 
• 
• 
Number of gaps/Ns GC content Coverage and completeness Structural consistency 50% of assembly N50 length: 18 Kb N50 index: 4 –  Paired-‐end data –  Physical and gene4c maps •  Error rate –  Known sequences (BAC clones, ESTs, etc.) Sequencing strategy Sequencing strategy: which sequencing data is necessary for my project? How can we balance the benefits and the costs? Technologies have different proper4es and error profiles What is the contribu4on of different input data to an assembly? Kircher & Kelso, 2010 Sequencing strategy Sequencing strategy Recommenda4ons •  Longer reads (Sanger, 454) improve assembly quality •  Deep coverage by reads with lengths longer than common repeats •  Mixture of short and long insert sizes for paired-‐end data Example •  De novo assembly: overlap-‐layout-‐extend –  454 & Sanger shotgun sequences –  454 & Sanger matepair sequences •  Base error correc4on with SOLiD reads •  Base error correc4on with Illumina reads •  Gap filling with Sanger BAC sequences Summary •  Two basic approaches for de novo assembly –  Overlap-‐layout-‐consensus –  De Bruijn graphs •  Assembly challenges –  Repeats –  Heterozygosity •  Assembly is not trivial –  Think about your sequencing and assembly strategy Literature Assembly of large genomes using second-‐genera4on sequencing Schatz et al. Genome Res. 2010 Assembly algorithms for next-‐genera4on sequencing data Miller et al. Genomics 2010 High-‐throughput DNA sequencing -‐-‐ concepts and limita4ons Kircher and Kelso Bioessays 2010 Acknowledgements • 
• 
• 
• 
Jack Leunissen Roeland van Ham Erwin Datema Jan van Haarst –  The Interna4onal Tomato Genome Sequencing Consor4um –  Potato Genome Sequencing Consor4um

De novo assembly

Transcription

Similar documents

DNA, Genes and their Regula+on

an introduction to the relationship between ict and sustainable