GEP Hybrid Assembly Walkthrough

Transcription

GEP Hybrid Assembly Walkthrough
LastUpdate:5/19/2016
GEPHybridAssemblyWalkthrough
DevelopedbyChristopherShaffer,withinputfromGEPmembersDonPaetkau,MichaelRubin,
LauraReed
Prerequisites
Consedversion25orhigher
FamiliaritywithConsed,(Forexample,priortrainingwith“UsingConsedGraphically”and
the“DrosophilaFinishingProblemSet”)
FilesforthisExercise
Projectscf7180000301495_190000_290000
Introduction
Thiswalkthroughwillillustratethetechniquesforconsensuserrorcorrectionaswellas
closinggapsusingtheDrosophilabiarmipesprojectscf7180000301495_190000_290000.
NotethatthiswalkthroughassumesthereaderisalreadyfamiliarwithConsedandthe
exactdetailsonhowtoaccomplishmanyofthetasksarenotgiven.Manyofthefigureswill
notmatchexactlytheimagesobtainedbytheuser(eveniftheyfollowtheprotocol
exactly).Usersofthiswalkthroughareexpectedtohavesufficientexperiencetointerpret
anydifferencesanddetermineiftheyaresignificantortrivialdifferences.Assuch,usersof
thiswalkthroughshouldbeveryfamiliarwiththetechniquescoveredininthe“Using
ConsedGraphically”walkthroughand“DrosophilaFinishingProblemSet”exercise
(availableontheGEPwebsite).
Setup
LaunchX11andopenanewxterm;navigatetotheedit_diroftheD.biarmipesproject
scf7180000301495_190000_290000(e.g.cd scf7180000301495_190000_290000/edit_dir).
Enterconsed&atthextermprompt.The“&”willkeepyourterminalactiveincaseyouneed
touseitlater.Openscf7180000301495_190000_290000.ace.3.Select“No”ifapromptappears
thatasksifyouwouldliketoapplyeditsfromtheedithistory(.wrk)file.
1
LastUpdate:5/19/2016
Whenimprovinghybridassemblies,wewillusecustomsettingsthatdifferfromthedefault
Consedsettings.Youwillneedtoverifythesesettings(andchangethesesettingsas
necessary)eachtimeyoulaunchConsed.
First,inthemainWindow,select“Options->GeneralPreferences”.Checkthatthe
ThresholdforLowConsensusQuality(highestlow)”issetto25,andthe“Thresholdfor
HighQualityDiscrepancy(lowesthigh)”issetto30.
Nowdoubleclickonanycontig(e.g.scf7180000301495:190000-290000inthisproject)to
openanAlignedReadswindow.Inthe“Dim”menu,verifythattheDimoptionissetto
“>DimNothing”.BydefaultConsedwilldimtheunalignedregionsattheendsofreads.This
makesalotofsensewhenworkingwithSangerreadsthatoftenhavevectorsequenceat
theend,itdoesnotmakesensewhenworkingwithIlluminareads.InIlluminareadsALL
dataisrelevantandshouldnotbegivenablackbackground,hencethe“Dimnothing”
setting.
IntheSortmenu,clickontheitem“SortOptionsandHelp”.Inthedialogboxthatappears,
changethe“Displayreadssortedalphabeticallyorbystrand/leftreadend?”fieldto
“Strand/LeftEnd”.Changethe“Whenyouclickontheconsensus,howdoyouwantreads
sortedatcursorposition”to“bybase”.Beawarethatwiththissettingmeansthatthe
screenshotsinthiswalkthroughmaynotexactlymatchtheimageonyourcomputer
screen.However,itdoesresultintheabilityworkmorequicklyandefficientlywhendoing
actualfinishingsothesesettingswereusedthroughoutthewalkthrough.
Assemblyview
Thisprojecthasasinglecontigof100kb.Clickonthe“AssemblyView”buttononthe
ConsedMainWindowtoopenassemblyview.Runcross_matchtodetectsequencematches
withinthiscontigandidentifyregionswithahighdensityofdiscrepantreadpairs.
2
LastUpdate:5/19/2016
Inthisproject,thereisaclusterof7inconsistentmatepairswhereonememberofthe
matepairisplacedataround13kbwhiletheothermemberisplacedataround51kb.The
numberofdiscrepantreadsofthistypewillvarywitheachproject.Whileitisnot
necessarytoremovetheseinconsistentreadsduetothetypicalveryhighlevelsof
coverate,itisrecommendedasitwilllikelyreducethesizeoftheHQDlistthatmustbe
analyzed.
Asanexampleofhowtoremovereadsofthistype,clickontheclusterofredlinesthat
spansfrom13kbto51kbofthecontig.Thiswillbringupalistofallthediscrepantreadsin
thatgroup.Notehowmostpairsaremappinginaninconsistentorientation(i.e.thepaired
endreadsarepointingawayfromeachother),hence,bothdistanceandorientationare
evidencethatthesereadsareimproperlymapped.
Clickthe“PullOutReads”button.Inthenextwindowselectallthereadsandclickthe
“RemoveHighlightedReads”button.Thiswillputeachreadintoitsowncontig
(Contig290001throughContig290014).Aswithanyprocedurethatchangestheassembly
(i.e.movesanyreadsintooroutofacontigormakesanytearsorjoins),youmustsavethe
assemblybeforemakinganyadditionalchangestotheassembly.Theprojectwassavedas
scf7180000301495_190000_290000.ace.4,toifyouthinkyouhavemadeamistakeyou
mayquitConsedandloadthisacefile.Throughoutthewalkthroughotheracefilenames
willbegiven,theycanbeloadedtosetthestateoftheprojectbeforecontinuing.
3
LastUpdate:5/19/2016
NavigatetoAssemblyViewandnoticeanothersetofinconsistentmatepairsthatspans
fromthebeginningofthemaincontigtotheregionataround53kb.Thesearealso
inconsistentbecausetheyareintheincorrectrelativeorientationandbecausetheyaretoo
farapartfromeachother.Pullouttheseinconsistentreadsfromthemaincontigusingthe
protocoldescribedabove(i.e.clickonredlines,clicktopulloutreads,selectallreads,
removehighlightedreads,saveassembly).
Thesetofreadsthatextendsfrom36kto42k(redarrowinfigureabove),containsonly
tworeadsandforthepurposesofthiswalkthroughwillnotberemoved.Finishersarefree
todeveloptheirownpolicyinregardtothenumberofreadsinaclusterthatwouldjustify
theirremovalfromthemaincontig.Thelastsetofinconsistentreadsisasmallcluster
around36k(greenarrow).Youmayneedto“ZoomIn”toseethese.Therearesufficient
4
LastUpdate:5/19/2016
readsherethatifthiswerearealprojecttheirremovalisrecommended.Howeverforthe
purposesofthiswalkthroughtheywereleftintheproject.
Beforecontinuingwiththerestofthiswalkthrough,readerswhowishtohaveanexact
matchtothescreenshotsshowninthiswalkthroughmaywishtoquitConsed,restartand
openscf7180000301495_190000_290000.ace.5.(RemembertochangetheConsed
settingsdescribedatthetopofpage2afteryouopentheacefile.)
Resolvingbaseerrorsatmononucleotideruns
TheprimarygoaloftheGEPsequenceimprovementprojectistocorrectconsensuserrors
withinmononucleotideruns(MNR).Thesecondarygoalofthesequenceimprovement
projectistoseeifthereissufficientIlluminadatatoclosegaps.Optionally,finishersmay
designprimerstocarryoutPCR/Sangertoaddnewdatatotheproject.Thiscanbeusedto
resolvegapsandregionswithlowconsensusquality(discussedbelow).Youshouldtalkto
yourmentoraboutprimerdesignandSangersequencing,thiswillonlybedoneisspecial
casesifyouaredoingawetlabcomponenttoyourfinishingproject.
ThiswalkthroughwilladdresstheprimarygoalofresolvingtheMNRregionsbefore
addressinggapsandlowconsensusqualityregions.However,becauseoftheamountof
timerequiredtoproduceadditionalsequencingdata,ifyouareplanningondoing
PCR/Sangerwerecommendthatyoubeginwithprimerdesign.Youcanworkoncorrecting
errorsinMNRregionswhilewaitingfortheresultsfromyourSangerreactions.
5
LastUpdate:5/19/2016
Inthe“ContigList”sectionoftheConsedMainWindow,youwillseetheprojectnowhasa
longlistoftheindividualreads(weremovedthemfromthemaincontigabove).Themain
contigweareworkingonisatthebottomofthelist(scf7180000301495:190000-290000).
Scrolldowntofindthemaincontiginthe“ContigList”anddoubleclickonthemaincontig
toopenitinanAlignedReadswindow.Alternativlyopenthealignreadswindowdirectly
fromtheassemblyview.
Dependingonyoursettingsyoumayseethatthebasenumberingsystemstartswith
189,660,thisisanewfeatureavailableinConsedv25thatallowsforpropernumbering
whenaprojectisactuallyasubsectionofalargerscaffold.Thisprojectisderivedfroma
portionofthegenomicscaffoldscf7180000301495intheD.biarmipeswholegenome
assemblyandthenumberinghereindicatesthelocationwithinthatscaffold.
Youcanchangethenumberingsystemsothatbase1isthefirstbaseintheproject.Inthe
AlignedReadswindow,clickontheMiscmenuandselect“TurnOn/OffUser-Defined
ConsensusScaleNumbers”.Thiswilltemporarilychangethenumberingschemeforaslong
asConsedisrunning,IfyouquitandrestartConsedthenumberingsystemwillchange
back.Ifyouwishtopermanentlychangethenumberingscheme,youcandeletethe
“startNumberingConsensus”consensustagatthebeginningofthecontig.Clickonthe“<<<”
buttontonavigatetothestartofthesequenceinthisproject.
6
LastUpdate:5/19/2016
Toscreenforbaseerrors,returntotheConsedMainWindowandselectNavigate->
“SearchforHighlyDiscrepantPositions”.Giventhehighreaddepthandthelargenumber
ofreadsthatareimproperlyplacedintheseassembly,siteswithoneortwoHQD’sare
quitecommonandseldomindicateagenuineproblemwiththeconsensus.Asaninitial
screenforproblemareasinthemaincontig,wewillsetthe“minimum#ofdiscrepant
reads”fieldto3.Thisfocusthelistofdiscrepantregionswhereaconsensuserrormight
actuallyexist.TofocusonregionswheretheIlluminadatadoesnotsupporttheconsensus
setthe“ignorebasesbelowthisquality”fieldto30.Click“Search”.
Theresultinglistwillhavemanyregionswheretheconsensusiscorrecteventhoughthere
are3highqualityreadsthatdisagreewiththeconsensus.Inmanycases,thediscrepancies
canbeattributedtoerrorsinthe454readsthatshowdifferentnumberofbasescompared
toboththeIlluminareadsandtheconsensus.(Youcanusethereadnametodistinguish
454readsfromIlluminareads;Illuminareadshavetheprefix“USI-“,454readswillstart
witha“G”)Someoftheotherdiscrepantregionsonthelistarecausedbyreadsthathave
beenincorrectlyplacedintheassembly(e.g.becauseoflargetransposonsorother
repetitiveregionsinthegenome)bythemappingprogram.
7
LastUpdate:5/19/2016
ThebasicstrategyistonavigatethrougheachitemintheHighlyDiscrepantRegionslist
andlookfordiscrepantregionsthatareassociatedwithaMNR.Whenyoufindaregion
associatedwithaMNR(within5bases),youshouldinspecttheregioncarefullyandeither
confirmoredittheconsensusbasedontheavailableevidence.Notethatbecauseofthe
knownweaknessesofthe454sequencingtechnologyinresolvingthecorrectnumberof
basesinlongMNR’s,finishersshouldrelyontheIlluminadatawhendetermininglength.
SinceadjacentGEPsequenceimprovementprojectsoverlapwitheachother,thefinishers
canignorethediscrepanciesthatarefoundwithin2500basesoftheendsoftheproject
(becausetheseproblemareaswillberesolvedbythefinishersworkingontheadjacent
projects).Consequently,wewillignoretheinitialhighlydiscrepantregionsandexamine
theregionatbaseposition2719.InspectionofthatregionintheAlignedReadswindow
showsthatthediscrepantposition(i.e.2719)isnotnearaMNR.Becausewearesorting
“bybase”,thefourreadswithadiscrepantC(3ofwhicharehighquality)arelistedabove
allthereadswiththeT.
8
LastUpdate:5/19/2016
Thisdiscrepancymaybeduetoabasecallingerror,mis-mappingofthereadoritis
possiblethatthissiteispolymorphic.Regardlessofwhythisdiscrepancyexists,thereis
insufficientevidencetosupportthehypothesisthattheconsensusisincorrect:thereare70
readshere(withqualityscore30orabove)thatagreewiththeconsensusandshowaT
andonly3highqualityreadsthatshowaC.
Infact,giventhatthediscrepantpositionisNOTpartofthenearbyMNR,wecouldsimply
moveontothenextregionwith3+HQD’s.However,ifyouexaminethesereadscarefully,
youwillfindadditionaldiscrepanciesfurtherdownstream(e.g.at2799oftheconsensus).
Thismakesitverylikelythatthesereadsactuallybelongsomewhereelseinthegenome
anditstrengthenstheargumentthattheconsensusshouldremainaT.Giventhehigh
frequencyofmis-mappedreads,discrepanciesofthistypewillnotbeexaminedcarefully.
Insteadfinishersshouldfocustheireffortsexclusivelyondiscrepantregionsassociated
withMNR’s.Clickonthe“Next”buttonontheAlignedReadswindowtonavigatetothenext
highlydiscrepantregion.Continuetoclicknextuntilyounavigatetoadiscrepantregion
thatisassociatedwithaMNR.ThefirstdiscrepantpositionwithinaMNRisat5800.
Thesorting“bybase”optionhasplacedallthereadswiththepad(*)atthetopofthe
AlignedReadswindow,belowtheseareallthereadswithaTatthisposition.Thereare
quiteafew454readsinthisregionthatshowfewernumberofT’sthanareinconsensus
(i.e.16ofthe454readsshowapad).Usethescrollbarontherighttoscrolldownand
examinetheIlluminareads.AllthehighqualityIlluminareadsthatalignedtothisregion
agreewiththeconsensusandshows7T’s(IlluminareadsstartwiththeprefixUSI-;454
readnamesareshorterandstartwithG.)BecausealltheIlluminareadsagreewiththe
consensus,thereisinsufficientevidencetochangetheconsensusandwewillkeepthe
consensusat7T’sandmoveon.
9
LastUpdate:5/19/2016
Click“next”intheAlignedReadswindowtonavigatetothenextdiscrepantregion.The
nextlocationontheHQDlistthatshowsaninterestingdiscrepancyislocatedat6055.
Examinationoftheregionshowsfive454readsthathavea7bpinsertion(TCTCATT)
comparedtotheconsensus(ifyoucanonlyfind3discrepantreads,makesure“Dim
nothing”isselectedandlookagain).Againthepreponderanceofevidencesuggeststhatthe
consensusiscorrect(i.e.thereare~90readscoveringthisregion,85ofwhicharehigh
qualitythatagreewiththeconsensuswhileonlyafewdisagreewiththeconsensus).The
skewinthedistributionofreadsthatdisagreewiththeconsensus(3%)comparedtothe
percentageofreadsthatagreewiththeconsensus(97%)makesitextremelyunlikelyfor
thediscrepancytobecausedbyapolymorphicsite.
ThenextdiscrepantMNRsitewith3ormoreHQD’sisat7557.ItisassociatedwithaMNR
of4T’s.Howeverthereare52(highquality)readswithaTatthispositionandonly3(high
quality)readswithaC,soagainthereisinsufficientevidencetochangetheconsensus.
Clickonthe“Next”buttonintheAlignedReadsregiontonavigatetothenextdiscrepant
region.ThereareseveralotherregionsthatareassociatedwithMNR’sdownstream,butin
eachofthesecasestheconsensusisfine,nochangesarejustifiedbydiscrepantreads.
Thenextinterestingregionislocatedat9651.Thisregionisanexampleofaproblematic
alignmentwhichnecessitatescarefulexamination.Heretheconsensussequence(ignoring
pads)isAAATGAGAAAAAAACATAT.WhenyouscrolldownintheAlignedReadsWindow,
youwillfindmanybasediscrepanciesinthisregion.Howevernoonepositionisdiscrepant
acrossallthehighqualityIlluminareads.NotehowsomeofthehighqualityIlluminareads
showadiscrepantAwhichdiffersfromthepadjusttotherightofbase9651.Otherhigh
qualityIlluminareadshavea(consistent)padatthatposition(e.g.USIEAS376_0023:6:17:19462:9201_1.
10
LastUpdate:5/19/2016
Inthisregion,carefulexaminationrevealsthatalloftheproperlymappedhighquality
Illuminareadshave8A’sirrespectiveofhowtheywerealignedtotheconsensus.Hence
theavailableevidencesuggestsweshouldchangetheconsensusfrom7A’sto8A’s.
InconsistentalignmentsofthistypeareacommonissuewithConsed.Thisisparticularly
trueforregionswithlongMNR’sormanysmallerMNR’s.Forthisreasonitisvery
importanttoinspectthediscrepantregionscarefully,countingthenumberofbasesineach
readifnecessary,todeterminetheconsensussequence.
Iftheyprefer,finisherscanusethestandardtechniqueofopeningatracewindowforone
oftheIlluminareadsandusingchangeconsensustomaketheproperedit.However,unlike
previousversionsofConsed,Consed25allowuserstoedittheconsensussequencedirectly.
ToaddanextraAtotheconsensususingthistechnique,clickononeofthepad(*)adjacent
tothe7A’sintheconsensus.HittheAkeytochangetheconsensusfromapadtoana.Note
thatalleditstotheconsensusarekeptaslowercase(i.e.lowqualityedit)byConsed.
11
LastUpdate:5/19/2016
Aftereditingtheconsensussavetheassembly(scf7180000301495_190000_290000.ace.6).
Forpractice,youcancontinuetoassessregionsontheHQDlistthatareassociatedwith
MNR’sandcorrecttheconsensussequenceifitweresupportedbytheIlluminadata.There
areseveraladditionalregionsinthisprojectthatrequirescorrectionstotheconsensus
sequencebuttheywillnotbediscussedinthiswalkthrough.Thenextregiondiscussedwill
betheregionaroundthelongAMNRaround51,425.Thisregionsisagoodexampleofa
verycomplexregionwherethereadsmustbeanalyzedonebyoneinordertoresolvethe
consensus.
Thisregionhas2MNRofA’sthatareseparatedby2T’s.RegionswithmultipleMNR’sare
particularlyerrorpronefor454sequencingandarealsomorelikelytohavebeen
misalignedbyConsed.Ifsomeofthereadshaveacompletelyblackbackgroundattheend
ofthereadbesuretousetheDimmenutoselect“DimNothing”.
Acarefulexaminationofthisregionshowsthatmanyofthereddiscrepantbasesinthis
regioncanbeattributedtomisalignmentsandarenotgenuinediscrepanciesinthelengths
oftheMNR’s.TheconsensushasasequenceofGGGAAAAATTAAAAAAAAAAATTAT,with
thecriticalissueofhowmanyA’sareinthetwoMNR’s.ClickonthemiddleAatposition
51,425,thisshouldbringalltheIlluminareadstothetop.Irrespectiveofwheretheyare
12
LastUpdate:5/19/2016
sortedorhowtheywerealignedcarefulexaminationshowsthat7oftheIlluminareads
(highlightedinthefigurebelow)havethesequence3G’sfollowedby6A’sfollowedby2
T’sthen11A’sfollowedbyTTAT(i.e.GGGAAAAAATTAAAAAAAAAAATTAT).
Ofthereadsthatdonotmatch,thereadnamethatendswith:3596_2(topreadinthe
figureabove)doesnotcovertheentireregionbeingdiscussedandistherefore
uninformative.Thereadnamethatendswith:6277_1onlydisagreeswiththissequenceat
positionsthathaveverylowquality.Incontrast,thereadnamesthatendwith:12171_1
and:5958_1containsmanyHQD’scomparedtotheconsensusthroughouttheirentire
length.Giventhatthisregionhasabluerepeattag,itisverylikelythatthesereadsbeen
improperlymappedtothisregionandthedifferencesindicatedbythesereadsare
unreliable.Thereadnamethatendswith:9617_1isverylowqualityformostofthisregion
andtheevidenceitprovideswouldnotbesufficienttooverrulethe7highqualityreads
thatareallconsistentwitheachother.
Collectively,ouranalysisofthehighqualityIlluminareadsthatalignedtothisregion
suggeststhatthecorrectconsensussequenceisGGGAAAAAATTAAAAAAAAAAATTAT.
EdittheconsensusaddinganadditionAtoeachmononucleotiderunandsavethe
assembly.(scf7180000301495_190000_290000.ace.7)
Thereareotherregionsfurtherdownstreaminthiscontigwithatleast3HQD’sandwhere
mostoftheIlluminareadsdisagreewiththeconsensus.Analysisoftheseregionsisleftas
anexerciseforthereader.
Checkingareaswithfewerthan40reads
TheprotocolforresolvingMNR’sdescribedaboveassumesthatthediscrepantregion
containsmanyreads.Settingtheminimumofatleast3HQD’sallowsustofilteroutmany
locationswithmis-mappedreads.However,wewillneedtouseanalternateapproachto
checkforerrorsinMNR’sfoundinregionswithlowcoverage.Afinishermustnavigateto
theseregionsandcarefullyinspectthemforanypotentialerrorsintheconsensus
sequence.
13
LastUpdate:5/19/2016
Tosearchforregionswithlowreadcoverage,usethe“MainWindow->Navigate->Search
forHigh(low)DepthofCoverage”menu.Inthe“NavigatebyHigh(orLow)Depthof
Coverage”window,uncheckthe“showhighdepth(notlowdepth)”fieldtosearchfor
regionswithlowreadcoverage.Setthe“ignorereadbasesbelowthisquality”fieldto10
andthensetthe“max(forlowdepthregions)depthofcoverage”fieldto40.ClickSearch.
The“LowDepthofCoverageRegions”navigatorwindowwillappearandyoucanusethis
toquicklynavigatetotheareaswithlowreadcoverage.Theentriesatthebeginningofthe
listcorrespondtothesinglereadsthatwehavepreviouslyremovedfromthemaincontig.
Scrolldownuntilyoureachtheentriesthatcorrespondtoyourmaincontig(i.e.
scf7180000301495:190000-290000),rememberthatthefirstandthelast2.5kbdonot
needtobefinished.
Thefirstlowcoverageregionweneedtoinspectspansfrom8465to8800.Searchforany
sequencewithinthisregionwithaMNRof5ormore.Thiscanbeaccomplishedbyeither
manualinspection(easilydoneiftheregionissmall)orbyusing“searchforstring”.Be
suretosearchwithbothGGGGG(whichwillfindbothGGGGGandCCCCC)andAAAAA
(whichwillfindbothAAAAAandTTTTT).SearchingwithGGGGGshowsnoMNRof5or
longerwithinthislowcoverageregion.Also,rememberthatwhensearchingforstring
regionsthatmatchthecomplementofthequerysequence(i.e.locationswithCCCCC)will
belistedAFTERallthelocationsthatmatchedtheuncomplementedsequence(i.e.GGGGG).
HenceyouwillneedtoscrolldowntoascertainifanyCCCCCMNR’sarefoundbetween
8465-8800.(Inthiscase,therearenoMNR’sofCCCCCinthislowcoverageregion.)To
avoidthisscrollingissuefinishersmaywishtorun4differentMNRsearchesandcrossreferenceall4listswhenlookingforoverlappingregions.
14
LastUpdate:5/19/2016
SearchingwithAAAAAshows3regionsbetween8465-8800thatshouldbecarefully
inspected,8593-8597,8653-8657,and8741-8745.Examinationofthethreeregionsshows
nodiscrepanciesorinconsistenciesthatwouldrequirechangestotheconsensussequence.
Scrolltothe“complemented”sectionofthe“SearchingContigs”listtocheckforanyTTTTT
MNR’sinthisregion(therearenoTTTTTMNR’sthatoverlapwiththelowcoverageregion
at8465-8800).
15
LastUpdate:5/19/2016
AccordingtotheGEPsequenceimprovementstandardforthehybridassemblies,each
consensuspositionmustbesupportedbyaminimumoftworeadsthatagreewitheach
otherandareeachofsufficientquality.Thesereadscanbeeither454orIlluminareads(or
acombinationofboth).
“Sufficient”qualityinthiscontextmeansthatthereadpositionhasaphred(quality)score
ofatleast20ineachreadandthatthefinisherisconfidentthatthereadhasbeenmapped
tothecorrectregion.Tohavehighconfidencethatthereadisproperlymapped,itshould
havenomorethatoneHQDanywhereinthereadcomparedtothefinalconsensus.Ifthere
isonlyasinglehighqualityreadthatsupportstheconsensus,thenthefinisherwillneedto
adda“dataneeded”tagtotheregionandhighlightthepresenceofanunresolvedlow
qualityregioninthefinalfinishingreportform.Ifyouareimplementingawetlab
PCR/Sangerpipelineyoumaybeaskedtocovertheregiontoresolvetheissue,checkwith
yourmentor.IfyouarenotimplementingyourownPCR/Sangerexperiments,youshould
NOTdesignprimersfortheselowconsensusquality“dataneeded”regions.
Forpracticeyoumaycontinuetoworkthroughtheregionsinthelowcoveragelistby
manuallyinspectingtheselowcoverageregionsandcross-referencewiththeMNRsearch
forstringlists.Examineeachlocationtoconfirmtheconsensusintheseregions.
Thenextlowcoverageregiondiscussedhereislocatedat94415-94757.Withinthisregion
isaMNRof10T’sintheconsensusstartingatbase94619Yetagaintherearealignment
problemsthatmakethisareadifficulttoanalyze.
Whiletheconsensushas10T’s,carefulinspectionofthe3highqualityIlluminareads
(highlightedinthefigureabove)allshow11T’s.Notehow,becauseofthevagariesofthe
alignmentalgorithmusedbyConsed,noconsensuspositionhas3HQD’s.Twoofthereads
arediscrepantwiththeconsensusAat94618.Theotherreadisdiscrepantwiththepad
justafterconsensuspositionat94628.Wewillneedtoverifythatthese3readscontainno
morethanoneHQDoutsideofthisMNRtobeconfidentthattheyhavebeenmapped
correctlytothisregion.GiventhatIlluminareadsareallrelativelyshortthiscanbedone
simplybymanualinspection.Tocheckthiscomputationally,pullupthelistofallHQD’s
16
LastUpdate:5/19/2016
(Navigate->Highquality(>=30)discrepancies,>5bpfromunalignedregion).Scrolldown
totheregionofinterestat~94,000andnotewhichofthereadscontainHQD’s.
TheHQDlistshowsthateachofthesethreereads(thathaveanextraTcomparedtothe
consensus)onlyhasasingleHQD.Ineachcase,theonlyHQDlistedforeachofthesereads
istheHQDthatisassociatedwiththeincorrectlengthoftheMNRintheconsensus.
Consequently,wecanbeconfidentthatthesethreediscrepantreadshavebeenmappedto
thecorrectregionandwecancorrecttheconsensususingthesereads.Correctthe
consensus(11T’s)andsavetheassembly.(scf7180000301495_190000_290000.ace.8)
Resolvinggapsandlowconsensusqualityregions
Becauseresolvinggapsandlowconsensusqualityregionssometimesrequireadditional
datafromPCR/Sangersequencingreactions,werecommendthatfinishersdoingthe
optionalPCR/Sangerpipelinebegintheirsequenceimprovementprojectbyresolvinggaps.
TheprimarygoalofresolvingbaseerrorsinMNR’scanbeworkedonwhilewaitingforthe
PCR/Sangerresults.Todetermineifyourprojecthasanygaps,simplyuse“Searchfor
String”andsearchfor“nnnnn”intheconsensus.
17
LastUpdate:5/19/2016
Theseresultsshown’sintheregionaround20445plusorminusabout13bases.Double
clickonthefirstmatch(at20432-20436)tonavigatetothisregion.Noticethatthegap
regionactuallyspansfrom20432-20458.
Becauseofhowtheprojectsareconstructed(i.e.mapping454andIlluminareadsagainst
thepublishedconsensussequence),manyoftheprojectscontainsmallgapsthatcanbe
resolvedwithoutadditionaldata.Inthisexample,therearemultiplehigh-qualityIllumina
readsthatspantheentiregap.Inaddition,notehowthebasesthatarealigningtothen’sin
theconsensusactuallymatchthesequenceadjacenttothegap.Thissuggeststhatthereis
nomissingdata.Finishersmaybeabletodetecttheoverlapbyvisualinspectionorthey
canusethe“SearchforString”functionalityinConsedtosearchforoverlappingregions.
Inthisexample,wewillusethereadUSI-EAS376_0023:6:66:17825:11588_2(which
extendsintoandbeyondthegap)tohelpresolvethisregion.Thelastbasesinthe
consensusontheleftsideofthegapareTTTTTGGga,selectthebasesinreadUSIEAS376_0023:6:66:17825:11588_2immediatelyfollowingthissequencetotheendofthe
readandperformaSearchforString.
Theresultsof“SearchforString”revealedanexactmatchthatislocatedjustontheright
sideofthegap(atposition20477-20514).Visualinspectionofthisregionrevealsan
overlapofafewbasesoneachsideofthegap:ataggaatttttgggaisactuallyrepeatedoneach
sideofthegap.TherearemanyIlluminareadsthatsupportthehypothesisthatthereis
onlyasingleinstanceofthissequence.Hencewemightbeabletoresolvethisgapby
performingaforcejointocollapsetheoverlappingreadsintoasingleregion.
Ifourhypothesiswerecorrect,thentheassemblypiece(scf7180000301495:190000290000)forthisprojectiswrongandcontainsamisassembly.Iftheassemblypieceread
remainsinthecontig,Consedwillprohibitusfromclosingthegap.Thus,wemustfirstpull
outtheassemblypiecethatwasusedtoconstructtheinitialassemblybeforeattemptingto
reassemblethisregion.Thisfakereadhasthesamenameastheproject:
scf7180000301495:190000-290000.
18
LastUpdate:5/19/2016
Removetheassemblypiecereadbyrightclickingonthereadandselect“Removeread
scf7180000301495:190000-290000fromthiscontig”.Inthe“RemoveReads”dialogbox
thatappearsmakesurethatonlythe“scf7180000301495:190000-290000”readislistedin
the“Readstoberemoved”boxontheright.Thedefaultsettingsareappropriateanddonot
needtobechanged,click“Doit”.Anew“ReadsRemoved”dialogwillappearwithanote
thatindicateswemustsavetheassembly.GobacktotheConsedMainWindowandsave
theassembly(scf7180000301495_190000_290000.ace.9).
Dependingonyourproject,removingtheassemblypiecefromthemaincontigcouldcause
thecontigtobreakintomultiplesmallercontigs.However,inthiscase,theassembly
remainedinasinglecontig.WhentheassemblypiecewasremovedConsedattemptedto
calculateanewconsensus,replacingtheN’swithbasesfromthereadsbelow.
PerformaSearchforStringusingtheoverlappingsequencewehaveidentifiedpreviously
(i.e.ataggaatttttggga)tonavigatetothegapregionthatwearetryingtoresolve.Because
thereadsbelowtheconsensusarenotproperlyaligned,therearemanydiscrepantred
basesinregion.
19
LastUpdate:5/19/2016
However,carefulvisualinspectionoftheregionsurroundingthelocationoftheprevious
n’s(bases20432-20456)showsthattheregionremainsthesame:therearestilltwocopies
ofthesequenceataggaatttttggga.ToresolvethegapandcreateanassemblywithnoHQD’s,
wewillneedtotearandre-jointheoverlappingregionswiththecorrectoverlap.Rightclickatbase20440oftheconsensusandselect“Tearcontigatthisconsensusposition”.
Becauseourhypothesisisthatthetworepeatedregionsshouldbecollapsedintoasingle
copy,itdoesnotmatterwhichreadsendsupintheleft(highlighted)contigversustheright
(unhighlighted)contigaslongassomereadsgoineachdirection.Hencewewillacceptthe
defaultselectionfromConsed.Clickon“DoTear”tocreatetwocontigs,alefthandone
(~20kb)andarighthandone(~80kb).
Savetheassembly(scf7180000301495_190000_290000.ace.10).Usethestandard
“CompareContig”techniquetojointhetwocontigstogether(ThisisdescribedinConsed
exercise“AComplexDrosophilaFosmid”).Abriefoutlineoftheprocedureisasfollows:
Navigatetothefarrightendofthe20kbcontig,usethesequencefoundtheretoperforma
“SearchforString”toidentifythematchingsiteinthe80kbcontig.Usingthe“Searching
Contigs”resultswindow,navigatetothematchattheendofthe20kbcontigandclickon
“CompareCont”.Returntothe“SearchingContigs”resultswindowandnavigatetothe
matchatthebeginningofthe80kbcontigandclickon“CompareCont”.Click“Align”onthe
CompareContigswindowandexaminethealignment.iftherearenohighquality
discrepancies(asisthecasehere)Clickon“JoinContigs”.
20
LastUpdate:5/19/2016
Savetheassembly(scf7180000301495_190000_290000.ace.11).Inspectionoftheregion
aftertheabovetearandjoinshowsthatthegaporiginallyat20432-20456isnolonger
present.TheresultingassemblyismuchbetterwithnoHQD’s.
ThePCR/Sangerpipeline
SomeofthegapsintheD.biarmipesprojectsaregenuineandrequiresadditional
sequencingdata.Forgapsofthissecondtypenewsequencedatamustbegeneratedtofill
inthemissingbases.Becausethereisnotemplateavailableforsequencing,newdatawill
needtobegeneratedbyfirstgeneratingtemplatewithPCRpriortosequencingwith
Sanger.Topracticethistechnique,quitandthenrestartconsedandopentheacefile
scf7180000301495_190000_290000.ace.8(whichstillhasthegapintheconsensus
sequence).Whilefinishersshouldnotdesignprimerstocoverlowconsensusquality
regions,theyshoulddesignPCRprimersthatspananyunresolvablegaps.Theseprimers
canbeusedinanattempttogeneratethenecessarydatatoclosethegap.ConsedhasaPCR
primerpickingprotocolthatwillgeneratealistofpossiblePCRprimerpairs:gotothegap
(consensusposition20432),rightclickontheconsensuspositionjusttotheleftofthegap
(atposition20426)andselect“Pick-->(TopStrand)FirstPCRPrimer”;dismissthe
informationdialogbox.
21
LastUpdate:5/19/2016
Right-clickontheconsensusjusttotherightofthegap(atposition20462)andselect
“Pick<--(BottomStrand)SecondPCRPrimer”
ThecriteriaforpickingamongthesuggestedPCRprimersarediscussedindetailinthe
“PCRPrimerSelectionGuide”document.Finishersworkingslowlyandcarefullyshould
selectasinglepairtoattemptPCR/Sanger.Finisherswhoareconstrainedfortimeshould
attempttopicktwodifferentsetsofPCRprimersforeachproblemareasuchthatboth
forwardprimersarecompatiblewithbothreverseprimers.Ingeneral,thesizeofthePCR
ampliconsshouldbekeptlessthan1000basesifpossible,andshouldbekeptassmallas
practicalinallcases.Ifallthepossiblesizesarelargerthan1250bases,thenatleastoneof
thetwoprimersMUSTBEwithin350bpoftheregionwherenewdataisneeded.Ifyou
cannotfindprimerpairsthatsatisfyeitherthe“1000bpsizelimit”orthe“oneendwithin
350basesofend”rulethenyouwillalsoneedtocreateathirdsequencingprimerthatis
closeenoughtotheproblemareatousetoprimetheSangersequencingreaction.
Inthispracticecase,wewillcarefullyexaminethelistofPCRprimerssuggestedbyConsed
andattempttofind4primers(2leftand2right)inwhichall4combinationsareonthelist.
Inaddition,wewillattempttohaveatleasttwooftheprimercombinationshavean
estimatedproductsizeoflessthan1000bases.Finishersmayfindithelpfultodrawthe
regionandtherelativepositionsofthesuggestedprimerstoassistinpickingthebest
possiblesetsofprimerpairs.Ifyoucannotfindtwosetsofprimerpairsthatwillproducea
productsizeoflessthan1000bases,thencontinuetoscreentheprimersonthelistand
attempttokeepthePCRproductsassmallaspossible.
Inthisinstancethereareonly8pairsinwhichthedistancebetweentheprimersisless
than1000bases.Furthermore,the8primerpairsconsistofonlytwouniqueforward
primers:20034-20055and19588-19617.
22
LastUpdate:5/19/2016
Similarly,therearethreeuniquereverseprimersamongthefirst8primerpairssuggested
byConsed:20543-20568,20871-20890and20950-20969.Notethatprimersthathave
substantialoverlapwitheachother(e.g.20871-20890versus20871-20891)arenot
consideredtobe“unique”inthiscontext.
Becausethereareonlytwouniquelefthandprimersandwewanttodesigntwosetsof
primerpairs,wewillpickonepairwiththelefthandprimerat20034-20055(i.e.pairs1-6)
andtheotherpairwiththelefthandprimerat19588-19617(i.e.pairs7-8).Inspectionof
theprimercoordinatesshowsthatpairs7and8areonlytriviallydifferentontheright
primer;tochoosebetweenpairs7and8wewillpickthepairwiththesmallestdifference
inmeltingtemperature(Tm)betweentheleftandrightprimers(i.e.pair8).
Basedonoursearchcriteria,thesecondlefthandPCRprimermustbeoneofthefirst6
primerpairssuggestedbyConsed.Wecaneliminatethefirsttwoprimerpairsbecause
theyhavethe“same”righthandprimerasprimerpair8.Thisleavespairs3-6whichhave2
possiblerighthandprimersat20871-20890and20950-20969,respectively.
Todeterminewhichrighthandprimerweshoulduse,wecanchecktoseeifeitherofthese
twoprimersislistedincombinationwithourlefthandprimerinpair8.Notethatpair13
and14hasthelefthandprimerat19588-19617andtherighthandprimerat2087120890;thiswouldindicatethatprimerpairs3and4arebetterthanprimerpairs5and6.
Becausetherighthandprimerfortheprimerpairs3and4areessentiallythesame,wewill
againpicktheprimerpairbasedontheclosestTm,whichwouldleadustopicktheprimer
pair4overprimerpair3.
Collectively,wehaveselectedtwolefthandprimers(20034-20055and19588-19617)and
tworighthandprimers(20871-20891and20544-20568)inwhichallfourpairwise
combinationsarefoundonthelist(pairs1,4,8and13).Inaddition,3ofthe4
combinationsproducePCRproductsthatarelessthan1000bpinsize.
23
LastUpdate:5/19/2016
Selectoneoftheprimerpairsandclick“AcceptPair”.Repeattheprimerdesignprocedure
toregeneratetheprimerpairlistandselectthesecondprimerpair.Afteracceptingboth
primerpairs,savetheassembly(scf7180000301495_190000_290000.ace.12).
Duringtheprimerselectionprocess,ifyoufindmultipleprimersthatyoucannotdecide
between,youcanperformaBLASTNsearchagainsttheD.biarmipeswholegenome
assemblytoscreenforoff-targetpriming.UsingthoseBLASTNresults,youcanselectthe
primerthatminimizestheprobabilityofoff-targetpriming.Seethe“PCRPrimerSelection
Guide”fordetaileddescriptionofthesearchprotocol.
PCR/SangerforLowConsensusQuality
FortheD.biarmipesprojects,almostallthelowconsensusquality(LCQ)regionswillbe
associatedwitheithertheendsoftheprojectorlocatedadjacenttoagap.Asdescribed
above,thefinishercanignoreLCQregionswithinthefirstandlast2.5kboftheproject.In
addition,improvingagapwillalsoimproveanysurroundingLCQregions.Becauseofthe
highreadcoveragefrom454andIlluminareads,weexpectthatgenuineLCQregionswill
beextremelyrare.Furthermore,manyoftheLCQregionsmaybeofacceptablequalityand
couldberesolvedbymanualinspectionandediting.IfyoufindanytrueLCQregions,you
shouldseekassistancefromGEPstaffmemberstodetermineifPCR/Sangerisnecessary.If
theregionisapprovedforPCR/Sanger,youcanapplythesamerapidprotocolasabovefor
orderingprimersforgaps(i.e.order4primerssuchthatbothforwardprimersare
compatiblewithbothreverseprimers).
(Optionalanalysis)Identifyingputativepolymorphisms
Anoptionalobjectivefortheseprojectsistoidentifyandtagregionswithputative
polymorphisms.Preliminaryanalysissuggeststhat,atleastforD.biarmipes,thefrequency
ofpolymorphismsisextremelylow.Ingeneral,weexpectthetwopolymorphicsequences
toberepresentedinatleast40%ofallthereads.Youcandeterminethefrequencyofeach
alleleusingtheprecentagesshownintheHighlyDiscrepantPositionsnavigatororbyusing
theMisc->DepthofcoverageatCursormenuitem.
Thereisanalternatetechniquetocountreadsthatmaybemoreusefulwhenlookingfor
readswithsequenceslongerthanasinglebase.Wewillusethistechniquetoconsiderthe
readsthataligntotheconsensusposition35,947.ThispositionconsistsofamixofG’sand
T’s.TodeterminethenumberofreadsthathaveaTatthispositionrequiresatwo-step
procedure.First,unhighlightallthereadsintheprojectbyselectingthe“Highlight->
UnhighlightAllReadsinAllContigs”option.
24
LastUpdate:5/19/2016
Onceallhighlightshavebeenremoved,selectHighlight->HighlightReadswithStringat
Cursor.Thiswillopenadialogboxwherewecanentertheallelewewouldliketosearch.
Inthisexample,enteraTinthesearchboxandclick“Ok”.Sequenceofanylengthcanbe
enteredintothisbox.Anyreadthatmatchesthesequencestartingatthepositionofthe
cursorwillbehighlighted.
InthiscaseallthereadnameswithaTatthatpositionwillbehighlightedandyouwillget
aninfoboxstatingthenumberofreads(inthiscase25)thatmatchedtheT.
Repeatthetwo-steptechniquedescribedabovetocountthenumberofG’satthisposition.
25
LastUpdate:5/19/2016
Ouranalysisshowsthat,ofthe100readsatthisposition,25reads(25%)hasaTwhile75
reads(75%)hasaG.Giventhisresult,thislocationdoesnotsatisfytheminimum40%and
wewouldnotadda“polymorphism”tagtothislocation.
26