GEP Hybrid Assembly Walkthrough
Transcription
GEP Hybrid Assembly Walkthrough
LastUpdate:5/19/2016 GEPHybridAssemblyWalkthrough DevelopedbyChristopherShaffer,withinputfromGEPmembersDonPaetkau,MichaelRubin, LauraReed Prerequisites Consedversion25orhigher FamiliaritywithConsed,(Forexample,priortrainingwith“UsingConsedGraphically”and the“DrosophilaFinishingProblemSet”) FilesforthisExercise Projectscf7180000301495_190000_290000 Introduction Thiswalkthroughwillillustratethetechniquesforconsensuserrorcorrectionaswellas closinggapsusingtheDrosophilabiarmipesprojectscf7180000301495_190000_290000. NotethatthiswalkthroughassumesthereaderisalreadyfamiliarwithConsedandthe exactdetailsonhowtoaccomplishmanyofthetasksarenotgiven.Manyofthefigureswill notmatchexactlytheimagesobtainedbytheuser(eveniftheyfollowtheprotocol exactly).Usersofthiswalkthroughareexpectedtohavesufficientexperiencetointerpret anydifferencesanddetermineiftheyaresignificantortrivialdifferences.Assuch,usersof thiswalkthroughshouldbeveryfamiliarwiththetechniquescoveredininthe“Using ConsedGraphically”walkthroughand“DrosophilaFinishingProblemSet”exercise (availableontheGEPwebsite). Setup LaunchX11andopenanewxterm;navigatetotheedit_diroftheD.biarmipesproject scf7180000301495_190000_290000(e.g.cd scf7180000301495_190000_290000/edit_dir). Enterconsed&atthextermprompt.The“&”willkeepyourterminalactiveincaseyouneed touseitlater.Openscf7180000301495_190000_290000.ace.3.Select“No”ifapromptappears thatasksifyouwouldliketoapplyeditsfromtheedithistory(.wrk)file. 1 LastUpdate:5/19/2016 Whenimprovinghybridassemblies,wewillusecustomsettingsthatdifferfromthedefault Consedsettings.Youwillneedtoverifythesesettings(andchangethesesettingsas necessary)eachtimeyoulaunchConsed. First,inthemainWindow,select“Options->GeneralPreferences”.Checkthatthe ThresholdforLowConsensusQuality(highestlow)”issetto25,andthe“Thresholdfor HighQualityDiscrepancy(lowesthigh)”issetto30. Nowdoubleclickonanycontig(e.g.scf7180000301495:190000-290000inthisproject)to openanAlignedReadswindow.Inthe“Dim”menu,verifythattheDimoptionissetto “>DimNothing”.BydefaultConsedwilldimtheunalignedregionsattheendsofreads.This makesalotofsensewhenworkingwithSangerreadsthatoftenhavevectorsequenceat theend,itdoesnotmakesensewhenworkingwithIlluminareads.InIlluminareadsALL dataisrelevantandshouldnotbegivenablackbackground,hencethe“Dimnothing” setting. IntheSortmenu,clickontheitem“SortOptionsandHelp”.Inthedialogboxthatappears, changethe“Displayreadssortedalphabeticallyorbystrand/leftreadend?”fieldto “Strand/LeftEnd”.Changethe“Whenyouclickontheconsensus,howdoyouwantreads sortedatcursorposition”to“bybase”.Beawarethatwiththissettingmeansthatthe screenshotsinthiswalkthroughmaynotexactlymatchtheimageonyourcomputer screen.However,itdoesresultintheabilityworkmorequicklyandefficientlywhendoing actualfinishingsothesesettingswereusedthroughoutthewalkthrough. Assemblyview Thisprojecthasasinglecontigof100kb.Clickonthe“AssemblyView”buttononthe ConsedMainWindowtoopenassemblyview.Runcross_matchtodetectsequencematches withinthiscontigandidentifyregionswithahighdensityofdiscrepantreadpairs. 2 LastUpdate:5/19/2016 Inthisproject,thereisaclusterof7inconsistentmatepairswhereonememberofthe matepairisplacedataround13kbwhiletheothermemberisplacedataround51kb.The numberofdiscrepantreadsofthistypewillvarywitheachproject.Whileitisnot necessarytoremovetheseinconsistentreadsduetothetypicalveryhighlevelsof coverate,itisrecommendedasitwilllikelyreducethesizeoftheHQDlistthatmustbe analyzed. Asanexampleofhowtoremovereadsofthistype,clickontheclusterofredlinesthat spansfrom13kbto51kbofthecontig.Thiswillbringupalistofallthediscrepantreadsin thatgroup.Notehowmostpairsaremappinginaninconsistentorientation(i.e.thepaired endreadsarepointingawayfromeachother),hence,bothdistanceandorientationare evidencethatthesereadsareimproperlymapped. Clickthe“PullOutReads”button.Inthenextwindowselectallthereadsandclickthe “RemoveHighlightedReads”button.Thiswillputeachreadintoitsowncontig (Contig290001throughContig290014).Aswithanyprocedurethatchangestheassembly (i.e.movesanyreadsintooroutofacontigormakesanytearsorjoins),youmustsavethe assemblybeforemakinganyadditionalchangestotheassembly.Theprojectwassavedas scf7180000301495_190000_290000.ace.4,toifyouthinkyouhavemadeamistakeyou mayquitConsedandloadthisacefile.Throughoutthewalkthroughotheracefilenames willbegiven,theycanbeloadedtosetthestateoftheprojectbeforecontinuing. 3 LastUpdate:5/19/2016 NavigatetoAssemblyViewandnoticeanothersetofinconsistentmatepairsthatspans fromthebeginningofthemaincontigtotheregionataround53kb.Thesearealso inconsistentbecausetheyareintheincorrectrelativeorientationandbecausetheyaretoo farapartfromeachother.Pullouttheseinconsistentreadsfromthemaincontigusingthe protocoldescribedabove(i.e.clickonredlines,clicktopulloutreads,selectallreads, removehighlightedreads,saveassembly). Thesetofreadsthatextendsfrom36kto42k(redarrowinfigureabove),containsonly tworeadsandforthepurposesofthiswalkthroughwillnotberemoved.Finishersarefree todeveloptheirownpolicyinregardtothenumberofreadsinaclusterthatwouldjustify theirremovalfromthemaincontig.Thelastsetofinconsistentreadsisasmallcluster around36k(greenarrow).Youmayneedto“ZoomIn”toseethese.Therearesufficient 4 LastUpdate:5/19/2016 readsherethatifthiswerearealprojecttheirremovalisrecommended.Howeverforthe purposesofthiswalkthroughtheywereleftintheproject. Beforecontinuingwiththerestofthiswalkthrough,readerswhowishtohaveanexact matchtothescreenshotsshowninthiswalkthroughmaywishtoquitConsed,restartand openscf7180000301495_190000_290000.ace.5.(RemembertochangetheConsed settingsdescribedatthetopofpage2afteryouopentheacefile.) Resolvingbaseerrorsatmononucleotideruns TheprimarygoaloftheGEPsequenceimprovementprojectistocorrectconsensuserrors withinmononucleotideruns(MNR).Thesecondarygoalofthesequenceimprovement projectistoseeifthereissufficientIlluminadatatoclosegaps.Optionally,finishersmay designprimerstocarryoutPCR/Sangertoaddnewdatatotheproject.Thiscanbeusedto resolvegapsandregionswithlowconsensusquality(discussedbelow).Youshouldtalkto yourmentoraboutprimerdesignandSangersequencing,thiswillonlybedoneisspecial casesifyouaredoingawetlabcomponenttoyourfinishingproject. ThiswalkthroughwilladdresstheprimarygoalofresolvingtheMNRregionsbefore addressinggapsandlowconsensusqualityregions.However,becauseoftheamountof timerequiredtoproduceadditionalsequencingdata,ifyouareplanningondoing PCR/Sangerwerecommendthatyoubeginwithprimerdesign.Youcanworkoncorrecting errorsinMNRregionswhilewaitingfortheresultsfromyourSangerreactions. 5 LastUpdate:5/19/2016 Inthe“ContigList”sectionoftheConsedMainWindow,youwillseetheprojectnowhasa longlistoftheindividualreads(weremovedthemfromthemaincontigabove).Themain contigweareworkingonisatthebottomofthelist(scf7180000301495:190000-290000). Scrolldowntofindthemaincontiginthe“ContigList”anddoubleclickonthemaincontig toopenitinanAlignedReadswindow.Alternativlyopenthealignreadswindowdirectly fromtheassemblyview. Dependingonyoursettingsyoumayseethatthebasenumberingsystemstartswith 189,660,thisisanewfeatureavailableinConsedv25thatallowsforpropernumbering whenaprojectisactuallyasubsectionofalargerscaffold.Thisprojectisderivedfroma portionofthegenomicscaffoldscf7180000301495intheD.biarmipeswholegenome assemblyandthenumberinghereindicatesthelocationwithinthatscaffold. Youcanchangethenumberingsystemsothatbase1isthefirstbaseintheproject.Inthe AlignedReadswindow,clickontheMiscmenuandselect“TurnOn/OffUser-Defined ConsensusScaleNumbers”.Thiswilltemporarilychangethenumberingschemeforaslong asConsedisrunning,IfyouquitandrestartConsedthenumberingsystemwillchange back.Ifyouwishtopermanentlychangethenumberingscheme,youcandeletethe “startNumberingConsensus”consensustagatthebeginningofthecontig.Clickonthe“<<<” buttontonavigatetothestartofthesequenceinthisproject. 6 LastUpdate:5/19/2016 Toscreenforbaseerrors,returntotheConsedMainWindowandselectNavigate-> “SearchforHighlyDiscrepantPositions”.Giventhehighreaddepthandthelargenumber ofreadsthatareimproperlyplacedintheseassembly,siteswithoneortwoHQD’sare quitecommonandseldomindicateagenuineproblemwiththeconsensus.Asaninitial screenforproblemareasinthemaincontig,wewillsetthe“minimum#ofdiscrepant reads”fieldto3.Thisfocusthelistofdiscrepantregionswhereaconsensuserrormight actuallyexist.TofocusonregionswheretheIlluminadatadoesnotsupporttheconsensus setthe“ignorebasesbelowthisquality”fieldto30.Click“Search”. Theresultinglistwillhavemanyregionswheretheconsensusiscorrecteventhoughthere are3highqualityreadsthatdisagreewiththeconsensus.Inmanycases,thediscrepancies canbeattributedtoerrorsinthe454readsthatshowdifferentnumberofbasescompared toboththeIlluminareadsandtheconsensus.(Youcanusethereadnametodistinguish 454readsfromIlluminareads;Illuminareadshavetheprefix“USI-“,454readswillstart witha“G”)Someoftheotherdiscrepantregionsonthelistarecausedbyreadsthathave beenincorrectlyplacedintheassembly(e.g.becauseoflargetransposonsorother repetitiveregionsinthegenome)bythemappingprogram. 7 LastUpdate:5/19/2016 ThebasicstrategyistonavigatethrougheachitemintheHighlyDiscrepantRegionslist andlookfordiscrepantregionsthatareassociatedwithaMNR.Whenyoufindaregion associatedwithaMNR(within5bases),youshouldinspecttheregioncarefullyandeither confirmoredittheconsensusbasedontheavailableevidence.Notethatbecauseofthe knownweaknessesofthe454sequencingtechnologyinresolvingthecorrectnumberof basesinlongMNR’s,finishersshouldrelyontheIlluminadatawhendetermininglength. SinceadjacentGEPsequenceimprovementprojectsoverlapwitheachother,thefinishers canignorethediscrepanciesthatarefoundwithin2500basesoftheendsoftheproject (becausetheseproblemareaswillberesolvedbythefinishersworkingontheadjacent projects).Consequently,wewillignoretheinitialhighlydiscrepantregionsandexamine theregionatbaseposition2719.InspectionofthatregionintheAlignedReadswindow showsthatthediscrepantposition(i.e.2719)isnotnearaMNR.Becausewearesorting “bybase”,thefourreadswithadiscrepantC(3ofwhicharehighquality)arelistedabove allthereadswiththeT. 8 LastUpdate:5/19/2016 Thisdiscrepancymaybeduetoabasecallingerror,mis-mappingofthereadoritis possiblethatthissiteispolymorphic.Regardlessofwhythisdiscrepancyexists,thereis insufficientevidencetosupportthehypothesisthattheconsensusisincorrect:thereare70 readshere(withqualityscore30orabove)thatagreewiththeconsensusandshowaT andonly3highqualityreadsthatshowaC. Infact,giventhatthediscrepantpositionisNOTpartofthenearbyMNR,wecouldsimply moveontothenextregionwith3+HQD’s.However,ifyouexaminethesereadscarefully, youwillfindadditionaldiscrepanciesfurtherdownstream(e.g.at2799oftheconsensus). Thismakesitverylikelythatthesereadsactuallybelongsomewhereelseinthegenome anditstrengthenstheargumentthattheconsensusshouldremainaT.Giventhehigh frequencyofmis-mappedreads,discrepanciesofthistypewillnotbeexaminedcarefully. Insteadfinishersshouldfocustheireffortsexclusivelyondiscrepantregionsassociated withMNR’s.Clickonthe“Next”buttonontheAlignedReadswindowtonavigatetothenext highlydiscrepantregion.Continuetoclicknextuntilyounavigatetoadiscrepantregion thatisassociatedwithaMNR.ThefirstdiscrepantpositionwithinaMNRisat5800. Thesorting“bybase”optionhasplacedallthereadswiththepad(*)atthetopofthe AlignedReadswindow,belowtheseareallthereadswithaTatthisposition.Thereare quiteafew454readsinthisregionthatshowfewernumberofT’sthanareinconsensus (i.e.16ofthe454readsshowapad).Usethescrollbarontherighttoscrolldownand examinetheIlluminareads.AllthehighqualityIlluminareadsthatalignedtothisregion agreewiththeconsensusandshows7T’s(IlluminareadsstartwiththeprefixUSI-;454 readnamesareshorterandstartwithG.)BecausealltheIlluminareadsagreewiththe consensus,thereisinsufficientevidencetochangetheconsensusandwewillkeepthe consensusat7T’sandmoveon. 9 LastUpdate:5/19/2016 Click“next”intheAlignedReadswindowtonavigatetothenextdiscrepantregion.The nextlocationontheHQDlistthatshowsaninterestingdiscrepancyislocatedat6055. Examinationoftheregionshowsfive454readsthathavea7bpinsertion(TCTCATT) comparedtotheconsensus(ifyoucanonlyfind3discrepantreads,makesure“Dim nothing”isselectedandlookagain).Againthepreponderanceofevidencesuggeststhatthe consensusiscorrect(i.e.thereare~90readscoveringthisregion,85ofwhicharehigh qualitythatagreewiththeconsensuswhileonlyafewdisagreewiththeconsensus).The skewinthedistributionofreadsthatdisagreewiththeconsensus(3%)comparedtothe percentageofreadsthatagreewiththeconsensus(97%)makesitextremelyunlikelyfor thediscrepancytobecausedbyapolymorphicsite. ThenextdiscrepantMNRsitewith3ormoreHQD’sisat7557.ItisassociatedwithaMNR of4T’s.Howeverthereare52(highquality)readswithaTatthispositionandonly3(high quality)readswithaC,soagainthereisinsufficientevidencetochangetheconsensus. Clickonthe“Next”buttonintheAlignedReadsregiontonavigatetothenextdiscrepant region.ThereareseveralotherregionsthatareassociatedwithMNR’sdownstream,butin eachofthesecasestheconsensusisfine,nochangesarejustifiedbydiscrepantreads. Thenextinterestingregionislocatedat9651.Thisregionisanexampleofaproblematic alignmentwhichnecessitatescarefulexamination.Heretheconsensussequence(ignoring pads)isAAATGAGAAAAAAACATAT.WhenyouscrolldownintheAlignedReadsWindow, youwillfindmanybasediscrepanciesinthisregion.Howevernoonepositionisdiscrepant acrossallthehighqualityIlluminareads.NotehowsomeofthehighqualityIlluminareads showadiscrepantAwhichdiffersfromthepadjusttotherightofbase9651.Otherhigh qualityIlluminareadshavea(consistent)padatthatposition(e.g.USIEAS376_0023:6:17:19462:9201_1. 10 LastUpdate:5/19/2016 Inthisregion,carefulexaminationrevealsthatalloftheproperlymappedhighquality Illuminareadshave8A’sirrespectiveofhowtheywerealignedtotheconsensus.Hence theavailableevidencesuggestsweshouldchangetheconsensusfrom7A’sto8A’s. InconsistentalignmentsofthistypeareacommonissuewithConsed.Thisisparticularly trueforregionswithlongMNR’sormanysmallerMNR’s.Forthisreasonitisvery importanttoinspectthediscrepantregionscarefully,countingthenumberofbasesineach readifnecessary,todeterminetheconsensussequence. Iftheyprefer,finisherscanusethestandardtechniqueofopeningatracewindowforone oftheIlluminareadsandusingchangeconsensustomaketheproperedit.However,unlike previousversionsofConsed,Consed25allowuserstoedittheconsensussequencedirectly. ToaddanextraAtotheconsensususingthistechnique,clickononeofthepad(*)adjacent tothe7A’sintheconsensus.HittheAkeytochangetheconsensusfromapadtoana.Note thatalleditstotheconsensusarekeptaslowercase(i.e.lowqualityedit)byConsed. 11 LastUpdate:5/19/2016 Aftereditingtheconsensussavetheassembly(scf7180000301495_190000_290000.ace.6). Forpractice,youcancontinuetoassessregionsontheHQDlistthatareassociatedwith MNR’sandcorrecttheconsensussequenceifitweresupportedbytheIlluminadata.There areseveraladditionalregionsinthisprojectthatrequirescorrectionstotheconsensus sequencebuttheywillnotbediscussedinthiswalkthrough.Thenextregiondiscussedwill betheregionaroundthelongAMNRaround51,425.Thisregionsisagoodexampleofa verycomplexregionwherethereadsmustbeanalyzedonebyoneinordertoresolvethe consensus. Thisregionhas2MNRofA’sthatareseparatedby2T’s.RegionswithmultipleMNR’sare particularlyerrorpronefor454sequencingandarealsomorelikelytohavebeen misalignedbyConsed.Ifsomeofthereadshaveacompletelyblackbackgroundattheend ofthereadbesuretousetheDimmenutoselect“DimNothing”. Acarefulexaminationofthisregionshowsthatmanyofthereddiscrepantbasesinthis regioncanbeattributedtomisalignmentsandarenotgenuinediscrepanciesinthelengths oftheMNR’s.TheconsensushasasequenceofGGGAAAAATTAAAAAAAAAAATTAT,with thecriticalissueofhowmanyA’sareinthetwoMNR’s.ClickonthemiddleAatposition 51,425,thisshouldbringalltheIlluminareadstothetop.Irrespectiveofwheretheyare 12 LastUpdate:5/19/2016 sortedorhowtheywerealignedcarefulexaminationshowsthat7oftheIlluminareads (highlightedinthefigurebelow)havethesequence3G’sfollowedby6A’sfollowedby2 T’sthen11A’sfollowedbyTTAT(i.e.GGGAAAAAATTAAAAAAAAAAATTAT). Ofthereadsthatdonotmatch,thereadnamethatendswith:3596_2(topreadinthe figureabove)doesnotcovertheentireregionbeingdiscussedandistherefore uninformative.Thereadnamethatendswith:6277_1onlydisagreeswiththissequenceat positionsthathaveverylowquality.Incontrast,thereadnamesthatendwith:12171_1 and:5958_1containsmanyHQD’scomparedtotheconsensusthroughouttheirentire length.Giventhatthisregionhasabluerepeattag,itisverylikelythatthesereadsbeen improperlymappedtothisregionandthedifferencesindicatedbythesereadsare unreliable.Thereadnamethatendswith:9617_1isverylowqualityformostofthisregion andtheevidenceitprovideswouldnotbesufficienttooverrulethe7highqualityreads thatareallconsistentwitheachother. Collectively,ouranalysisofthehighqualityIlluminareadsthatalignedtothisregion suggeststhatthecorrectconsensussequenceisGGGAAAAAATTAAAAAAAAAAATTAT. EdittheconsensusaddinganadditionAtoeachmononucleotiderunandsavethe assembly.(scf7180000301495_190000_290000.ace.7) Thereareotherregionsfurtherdownstreaminthiscontigwithatleast3HQD’sandwhere mostoftheIlluminareadsdisagreewiththeconsensus.Analysisoftheseregionsisleftas anexerciseforthereader. Checkingareaswithfewerthan40reads TheprotocolforresolvingMNR’sdescribedaboveassumesthatthediscrepantregion containsmanyreads.Settingtheminimumofatleast3HQD’sallowsustofilteroutmany locationswithmis-mappedreads.However,wewillneedtouseanalternateapproachto checkforerrorsinMNR’sfoundinregionswithlowcoverage.Afinishermustnavigateto theseregionsandcarefullyinspectthemforanypotentialerrorsintheconsensus sequence. 13 LastUpdate:5/19/2016 Tosearchforregionswithlowreadcoverage,usethe“MainWindow->Navigate->Search forHigh(low)DepthofCoverage”menu.Inthe“NavigatebyHigh(orLow)Depthof Coverage”window,uncheckthe“showhighdepth(notlowdepth)”fieldtosearchfor regionswithlowreadcoverage.Setthe“ignorereadbasesbelowthisquality”fieldto10 andthensetthe“max(forlowdepthregions)depthofcoverage”fieldto40.ClickSearch. The“LowDepthofCoverageRegions”navigatorwindowwillappearandyoucanusethis toquicklynavigatetotheareaswithlowreadcoverage.Theentriesatthebeginningofthe listcorrespondtothesinglereadsthatwehavepreviouslyremovedfromthemaincontig. Scrolldownuntilyoureachtheentriesthatcorrespondtoyourmaincontig(i.e. scf7180000301495:190000-290000),rememberthatthefirstandthelast2.5kbdonot needtobefinished. Thefirstlowcoverageregionweneedtoinspectspansfrom8465to8800.Searchforany sequencewithinthisregionwithaMNRof5ormore.Thiscanbeaccomplishedbyeither manualinspection(easilydoneiftheregionissmall)orbyusing“searchforstring”.Be suretosearchwithbothGGGGG(whichwillfindbothGGGGGandCCCCC)andAAAAA (whichwillfindbothAAAAAandTTTTT).SearchingwithGGGGGshowsnoMNRof5or longerwithinthislowcoverageregion.Also,rememberthatwhensearchingforstring regionsthatmatchthecomplementofthequerysequence(i.e.locationswithCCCCC)will belistedAFTERallthelocationsthatmatchedtheuncomplementedsequence(i.e.GGGGG). HenceyouwillneedtoscrolldowntoascertainifanyCCCCCMNR’sarefoundbetween 8465-8800.(Inthiscase,therearenoMNR’sofCCCCCinthislowcoverageregion.)To avoidthisscrollingissuefinishersmaywishtorun4differentMNRsearchesandcrossreferenceall4listswhenlookingforoverlappingregions. 14 LastUpdate:5/19/2016 SearchingwithAAAAAshows3regionsbetween8465-8800thatshouldbecarefully inspected,8593-8597,8653-8657,and8741-8745.Examinationofthethreeregionsshows nodiscrepanciesorinconsistenciesthatwouldrequirechangestotheconsensussequence. Scrolltothe“complemented”sectionofthe“SearchingContigs”listtocheckforanyTTTTT MNR’sinthisregion(therearenoTTTTTMNR’sthatoverlapwiththelowcoverageregion at8465-8800). 15 LastUpdate:5/19/2016 AccordingtotheGEPsequenceimprovementstandardforthehybridassemblies,each consensuspositionmustbesupportedbyaminimumoftworeadsthatagreewitheach otherandareeachofsufficientquality.Thesereadscanbeeither454orIlluminareads(or acombinationofboth). “Sufficient”qualityinthiscontextmeansthatthereadpositionhasaphred(quality)score ofatleast20ineachreadandthatthefinisherisconfidentthatthereadhasbeenmapped tothecorrectregion.Tohavehighconfidencethatthereadisproperlymapped,itshould havenomorethatoneHQDanywhereinthereadcomparedtothefinalconsensus.Ifthere isonlyasinglehighqualityreadthatsupportstheconsensus,thenthefinisherwillneedto adda“dataneeded”tagtotheregionandhighlightthepresenceofanunresolvedlow qualityregioninthefinalfinishingreportform.Ifyouareimplementingawetlab PCR/Sangerpipelineyoumaybeaskedtocovertheregiontoresolvetheissue,checkwith yourmentor.IfyouarenotimplementingyourownPCR/Sangerexperiments,youshould NOTdesignprimersfortheselowconsensusquality“dataneeded”regions. Forpracticeyoumaycontinuetoworkthroughtheregionsinthelowcoveragelistby manuallyinspectingtheselowcoverageregionsandcross-referencewiththeMNRsearch forstringlists.Examineeachlocationtoconfirmtheconsensusintheseregions. Thenextlowcoverageregiondiscussedhereislocatedat94415-94757.Withinthisregion isaMNRof10T’sintheconsensusstartingatbase94619Yetagaintherearealignment problemsthatmakethisareadifficulttoanalyze. Whiletheconsensushas10T’s,carefulinspectionofthe3highqualityIlluminareads (highlightedinthefigureabove)allshow11T’s.Notehow,becauseofthevagariesofthe alignmentalgorithmusedbyConsed,noconsensuspositionhas3HQD’s.Twoofthereads arediscrepantwiththeconsensusAat94618.Theotherreadisdiscrepantwiththepad justafterconsensuspositionat94628.Wewillneedtoverifythatthese3readscontainno morethanoneHQDoutsideofthisMNRtobeconfidentthattheyhavebeenmapped correctlytothisregion.GiventhatIlluminareadsareallrelativelyshortthiscanbedone simplybymanualinspection.Tocheckthiscomputationally,pullupthelistofallHQD’s 16 LastUpdate:5/19/2016 (Navigate->Highquality(>=30)discrepancies,>5bpfromunalignedregion).Scrolldown totheregionofinterestat~94,000andnotewhichofthereadscontainHQD’s. TheHQDlistshowsthateachofthesethreereads(thathaveanextraTcomparedtothe consensus)onlyhasasingleHQD.Ineachcase,theonlyHQDlistedforeachofthesereads istheHQDthatisassociatedwiththeincorrectlengthoftheMNRintheconsensus. Consequently,wecanbeconfidentthatthesethreediscrepantreadshavebeenmappedto thecorrectregionandwecancorrecttheconsensususingthesereads.Correctthe consensus(11T’s)andsavetheassembly.(scf7180000301495_190000_290000.ace.8) Resolvinggapsandlowconsensusqualityregions Becauseresolvinggapsandlowconsensusqualityregionssometimesrequireadditional datafromPCR/Sangersequencingreactions,werecommendthatfinishersdoingthe optionalPCR/Sangerpipelinebegintheirsequenceimprovementprojectbyresolvinggaps. TheprimarygoalofresolvingbaseerrorsinMNR’scanbeworkedonwhilewaitingforthe PCR/Sangerresults.Todetermineifyourprojecthasanygaps,simplyuse“Searchfor String”andsearchfor“nnnnn”intheconsensus. 17 LastUpdate:5/19/2016 Theseresultsshown’sintheregionaround20445plusorminusabout13bases.Double clickonthefirstmatch(at20432-20436)tonavigatetothisregion.Noticethatthegap regionactuallyspansfrom20432-20458. Becauseofhowtheprojectsareconstructed(i.e.mapping454andIlluminareadsagainst thepublishedconsensussequence),manyoftheprojectscontainsmallgapsthatcanbe resolvedwithoutadditionaldata.Inthisexample,therearemultiplehigh-qualityIllumina readsthatspantheentiregap.Inaddition,notehowthebasesthatarealigningtothen’sin theconsensusactuallymatchthesequenceadjacenttothegap.Thissuggeststhatthereis nomissingdata.Finishersmaybeabletodetecttheoverlapbyvisualinspectionorthey canusethe“SearchforString”functionalityinConsedtosearchforoverlappingregions. Inthisexample,wewillusethereadUSI-EAS376_0023:6:66:17825:11588_2(which extendsintoandbeyondthegap)tohelpresolvethisregion.Thelastbasesinthe consensusontheleftsideofthegapareTTTTTGGga,selectthebasesinreadUSIEAS376_0023:6:66:17825:11588_2immediatelyfollowingthissequencetotheendofthe readandperformaSearchforString. Theresultsof“SearchforString”revealedanexactmatchthatislocatedjustontheright sideofthegap(atposition20477-20514).Visualinspectionofthisregionrevealsan overlapofafewbasesoneachsideofthegap:ataggaatttttgggaisactuallyrepeatedoneach sideofthegap.TherearemanyIlluminareadsthatsupportthehypothesisthatthereis onlyasingleinstanceofthissequence.Hencewemightbeabletoresolvethisgapby performingaforcejointocollapsetheoverlappingreadsintoasingleregion. Ifourhypothesiswerecorrect,thentheassemblypiece(scf7180000301495:190000290000)forthisprojectiswrongandcontainsamisassembly.Iftheassemblypieceread remainsinthecontig,Consedwillprohibitusfromclosingthegap.Thus,wemustfirstpull outtheassemblypiecethatwasusedtoconstructtheinitialassemblybeforeattemptingto reassemblethisregion.Thisfakereadhasthesamenameastheproject: scf7180000301495:190000-290000. 18 LastUpdate:5/19/2016 Removetheassemblypiecereadbyrightclickingonthereadandselect“Removeread scf7180000301495:190000-290000fromthiscontig”.Inthe“RemoveReads”dialogbox thatappearsmakesurethatonlythe“scf7180000301495:190000-290000”readislistedin the“Readstoberemoved”boxontheright.Thedefaultsettingsareappropriateanddonot needtobechanged,click“Doit”.Anew“ReadsRemoved”dialogwillappearwithanote thatindicateswemustsavetheassembly.GobacktotheConsedMainWindowandsave theassembly(scf7180000301495_190000_290000.ace.9). Dependingonyourproject,removingtheassemblypiecefromthemaincontigcouldcause thecontigtobreakintomultiplesmallercontigs.However,inthiscase,theassembly remainedinasinglecontig.WhentheassemblypiecewasremovedConsedattemptedto calculateanewconsensus,replacingtheN’swithbasesfromthereadsbelow. PerformaSearchforStringusingtheoverlappingsequencewehaveidentifiedpreviously (i.e.ataggaatttttggga)tonavigatetothegapregionthatwearetryingtoresolve.Because thereadsbelowtheconsensusarenotproperlyaligned,therearemanydiscrepantred basesinregion. 19 LastUpdate:5/19/2016 However,carefulvisualinspectionoftheregionsurroundingthelocationoftheprevious n’s(bases20432-20456)showsthattheregionremainsthesame:therearestilltwocopies ofthesequenceataggaatttttggga.ToresolvethegapandcreateanassemblywithnoHQD’s, wewillneedtotearandre-jointheoverlappingregionswiththecorrectoverlap.Rightclickatbase20440oftheconsensusandselect“Tearcontigatthisconsensusposition”. Becauseourhypothesisisthatthetworepeatedregionsshouldbecollapsedintoasingle copy,itdoesnotmatterwhichreadsendsupintheleft(highlighted)contigversustheright (unhighlighted)contigaslongassomereadsgoineachdirection.Hencewewillacceptthe defaultselectionfromConsed.Clickon“DoTear”tocreatetwocontigs,alefthandone (~20kb)andarighthandone(~80kb). Savetheassembly(scf7180000301495_190000_290000.ace.10).Usethestandard “CompareContig”techniquetojointhetwocontigstogether(ThisisdescribedinConsed exercise“AComplexDrosophilaFosmid”).Abriefoutlineoftheprocedureisasfollows: Navigatetothefarrightendofthe20kbcontig,usethesequencefoundtheretoperforma “SearchforString”toidentifythematchingsiteinthe80kbcontig.Usingthe“Searching Contigs”resultswindow,navigatetothematchattheendofthe20kbcontigandclickon “CompareCont”.Returntothe“SearchingContigs”resultswindowandnavigatetothe matchatthebeginningofthe80kbcontigandclickon“CompareCont”.Click“Align”onthe CompareContigswindowandexaminethealignment.iftherearenohighquality discrepancies(asisthecasehere)Clickon“JoinContigs”. 20 LastUpdate:5/19/2016 Savetheassembly(scf7180000301495_190000_290000.ace.11).Inspectionoftheregion aftertheabovetearandjoinshowsthatthegaporiginallyat20432-20456isnolonger present.TheresultingassemblyismuchbetterwithnoHQD’s. ThePCR/Sangerpipeline SomeofthegapsintheD.biarmipesprojectsaregenuineandrequiresadditional sequencingdata.Forgapsofthissecondtypenewsequencedatamustbegeneratedtofill inthemissingbases.Becausethereisnotemplateavailableforsequencing,newdatawill needtobegeneratedbyfirstgeneratingtemplatewithPCRpriortosequencingwith Sanger.Topracticethistechnique,quitandthenrestartconsedandopentheacefile scf7180000301495_190000_290000.ace.8(whichstillhasthegapintheconsensus sequence).Whilefinishersshouldnotdesignprimerstocoverlowconsensusquality regions,theyshoulddesignPCRprimersthatspananyunresolvablegaps.Theseprimers canbeusedinanattempttogeneratethenecessarydatatoclosethegap.ConsedhasaPCR primerpickingprotocolthatwillgeneratealistofpossiblePCRprimerpairs:gotothegap (consensusposition20432),rightclickontheconsensuspositionjusttotheleftofthegap (atposition20426)andselect“Pick-->(TopStrand)FirstPCRPrimer”;dismissthe informationdialogbox. 21 LastUpdate:5/19/2016 Right-clickontheconsensusjusttotherightofthegap(atposition20462)andselect “Pick<--(BottomStrand)SecondPCRPrimer” ThecriteriaforpickingamongthesuggestedPCRprimersarediscussedindetailinthe “PCRPrimerSelectionGuide”document.Finishersworkingslowlyandcarefullyshould selectasinglepairtoattemptPCR/Sanger.Finisherswhoareconstrainedfortimeshould attempttopicktwodifferentsetsofPCRprimersforeachproblemareasuchthatboth forwardprimersarecompatiblewithbothreverseprimers.Ingeneral,thesizeofthePCR ampliconsshouldbekeptlessthan1000basesifpossible,andshouldbekeptassmallas practicalinallcases.Ifallthepossiblesizesarelargerthan1250bases,thenatleastoneof thetwoprimersMUSTBEwithin350bpoftheregionwherenewdataisneeded.Ifyou cannotfindprimerpairsthatsatisfyeitherthe“1000bpsizelimit”orthe“oneendwithin 350basesofend”rulethenyouwillalsoneedtocreateathirdsequencingprimerthatis closeenoughtotheproblemareatousetoprimetheSangersequencingreaction. Inthispracticecase,wewillcarefullyexaminethelistofPCRprimerssuggestedbyConsed andattempttofind4primers(2leftand2right)inwhichall4combinationsareonthelist. Inaddition,wewillattempttohaveatleasttwooftheprimercombinationshavean estimatedproductsizeoflessthan1000bases.Finishersmayfindithelpfultodrawthe regionandtherelativepositionsofthesuggestedprimerstoassistinpickingthebest possiblesetsofprimerpairs.Ifyoucannotfindtwosetsofprimerpairsthatwillproducea productsizeoflessthan1000bases,thencontinuetoscreentheprimersonthelistand attempttokeepthePCRproductsassmallaspossible. Inthisinstancethereareonly8pairsinwhichthedistancebetweentheprimersisless than1000bases.Furthermore,the8primerpairsconsistofonlytwouniqueforward primers:20034-20055and19588-19617. 22 LastUpdate:5/19/2016 Similarly,therearethreeuniquereverseprimersamongthefirst8primerpairssuggested byConsed:20543-20568,20871-20890and20950-20969.Notethatprimersthathave substantialoverlapwitheachother(e.g.20871-20890versus20871-20891)arenot consideredtobe“unique”inthiscontext. Becausethereareonlytwouniquelefthandprimersandwewanttodesigntwosetsof primerpairs,wewillpickonepairwiththelefthandprimerat20034-20055(i.e.pairs1-6) andtheotherpairwiththelefthandprimerat19588-19617(i.e.pairs7-8).Inspectionof theprimercoordinatesshowsthatpairs7and8areonlytriviallydifferentontheright primer;tochoosebetweenpairs7and8wewillpickthepairwiththesmallestdifference inmeltingtemperature(Tm)betweentheleftandrightprimers(i.e.pair8). Basedonoursearchcriteria,thesecondlefthandPCRprimermustbeoneofthefirst6 primerpairssuggestedbyConsed.Wecaneliminatethefirsttwoprimerpairsbecause theyhavethe“same”righthandprimerasprimerpair8.Thisleavespairs3-6whichhave2 possiblerighthandprimersat20871-20890and20950-20969,respectively. Todeterminewhichrighthandprimerweshoulduse,wecanchecktoseeifeitherofthese twoprimersislistedincombinationwithourlefthandprimerinpair8.Notethatpair13 and14hasthelefthandprimerat19588-19617andtherighthandprimerat2087120890;thiswouldindicatethatprimerpairs3and4arebetterthanprimerpairs5and6. Becausetherighthandprimerfortheprimerpairs3and4areessentiallythesame,wewill againpicktheprimerpairbasedontheclosestTm,whichwouldleadustopicktheprimer pair4overprimerpair3. Collectively,wehaveselectedtwolefthandprimers(20034-20055and19588-19617)and tworighthandprimers(20871-20891and20544-20568)inwhichallfourpairwise combinationsarefoundonthelist(pairs1,4,8and13).Inaddition,3ofthe4 combinationsproducePCRproductsthatarelessthan1000bpinsize. 23 LastUpdate:5/19/2016 Selectoneoftheprimerpairsandclick“AcceptPair”.Repeattheprimerdesignprocedure toregeneratetheprimerpairlistandselectthesecondprimerpair.Afteracceptingboth primerpairs,savetheassembly(scf7180000301495_190000_290000.ace.12). Duringtheprimerselectionprocess,ifyoufindmultipleprimersthatyoucannotdecide between,youcanperformaBLASTNsearchagainsttheD.biarmipeswholegenome assemblytoscreenforoff-targetpriming.UsingthoseBLASTNresults,youcanselectthe primerthatminimizestheprobabilityofoff-targetpriming.Seethe“PCRPrimerSelection Guide”fordetaileddescriptionofthesearchprotocol. PCR/SangerforLowConsensusQuality FortheD.biarmipesprojects,almostallthelowconsensusquality(LCQ)regionswillbe associatedwitheithertheendsoftheprojectorlocatedadjacenttoagap.Asdescribed above,thefinishercanignoreLCQregionswithinthefirstandlast2.5kboftheproject.In addition,improvingagapwillalsoimproveanysurroundingLCQregions.Becauseofthe highreadcoveragefrom454andIlluminareads,weexpectthatgenuineLCQregionswill beextremelyrare.Furthermore,manyoftheLCQregionsmaybeofacceptablequalityand couldberesolvedbymanualinspectionandediting.IfyoufindanytrueLCQregions,you shouldseekassistancefromGEPstaffmemberstodetermineifPCR/Sangerisnecessary.If theregionisapprovedforPCR/Sanger,youcanapplythesamerapidprotocolasabovefor orderingprimersforgaps(i.e.order4primerssuchthatbothforwardprimersare compatiblewithbothreverseprimers). (Optionalanalysis)Identifyingputativepolymorphisms Anoptionalobjectivefortheseprojectsistoidentifyandtagregionswithputative polymorphisms.Preliminaryanalysissuggeststhat,atleastforD.biarmipes,thefrequency ofpolymorphismsisextremelylow.Ingeneral,weexpectthetwopolymorphicsequences toberepresentedinatleast40%ofallthereads.Youcandeterminethefrequencyofeach alleleusingtheprecentagesshownintheHighlyDiscrepantPositionsnavigatororbyusing theMisc->DepthofcoverageatCursormenuitem. Thereisanalternatetechniquetocountreadsthatmaybemoreusefulwhenlookingfor readswithsequenceslongerthanasinglebase.Wewillusethistechniquetoconsiderthe readsthataligntotheconsensusposition35,947.ThispositionconsistsofamixofG’sand T’s.TodeterminethenumberofreadsthathaveaTatthispositionrequiresatwo-step procedure.First,unhighlightallthereadsintheprojectbyselectingthe“Highlight-> UnhighlightAllReadsinAllContigs”option. 24 LastUpdate:5/19/2016 Onceallhighlightshavebeenremoved,selectHighlight->HighlightReadswithStringat Cursor.Thiswillopenadialogboxwherewecanentertheallelewewouldliketosearch. Inthisexample,enteraTinthesearchboxandclick“Ok”.Sequenceofanylengthcanbe enteredintothisbox.Anyreadthatmatchesthesequencestartingatthepositionofthe cursorwillbehighlighted. InthiscaseallthereadnameswithaTatthatpositionwillbehighlightedandyouwillget aninfoboxstatingthenumberofreads(inthiscase25)thatmatchedtheT. Repeatthetwo-steptechniquedescribedabovetocountthenumberofG’satthisposition. 25 LastUpdate:5/19/2016 Ouranalysisshowsthat,ofthe100readsatthisposition,25reads(25%)hasaTwhile75 reads(75%)hasaG.Giventhisresult,thislocationdoesnotsatisfytheminimum40%and wewouldnotadda“polymorphism”tagtothislocation. 26