Genotyping and quality control of UK Biobank, a large
Transcription
Genotyping and quality control of UK Biobank, a large
GenotypingandqualitycontrolofUK Biobank,alarge-scale,extensively phenotypedprospectiveresource Informationforresearchers v1.2Oct2015 InterimDataRelease2015 1 Introduction.................................................................................................................3 1.1 UKBiobank...........................................................................................................3 1.2 Purposeofthisdocument.....................................................................................3 1.3Datareleases.........................................................................................................3 1.4TheUKBiobankAxiomgenotypingarray...............................................................4 1.5 OverviewofDNAextractionandgenotyping........................................................5 2 Additionalqualitycontrol............................................................................................7 2.1 Ourapproach.........................................................................................................7 2.2 SNPQC...................................................................................................................8 2.2 SampleQC...........................................................................................................11 2.3 Summary..............................................................................................................13 3 PropertiesoftheUKBiobankgenotypedataintheinterimrelease.......................14 3.1 Propertiesofsamples..........................................................................................14 3.2 PropertiesofSNPs...............................................................................................17 References........................................................................................................................20 Appendices.......................................................................................................................21 2 1Introduction 1.1 UKBiobank UKBiobankisaprospectivecohortstudyofover500,000individualsfromacrossthe UnitedKingdom.Participants,agedbetween40and69,wereinvitedtooneof22 centresacrosstheUKbetween2006and2010.Blood,urineandsalivasampleswere collected,physicalmeasurementsweretaken,andeachindividualansweredan extensivequestionnairefocusedonquestionsofhealthandlifestyle. TheresourcewillprovideapictureofhowthehealthoftheUKpopulationdevelops overmanyyearsanditwillenableresearcherstoimprovethediagnosisandtreatment ofcommondiseases[1]. AkeygoalofUKBiobankistocollectgeneticdataoneveryparticipant.Thisdata, combinedwiththeextensiveinformationaboutmedicalhistoryandlifestylechoices, willpresentanunparalleledopportunitytoinvestigatehowgeneticsandotherfactors impacttheonsetanddevelopmentofdisease. TheUKBiobankresourceisopentotheresearchcommunityanditwillgrowand developovertime.FindingsthatuseUKBiobankdatamustbefedbacktoUKBiobank andmadeavailabletootherresearchers. 1.2 Purposeofthisdocument Herewedescribethequalitycontrol(QC)proceduresappliedtothegenotypedatain theinterimUKBiobankdatarelease,whichcontains~150,000samplesgenotypedat ~800,000SNPs.Wealsodescribecharacteristicsofthereleasedgenotypedata,bothin termsofcontentandquality.Thisdocumentisrelevanttoresearchersaccessingand usingthegenotypedataavailableintheinterimrelease.However,largelythesame procedureswillbeappliedinfuturereleases.WealsobrieflydescribetheUKBiobank resource,thegenotypingarray,thesamplestorageandgenotypingprocedures, althoughthesearedescribedinmoredetailinthereferences. 1.3 Datareleases TheinterimreleaseofgenotypedataforUKBiobankcomprises~150,000samples.Work isongoingonaspectsofgenotypecallingthatcanutilisethescaleoftheprojectto furtherimprovethecomprehensivenessofthegeneticdata.Thismeansthatsomesmall numberofgenotypecallsintheinterimreleasemaychangeinsubsequentreleases.If thisoccurs,informationwillbemadeavailableaboutwhichgenotypecallshave changed,asacomplementtothenewgenotypedata. Informationaboutthelikelytimingandextentoffuturedatareleasesisavailablefrom theUKBiobankwebsite,http://biobank.ctsu.ox.ac.uk. 3 1.4 TheUKBiobankAxiomgenotypingarray TheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedbyanexpert group,forthepurposeofgenotypingtheUKBiobankparticipants.Manyresearchers contributedmarkersanddataduringthearraydesignprocess.Thereare~800,000 markersonthearray(see[2]formoredetails). Briefly,thearraydesignphilosophywasto: • • • Addmarkersthatareofparticularinterestbecauseofknownassociationsor possiblerolesinphenotypicvariation. Addcodingvariantsacrossarangeofminorallelefrequencies(MAFs),principally missenseandproteintruncatingvariants. Choosetheremainingcontenttoprovidegoodgenome-wideimputationcoveragein Europeanpopulationsinthecommon(>5%)andlowfrequency(1-5%)MAFranges. TheUKBiobankAxiomarrayisbeingusedtogenotype~450,000ofthe~500,000UK Biobankparticipants.Theother~50,000samplesweregenotypedonthecloselyrelated UKBiLEVEarray.TheUKBiLEVEproject,forwhichtheUKBiLEVEarraywasdesigned, aimstostudythegeneticsoflunghealthanddisease,andsothose~50,000individuals wereselectedbasedonlungfunctionandsmokingbehaviourfromparticipantswith self-declaredEuropeanancestry.Otherwise,theUKBiLEVEcohortandtherestofUK BiobankdifferonlyinsmalldetailsoftheDNAprocessingstage(e.g.,UKBiLEVEsamples weremanuallytransferredfromstoragetoplatesforDNAextraction). ThetwoSNParraysareverysimilarwithover95%commonmarkercontent.TheUK BiobankAxiomarrayisanupdatedversionoftheUKBiLEVEAxiomarray,anditincludes additionalnovelmarkers(suchascancer-relatedmarkers),whichreplacedasmall fractionofthemarkersusedforgenome-widecoverage.Themarkerlistsforboththe UKBiLEVEandtheUKBiobankAxiomarraysareavailableaspartoftheUKBiobank resource,andfurtherdetailsofthearraydesignareavailableintheUKBiobankAxiom Arraycontentsummary[2]. The~50,000samplesgenotypedontheUKBiLEVEAxiomarrayareincludedinthe interimrelease.SincetheUKBiLEVEsamplingschemeandarraydesignarereportedin detailelsewhere[3],inthefollowingsectionswedescribetheDNAextractionand genotypingoftheother~450,000samplesprocessedontheUKBiobankAxiomarray. Asmallnumberofvariants(7,104)assayedonthearraywereknown,orsuspectedto havemorethantwosegregatingalleles.Multi-allelicmarkersrequirespecialtreatment inarraydesignandgenotypecalling.Anumberofthesevariants(3,690)areparticularly complicatedandarenotcurrentlysupportedbytheAffymetrixanalysispipeline;they havebeensettomissinginallbatches.Theremaining(3,414)multi-allelicvariantsare supportedbyAffymetrixbutcaremustbetakenintheinterpretationofthecalls provided,asapairofcalls(forthesameindividual)mustbeconsideredtogetherto 4 reconstructtheactualgenotypeatthemarker.Thelistofallmulti-allelicmarkers,both supportedandunsupportedbyAffymetrix,isavailabletodownload.Furthermore, researchersinterestedinmulti-allelicmarkerscandownloadeitherthearrayintensity files(.celfiles)ortheprocessedintensityvalues,andundertaketheirowncalling,QC andanalyses. Thecustom-designedUKBiobankAxiomarrayattemptstoassayalargenumberofSNPs thathavenotbeenpreviouslygenotyped.Asexpected,asmallnumberofmarkers (~38,000,i.e.,lessthan5%ofallmarkerspresentontheUKBiobankAxiomarray) exhibitedsub-optimaland/orcomplexclusteringpatternsandhencewereexcluded fromallsubsequentQCmetricsandstatistics,andtheircorrespondingcallsweresetto missingintheinterimdatarelease. 1.5 OverviewofDNAextractionandgenotyping 1.5.1 SamplestorageandDNAextraction ThesamplescollectedfromparticipantsareheldattheUKBiobankfacilityinStockport, UK.Storageprotocolsforallsamplesrequire850µlstoredinracksof96x1.2ml microtubes,ateither-80°Cor-196°C(dependingonsampletype).Generallytheracks arepopulatedwithsamplesgroupedbysampletype,collectioncentreandcollection time.DNAisextractedfrombuffycoatsamples,which(generally)makeup24ofevery 96tubesontheracksinstorage.Samplesarepickedbyrobottoa96-position destinationrack(aplate)readyforDNAextraction(94samplesperplateleavingtwo spacesfortheadditionofcontrols). Giventheunprecedentedsamplesizeofthecohort,specialattentionwasgivento ensurethatsourcesofsamplecollectionorextractionvariabilityandother measurementerrorsdonotsystematicallydifferbetweencasesandcontrolsinany futurecase-controlstudies.Attemptsweremadetoavoidsamplessubmittedfor analysisbeinggroupedorsubmittedinasequencewhichitselfexhibitsanunderlying trend.Thiswasachievedviaasampleselectionalgorithmthatensuresamixtureof collectioncentresoneachdestinationrack[4].DuringDNAextraction,theDNA concentrationandpurityareassessed.Samplesfailingtomeetdefinedthresholdsare notsubmittedforgenotyping;wherepossiblethesesamplesarere-processedatalater date.FurtherdetailsoftheUKBiobanksamplingandDNAextractionprocedurescanbe foundin[4,5]. 1.5.2 Genotyping SamplesweregenotypedattheAffymetrixResearchServicesLaboratoryinSantaClara, California,USA.Uponreceiptofa96-wellplatecontaining94UKBiobanksamples, Affymetrixaddedtwocontrolindividuals(from1000Genomes)tothesamewell positionsoneachplate:HG00097towellA12andHG00264towellE12.SeeAffymetrix laboratoryprocessdocumentationforfurtherdetails[6]. 5 AxiomArrayplateswereprocessedontheAffymetrixGeneTitan®Multi-Channel(MC) Instrument.Genotypeswerethencalledfromtheresultingintensitiesinbatchesof ~4,700samples(~4,800includingthecontrols)usingtheAffymetrixPowerTools softwareandtheAffymetrixBestPracticesWorkflow[7].SupplementaryTableS1shows thenumberofsamplesandplatesperbatchintheinterimrelease(whichincludesthe 11UKBiLEVEbatchesand22UKBiobankbatches,i.e.11batchesgenotypedontheUK BiLEVEAxiomarrayand22batchesgenotypedontheUKBiobankAxiomarray). IndividualswiththesamegenotypeatanygivenSNPwillclustertogetherinatwodimensionalintensityspace(onedimensionforeachtargetedallele).Briefly,genotype callinginvolvedinferringpropertiesoftheseclusterswithineachbatchandassigning eachsampleagenotype(orleavingthecallmissing)basedonitspositioninintensity space.Fortheinterimdatarelease,Affymetrixperformedfurtherroundsofgenotype callingusingalgorithmscustomisedfortheUKBiobankproject.Thesealgorithms targetedveryrareSNPswith6orfewerminorallelesinabatch,andasubsetofSNPs forwhichthegenericcallingalgorithmdidnotperformoptimally[8].Aftergenotype calling,Affymetrixperformedqualitycontrolineachbatchseparately,toexcludeSNPs withpoorclusterproperties.IfaSNPdidnotmeettheAffymetrixprescribedQC thresholdsinagivenbatch,itwassettomissinginallindividualsfromthatbatch. Affymetrixalsocheckedsamplequality(suchasDNAconcentration)andgenotypecalls wereprovidedonlyforsampleswithsufficientDNAmetrics.Moreinformationabout theAffymetrixcallingalgorithmsandqualitycontrolprotocolsareavailablein[6,7,8]. 6 2 Additionalqualitycontrol 2.1 Ourapproach WeundertookQCinseveralstages.FirstweusedseveralSNP-basedmetricstoflag SNPswithlessreliablegenotypingresults,tobesettomissinginthebatcheswherethey failedourfilters.ThenweidentifiedpoorqualitysamplesusingonlyhighqualitySNPs (definedasSNPsthatpassedQCfiltersinall33batchesinthisinterimrelease).Wealso performedothersample-basedinferencesuchasprincipalcomponentanalysisand relatednessinference.PropertiesofUKBiobank(suchasitslargecohortsize)meanthat somequalitycontrolmetricscommonlyusedingenome-wideassociationstudies (GWAS)arenotsufficientinthiscontext.WeusedavarietyofapproachesinourQC procedurestoaccountfortheeffectsofpopulationstructureandbatch-based genotyping,whichwediscussbelow. 2.1.1 Diverseancestries UKBiobankconsistsof~500,000UKindividuals.Participantswereaskedtochoosefrom asetofpredefinedethniccategories,or‘Other’,and~470,000reportedtheirethnicity as‘White’.Otherindividualscomefromawidevarietyofethnicgroups(Table1). Self-reportedethnicity White Asian Black Mixed Other/Unknown British Irish Anyotherwhitebackground Indian Pakistani Bangladeshi Chinese AnyotherAsianbackground African Caribbean AnyotherBlackbackground WhiteandAsian WhiteandBlackAfrican WhiteandBlackCaribbean Anyothermixedbackground Representation (%) 94.06 88.07 2.63 3.36 2.28 1.18 0.37 0.05 0.31 0.37 1.61 0.68 0.90 0.03 0.59 0.17 0.08 0.12 0.22 1.46 Table1Self-reportedethnicgroupsinthe~500,000UKBiobankparticipants.Ofthese,~150,000were genotypedfortheinterimdatarelease. TheinclusionofsampleswithdiverseancestrycanconfoundstandardQCmetrics.For 7 instance,individualswithunusualheterozygosityaretypicallyexcludedfromaGWAS, butheterozygosityiscorrelatedwithancestryasallelefrequencydistributionscanvary acrosspopulations.Similarly,testingthatHardy-WeinbergEquilibrium(HWE)holdsisa commonapproachforidentifyingpoorqualitySNPs,butdeparturesfromHWEcanbe expectedinthecontextofstrongpopulationstructure,againbecauseofdifferencesin allelefrequencydistributions. Toaccountfortheeffectsofpopulationstructure,weproceededintwophases.For SNP-basedQCmetricsweusedonlyindividualswithsimilarancestry(sothat,for example,HWEisexpected).TodothisweidentifiedasetofindividualswithEuropean ancestrybyprojectingindividualsontoprincipalcomponentscomputedfromthe1000 Genomesproject.WealsocharacterisedthepopulationstructureuniquetoUKBiobank bycomputingprincipalcomponentsusingonlyUKBiobankindividuals(afterapplying SNPQC).WeusedtheUKBiobank-specificprincipalcomponentsanalysis(PCA)results toaccountforpopulationstructureinalloursample-basedQCmetrics. 2.1.2 Batch-basedgenotypecalling InviewofUKBiobank’slargecohortsize,Affymetrixcarriedoutthegenotypingand initialSNPQCinbatchesofaround4,800samples,effectivelytreatingeachbatchasan independentexperiment.However,theavailabilityofmultiplebatches,processedunder thesamestrictguidelines,providesnewopportunitiesforSNPQC:wecancheckthe consistencyofgenotypecallingbetweenbatches.Inrareinstances,theAffymetrix callingalgorithmmightincorrectlycallaSNPinonebatchbutnotothers. Affymetrixassaysgeneticmarkersusing“probesets”whichtargetaparticularvariant.A probesetisasetofprobeswhosesignalissummarisedtomakethegenotypingcall.A smallfractionofvariants(mostlythosethatarenoveltotheUKBiobankAxiomarray) aregenotypedusingmultipleprobesets,andinthiscasemorethanonecallismadefor thesamemarker.ForthesemarkersAffymetrixrecommendsasingle“best”probesetin eachbatchseparatelyandtheinterimreleaseincludesonlycallsfromthe“best” probesets.WedidnotusethesemarkersinoursampleQCanalysesasadifferent probesetcanberecommendedforthesameSNPacrossbatches. 2.2 SNPQC DuetothesizeoftheUKBiobankcohort,genotypingwasperformedinalargenumber ofbatches(33batchesof~4800individualsfortheinterimdatarelease).Thisprovides additionalopportunitiestostudyandensuredataconsistency.Affymetrixroutinely undertakesSNPQC[7,8],andweadoptedtheAffymetrixrecommendationsthroughout, foreachgivenbatch.Inaddition,weperformedqualitychecksthatareappropriatefora large-scaledatasetgenotypedinbatches.Forthereasonsdescribedabove,we computedallSNPQCmetricsusingahomogeneoussubsetofindividualsdrawnfrom thelargestancestralgroupinthecohort(whichisEuropeaninUKBiobank).Toidentify theseindividuals,weprojectedUKBiobanksamplesonthetwomajorprincipal 8 componentscomputedbyanalysingtheCEU,YRI,CHBandJPTpopulationsfromthe HapMap3referencepanel(withgenotypesprovidedby1000Genomes,phase1,release v3).ThenweselectedsamplesthatwereprojectedintheneighbourhoodoftheCEU cluster,asshowninFigure1. TheUKBiLEVEbatcheshaveahigherproportionofsampleswithEuropeanancestryby design,asparticipantswereselectedinpartbasedonself-declaredethnicity.Inthose 11batchesweused~97%samplesforSNPQC.IntheUKBiobankbatchesweused91%93%samplesforSNPQC,asthesebatchesaremoreethnicallydiverse.AppendixA1 describestheanalysisweusedtochooseahomogeneoussubsetofsamplesforSNPQC. Insamplesdrawnfromthesamepopulationwewouldnotexpectdifferencesin genotypefrequencies,eitherbetweenbatchesorbetweenplateswithinabatch,atthe samemarker.SuchdifferencesmightindicatethattheSNPwasnotgenotypedas accuratelyasotherSNPs,inthebatch(orplate)whichexhibitsunusualgenotype frequencies.Werefertothesecasesasbatchorplateeffects.Forexample,batch effectscanoccurwhenthesampleintensitiesinonebatchshiftrelativetothe intensitiesinotherbatches.Inrarecases,suchashiftcancausetheAffymetrixcalling algorithmtomiscallagenotypeclusterthatisnotdetectedbytheroutineAffymetrix SNPQC.Similarly,plateeffectscanoccurwhentheintensitiesinoneplateshiftrelative totheintensitiesinotherplates,inthesamebatch. Tolookforeffectsinaparticularbatchwetestedwhetherwecanrejectthenull hypothesisthatthegivenbatchhasthesamegenotypefrequenciesasallotherbatches combined.Tolookforeffectsinaparticularplatewetestedwhetherwecanrejectthe nullhypothesisthatthegivenplatehasthesamegenotypefrequenciesasallother plates,withinthesamebatch,combined.InbothcasesweusedFisher’sexactteston the2×3tableofgenotypes.(Sincethereareseveralplatesinabatch,weperformed Fisher’sexacttestforeachplatethatisatleasthalf-full,i.e.,with48samplesormore, andthentookthesmallestp-value.)SeeAppendixA2formoredetails. WealsoperformedanexacttestforHardy-Weinbergequilibriumforeachbatch[9]. Again,selectingahomogeneoussubsetofsamplesmakestheproceduremore conservative,asHardy-Weinbergequilibriumdoesnotnecessarilyholdinthepresence ofpopulationstructure. IfaSNPdidnotpassanyofthesetests(withap-valueoflessthan10-12),thismight indicatethatthegenotypeshavenotbeencalledcorrectlyinthecorrespondingbatch andtheSNPisflagged.Forthecurrentinterimdatarelease,genotypesatsuchflagged SNPsweresettomissinginbatcheswherethetestssuggestedissueswiththeinitial calls.Withtheaimtoimprovegenotypecallinginsubsequentdatareleases,SNPsthat werefilteredoutinatleastonebatcharethesubjectofongoingadvancedanalysis workbyAffymetrix.PreliminarydatageneratedbyAffymetrixadvancedanalysis workflowindicatesthatasubstantialnumberofSNPflaggedintheinterimreleasewill bereleasedinthefinalrelease. 9 Figure1Weused1000GenomesdataforfourHapMappopulations(CEU,CHB,JPT,YRI)tocomputePCA loadingsfor~40,000SNPsontheUKBiobankAxiomarray.Inthetopleftpanel,theseHapMapsamples st nd areprojectedontothe1 and2 principalcomponentsandarecolouredbypopulation.Intheother panels,all11UKBiLEVEbatches(labeledb1tob11)andanarbitrarilychosensubsetof8UKBiobank batches(labeledb001tob008)areprojectedintothesameprincipalcomponentspace.Thesamplesare colouredaccordingtowhethertheywereusedinSNPQCproceduresornot(inblackandgray, respectively).ForeachbatchtheproportionofsamplesusedforSNPQCisalsoreported. 10 2.3 SampleQC TocarryoutQConsamples,wefirstappliedSNPQC(asdescribedabove)andselecteda setofhighqualityautosomalSNPs.Theanalysesdescribedbelowarebasedon ~600,000autosomalSNPswhichareonboththeUKBiobankandUKBiLEVEarrays,and passedSNPQCinall33batches. 2.3.1 Populationstructure TocapturepopulationstructurespecifictotheUKBiobankcohort,weperformed principalcomponentanalysisof~150,000UKBiobanksamplesusing~100,000SNPs. ThesePCscanbeusedtoidentifysampleswithsimilarancestryortocontrolfor populationstructureinassociationstudies.Metricsforsamplequalitycontrolcanbe sensitivetopopulationstructureaswell,soweusedtheprincipalcomponentsinthe processofidentifyingpoorqualitysamples.ThefourmajorPCsareshowninFigure2. ThenextsixteenPCs(fromPC5toPC20)areshowninFigureS1anddetailsofthe analysisarepresentedinAppendixA3. Figure2GeneticprincipalcomponentsinUKBiobank,computedfrom141,0670samplesand101,284 st nd SNPsusingflashPCA[10].(A)The1 principalcomponent(PC1)onthex-axisandthe2 principal rd th component(PC2)onthey-axis.(B)The3 principalcomponent(PC3)onthex-axisandthe4 principal component(PC4)onthey-axis.Inbothpanels,samplesarecolouredaccordingtoself-reportedethnicity. Thelegendindicatesthecolouredsymbolusedforeachpredefinedethnicitythroughoutthisdocument. 2.3.2 Heterozygosityandmissingrates Extremeheterozygosityand/orlowcallratecanbeindicatorsofpoorsamplequality [11].However,heterozygosityissensitivetopopulationstructurebecauseallele 11 frequencydistributions(andthusheterozygosity)candifferbetweenpopulations.Figure 3AshowstheeffectofSNPascertainmentonheterozygosity:sincetheUKBiobankarray wasdesignedtoprovidegoodimputationcoverageinEuropeanpopulations,samples withnon-Europeanethnicitytendtohavelowerheterozygosity.Wecontrolforthisby fittingalinearregressionmodelwithheterozygosityastheoutcomeandthefourmajor PCsasthepredictors(seeAppendixA4fordetails).Thecorrectedheterozygosityis plottedinFigure3B. Somesamplescanhavenaturallyextremeheterozygosity,evenafteraccountingfor populationstructure.Specifically,individualswithmixedethnicitytendtohavehigher heterozygosity(whichisnotcapturedbytheprincipalcomponents),andindividuals whoseparentsarecloselyrelatedtendtohavelowerheterozygosity.Therefore,we attemptedtoflagasoutlierssampleswhoseextremeheterozygosityisnotexplainedby mixedancestryorincreasedlevelsofmarriagebetweencloserelatives. Figure3Heterozygosityandmissingnessfor152,256samplesintheinterimUKBiobankdatarelease, afterremoving480outliers.(Section2.3.2detailstheproceduretoflagoutliers.)Pointsarecolouredby self-reportedethnicity,usingthecolouredsymbolsinthelegendofFigure2.(A)Heterozygosity (proportionofautosomalheterozygouscalls)onthey-axisagainstlogit-transformedmissingness (proportionofgenotypesnotcalled)onthex-axis.Thelogittransformation,definedaslogit(x)=log(x/(1x)),isappliedtonormalisethemissingnessvalues.(B)Ancestry-correctedheterozygosityonthey-axis againstlogit-transformedmissingnessonthex-axis.Theheterozygosityvaluesarecorrectedfor systematicdifferencesduetopopulationstructureusingfourgeneticprincipalcomponents,asdescribed inAppendixA4. Aftertakingintoaccountmixedethnicity,weidentified472outliers(0.3%oftotal samples)withhighmissingnessorhighheterozygosity(plottedinredinFigure4A),by visuallyinspectingthescatterplotsofheterozygosityandmissingnessforeachselfreportedethnicity(seeFigureS2).Todistinguishbetweenpoorqualitysamplesand sampleswithnaturallylowheterozygosity,welookedforlongrunsofhomozygosity (ROH).WecomputedthetotallengthoflongROHusingplink[12](seeAppendixA5for 12 details),andidentified8sampleswithtotalROHthatisunusuallyshort,comparedto othersampleswithsimilarheterozygosity(Figure4B). Intotal,weidentified480samples(0.3%oftotalsamples)withhighmissingnessorfor whichheterozygosityrateswerenotexplainedbyROHanalysisnormixedethnicity. ThesesamplesarenotexcludedfromthedatareleaseandinsteadalistofoutlierIDsfor thesesamplesisprovidedtoresearchersalongwiththegenotypedata. Figure4Atotalof152,736UKBiobanksamplesweregenotypedfortheinterimdatarelease.(Intended andunintendedduplicatesareexcludedfromthiscount.)Ofthese,thereare480outliers,showninred; therestofthesamplesareshowningray.(A)Ancestry-correctedheterozygosityonthey-axisandlogittransformedmissingnessonthex-axis.Thisplotemphasizesthatsomeoutliershavehighmissingnessor highheterozygosity.(Sampleswithmixedancestrytendtohaveincreasedheterozygosityaswell,butthis isexpectedandsuchsamplesarenotflaggedasoutliersbasedonheterozygosityalone.)(B)Ancestrycorrectedheterozygosityonthey-axisandtotallength(inkb)oflongrunsofhomozygosity(ROH)onthe x-axis.ThisplotemphasizesthatsomeoutlierswithlowheterozygosityhaveunusuallyshorttotalROH. 2.4 Summary AfterQCprocedureswereapplied,theinterimUKBiobankdatareleasecontains genotypesfor152,736samplesthatpassedsampleQC(~99.9%oftotalsamples),and 806,466SNPsthatpassedSNPQCinatleastonebatch(>99%ofthearraycontent).As notedabove,Affymetrixispursuingongoingdevelopmentworkongenotypecallingin extremelylargemulti-batchsettings.Therefore,somegenotypecallsmaychange betweenthisinterimdatareleaseandthefinaldatarelease,andweanticipatethatthe variousmetricswillimprovefurther. 13 3 PropertiesoftheUKBiobankgenotypedatafor InterimRelease TheinterimdatareleaseofUKBiobankgeneticdataconsistsof152,736samples.Of those,102,754weregenotypedontheUKBiobankarray(splitinto22batches)and 49,982weregenotypedontheUKBiLEVEarray(splitinto11batches).Inadditionto computingprincipalcomponents,weanalysedseveralaspectsoftheinterimrelease dataafterqualitycontrolhadbeenapplied. 3.1 Propertiesofsamples 3.1.1 RelatedIndividuals Weidentifiedrelatedsamplesbycalculatingkinshipcoefficientsforallpairsofsamples usingKING’srobustestimator[13].Weusedthisestimatorasitisrobusttopopulation structureanditisimplementedinanalgorithmefficientenoughtoconsideralln(n− 1)/2(~11,250,000,000)pairsinapracticableamountoftime.Parent-childandfull siblingpairshavethesameexpectedkinshipcoefficientbutcanbedistinguishedby theirIBS0fraction,definedastheproportionofSNPsatwhichtwosampleshaveno allelesincommon(seeFigure5).Weexcludedsomesamplesfromthekinship calculationbecauseKING’srobustestimatorisnotreliableforindividualswithhigh heterozygosityorhighmissingness[13].SeeAppendixA6fordetails. Weonlyreportrelativestothe3rd,2ndand1stdegreeandmonozygotictwins(Table2). Relationship Pairs Monozygotic ParentFull 2nd twins offspring siblings degree 18 619 2,183 1,061 3rd degree 5,811 rd Table2Relatedpairs(3 degreeorcloser)for~150,000UKBiobankparticipantsgenotypedintheinterim UKBiobankdatarelease.(ThecountsarederivedfromthekinshipinformationpresentedinFigure5.) Wedetected1,856individualsthatarerelated(tothe1stdegreeorasmonozygotic twins)tomorethanoneperson,andthuswilloccurinmorethanonepairinTable2. Seventy-twooftheseindividualsarewithinatrio(childwithtwoparents)inwhich checkingofthesexandagesofbothparentsandageofthechildwasconsistentwith theinferredrelationship.Thereare6instancesoftwosiblingsandaparent,andinone ofthesethesiblingsaremonozygotictwins.Theothersareindividualswithinsetsof3or 4siblings. 14 Figure5Closerelationshipsfor~150,000UKBiobankparticipantsgenotypedintheinterimrelease.Each pointrepresentsapairofrelatedindividualsandthecoloursindicatethedegreeofrelatedness: st nd rd monozygotictwinsinblack(intheupperleftcorner),1 ,2 and3 degreerelativesinred,greenand st blue,respectively.Therearetwogroupsof1 degreerelatives:parent-childpairs(redtriangles)andfull siblings(redcircles).Forallpairs,they-axisshowsthekinshipcoefficient,definedastheprobabilitythat twoallelessampledatrandom(onefromeachindividual)areidenticalbydescent.Thex-axisshowsthe proportionofzeroidentity-by-state(IBS0),definedastheproportionofSNPsatwhichonesamplecarries theminorhomozygoteandtheothersample–themajorhomozygote,sothattheysharenoalleles.)The degreeofrelatednessisinferredfromtheestimatedkinshipcoefficientusingKING’scriteria[13]. 3.1.2 Sexmismatches Affymetrixinfersanindividual'ssexpriortogenotypecalling(butaftermeasuringallele intensities)sothatitcanuseanappropriatealgorithmtocallSNPsonthesex-linked chromosomes,XandY.Forthispurpose,AffymetrixusesspecialprobesfornonpolymorphicsitesontheXandYchromosomes,whichproducelargedifferencesin intensitybetweenmalesandfemales.Self-reportedsex(recordedatrecruitment)and geneticallyinferredsexareavailableforallsamples.Outofthe~150,000samplesinthe interimrelease,theself-reportedsexdoesnotmatchthegeneticallyinferredsexin191 cases(0.1%oftotalsamples). Therearethreepossibleexplanationsforsexmismatches: • • Clericalerror:EithertheDNAsamplewasassociatedwiththewrongindividual (mislabelling)orsexwasrecordedincorrectlyatrecruitment Sexdeterminedbychromosomalmake-updoesnotmatchgenderidentity(andthus self-reportedsex) 15 • Sexchromosomeaneuploidy(i.e.,abnormalnumberofsexchromosomes,for example–XXY) AnalysisoftheXandY-chromosomeaverageintensities(whichareavailableto download)canbeusedtoidentifyinstancesofthethirdpossibleexplanation.Afterthe interimrelease,UKBiobankintendstoextractDNA(wherepossible)andreprocess sampleswithunexplainedgendermismatches. Figure6reportstwomeasuresthatcanbeusedtoinfergender.X-chromosome heterozygosityisinformativebecausemalescarryasinglecopyoftheXchromosome andthuscannotbeheterozygous.TheratioofY-chromosometoX-chromosome averageintensityisinformativebecausefemalescarrynocopyoftheYchromosome andthustheiraverageYintensityshouldbelower(notnecessarilyzerobutat backgroundlevel).Thetwomeasuresarenotmutuallyredundantandcanbeusedto identifypossiblecasesofsexchromosomeaneuploidy.Forexample,sampleswithXXY aneuploidyareexpectedtohavefemale-likeheterozygosityontheXchromosome,but alsohavemale-likeintensityvaluesfortheYchromosome.Suchsamplesshouldnotbe usedindownstreamanalysis,orusedwithcaution,especiallyinconjunctionwiththeir phenotypicdata. Figure6X-chromosomeheterozygosityandratioofY-chromosometoX-chromosomeaverageintensity, for152,736UKBiobanksamples.TheX-chromosomeheterozygosityiscomputedfromallX-chromosome SNPsoutsidethePARregions.Theintensityvaluesaremeasuredattheprobesusedfordeterminingsex priortogenotypecalling.(A)Samplesarecolouredbygender:iftheself-reportedandgeneticallyinferred sexagree,thenfemalesareplottedinredandmalesinblue;otherwise,mismatchesareplottedinblack. Pointsincentreoftheplot(separatedfromtheblueandredclusters)arepossiblecasesofXXY aneuploidy.(B)Thesamepointsarecolouredbyself-reportedethnicity,usingthecolouredsymbolsinthe legendofFigure2.X-chromosomeheterozygosityexhibitsascertainmentbiasduetopopulation structure,similarlytoautosomalheterozygosity.(Comparethesystematicoffsetinheterozygosity betweensampleswithdifferentethnicbackgroundinthisfigureandinFigure3A). 16 3.2 PropertiesofSNPs Figures7,8and9illustratevariousqualitymetricsandpropertiesofSNPsgenotypedon theUKBiobankAxiomarray,acrossmultiplebatches.Affymetrixprocessedand genotypedthebatchesseparatelyandweappliedthesamefilters(thetestsforbatchor plateeffectsandHardy-WeinbergequilibriumdescribedinSection2.2),independently, multipletimes.Therefore,thenumberoftimesaSNPpassedthesefiltersisan extremelystrictmeasureofitsgenotypecallingquality.Thisandthecallrateare reportedinFigure7. Figure7OverallqualityofthegenotypedataintheinterimUKBiobankrelease,afterallSNPQCsteps havebeenapplied.(A)NumberofbatchesinwhichaSNPissettomissing(outof33batches),for common,low-frequencyandrareSNPsgenotypedonboththeUKBiLEVEandUKBiobankAxiomarrays. Theshadingindicatesoneofthreeminorallelefrequency(MAF)categoriesofSNPs:common(MAF>5%); lowfrequency(5%>MAF>1%);rare(MAF<1%).MAFsinUKBiobankwereestimatedfromsampleswith inferredEuropeanancestry.(B)SNPcallrateforcommon,lowfrequencyandrareSNPscombined. ThesmallpeaksinthecallrateinFigure7BareduetoSNPssettomissinginjustafew batches.Forexample,ifaSNPdidnotpassaQCthresholdinexactlyonebatchinn batchesbutotherwisehasahighcallrateintheremainingbatches,itscallrateis~(n1)/n.Sincethereare33batchesintheinterimrelease,thereisasubsetofSNPswithcall rate~32/33=0.97andasmallersubsetwithSNPswithcallrate~31/33=0.94. Anothermeasureofgenotypingquality,reproducibilityofcalls,wasassessedintwo controlsfrom1000Genomeswhichwereaddedtoeveryplate(inthesamewellon eachplate)andweregenotypedmultipletimes.Lowdiscordancebetweencallsforthe sameindividualacrossdifferentplatesindicateshighqualitygenotyping.The discordanceforaparticularSNPiscomputedas: 1– max{nAA,nAB,nBB} nAA+nAB+nBB wherenAA,nAB,nBBisthenumberoftimesthegenotypeAA,AB,BBiscalled, respectively.Forconcreteness,supposethatmax{nAA,nAB,nBB}=nAA.Thatis,nAAisthe 17 modeoftheset{nAA,nAB,nBB}andthereforeAAistheconsensuscall.Thediscordance istheproportionofcallsthatarenottheconsensuscall;intheexample,thisisthe proportionofABorBBcalls.Figure8showsthediscordanceratesforthetwo1000 Genomescontrols.Inbothcases,thereisasmallnumberofSNPswithdiscordance> 0.05(282forHG00097and143forHG00264,or417(0.05%)intotal).TheseSNPsare includedintheinterimreleasebutthelistcanbedownloaded.Somemightbesubject toexclusioninthefinalreleaseafterfurtheranalysishasbeenperformed. Figure8Ratesofdiscordancefromtheconsensuscall,forthetwo1000Genomescontrolsgenotyped multipletimesontheUKBiobankarray.(A)DiscordanceforHG00097.(B)DiscordanceforHG00264. Figure9showsthedistributionsofminorallelefrequencyandmissingness,acrossSNPs thatpassedallSNPQCfiltersinall33batchesintheinterimrelease. Figure9Distributionsofminorallelefrequencyandmissingnessacrossasetof626,445SNPsgenotyped onboththeUKBiLEVEandUKBiobankAxiomarrays,whichpassedallSNPQCfiltersinthe33batchesof theinterimrelease.(A)Histogramofminorallelefrequenciesestimatedfromsampleswithinferred Europeanancestry.TheshadingindicatesoneofthreeMAFcategories:commonSNPswithMAF>5%;low frequencySNPswith5%>MAF>1%;rareSNPswithMAF<1%.(B)Histogramoflogit-transformed missingnessforcommon,lowfrequencyandrareSNPscombined.Forreference,logit(-8)correspondsto 0.033%missingness;logit(-6)to0.247%missingness;logit(-4)to1.799%missingness. 18 AsmallnumberofgenotypedautosomalSNPs(65)havebeenfoundwhichshow significantlydifferentallelefrequenciesbetweentheUKBiLEVEarrayandtheUK Biobankarray.TheseSNPsareintheinterimdatareleasebutshouldbeexcludedfrom analyses.Anumber(27)oftheseSNPswereusedinphasingandimputation.We stronglyrecommendconditioningonarrayinassociationteststoamelioratetheeffect oftheseSNPs.TherecouldstillbeasubtlebiasintheneighbourhoodoftheseSNPs afterconditioning,butthiswilldependuponthephenotypebeingtestedfor association.WerecommendlookingcarefullyatanyresultswithimputedSNPsinthe regionsoftheaffectedSNPs,includingconfirminganyGWAShitswiththegenotypedonlydataandlookingatclusterplotsofthegenotypedata.Additionally,therearea numberofSNPs(46)onchromosomeXwhichshowasignificantallelefrequency differencebetweenmalesandfemalesorshowdifferencesbetweenarrays.We recommendthattheseSNPsbeexcludedfromallanalyses.Thefulllistofthesemarkers isavailabletodownload.TheseSNPswereidentifiedasthosewithap-valuelessthan 10-40inaFisherexacttestongenotypecounts. 19 References [1]N.Allen,C.Sudlow,P.Downey,T.Peakman,J.Danesh,P.Elliott,J.Gallacher,J.Green,P. Matthews,J.Pell,T.Sprosen,andR.Collins,“UKBiobank:Currentstatusandwhatitmeansfor epidemiology,”HealthPolicyandTechnology,1(3):123-126,2012. [2]TheUKBiobankArrayDesignGroup,“UKBiobankAxiomarray:contentsummary”,2014. http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UK-Biobank-Axiom-Array-ContentSummary-2014.pdf [3]L.V.Wainetal.,“Novelinsightsintothegeneticsofsmokingbehavior,lungfunctionand chronicobstructivepulmonarydiseaseinUKBiobank,”Submitted,2015. [4]UKBiobank,“Genotypingof500,000participants:Descriptionofsampleprocessingworkflow andpreparationofDNAforgenotyping”,20April,2015. [5]UKBiobank,“DNAextractionatUKBiobank”,2014.http://www.ukbiobank.ac.uk/wpcontent/uploads/2014/04/DNA-Extraction-at-UK-Biobank-October-2014.pdf [6]Affymetrix,“UKB_WCSGAX:UKBiobank500KSamplesProcessingbytheAffymetrixResearch ServicesLaboratory”,April,2015. [7]Affymetrix,“Axiom®GenotypingSolutionDataAnalysisGuide”,2014. http://media.affymetrix.com/support/downloads/manuals/axiom_genotyping_solution_analysi s_guide.pdf [8]Affymetrix,“UKB_WCSGAX:UKBiobank500KGenotypingDataGenerationbytheAffymetrix ResearchServicesLaboratory”,April,2015. [9]J.E.Wigginton,D.J.CutlerandG.R.Abecasis,“AnoteonexacttestsofHardy-Weinberg equilibrium,”TheAmericanJournalofHumanGenetics,76(5):887-893,2005. [10]G.AbrahamandM.Inouye,“Fastprincipalcomponentanalysisoflarge-scalegenome-wide data,”PLoSONE,9(4):e93766,2014. [11]IMSGCandWellcomeTrustCaseControlConsortium,“Geneticriskandtheroleofcell mediatedimmunemechanismsinmultiplesclerosis,“Nature,476(7539):214-219,2011. [12]PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraM,BenderD,MallerJ,SklarP,de BakkerP,DalyMJ,ShamPC(2007)“PLINK:AToolSetforWhole-GenomeandPopulation-Based LinkageAnalyses,”TheAmericanJournalofHumanGenetics,81(3):559–575,2007. [13]A.Manichaikul,J.C.Mychaleckyj,S.S.Rich,K.Daly,M.Sale,andW.-M.Chen,“Robust relationshipinferenceingenome-wideassociationstudies,”Bioinformatics,26(22):2867-2873, 2010. [14]C.Bellenguez,A.Strange,C.Freeman,WellcomeTrustCaseControlConsortium,P. Donnelly,andC.C.Spencer,“Arobustclusteringalgorithmforidentifyingproblematicsamplesin genome-wideassociationstudies,”Bioinformatics,28(1):134-135,2012. [15]A.L.Priceetal.Long-rangeLDcanconfoundgenomescansinadmixedpopulations.The AmericanJournalofHumanGenetics,83(1):132-135,2008. 20 Appendices TheinterimUKBiobankdatareleaseconsistsof11UKBiLEVEbatchesand22UK Biobankgenotyped(inthisorder)byAffymetrixusingcallingalgorithmsspecifically adaptedtotheUKBiobankproject[7,8]. Batch UKBiLEVEb1 UKBiLEVEb2 UKBiLEVEb3 UKBiLEVEb4 UKBiLEVEb5 UKBiLEVEb6 UKBiLEVEb7 UKBiLEVEb8 UKBiLEVEb9 UKBiLEVEb10 UKBiLEVEb11 UKBiobankb001 UKBiobankb002 UKBiobankb003 UKBiobankb004 UKBiobankb005 UKBiobankb006 UKBiobankb007 UKBiobankb008 UKBiobankb009 UKBiobankb010 UKBiobankb011 UKBiobankb012 UKBiobankb013 UKBiobankb014 UKBiobankb015 UKBiobankb016 UKBiobankb017 UKBiobankb018 UKBiobankb019 UKBiobankb020 UKBiobankb021 UKBiobankb022 Numberof genotyping plates Numberof UKBiobank samples Numberof control samples 52 53 58 52 59 61 63 53 54 59 72 52 74 85 91 87 64 75 186 73 85 97 83 56 201 469 177 134 111 131 88 182 79 4592 4598 4587 4601 4596 4573 4589 4593 4594 4597 4600 4710 4657 4648 4652 4661 4689 4678 4755 4693 4713 4704 4706 4692 4710 4714 4605 4600 4621 4627 4637 4582 4719 195 186 210 196 197 216 199 202 198 184 199 90 134 141 142 141 113 118 41 104 85 95 89 106 82 71 87 91 93 72 119 175 40 TableS1NumberofgenotypingplatesandprocessedsamplesperbatchfortheinterimUKBiobankdata release.(ThesenumbersexcludesampleswithlowDNAqualitybutincludeintended/unintended duplicatesandsampleoutliers.)The11UKBiLEVEbatches,labelledb1tob11,weregenotypedontheUK BiLEVEAxiomarray;the22UKBiobankbatches,labelledb001tob022,weregenotypedontheUK BiobankAxiomarray. 21 A1 SelectingsampleswithEuropeanancestryforSNPQC HerewedescribetheproceduretoidentifysampleswithEuropeanancestryandthus constructthehomogeneoussubsetusedincomputingSNPQCmetrics.Theprocedure includesprincipalcomponentanalysisandtwo-wayclustering. Wefirstdownloaded1000GenomesdatainVariantCallFile(VCF)formatandextracted 714,168SNPs(noINDELs)thataregenotypedontheUKBiobankAxiomarrayaswell. Weselected355unrelatedsamplesfromthepopulationsCEU,CHB,JPT,YRI,andthen choseSNPsforprincipalcomponentanalysisusingthefollowingcriteria: • • • • MAF≥5%andHWEp-value>10-6,ineachofthepopulationsCEU,CHB,JPTandYRI. Pairwiser2≤0.1toexcludeSNPsinhighLD.(Ther2coefficientwascomputedusing plink[12]andits‘indep-pairwise’functionwithamovingwindowofsize1000bp). RemovedC/GandA/TSNPstoavoidunresolvablestrandmismatches. ExcludedSNPsinseveralregionswithhighPCAloadings(afteraninitialPCA). Withtheremaining40,538SNPswecomputedPCAloadingsfromthe3551,000 Genomessamples,thenprojectedtheUKBiobanksamplesontothe1stand2ndprincipal components.AllcomputationswereperformedwithShellfish, http://www.stats.ox.ac.uk/~davison/software/shellfish/shellfish.php. Finally,weappliedanoutlierdetectionalgorithm(aberrant[14],withthelambda parametersetto20),toisolatethelargestclusterofsamplesfromtherest,basedon thetwoleadingPCs.InUKBiobank,thelargestclusteriscomposedofindividualswith Europeanancestry. A2 Testingforbatcheffects TheinterimUKBiobankdatareleaseconsistsof33batches:thereare11UKBiLEVE batcheslabeledb1,...,b11and22UKBiobankbatcheslabeledb001,...,b022.To performabatcheffecttest,wecomparedthegenotypecountsinonebatchtothe genotypecountsinotherbatchescombined,usingFisher’sexacttest.Forconcreteness andforaspecificprobeset,wewriteb1+b2tomean(nAA:b1+nAA:b2,nAB:b1+nAB:b2,nBB:b1 +nBB:b2)wherenAA:b1isthenumberofcalledAAgenotypesinbatchb1.Itis straightforwardtogeneralisethisnotationtoaggregatethegenotypecountsinmultiple batches.Furthermore,aftertheinitial,batch-specificQCbyAffymetrix,allthecallsina batchmightbesettomissing,e.g.,itmightbecasethatnAA:b1=nAB:b1=nBB:b1=0. Weusedatwo-testapproachtocheckforcallingconsistencybetweentheUKBiLEVE andUKBiobankbatches.SupposethatwewanttocheckthatthegenotypesinUK Biobankbatchb001,foraspecificprobeset,areconsistentwiththegenotypesinthe other32batches,forthesameprobeset. • UseFisher’sexacttesttocompareb001tob002+…+b022,i.e.,checkforbatch 22 effectswithintheUKBiobankbatches. • UseFisher’sexacttesttocompareb001tob1+…+b11,i.e.,checkforbatcheffects acrosstheUKBiLEVEandUKBiobankbatches. Weperformedthesecondtest(thecomparisonacrossthetwoarrays)onlyfor probesetsthatuniquelygenotypeaSNP.(ThereareSNPsthataregenotypedusing multipleprobesetsforwhichAffymetrixrecommended,separatelyforeachbatch,the bestprobesettogenotypetheSNP.)Ifthep-valuesfromthetestsperformedare smallerthanthesignificancethresholdusedthroughout,10-12,thenthecalls-inbatch b001intheexampleabove-aresettomissing. A3 PrincipalcomponentsanalysisofUKBiobanksamples WecharacterisedpopulationstructureuniquetoUKBiobankusingPCA.Firstwe selectedasubsetofSNPsfromthosethatpassedallQCfiltersin33outof33batches, usingthefollowingcriteria: • • • • Minorallelefrequency≥2.5%andmissingness≤1.5%.(CheckingthatHWEholdsin asubsetofsampleswithEuropeandescentwaspartoftheSNPQCprocedures.) Pairwiser2≤0.1,toexcludeSNPsinhighLD. RemovedC/GandA/TSNPstoavoidunresolvablestrandmismatches. ExcludedSNPsinseveralregionswithlong-rangeLD[15].(ThelistincludestheMHC and22otherregions.) Wealsoremovedsampleswhowererelatedtomultipleothersamples(tothe1st,2ndor 3rddegree),onesamplefromeachremainingrelatedpair(chosenrandomly),aswellas removingalltwinsandgendermismatchesandsampleswithahighmissingrate.These filtersresultedin101,284SNPsfor141,070samples.WeusedflashPCA[10]ratherthan Shellfishtocomputeloadingsandprincipalcomponents,becauseflashPCA–whichuses anefficientrandomisedalgorithm–ismorescalable.Finally,inthiscomputation,itis importanttouseonlySNPssuccessfullygenotypedinallbatches;otherwise,differential patternsofmissingnessacrossbatchesmeanthatthemajorPCswilldistinguish betweenbatches,notbetweengroupswithdistinctancestry. 23 FigureS1GeneticprincipalcomponentsinUKBiobank.ThisfigureshowsprincipalcomponentsPC5to PC20,anditcomplementsFigure2,whichshowsprincipalcomponentsPC1toPC4.PCsareplottedin rd pairs,fromPC5andPC6inthetopleftpanel,toPC19andPC20inthelastpanelonthe3 row.Ineach panel,samplesarecolouredbyself-reportedethnicity,usingthesamecolouredsymbolsasinFigure2. Thelaterprincipalcomponents(PC16toPC20)donotappeartodistinguishanysubsetsinUKBiobank andonlyPC1toPC15arereportedaspartoftheinterimrelease. A4 Accountingfortheheterozygositybiasexplainedbypopulationstructure Heterozygosity(computedfromeitherautosomalorX-chromosomeSNPs)issensitiveto populationstructurebecauseofascertainmentbias:amajorityofSNPsontheUK BiobankAxiomarraywerechosentosatisfycertainproperties–imputationcoverage, 24 forexample–inEuropeanpopulations.Herewedescribethedetailsofaregression modeltoadjustheterozygositybyaccountingfortheeffectsofpopulationstructure. Lethdenotetheheterozygosityandletxbeasetoffeaturescorrelatedwithancestry. WeusedtheprojectionsontothefourmajorUKBiobankprincipalcomponentsto characteriseancestry,writingx=(x1,x2,x3,x4)forthesefourprincipalcomponent values.Considerthefollowingmodelforheterozygosityunderpopulationstructure: h(x)=h0+β(x) whereh(x)istherawheterozygosity,whichdependsonthefeaturesx,h0isthe ancestry-adjustedheterozygosityandβ(x)isabiastermduetopopulationstructure.We choseaquadraticformforβ(x),whichincludesalllinearandquadratictermsxiandxi2 aswellasallcrosstermsxixj,andweestimatedh0withordinaryleastsquares.More specifically,thebiaswasassumedtohavethefollowingfunctionalform: β(x)=β11x12+β22x22+β33x32+β44x42+β1x1+β2x2+β3x3+β4x4+β12x1x2+β13x1x3+β14x1x4+β23x2x3 +β24x2x4+β34x3x4. Thefittedvalueĥ0istheancestry-correctedheterozygosity,plottedonthey-axisin Figure3B(allethnicitiescombined)andinFigureS2(eachpredefinedethnicgroup separately). A5 Detectinglongrunsofhomozygosity Weusedplink[12]todetectlongROHs(runsofhomozygousgenotypes),usingthe `homozyg-kb`commandwithahomozygousrunrequiredtospanatleast1000kb distance. A6 Detectingfamilialrelationships TodetectrelatednessamongUKBiobankindividuals,weusedtherobustkinship coefficientestimatorimplementedinKING[13].Thisestimatorisrobusttopopulation structureandcomputationallypracticableevenonthescaleoftheUKBiobankcohort. Ontheotherhand,itisnotreliableforsampleswithhighheterozygosityorhighmissing rate,andasinglepoorlygenotypedindividualcouldleadtoaclusterofinflated relationships[13].Therefore,tominimisefalsepositivesinthedetectionofrelated samplesweexcludedindividualsusingthefollowingfilters: 1.Individualswithself-reported‘mixed’ethnicity(whichtendstoincrease heterozygosity)wereexcludedfromthekinshipinference.Thatis,individualsinoneof thefollowingcategoriesofself-reportedethnicbackground(~700individuals): 25 • Anyothermixedbackground • Mixed • WhiteandAsian • WhiteandBlackAfrican • WhiteandBlackCaribbean 2.Afterinferringpairsthatarerelatedto3rddegreeorcloser,weexcludedpairsfor whichatleastoneofthepairhadeitherofthefollowingproperties(~800individuals): • Heterozygosity(PC-adjusted)>0.1951154(equivalentto1.28standard deviationsfromthemean) • Missingrate>0.02 Foreveryindividualaflaghasbeenprovidedwhichindicateswhethertheyhavebeen excludedfromkinshipinference. 26 FigureS2Ancestry-correctedheterozygosityandmissingness,foreachpredefinedethnicgroupinUK Biobank.Theaxesarethesameineverypanel:heterozygosityaftercorrectingforbiasduetopopulation structureonthey-axis,andlogit-transformedmissingnessonthex-axis.Thelogitfunctionisdefinedas logit(x)=log(x/(1-x)).ThecolouredsymbolsforeachethnicityarethoseusedinthelegendofFigure3 (andthroughoutthisdocument).Inallpanels,theblackdottedlineindicatestheoverallmean heterozygosity;ineachpanel,thecoloureddashedlineindicatesthemeanheterozygosityforthe respectiveethnicity.Theindividualswithmixedancestry(particularly,thosewhoself-identifiedas“White andBlackAfrican”or“WhiteandBlackCaribbean”)tendtohaveincreasedheterozygosity,evenafter correctingthebiasduetopopulationstructure. 27