Genotyping and quality control of UK Biobank, a large

Transcription

Genotyping and quality control of UK Biobank, a large
GenotypingandqualitycontrolofUK
Biobank,alarge-scale,extensively
phenotypedprospectiveresource
Informationforresearchers
v1.2Oct2015
InterimDataRelease2015
1 Introduction.................................................................................................................3
1.1 UKBiobank...........................................................................................................3
1.2 Purposeofthisdocument.....................................................................................3
1.3Datareleases.........................................................................................................3
1.4TheUKBiobankAxiomgenotypingarray...............................................................4
1.5 OverviewofDNAextractionandgenotyping........................................................5
2 Additionalqualitycontrol............................................................................................7
2.1 Ourapproach.........................................................................................................7
2.2 SNPQC...................................................................................................................8
2.2 SampleQC...........................................................................................................11
2.3 Summary..............................................................................................................13
3 PropertiesoftheUKBiobankgenotypedataintheinterimrelease.......................14
3.1 Propertiesofsamples..........................................................................................14
3.2 PropertiesofSNPs...............................................................................................17
References........................................................................................................................20
Appendices.......................................................................................................................21
2
1Introduction
1.1 UKBiobank
UKBiobankisaprospectivecohortstudyofover500,000individualsfromacrossthe
UnitedKingdom.Participants,agedbetween40and69,wereinvitedtooneof22
centresacrosstheUKbetween2006and2010.Blood,urineandsalivasampleswere
collected,physicalmeasurementsweretaken,andeachindividualansweredan
extensivequestionnairefocusedonquestionsofhealthandlifestyle.
TheresourcewillprovideapictureofhowthehealthoftheUKpopulationdevelops
overmanyyearsanditwillenableresearcherstoimprovethediagnosisandtreatment
ofcommondiseases[1].
AkeygoalofUKBiobankistocollectgeneticdataoneveryparticipant.Thisdata,
combinedwiththeextensiveinformationaboutmedicalhistoryandlifestylechoices,
willpresentanunparalleledopportunitytoinvestigatehowgeneticsandotherfactors
impacttheonsetanddevelopmentofdisease.
TheUKBiobankresourceisopentotheresearchcommunityanditwillgrowand
developovertime.FindingsthatuseUKBiobankdatamustbefedbacktoUKBiobank
andmadeavailabletootherresearchers.
1.2 Purposeofthisdocument
Herewedescribethequalitycontrol(QC)proceduresappliedtothegenotypedatain
theinterimUKBiobankdatarelease,whichcontains~150,000samplesgenotypedat
~800,000SNPs.Wealsodescribecharacteristicsofthereleasedgenotypedata,bothin
termsofcontentandquality.Thisdocumentisrelevanttoresearchersaccessingand
usingthegenotypedataavailableintheinterimrelease.However,largelythesame
procedureswillbeappliedinfuturereleases.WealsobrieflydescribetheUKBiobank
resource,thegenotypingarray,thesamplestorageandgenotypingprocedures,
althoughthesearedescribedinmoredetailinthereferences.
1.3 Datareleases
TheinterimreleaseofgenotypedataforUKBiobankcomprises~150,000samples.Work
isongoingonaspectsofgenotypecallingthatcanutilisethescaleoftheprojectto
furtherimprovethecomprehensivenessofthegeneticdata.Thismeansthatsomesmall
numberofgenotypecallsintheinterimreleasemaychangeinsubsequentreleases.If
thisoccurs,informationwillbemadeavailableaboutwhichgenotypecallshave
changed,asacomplementtothenewgenotypedata.
Informationaboutthelikelytimingandextentoffuturedatareleasesisavailablefrom
theUKBiobankwebsite,http://biobank.ctsu.ox.ac.uk.
3
1.4 TheUKBiobankAxiomgenotypingarray
TheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedbyanexpert
group,forthepurposeofgenotypingtheUKBiobankparticipants.Manyresearchers
contributedmarkersanddataduringthearraydesignprocess.Thereare~800,000
markersonthearray(see[2]formoredetails).
Briefly,thearraydesignphilosophywasto:
•
•
•
Addmarkersthatareofparticularinterestbecauseofknownassociationsor
possiblerolesinphenotypicvariation.
Addcodingvariantsacrossarangeofminorallelefrequencies(MAFs),principally
missenseandproteintruncatingvariants.
Choosetheremainingcontenttoprovidegoodgenome-wideimputationcoveragein
Europeanpopulationsinthecommon(>5%)andlowfrequency(1-5%)MAFranges.
TheUKBiobankAxiomarrayisbeingusedtogenotype~450,000ofthe~500,000UK
Biobankparticipants.Theother~50,000samplesweregenotypedonthecloselyrelated
UKBiLEVEarray.TheUKBiLEVEproject,forwhichtheUKBiLEVEarraywasdesigned,
aimstostudythegeneticsoflunghealthanddisease,andsothose~50,000individuals
wereselectedbasedonlungfunctionandsmokingbehaviourfromparticipantswith
self-declaredEuropeanancestry.Otherwise,theUKBiLEVEcohortandtherestofUK
BiobankdifferonlyinsmalldetailsoftheDNAprocessingstage(e.g.,UKBiLEVEsamples
weremanuallytransferredfromstoragetoplatesforDNAextraction).
ThetwoSNParraysareverysimilarwithover95%commonmarkercontent.TheUK
BiobankAxiomarrayisanupdatedversionoftheUKBiLEVEAxiomarray,anditincludes
additionalnovelmarkers(suchascancer-relatedmarkers),whichreplacedasmall
fractionofthemarkersusedforgenome-widecoverage.Themarkerlistsforboththe
UKBiLEVEandtheUKBiobankAxiomarraysareavailableaspartoftheUKBiobank
resource,andfurtherdetailsofthearraydesignareavailableintheUKBiobankAxiom
Arraycontentsummary[2].
The~50,000samplesgenotypedontheUKBiLEVEAxiomarrayareincludedinthe
interimrelease.SincetheUKBiLEVEsamplingschemeandarraydesignarereportedin
detailelsewhere[3],inthefollowingsectionswedescribetheDNAextractionand
genotypingoftheother~450,000samplesprocessedontheUKBiobankAxiomarray.
Asmallnumberofvariants(7,104)assayedonthearraywereknown,orsuspectedto
havemorethantwosegregatingalleles.Multi-allelicmarkersrequirespecialtreatment
inarraydesignandgenotypecalling.Anumberofthesevariants(3,690)areparticularly
complicatedandarenotcurrentlysupportedbytheAffymetrixanalysispipeline;they
havebeensettomissinginallbatches.Theremaining(3,414)multi-allelicvariantsare
supportedbyAffymetrixbutcaremustbetakenintheinterpretationofthecalls
provided,asapairofcalls(forthesameindividual)mustbeconsideredtogetherto
4
reconstructtheactualgenotypeatthemarker.Thelistofallmulti-allelicmarkers,both
supportedandunsupportedbyAffymetrix,isavailabletodownload.Furthermore,
researchersinterestedinmulti-allelicmarkerscandownloadeitherthearrayintensity
files(.celfiles)ortheprocessedintensityvalues,andundertaketheirowncalling,QC
andanalyses.
Thecustom-designedUKBiobankAxiomarrayattemptstoassayalargenumberofSNPs
thathavenotbeenpreviouslygenotyped.Asexpected,asmallnumberofmarkers
(~38,000,i.e.,lessthan5%ofallmarkerspresentontheUKBiobankAxiomarray)
exhibitedsub-optimaland/orcomplexclusteringpatternsandhencewereexcluded
fromallsubsequentQCmetricsandstatistics,andtheircorrespondingcallsweresetto
missingintheinterimdatarelease.
1.5 OverviewofDNAextractionandgenotyping
1.5.1 SamplestorageandDNAextraction
ThesamplescollectedfromparticipantsareheldattheUKBiobankfacilityinStockport,
UK.Storageprotocolsforallsamplesrequire850µlstoredinracksof96x1.2ml
microtubes,ateither-80°Cor-196°C(dependingonsampletype).Generallytheracks
arepopulatedwithsamplesgroupedbysampletype,collectioncentreandcollection
time.DNAisextractedfrombuffycoatsamples,which(generally)makeup24ofevery
96tubesontheracksinstorage.Samplesarepickedbyrobottoa96-position
destinationrack(aplate)readyforDNAextraction(94samplesperplateleavingtwo
spacesfortheadditionofcontrols).
Giventheunprecedentedsamplesizeofthecohort,specialattentionwasgivento
ensurethatsourcesofsamplecollectionorextractionvariabilityandother
measurementerrorsdonotsystematicallydifferbetweencasesandcontrolsinany
futurecase-controlstudies.Attemptsweremadetoavoidsamplessubmittedfor
analysisbeinggroupedorsubmittedinasequencewhichitselfexhibitsanunderlying
trend.Thiswasachievedviaasampleselectionalgorithmthatensuresamixtureof
collectioncentresoneachdestinationrack[4].DuringDNAextraction,theDNA
concentrationandpurityareassessed.Samplesfailingtomeetdefinedthresholdsare
notsubmittedforgenotyping;wherepossiblethesesamplesarere-processedatalater
date.FurtherdetailsoftheUKBiobanksamplingandDNAextractionprocedurescanbe
foundin[4,5].
1.5.2 Genotyping
SamplesweregenotypedattheAffymetrixResearchServicesLaboratoryinSantaClara,
California,USA.Uponreceiptofa96-wellplatecontaining94UKBiobanksamples,
Affymetrixaddedtwocontrolindividuals(from1000Genomes)tothesamewell
positionsoneachplate:HG00097towellA12andHG00264towellE12.SeeAffymetrix
laboratoryprocessdocumentationforfurtherdetails[6].
5
AxiomArrayplateswereprocessedontheAffymetrixGeneTitan®Multi-Channel(MC)
Instrument.Genotypeswerethencalledfromtheresultingintensitiesinbatchesof
~4,700samples(~4,800includingthecontrols)usingtheAffymetrixPowerTools
softwareandtheAffymetrixBestPracticesWorkflow[7].SupplementaryTableS1shows
thenumberofsamplesandplatesperbatchintheinterimrelease(whichincludesthe
11UKBiLEVEbatchesand22UKBiobankbatches,i.e.11batchesgenotypedontheUK
BiLEVEAxiomarrayand22batchesgenotypedontheUKBiobankAxiomarray).
IndividualswiththesamegenotypeatanygivenSNPwillclustertogetherinatwodimensionalintensityspace(onedimensionforeachtargetedallele).Briefly,genotype
callinginvolvedinferringpropertiesoftheseclusterswithineachbatchandassigning
eachsampleagenotype(orleavingthecallmissing)basedonitspositioninintensity
space.Fortheinterimdatarelease,Affymetrixperformedfurtherroundsofgenotype
callingusingalgorithmscustomisedfortheUKBiobankproject.Thesealgorithms
targetedveryrareSNPswith6orfewerminorallelesinabatch,andasubsetofSNPs
forwhichthegenericcallingalgorithmdidnotperformoptimally[8].Aftergenotype
calling,Affymetrixperformedqualitycontrolineachbatchseparately,toexcludeSNPs
withpoorclusterproperties.IfaSNPdidnotmeettheAffymetrixprescribedQC
thresholdsinagivenbatch,itwassettomissinginallindividualsfromthatbatch.
Affymetrixalsocheckedsamplequality(suchasDNAconcentration)andgenotypecalls
wereprovidedonlyforsampleswithsufficientDNAmetrics.Moreinformationabout
theAffymetrixcallingalgorithmsandqualitycontrolprotocolsareavailablein[6,7,8].
6
2 Additionalqualitycontrol
2.1 Ourapproach
WeundertookQCinseveralstages.FirstweusedseveralSNP-basedmetricstoflag
SNPswithlessreliablegenotypingresults,tobesettomissinginthebatcheswherethey
failedourfilters.ThenweidentifiedpoorqualitysamplesusingonlyhighqualitySNPs
(definedasSNPsthatpassedQCfiltersinall33batchesinthisinterimrelease).Wealso
performedothersample-basedinferencesuchasprincipalcomponentanalysisand
relatednessinference.PropertiesofUKBiobank(suchasitslargecohortsize)meanthat
somequalitycontrolmetricscommonlyusedingenome-wideassociationstudies
(GWAS)arenotsufficientinthiscontext.WeusedavarietyofapproachesinourQC
procedurestoaccountfortheeffectsofpopulationstructureandbatch-based
genotyping,whichwediscussbelow.
2.1.1 Diverseancestries
UKBiobankconsistsof~500,000UKindividuals.Participantswereaskedtochoosefrom
asetofpredefinedethniccategories,or‘Other’,and~470,000reportedtheirethnicity
as‘White’.Otherindividualscomefromawidevarietyofethnicgroups(Table1).
Self-reportedethnicity
White
Asian
Black
Mixed
Other/Unknown
British
Irish
Anyotherwhitebackground
Indian
Pakistani
Bangladeshi
Chinese
AnyotherAsianbackground
African
Caribbean
AnyotherBlackbackground
WhiteandAsian
WhiteandBlackAfrican
WhiteandBlackCaribbean
Anyothermixedbackground
Representation
(%)
94.06
88.07
2.63
3.36
2.28
1.18
0.37
0.05
0.31
0.37
1.61
0.68
0.90
0.03
0.59
0.17
0.08
0.12
0.22
1.46
Table1Self-reportedethnicgroupsinthe~500,000UKBiobankparticipants.Ofthese,~150,000were
genotypedfortheinterimdatarelease.
TheinclusionofsampleswithdiverseancestrycanconfoundstandardQCmetrics.For
7
instance,individualswithunusualheterozygosityaretypicallyexcludedfromaGWAS,
butheterozygosityiscorrelatedwithancestryasallelefrequencydistributionscanvary
acrosspopulations.Similarly,testingthatHardy-WeinbergEquilibrium(HWE)holdsisa
commonapproachforidentifyingpoorqualitySNPs,butdeparturesfromHWEcanbe
expectedinthecontextofstrongpopulationstructure,againbecauseofdifferencesin
allelefrequencydistributions.
Toaccountfortheeffectsofpopulationstructure,weproceededintwophases.For
SNP-basedQCmetricsweusedonlyindividualswithsimilarancestry(sothat,for
example,HWEisexpected).TodothisweidentifiedasetofindividualswithEuropean
ancestrybyprojectingindividualsontoprincipalcomponentscomputedfromthe1000
Genomesproject.WealsocharacterisedthepopulationstructureuniquetoUKBiobank
bycomputingprincipalcomponentsusingonlyUKBiobankindividuals(afterapplying
SNPQC).WeusedtheUKBiobank-specificprincipalcomponentsanalysis(PCA)results
toaccountforpopulationstructureinalloursample-basedQCmetrics.
2.1.2 Batch-basedgenotypecalling
InviewofUKBiobank’slargecohortsize,Affymetrixcarriedoutthegenotypingand
initialSNPQCinbatchesofaround4,800samples,effectivelytreatingeachbatchasan
independentexperiment.However,theavailabilityofmultiplebatches,processedunder
thesamestrictguidelines,providesnewopportunitiesforSNPQC:wecancheckthe
consistencyofgenotypecallingbetweenbatches.Inrareinstances,theAffymetrix
callingalgorithmmightincorrectlycallaSNPinonebatchbutnotothers.
Affymetrixassaysgeneticmarkersusing“probesets”whichtargetaparticularvariant.A
probesetisasetofprobeswhosesignalissummarisedtomakethegenotypingcall.A
smallfractionofvariants(mostlythosethatarenoveltotheUKBiobankAxiomarray)
aregenotypedusingmultipleprobesets,andinthiscasemorethanonecallismadefor
thesamemarker.ForthesemarkersAffymetrixrecommendsasingle“best”probesetin
eachbatchseparatelyandtheinterimreleaseincludesonlycallsfromthe“best”
probesets.WedidnotusethesemarkersinoursampleQCanalysesasadifferent
probesetcanberecommendedforthesameSNPacrossbatches.
2.2 SNPQC
DuetothesizeoftheUKBiobankcohort,genotypingwasperformedinalargenumber
ofbatches(33batchesof~4800individualsfortheinterimdatarelease).Thisprovides
additionalopportunitiestostudyandensuredataconsistency.Affymetrixroutinely
undertakesSNPQC[7,8],andweadoptedtheAffymetrixrecommendationsthroughout,
foreachgivenbatch.Inaddition,weperformedqualitychecksthatareappropriatefora
large-scaledatasetgenotypedinbatches.Forthereasonsdescribedabove,we
computedallSNPQCmetricsusingahomogeneoussubsetofindividualsdrawnfrom
thelargestancestralgroupinthecohort(whichisEuropeaninUKBiobank).Toidentify
theseindividuals,weprojectedUKBiobanksamplesonthetwomajorprincipal
8
componentscomputedbyanalysingtheCEU,YRI,CHBandJPTpopulationsfromthe
HapMap3referencepanel(withgenotypesprovidedby1000Genomes,phase1,release
v3).ThenweselectedsamplesthatwereprojectedintheneighbourhoodoftheCEU
cluster,asshowninFigure1.
TheUKBiLEVEbatcheshaveahigherproportionofsampleswithEuropeanancestryby
design,asparticipantswereselectedinpartbasedonself-declaredethnicity.Inthose
11batchesweused~97%samplesforSNPQC.IntheUKBiobankbatchesweused91%93%samplesforSNPQC,asthesebatchesaremoreethnicallydiverse.AppendixA1
describestheanalysisweusedtochooseahomogeneoussubsetofsamplesforSNPQC.
Insamplesdrawnfromthesamepopulationwewouldnotexpectdifferencesin
genotypefrequencies,eitherbetweenbatchesorbetweenplateswithinabatch,atthe
samemarker.SuchdifferencesmightindicatethattheSNPwasnotgenotypedas
accuratelyasotherSNPs,inthebatch(orplate)whichexhibitsunusualgenotype
frequencies.Werefertothesecasesasbatchorplateeffects.Forexample,batch
effectscanoccurwhenthesampleintensitiesinonebatchshiftrelativetothe
intensitiesinotherbatches.Inrarecases,suchashiftcancausetheAffymetrixcalling
algorithmtomiscallagenotypeclusterthatisnotdetectedbytheroutineAffymetrix
SNPQC.Similarly,plateeffectscanoccurwhentheintensitiesinoneplateshiftrelative
totheintensitiesinotherplates,inthesamebatch.
Tolookforeffectsinaparticularbatchwetestedwhetherwecanrejectthenull
hypothesisthatthegivenbatchhasthesamegenotypefrequenciesasallotherbatches
combined.Tolookforeffectsinaparticularplatewetestedwhetherwecanrejectthe
nullhypothesisthatthegivenplatehasthesamegenotypefrequenciesasallother
plates,withinthesamebatch,combined.InbothcasesweusedFisher’sexactteston
the2×3tableofgenotypes.(Sincethereareseveralplatesinabatch,weperformed
Fisher’sexacttestforeachplatethatisatleasthalf-full,i.e.,with48samplesormore,
andthentookthesmallestp-value.)SeeAppendixA2formoredetails.
WealsoperformedanexacttestforHardy-Weinbergequilibriumforeachbatch[9].
Again,selectingahomogeneoussubsetofsamplesmakestheproceduremore
conservative,asHardy-Weinbergequilibriumdoesnotnecessarilyholdinthepresence
ofpopulationstructure.
IfaSNPdidnotpassanyofthesetests(withap-valueoflessthan10-12),thismight
indicatethatthegenotypeshavenotbeencalledcorrectlyinthecorrespondingbatch
andtheSNPisflagged.Forthecurrentinterimdatarelease,genotypesatsuchflagged
SNPsweresettomissinginbatcheswherethetestssuggestedissueswiththeinitial
calls.Withtheaimtoimprovegenotypecallinginsubsequentdatareleases,SNPsthat
werefilteredoutinatleastonebatcharethesubjectofongoingadvancedanalysis
workbyAffymetrix.PreliminarydatageneratedbyAffymetrixadvancedanalysis
workflowindicatesthatasubstantialnumberofSNPflaggedintheinterimreleasewill
bereleasedinthefinalrelease.
9
Figure1Weused1000GenomesdataforfourHapMappopulations(CEU,CHB,JPT,YRI)tocomputePCA
loadingsfor~40,000SNPsontheUKBiobankAxiomarray.Inthetopleftpanel,theseHapMapsamples
st
nd
areprojectedontothe1 and2 principalcomponentsandarecolouredbypopulation.Intheother
panels,all11UKBiLEVEbatches(labeledb1tob11)andanarbitrarilychosensubsetof8UKBiobank
batches(labeledb001tob008)areprojectedintothesameprincipalcomponentspace.Thesamplesare
colouredaccordingtowhethertheywereusedinSNPQCproceduresornot(inblackandgray,
respectively).ForeachbatchtheproportionofsamplesusedforSNPQCisalsoreported.
10
2.3 SampleQC
TocarryoutQConsamples,wefirstappliedSNPQC(asdescribedabove)andselecteda
setofhighqualityautosomalSNPs.Theanalysesdescribedbelowarebasedon
~600,000autosomalSNPswhichareonboththeUKBiobankandUKBiLEVEarrays,and
passedSNPQCinall33batches.
2.3.1 Populationstructure
TocapturepopulationstructurespecifictotheUKBiobankcohort,weperformed
principalcomponentanalysisof~150,000UKBiobanksamplesusing~100,000SNPs.
ThesePCscanbeusedtoidentifysampleswithsimilarancestryortocontrolfor
populationstructureinassociationstudies.Metricsforsamplequalitycontrolcanbe
sensitivetopopulationstructureaswell,soweusedtheprincipalcomponentsinthe
processofidentifyingpoorqualitysamples.ThefourmajorPCsareshowninFigure2.
ThenextsixteenPCs(fromPC5toPC20)areshowninFigureS1anddetailsofthe
analysisarepresentedinAppendixA3.
Figure2GeneticprincipalcomponentsinUKBiobank,computedfrom141,0670samplesand101,284
st
nd
SNPsusingflashPCA[10].(A)The1 principalcomponent(PC1)onthex-axisandthe2 principal
rd
th
component(PC2)onthey-axis.(B)The3 principalcomponent(PC3)onthex-axisandthe4 principal
component(PC4)onthey-axis.Inbothpanels,samplesarecolouredaccordingtoself-reportedethnicity.
Thelegendindicatesthecolouredsymbolusedforeachpredefinedethnicitythroughoutthisdocument.
2.3.2 Heterozygosityandmissingrates
Extremeheterozygosityand/orlowcallratecanbeindicatorsofpoorsamplequality
[11].However,heterozygosityissensitivetopopulationstructurebecauseallele
11
frequencydistributions(andthusheterozygosity)candifferbetweenpopulations.Figure
3AshowstheeffectofSNPascertainmentonheterozygosity:sincetheUKBiobankarray
wasdesignedtoprovidegoodimputationcoverageinEuropeanpopulations,samples
withnon-Europeanethnicitytendtohavelowerheterozygosity.Wecontrolforthisby
fittingalinearregressionmodelwithheterozygosityastheoutcomeandthefourmajor
PCsasthepredictors(seeAppendixA4fordetails).Thecorrectedheterozygosityis
plottedinFigure3B.
Somesamplescanhavenaturallyextremeheterozygosity,evenafteraccountingfor
populationstructure.Specifically,individualswithmixedethnicitytendtohavehigher
heterozygosity(whichisnotcapturedbytheprincipalcomponents),andindividuals
whoseparentsarecloselyrelatedtendtohavelowerheterozygosity.Therefore,we
attemptedtoflagasoutlierssampleswhoseextremeheterozygosityisnotexplainedby
mixedancestryorincreasedlevelsofmarriagebetweencloserelatives.
Figure3Heterozygosityandmissingnessfor152,256samplesintheinterimUKBiobankdatarelease,
afterremoving480outliers.(Section2.3.2detailstheproceduretoflagoutliers.)Pointsarecolouredby
self-reportedethnicity,usingthecolouredsymbolsinthelegendofFigure2.(A)Heterozygosity
(proportionofautosomalheterozygouscalls)onthey-axisagainstlogit-transformedmissingness
(proportionofgenotypesnotcalled)onthex-axis.Thelogittransformation,definedaslogit(x)=log(x/(1x)),isappliedtonormalisethemissingnessvalues.(B)Ancestry-correctedheterozygosityonthey-axis
againstlogit-transformedmissingnessonthex-axis.Theheterozygosityvaluesarecorrectedfor
systematicdifferencesduetopopulationstructureusingfourgeneticprincipalcomponents,asdescribed
inAppendixA4.
Aftertakingintoaccountmixedethnicity,weidentified472outliers(0.3%oftotal
samples)withhighmissingnessorhighheterozygosity(plottedinredinFigure4A),by
visuallyinspectingthescatterplotsofheterozygosityandmissingnessforeachselfreportedethnicity(seeFigureS2).Todistinguishbetweenpoorqualitysamplesand
sampleswithnaturallylowheterozygosity,welookedforlongrunsofhomozygosity
(ROH).WecomputedthetotallengthoflongROHusingplink[12](seeAppendixA5for
12
details),andidentified8sampleswithtotalROHthatisunusuallyshort,comparedto
othersampleswithsimilarheterozygosity(Figure4B).
Intotal,weidentified480samples(0.3%oftotalsamples)withhighmissingnessorfor
whichheterozygosityrateswerenotexplainedbyROHanalysisnormixedethnicity.
ThesesamplesarenotexcludedfromthedatareleaseandinsteadalistofoutlierIDsfor
thesesamplesisprovidedtoresearchersalongwiththegenotypedata.
Figure4Atotalof152,736UKBiobanksamplesweregenotypedfortheinterimdatarelease.(Intended
andunintendedduplicatesareexcludedfromthiscount.)Ofthese,thereare480outliers,showninred;
therestofthesamplesareshowningray.(A)Ancestry-correctedheterozygosityonthey-axisandlogittransformedmissingnessonthex-axis.Thisplotemphasizesthatsomeoutliershavehighmissingnessor
highheterozygosity.(Sampleswithmixedancestrytendtohaveincreasedheterozygosityaswell,butthis
isexpectedandsuchsamplesarenotflaggedasoutliersbasedonheterozygosityalone.)(B)Ancestrycorrectedheterozygosityonthey-axisandtotallength(inkb)oflongrunsofhomozygosity(ROH)onthe
x-axis.ThisplotemphasizesthatsomeoutlierswithlowheterozygosityhaveunusuallyshorttotalROH.
2.4 Summary
AfterQCprocedureswereapplied,theinterimUKBiobankdatareleasecontains
genotypesfor152,736samplesthatpassedsampleQC(~99.9%oftotalsamples),and
806,466SNPsthatpassedSNPQCinatleastonebatch(>99%ofthearraycontent).As
notedabove,Affymetrixispursuingongoingdevelopmentworkongenotypecallingin
extremelylargemulti-batchsettings.Therefore,somegenotypecallsmaychange
betweenthisinterimdatareleaseandthefinaldatarelease,andweanticipatethatthe
variousmetricswillimprovefurther.
13
3 PropertiesoftheUKBiobankgenotypedatafor
InterimRelease
TheinterimdatareleaseofUKBiobankgeneticdataconsistsof152,736samples.Of
those,102,754weregenotypedontheUKBiobankarray(splitinto22batches)and
49,982weregenotypedontheUKBiLEVEarray(splitinto11batches).Inadditionto
computingprincipalcomponents,weanalysedseveralaspectsoftheinterimrelease
dataafterqualitycontrolhadbeenapplied.
3.1 Propertiesofsamples
3.1.1 RelatedIndividuals
Weidentifiedrelatedsamplesbycalculatingkinshipcoefficientsforallpairsofsamples
usingKING’srobustestimator[13].Weusedthisestimatorasitisrobusttopopulation
structureanditisimplementedinanalgorithmefficientenoughtoconsideralln(n−
1)/2(~11,250,000,000)pairsinapracticableamountoftime.Parent-childandfull
siblingpairshavethesameexpectedkinshipcoefficientbutcanbedistinguishedby
theirIBS0fraction,definedastheproportionofSNPsatwhichtwosampleshaveno
allelesincommon(seeFigure5).Weexcludedsomesamplesfromthekinship
calculationbecauseKING’srobustestimatorisnotreliableforindividualswithhigh
heterozygosityorhighmissingness[13].SeeAppendixA6fordetails.
Weonlyreportrelativestothe3rd,2ndand1stdegreeandmonozygotictwins(Table2).
Relationship
Pairs
Monozygotic ParentFull
2nd
twins
offspring siblings degree
18
619
2,183 1,061
3rd
degree
5,811
rd
Table2Relatedpairs(3 degreeorcloser)for~150,000UKBiobankparticipantsgenotypedintheinterim
UKBiobankdatarelease.(ThecountsarederivedfromthekinshipinformationpresentedinFigure5.)
Wedetected1,856individualsthatarerelated(tothe1stdegreeorasmonozygotic
twins)tomorethanoneperson,andthuswilloccurinmorethanonepairinTable2.
Seventy-twooftheseindividualsarewithinatrio(childwithtwoparents)inwhich
checkingofthesexandagesofbothparentsandageofthechildwasconsistentwith
theinferredrelationship.Thereare6instancesoftwosiblingsandaparent,andinone
ofthesethesiblingsaremonozygotictwins.Theothersareindividualswithinsetsof3or
4siblings.
14
Figure5Closerelationshipsfor~150,000UKBiobankparticipantsgenotypedintheinterimrelease.Each
pointrepresentsapairofrelatedindividualsandthecoloursindicatethedegreeofrelatedness:
st
nd
rd
monozygotictwinsinblack(intheupperleftcorner),1 ,2 and3 degreerelativesinred,greenand
st
blue,respectively.Therearetwogroupsof1 degreerelatives:parent-childpairs(redtriangles)andfull
siblings(redcircles).Forallpairs,they-axisshowsthekinshipcoefficient,definedastheprobabilitythat
twoallelessampledatrandom(onefromeachindividual)areidenticalbydescent.Thex-axisshowsthe
proportionofzeroidentity-by-state(IBS0),definedastheproportionofSNPsatwhichonesamplecarries
theminorhomozygoteandtheothersample–themajorhomozygote,sothattheysharenoalleles.)The
degreeofrelatednessisinferredfromtheestimatedkinshipcoefficientusingKING’scriteria[13].
3.1.2 Sexmismatches
Affymetrixinfersanindividual'ssexpriortogenotypecalling(butaftermeasuringallele
intensities)sothatitcanuseanappropriatealgorithmtocallSNPsonthesex-linked
chromosomes,XandY.Forthispurpose,AffymetrixusesspecialprobesfornonpolymorphicsitesontheXandYchromosomes,whichproducelargedifferencesin
intensitybetweenmalesandfemales.Self-reportedsex(recordedatrecruitment)and
geneticallyinferredsexareavailableforallsamples.Outofthe~150,000samplesinthe
interimrelease,theself-reportedsexdoesnotmatchthegeneticallyinferredsexin191
cases(0.1%oftotalsamples).
Therearethreepossibleexplanationsforsexmismatches:
•
•
Clericalerror:EithertheDNAsamplewasassociatedwiththewrongindividual
(mislabelling)orsexwasrecordedincorrectlyatrecruitment
Sexdeterminedbychromosomalmake-updoesnotmatchgenderidentity(andthus
self-reportedsex)
15
•
Sexchromosomeaneuploidy(i.e.,abnormalnumberofsexchromosomes,for
example–XXY)
AnalysisoftheXandY-chromosomeaverageintensities(whichareavailableto
download)canbeusedtoidentifyinstancesofthethirdpossibleexplanation.Afterthe
interimrelease,UKBiobankintendstoextractDNA(wherepossible)andreprocess
sampleswithunexplainedgendermismatches.
Figure6reportstwomeasuresthatcanbeusedtoinfergender.X-chromosome
heterozygosityisinformativebecausemalescarryasinglecopyoftheXchromosome
andthuscannotbeheterozygous.TheratioofY-chromosometoX-chromosome
averageintensityisinformativebecausefemalescarrynocopyoftheYchromosome
andthustheiraverageYintensityshouldbelower(notnecessarilyzerobutat
backgroundlevel).Thetwomeasuresarenotmutuallyredundantandcanbeusedto
identifypossiblecasesofsexchromosomeaneuploidy.Forexample,sampleswithXXY
aneuploidyareexpectedtohavefemale-likeheterozygosityontheXchromosome,but
alsohavemale-likeintensityvaluesfortheYchromosome.Suchsamplesshouldnotbe
usedindownstreamanalysis,orusedwithcaution,especiallyinconjunctionwiththeir
phenotypicdata.
Figure6X-chromosomeheterozygosityandratioofY-chromosometoX-chromosomeaverageintensity,
for152,736UKBiobanksamples.TheX-chromosomeheterozygosityiscomputedfromallX-chromosome
SNPsoutsidethePARregions.Theintensityvaluesaremeasuredattheprobesusedfordeterminingsex
priortogenotypecalling.(A)Samplesarecolouredbygender:iftheself-reportedandgeneticallyinferred
sexagree,thenfemalesareplottedinredandmalesinblue;otherwise,mismatchesareplottedinblack.
Pointsincentreoftheplot(separatedfromtheblueandredclusters)arepossiblecasesofXXY
aneuploidy.(B)Thesamepointsarecolouredbyself-reportedethnicity,usingthecolouredsymbolsinthe
legendofFigure2.X-chromosomeheterozygosityexhibitsascertainmentbiasduetopopulation
structure,similarlytoautosomalheterozygosity.(Comparethesystematicoffsetinheterozygosity
betweensampleswithdifferentethnicbackgroundinthisfigureandinFigure3A).
16
3.2 PropertiesofSNPs
Figures7,8and9illustratevariousqualitymetricsandpropertiesofSNPsgenotypedon
theUKBiobankAxiomarray,acrossmultiplebatches.Affymetrixprocessedand
genotypedthebatchesseparatelyandweappliedthesamefilters(thetestsforbatchor
plateeffectsandHardy-WeinbergequilibriumdescribedinSection2.2),independently,
multipletimes.Therefore,thenumberoftimesaSNPpassedthesefiltersisan
extremelystrictmeasureofitsgenotypecallingquality.Thisandthecallrateare
reportedinFigure7.
Figure7OverallqualityofthegenotypedataintheinterimUKBiobankrelease,afterallSNPQCsteps
havebeenapplied.(A)NumberofbatchesinwhichaSNPissettomissing(outof33batches),for
common,low-frequencyandrareSNPsgenotypedonboththeUKBiLEVEandUKBiobankAxiomarrays.
Theshadingindicatesoneofthreeminorallelefrequency(MAF)categoriesofSNPs:common(MAF>5%);
lowfrequency(5%>MAF>1%);rare(MAF<1%).MAFsinUKBiobankwereestimatedfromsampleswith
inferredEuropeanancestry.(B)SNPcallrateforcommon,lowfrequencyandrareSNPscombined.
ThesmallpeaksinthecallrateinFigure7BareduetoSNPssettomissinginjustafew
batches.Forexample,ifaSNPdidnotpassaQCthresholdinexactlyonebatchinn
batchesbutotherwisehasahighcallrateintheremainingbatches,itscallrateis~(n1)/n.Sincethereare33batchesintheinterimrelease,thereisasubsetofSNPswithcall
rate~32/33=0.97andasmallersubsetwithSNPswithcallrate~31/33=0.94.
Anothermeasureofgenotypingquality,reproducibilityofcalls,wasassessedintwo
controlsfrom1000Genomeswhichwereaddedtoeveryplate(inthesamewellon
eachplate)andweregenotypedmultipletimes.Lowdiscordancebetweencallsforthe
sameindividualacrossdifferentplatesindicateshighqualitygenotyping.The
discordanceforaparticularSNPiscomputedas:
1–
max{nAA,nAB,nBB}
nAA+nAB+nBB
wherenAA,nAB,nBBisthenumberoftimesthegenotypeAA,AB,BBiscalled,
respectively.Forconcreteness,supposethatmax{nAA,nAB,nBB}=nAA.Thatis,nAAisthe
17
modeoftheset{nAA,nAB,nBB}andthereforeAAistheconsensuscall.Thediscordance
istheproportionofcallsthatarenottheconsensuscall;intheexample,thisisthe
proportionofABorBBcalls.Figure8showsthediscordanceratesforthetwo1000
Genomescontrols.Inbothcases,thereisasmallnumberofSNPswithdiscordance>
0.05(282forHG00097and143forHG00264,or417(0.05%)intotal).TheseSNPsare
includedintheinterimreleasebutthelistcanbedownloaded.Somemightbesubject
toexclusioninthefinalreleaseafterfurtheranalysishasbeenperformed.
Figure8Ratesofdiscordancefromtheconsensuscall,forthetwo1000Genomescontrolsgenotyped
multipletimesontheUKBiobankarray.(A)DiscordanceforHG00097.(B)DiscordanceforHG00264.
Figure9showsthedistributionsofminorallelefrequencyandmissingness,acrossSNPs
thatpassedallSNPQCfiltersinall33batchesintheinterimrelease.
Figure9Distributionsofminorallelefrequencyandmissingnessacrossasetof626,445SNPsgenotyped
onboththeUKBiLEVEandUKBiobankAxiomarrays,whichpassedallSNPQCfiltersinthe33batchesof
theinterimrelease.(A)Histogramofminorallelefrequenciesestimatedfromsampleswithinferred
Europeanancestry.TheshadingindicatesoneofthreeMAFcategories:commonSNPswithMAF>5%;low
frequencySNPswith5%>MAF>1%;rareSNPswithMAF<1%.(B)Histogramoflogit-transformed
missingnessforcommon,lowfrequencyandrareSNPscombined.Forreference,logit(-8)correspondsto
0.033%missingness;logit(-6)to0.247%missingness;logit(-4)to1.799%missingness.
18
AsmallnumberofgenotypedautosomalSNPs(65)havebeenfoundwhichshow
significantlydifferentallelefrequenciesbetweentheUKBiLEVEarrayandtheUK
Biobankarray.TheseSNPsareintheinterimdatareleasebutshouldbeexcludedfrom
analyses.Anumber(27)oftheseSNPswereusedinphasingandimputation.We
stronglyrecommendconditioningonarrayinassociationteststoamelioratetheeffect
oftheseSNPs.TherecouldstillbeasubtlebiasintheneighbourhoodoftheseSNPs
afterconditioning,butthiswilldependuponthephenotypebeingtestedfor
association.WerecommendlookingcarefullyatanyresultswithimputedSNPsinthe
regionsoftheaffectedSNPs,includingconfirminganyGWAShitswiththegenotypedonlydataandlookingatclusterplotsofthegenotypedata.Additionally,therearea
numberofSNPs(46)onchromosomeXwhichshowasignificantallelefrequency
differencebetweenmalesandfemalesorshowdifferencesbetweenarrays.We
recommendthattheseSNPsbeexcludedfromallanalyses.Thefulllistofthesemarkers
isavailabletodownload.TheseSNPswereidentifiedasthosewithap-valuelessthan
10-40inaFisherexacttestongenotypecounts.
19
References
[1]N.Allen,C.Sudlow,P.Downey,T.Peakman,J.Danesh,P.Elliott,J.Gallacher,J.Green,P.
Matthews,J.Pell,T.Sprosen,andR.Collins,“UKBiobank:Currentstatusandwhatitmeansfor
epidemiology,”HealthPolicyandTechnology,1(3):123-126,2012.
[2]TheUKBiobankArrayDesignGroup,“UKBiobankAxiomarray:contentsummary”,2014.
http://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UK-Biobank-Axiom-Array-ContentSummary-2014.pdf
[3]L.V.Wainetal.,“Novelinsightsintothegeneticsofsmokingbehavior,lungfunctionand
chronicobstructivepulmonarydiseaseinUKBiobank,”Submitted,2015.
[4]UKBiobank,“Genotypingof500,000participants:Descriptionofsampleprocessingworkflow
andpreparationofDNAforgenotyping”,20April,2015.
[5]UKBiobank,“DNAextractionatUKBiobank”,2014.http://www.ukbiobank.ac.uk/wpcontent/uploads/2014/04/DNA-Extraction-at-UK-Biobank-October-2014.pdf
[6]Affymetrix,“UKB_WCSGAX:UKBiobank500KSamplesProcessingbytheAffymetrixResearch
ServicesLaboratory”,April,2015.
[7]Affymetrix,“Axiom®GenotypingSolutionDataAnalysisGuide”,2014.
http://media.affymetrix.com/support/downloads/manuals/axiom_genotyping_solution_analysi
s_guide.pdf
[8]Affymetrix,“UKB_WCSGAX:UKBiobank500KGenotypingDataGenerationbytheAffymetrix
ResearchServicesLaboratory”,April,2015.
[9]J.E.Wigginton,D.J.CutlerandG.R.Abecasis,“AnoteonexacttestsofHardy-Weinberg
equilibrium,”TheAmericanJournalofHumanGenetics,76(5):887-893,2005.
[10]G.AbrahamandM.Inouye,“Fastprincipalcomponentanalysisoflarge-scalegenome-wide
data,”PLoSONE,9(4):e93766,2014.
[11]IMSGCandWellcomeTrustCaseControlConsortium,“Geneticriskandtheroleofcell
mediatedimmunemechanismsinmultiplesclerosis,“Nature,476(7539):214-219,2011.
[12]PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraM,BenderD,MallerJ,SklarP,de
BakkerP,DalyMJ,ShamPC(2007)“PLINK:AToolSetforWhole-GenomeandPopulation-Based
LinkageAnalyses,”TheAmericanJournalofHumanGenetics,81(3):559–575,2007.
[13]A.Manichaikul,J.C.Mychaleckyj,S.S.Rich,K.Daly,M.Sale,andW.-M.Chen,“Robust
relationshipinferenceingenome-wideassociationstudies,”Bioinformatics,26(22):2867-2873,
2010.
[14]C.Bellenguez,A.Strange,C.Freeman,WellcomeTrustCaseControlConsortium,P.
Donnelly,andC.C.Spencer,“Arobustclusteringalgorithmforidentifyingproblematicsamplesin
genome-wideassociationstudies,”Bioinformatics,28(1):134-135,2012.
[15]A.L.Priceetal.Long-rangeLDcanconfoundgenomescansinadmixedpopulations.The
AmericanJournalofHumanGenetics,83(1):132-135,2008.
20
Appendices
TheinterimUKBiobankdatareleaseconsistsof11UKBiLEVEbatchesand22UK
Biobankgenotyped(inthisorder)byAffymetrixusingcallingalgorithmsspecifically
adaptedtotheUKBiobankproject[7,8].
Batch
UKBiLEVEb1
UKBiLEVEb2
UKBiLEVEb3
UKBiLEVEb4
UKBiLEVEb5
UKBiLEVEb6
UKBiLEVEb7
UKBiLEVEb8
UKBiLEVEb9
UKBiLEVEb10
UKBiLEVEb11
UKBiobankb001
UKBiobankb002
UKBiobankb003
UKBiobankb004
UKBiobankb005
UKBiobankb006
UKBiobankb007
UKBiobankb008
UKBiobankb009
UKBiobankb010
UKBiobankb011
UKBiobankb012
UKBiobankb013
UKBiobankb014
UKBiobankb015
UKBiobankb016
UKBiobankb017
UKBiobankb018
UKBiobankb019
UKBiobankb020
UKBiobankb021
UKBiobankb022
Numberof
genotyping
plates
Numberof
UKBiobank
samples
Numberof
control
samples
52
53
58
52
59
61
63
53
54
59
72
52
74
85
91
87
64
75
186
73
85
97
83
56
201
469
177
134
111
131
88
182
79
4592
4598
4587
4601
4596
4573
4589
4593
4594
4597
4600
4710
4657
4648
4652
4661
4689
4678
4755
4693
4713
4704
4706
4692
4710
4714
4605
4600
4621
4627
4637
4582
4719
195
186
210
196
197
216
199
202
198
184
199
90
134
141
142
141
113
118
41
104
85
95
89
106
82
71
87
91
93
72
119
175
40
TableS1NumberofgenotypingplatesandprocessedsamplesperbatchfortheinterimUKBiobankdata
release.(ThesenumbersexcludesampleswithlowDNAqualitybutincludeintended/unintended
duplicatesandsampleoutliers.)The11UKBiLEVEbatches,labelledb1tob11,weregenotypedontheUK
BiLEVEAxiomarray;the22UKBiobankbatches,labelledb001tob022,weregenotypedontheUK
BiobankAxiomarray.
21
A1
SelectingsampleswithEuropeanancestryforSNPQC
HerewedescribetheproceduretoidentifysampleswithEuropeanancestryandthus
constructthehomogeneoussubsetusedincomputingSNPQCmetrics.Theprocedure
includesprincipalcomponentanalysisandtwo-wayclustering.
Wefirstdownloaded1000GenomesdatainVariantCallFile(VCF)formatandextracted
714,168SNPs(noINDELs)thataregenotypedontheUKBiobankAxiomarrayaswell.
Weselected355unrelatedsamplesfromthepopulationsCEU,CHB,JPT,YRI,andthen
choseSNPsforprincipalcomponentanalysisusingthefollowingcriteria:
•
•
•
•
MAF≥5%andHWEp-value>10-6,ineachofthepopulationsCEU,CHB,JPTandYRI.
Pairwiser2≤0.1toexcludeSNPsinhighLD.(Ther2coefficientwascomputedusing
plink[12]andits‘indep-pairwise’functionwithamovingwindowofsize1000bp).
RemovedC/GandA/TSNPstoavoidunresolvablestrandmismatches.
ExcludedSNPsinseveralregionswithhighPCAloadings(afteraninitialPCA).
Withtheremaining40,538SNPswecomputedPCAloadingsfromthe3551,000
Genomessamples,thenprojectedtheUKBiobanksamplesontothe1stand2ndprincipal
components.AllcomputationswereperformedwithShellfish,
http://www.stats.ox.ac.uk/~davison/software/shellfish/shellfish.php.
Finally,weappliedanoutlierdetectionalgorithm(aberrant[14],withthelambda
parametersetto20),toisolatethelargestclusterofsamplesfromtherest,basedon
thetwoleadingPCs.InUKBiobank,thelargestclusteriscomposedofindividualswith
Europeanancestry.
A2 Testingforbatcheffects
TheinterimUKBiobankdatareleaseconsistsof33batches:thereare11UKBiLEVE
batcheslabeledb1,...,b11and22UKBiobankbatcheslabeledb001,...,b022.To
performabatcheffecttest,wecomparedthegenotypecountsinonebatchtothe
genotypecountsinotherbatchescombined,usingFisher’sexacttest.Forconcreteness
andforaspecificprobeset,wewriteb1+b2tomean(nAA:b1+nAA:b2,nAB:b1+nAB:b2,nBB:b1
+nBB:b2)wherenAA:b1isthenumberofcalledAAgenotypesinbatchb1.Itis
straightforwardtogeneralisethisnotationtoaggregatethegenotypecountsinmultiple
batches.Furthermore,aftertheinitial,batch-specificQCbyAffymetrix,allthecallsina
batchmightbesettomissing,e.g.,itmightbecasethatnAA:b1=nAB:b1=nBB:b1=0.
Weusedatwo-testapproachtocheckforcallingconsistencybetweentheUKBiLEVE
andUKBiobankbatches.SupposethatwewanttocheckthatthegenotypesinUK
Biobankbatchb001,foraspecificprobeset,areconsistentwiththegenotypesinthe
other32batches,forthesameprobeset.
• UseFisher’sexacttesttocompareb001tob002+…+b022,i.e.,checkforbatch
22
effectswithintheUKBiobankbatches.
• UseFisher’sexacttesttocompareb001tob1+…+b11,i.e.,checkforbatcheffects
acrosstheUKBiLEVEandUKBiobankbatches.
Weperformedthesecondtest(thecomparisonacrossthetwoarrays)onlyfor
probesetsthatuniquelygenotypeaSNP.(ThereareSNPsthataregenotypedusing
multipleprobesetsforwhichAffymetrixrecommended,separatelyforeachbatch,the
bestprobesettogenotypetheSNP.)Ifthep-valuesfromthetestsperformedare
smallerthanthesignificancethresholdusedthroughout,10-12,thenthecalls-inbatch
b001intheexampleabove-aresettomissing.
A3
PrincipalcomponentsanalysisofUKBiobanksamples
WecharacterisedpopulationstructureuniquetoUKBiobankusingPCA.Firstwe
selectedasubsetofSNPsfromthosethatpassedallQCfiltersin33outof33batches,
usingthefollowingcriteria:
•
•
•
•
Minorallelefrequency≥2.5%andmissingness≤1.5%.(CheckingthatHWEholdsin
asubsetofsampleswithEuropeandescentwaspartoftheSNPQCprocedures.)
Pairwiser2≤0.1,toexcludeSNPsinhighLD.
RemovedC/GandA/TSNPstoavoidunresolvablestrandmismatches.
ExcludedSNPsinseveralregionswithlong-rangeLD[15].(ThelistincludestheMHC
and22otherregions.)
Wealsoremovedsampleswhowererelatedtomultipleothersamples(tothe1st,2ndor
3rddegree),onesamplefromeachremainingrelatedpair(chosenrandomly),aswellas
removingalltwinsandgendermismatchesandsampleswithahighmissingrate.These
filtersresultedin101,284SNPsfor141,070samples.WeusedflashPCA[10]ratherthan
Shellfishtocomputeloadingsandprincipalcomponents,becauseflashPCA–whichuses
anefficientrandomisedalgorithm–ismorescalable.Finally,inthiscomputation,itis
importanttouseonlySNPssuccessfullygenotypedinallbatches;otherwise,differential
patternsofmissingnessacrossbatchesmeanthatthemajorPCswilldistinguish
betweenbatches,notbetweengroupswithdistinctancestry.
23
FigureS1GeneticprincipalcomponentsinUKBiobank.ThisfigureshowsprincipalcomponentsPC5to
PC20,anditcomplementsFigure2,whichshowsprincipalcomponentsPC1toPC4.PCsareplottedin
rd
pairs,fromPC5andPC6inthetopleftpanel,toPC19andPC20inthelastpanelonthe3 row.Ineach
panel,samplesarecolouredbyself-reportedethnicity,usingthesamecolouredsymbolsasinFigure2.
Thelaterprincipalcomponents(PC16toPC20)donotappeartodistinguishanysubsetsinUKBiobank
andonlyPC1toPC15arereportedaspartoftheinterimrelease.
A4
Accountingfortheheterozygositybiasexplainedbypopulationstructure
Heterozygosity(computedfromeitherautosomalorX-chromosomeSNPs)issensitiveto
populationstructurebecauseofascertainmentbias:amajorityofSNPsontheUK
BiobankAxiomarraywerechosentosatisfycertainproperties–imputationcoverage,
24
forexample–inEuropeanpopulations.Herewedescribethedetailsofaregression
modeltoadjustheterozygositybyaccountingfortheeffectsofpopulationstructure.
Lethdenotetheheterozygosityandletxbeasetoffeaturescorrelatedwithancestry.
WeusedtheprojectionsontothefourmajorUKBiobankprincipalcomponentsto
characteriseancestry,writingx=(x1,x2,x3,x4)forthesefourprincipalcomponent
values.Considerthefollowingmodelforheterozygosityunderpopulationstructure:
h(x)=h0+β(x)
whereh(x)istherawheterozygosity,whichdependsonthefeaturesx,h0isthe
ancestry-adjustedheterozygosityandβ(x)isabiastermduetopopulationstructure.We
choseaquadraticformforβ(x),whichincludesalllinearandquadratictermsxiandxi2
aswellasallcrosstermsxixj,andweestimatedh0withordinaryleastsquares.More
specifically,thebiaswasassumedtohavethefollowingfunctionalform:
β(x)=β11x12+β22x22+β33x32+β44x42+β1x1+β2x2+β3x3+β4x4+β12x1x2+β13x1x3+β14x1x4+β23x2x3
+β24x2x4+β34x3x4.
Thefittedvalueĥ0istheancestry-correctedheterozygosity,plottedonthey-axisin
Figure3B(allethnicitiescombined)andinFigureS2(eachpredefinedethnicgroup
separately).
A5
Detectinglongrunsofhomozygosity
Weusedplink[12]todetectlongROHs(runsofhomozygousgenotypes),usingthe
`homozyg-kb`commandwithahomozygousrunrequiredtospanatleast1000kb
distance.
A6 Detectingfamilialrelationships
TodetectrelatednessamongUKBiobankindividuals,weusedtherobustkinship
coefficientestimatorimplementedinKING[13].Thisestimatorisrobusttopopulation
structureandcomputationallypracticableevenonthescaleoftheUKBiobankcohort.
Ontheotherhand,itisnotreliableforsampleswithhighheterozygosityorhighmissing
rate,andasinglepoorlygenotypedindividualcouldleadtoaclusterofinflated
relationships[13].Therefore,tominimisefalsepositivesinthedetectionofrelated
samplesweexcludedindividualsusingthefollowingfilters:
1.Individualswithself-reported‘mixed’ethnicity(whichtendstoincrease
heterozygosity)wereexcludedfromthekinshipinference.Thatis,individualsinoneof
thefollowingcategoriesofself-reportedethnicbackground(~700individuals):
25
• Anyothermixedbackground
• Mixed
• WhiteandAsian
• WhiteandBlackAfrican
• WhiteandBlackCaribbean
2.Afterinferringpairsthatarerelatedto3rddegreeorcloser,weexcludedpairsfor
whichatleastoneofthepairhadeitherofthefollowingproperties(~800individuals):
• Heterozygosity(PC-adjusted)>0.1951154(equivalentto1.28standard
deviationsfromthemean)
• Missingrate>0.02
Foreveryindividualaflaghasbeenprovidedwhichindicateswhethertheyhavebeen
excludedfromkinshipinference.
26
FigureS2Ancestry-correctedheterozygosityandmissingness,foreachpredefinedethnicgroupinUK
Biobank.Theaxesarethesameineverypanel:heterozygosityaftercorrectingforbiasduetopopulation
structureonthey-axis,andlogit-transformedmissingnessonthex-axis.Thelogitfunctionisdefinedas
logit(x)=log(x/(1-x)).ThecolouredsymbolsforeachethnicityarethoseusedinthelegendofFigure3
(andthroughoutthisdocument).Inallpanels,theblackdottedlineindicatestheoverallmean
heterozygosity;ineachpanel,thecoloureddashedlineindicatesthemeanheterozygosityforthe
respectiveethnicity.Theindividualswithmixedancestry(particularly,thosewhoself-identifiedas“White
andBlackAfrican”or“WhiteandBlackCaribbean”)tendtohaveincreasedheterozygosity,evenafter
correctingthebiasduetopopulationstructure.
27