HISAT-genotype: fast software for analyzing human genomes

Transcription

HISAT-genotype: fast software for analyzing human genomes
HISAT-genotype:fastso3wareforanalyzinghumangenomesonapersonalcomputer
DaehwanKim1andStevenL.Salzberg1,2
1CenterforComputaHonalBiology,McKusick-NathansInsHtuteofGeneHcMedicine,JohnsHopkinsUniversitySchoolofMedicine,BalHmore,MD,USA
2DepartmentsofBiomedicalEngineering,ComputerScience,andBiostaHsHcs,JohnsHopkinsUniversity,BalHmore,MD,USA
HISAT-genotype:
GenotypingWholeGenomes
read
For HLA-typing, traditional approaches use all individual HLA allelic sequences,
readismappedtomanyplaces(mappingambiguity)
BRCA2
HLAgenes
Allele1ofHLA-A
BRCA1
graph-based approach to represent the alleles in a very space-efficient way by
Allele2ofHLA-A
variant
incorporating only the differences among the alleles (Figure 3). Another
significant benefit of using a graph representation is that reads are typically
Allele3,384ofHLA-A
CYP2D6
•  WillprovidetemplaterouBnestoincorporategeneBcdataintoHISATFigure
1. Genotyping across the whole human genome.
genotypeplaDorm
Highlighted
are the HLA genes, BRCA1/2, and CYP2D6.
–  Databasesforhumangenes(e.g.IMGT/HLAandClinVar)
–  PeoplewithdomainknowledgeandexperBseknowhowtomakebestuse
ofthedatabases
Introduction
Advancements in sequencing technologies and computational methods have
enabled rapid and accurate identification of genetic variants in the human
population. Many large-scale projects such as the 1000 Genomes Project,
GTEx, and GEUVADIS have already yielded a large and growing amount of
information about human genetic variation, including >110 million SNPs (in
dbSNP) and >10 million structural variants (in dbVar). Although these variants
represent a valuable resource for genetic analysis, computational tools do not
adequately incorporate the variants into genetic analysis. For instance, >3,000
alleles of the HLA-A gene have been identified. Representing and searching
through the numerous alleles of even one gene can be a challenge, requiring a
large amount of compute time and memory. Most methods have therefore
focused on genotyping one or a few genes, because analyzing whole genomes
has been too formidable.
which introduces an enormous amount of redundancy. In contrast, we use a
variant
mapped to just one location, thereby avoiding the issue of mapping ambiguity
TradiBonalapproaches(above)vs.HISAT2’sgraphbasedapproach(below)
variant
read
as is often the case with other, older approaches.
Results
readisusuallymappedtooneplace
Tests on Illumina’s Platinum Genomes data showed that our method correctly
identifies all 204 alleles of the six HLA genes for these 17 genomes, at a speed
Backbonesequence
surpassing other currently available methods (Figure 4). We also analyzed
Figure 3. Traditional approaches vs. HISAT2’s graph based approach for
HLA typing. If a read (in red) has no variants, it will map to every allele of
HLA-A. In HISAT2 these are represented as a single “backbone” sequence.
Methods
Michael Snyder’s genome (5X coverage, 160 million paired-end WGS reads,
SRR345300), which took 2 hours 45 minutes on a desktop. Tables 1 and 2
show the HLA alleles and 10 known genetic variants found in that genome.
To address these challenges, we have recently developed a novel indexing
scheme that captures a wide representation of genetic variants and has low
memory requirements. We have built a new alignment system, HISAT2 (
http://www.ccb.jhu.edu/software/hisat2), that enables fast search through the
index. Rather than interrogating one gene at a time, HISAT2 has the ability to
genotype essentially all the genes on the human genome on a desktop within a
Please note that these results should not be used for any diagnostic
assessment. Because our system works well for these highly diverse genes,
we anticipate it would be relatively straightforward to extend it to many, perhaps
all, known variants in human genes. Instead of genotyping one gene at a time,
HISAT-genotype will eventually allow for genotyping >20,000 genes within just a
few hours on a personal computer.
HLAgenename
Twoalleles
work, we chose one gene family, the Human Leukocyte Antigen genes (HLA-A,
HLA-A
A*24:02:01:01/A*24:02:01:02L
HLA-B, HLA-C, HLA-DQA1, HLA-DQB1, HLA-DRB1), which are among the
HLA-B
B*51:01:01:01/B*35:95
most diverse human genes. The IMGT/HLA database (
HLA-C
C*14:23/C*15:02:01:01
http://www.ebi.ac.uk/ipd/imgt/hla) encompasses >12,900 alleles of the HLA
HLA-DQA1
DQA1*03:03:01:01/DQA1*01:01:02
gene family (Figure 2). In addition, we chose >33,000 variants (SNVs and
HLA-DQB1
DQB1*05:01:01:02/DQB1*03:01:01:01
indels) from the ClinVar database (http://www.ncbi.nlm.nih.gov/clinvar) that are
HLA-DRB1
DRB1*04:03:01/DRB1*07:01:01:02
few hours (Figure 1). To demonstrate the capability of our initial genotyping
Table 1. Predicted HLA alleles for the Snyder genome.
known to be associated with genetic diseases. We incorporated these alleles
and variants into our index of the human genome, which in HISAT2 requires
Results–IlluminaPlaBnumGenomes(17individuals)
only a small addition in computational resources.
50
#ofreads #ofreads
(variant) (reference)
TAT
rs3755319
SNV
2:233758935
4
3
CLCN2 rs587777112 1-bpinserHon 3:184357430
3
5
SLC45A2 rs387906317 1-bpdeleHon 5:33954405
4
3
NIPAL4 rs199422217
SNV
5:157468727
3
5
IFNGR1 rs121912715
SNV
6:137215645
2
3
GAA
rs796065332 11-bpdeleHon 7:117536652
2
4
MBL2
rs1800450
SNV
10:52771474
4
3
CP
rs61751507
SNV
10:100069756
3
3
C2
rs794727302 4-bpdeleHon 12:101796718
2
3
PTPN11 rs121918467
SNV
12:112486481
3
3
40
Table 2. 10 selected pathogenic variants for the Snyder genome.
100
NomenclatureofHLAAlleles
HLALocus
SpecificHLA
protein
Differencesin
non-codingregion
HLA-DRB1*13:01:01:02
Allelegroup
Synonymous
DNAsubsBtuBonwithin
codingregion
Figure 2. The number of HLA gene alleles has grown dramatically
over time (top). HLA nomenclature explained at bottom.
Accuracy(%)
90
80
70
60
All
bwakit
A
HLA-VBSeq
B
Salmon
C
DQA1
IsasHLA
DQB1
DRB1
hisat-genotype
Figure 4. Results of HLA-typing on Illumina Platinum Genomes (17 individuals).
Only HISAT-genotype had 100% accuracy for all HLA types.
Gene
SNPID
Type
Locus
Acknowledgements
This work is supported in part by the National Human Genome Research
Institute (NIH) under grants R01-HL129239 and R01-HG006677 to SLS.