HISAT-genotype: fast software for analyzing human genomes
Transcription
HISAT-genotype: fast software for analyzing human genomes
HISAT-genotype:fastso3wareforanalyzinghumangenomesonapersonalcomputer DaehwanKim1andStevenL.Salzberg1,2 1CenterforComputaHonalBiology,McKusick-NathansInsHtuteofGeneHcMedicine,JohnsHopkinsUniversitySchoolofMedicine,BalHmore,MD,USA 2DepartmentsofBiomedicalEngineering,ComputerScience,andBiostaHsHcs,JohnsHopkinsUniversity,BalHmore,MD,USA HISAT-genotype: GenotypingWholeGenomes read For HLA-typing, traditional approaches use all individual HLA allelic sequences, readismappedtomanyplaces(mappingambiguity) BRCA2 HLAgenes Allele1ofHLA-A BRCA1 graph-based approach to represent the alleles in a very space-efficient way by Allele2ofHLA-A variant incorporating only the differences among the alleles (Figure 3). Another significant benefit of using a graph representation is that reads are typically Allele3,384ofHLA-A CYP2D6 • WillprovidetemplaterouBnestoincorporategeneBcdataintoHISATFigure 1. Genotyping across the whole human genome. genotypeplaDorm Highlighted are the HLA genes, BRCA1/2, and CYP2D6. – Databasesforhumangenes(e.g.IMGT/HLAandClinVar) – PeoplewithdomainknowledgeandexperBseknowhowtomakebestuse ofthedatabases Introduction Advancements in sequencing technologies and computational methods have enabled rapid and accurate identification of genetic variants in the human population. Many large-scale projects such as the 1000 Genomes Project, GTEx, and GEUVADIS have already yielded a large and growing amount of information about human genetic variation, including >110 million SNPs (in dbSNP) and >10 million structural variants (in dbVar). Although these variants represent a valuable resource for genetic analysis, computational tools do not adequately incorporate the variants into genetic analysis. For instance, >3,000 alleles of the HLA-A gene have been identified. Representing and searching through the numerous alleles of even one gene can be a challenge, requiring a large amount of compute time and memory. Most methods have therefore focused on genotyping one or a few genes, because analyzing whole genomes has been too formidable. which introduces an enormous amount of redundancy. In contrast, we use a variant mapped to just one location, thereby avoiding the issue of mapping ambiguity TradiBonalapproaches(above)vs.HISAT2’sgraphbasedapproach(below) variant read as is often the case with other, older approaches. Results readisusuallymappedtooneplace Tests on Illumina’s Platinum Genomes data showed that our method correctly identifies all 204 alleles of the six HLA genes for these 17 genomes, at a speed Backbonesequence surpassing other currently available methods (Figure 4). We also analyzed Figure 3. Traditional approaches vs. HISAT2’s graph based approach for HLA typing. If a read (in red) has no variants, it will map to every allele of HLA-A. In HISAT2 these are represented as a single “backbone” sequence. Methods Michael Snyder’s genome (5X coverage, 160 million paired-end WGS reads, SRR345300), which took 2 hours 45 minutes on a desktop. Tables 1 and 2 show the HLA alleles and 10 known genetic variants found in that genome. To address these challenges, we have recently developed a novel indexing scheme that captures a wide representation of genetic variants and has low memory requirements. We have built a new alignment system, HISAT2 ( http://www.ccb.jhu.edu/software/hisat2), that enables fast search through the index. Rather than interrogating one gene at a time, HISAT2 has the ability to genotype essentially all the genes on the human genome on a desktop within a Please note that these results should not be used for any diagnostic assessment. Because our system works well for these highly diverse genes, we anticipate it would be relatively straightforward to extend it to many, perhaps all, known variants in human genes. Instead of genotyping one gene at a time, HISAT-genotype will eventually allow for genotyping >20,000 genes within just a few hours on a personal computer. HLAgenename Twoalleles work, we chose one gene family, the Human Leukocyte Antigen genes (HLA-A, HLA-A A*24:02:01:01/A*24:02:01:02L HLA-B, HLA-C, HLA-DQA1, HLA-DQB1, HLA-DRB1), which are among the HLA-B B*51:01:01:01/B*35:95 most diverse human genes. The IMGT/HLA database ( HLA-C C*14:23/C*15:02:01:01 http://www.ebi.ac.uk/ipd/imgt/hla) encompasses >12,900 alleles of the HLA HLA-DQA1 DQA1*03:03:01:01/DQA1*01:01:02 gene family (Figure 2). In addition, we chose >33,000 variants (SNVs and HLA-DQB1 DQB1*05:01:01:02/DQB1*03:01:01:01 indels) from the ClinVar database (http://www.ncbi.nlm.nih.gov/clinvar) that are HLA-DRB1 DRB1*04:03:01/DRB1*07:01:01:02 few hours (Figure 1). To demonstrate the capability of our initial genotyping Table 1. Predicted HLA alleles for the Snyder genome. known to be associated with genetic diseases. We incorporated these alleles and variants into our index of the human genome, which in HISAT2 requires Results–IlluminaPlaBnumGenomes(17individuals) only a small addition in computational resources. 50 #ofreads #ofreads (variant) (reference) TAT rs3755319 SNV 2:233758935 4 3 CLCN2 rs587777112 1-bpinserHon 3:184357430 3 5 SLC45A2 rs387906317 1-bpdeleHon 5:33954405 4 3 NIPAL4 rs199422217 SNV 5:157468727 3 5 IFNGR1 rs121912715 SNV 6:137215645 2 3 GAA rs796065332 11-bpdeleHon 7:117536652 2 4 MBL2 rs1800450 SNV 10:52771474 4 3 CP rs61751507 SNV 10:100069756 3 3 C2 rs794727302 4-bpdeleHon 12:101796718 2 3 PTPN11 rs121918467 SNV 12:112486481 3 3 40 Table 2. 10 selected pathogenic variants for the Snyder genome. 100 NomenclatureofHLAAlleles HLALocus SpecificHLA protein Differencesin non-codingregion HLA-DRB1*13:01:01:02 Allelegroup Synonymous DNAsubsBtuBonwithin codingregion Figure 2. The number of HLA gene alleles has grown dramatically over time (top). HLA nomenclature explained at bottom. Accuracy(%) 90 80 70 60 All bwakit A HLA-VBSeq B Salmon C DQA1 IsasHLA DQB1 DRB1 hisat-genotype Figure 4. Results of HLA-typing on Illumina Platinum Genomes (17 individuals). Only HISAT-genotype had 100% accuracy for all HLA types. Gene SNPID Type Locus Acknowledgements This work is supported in part by the National Human Genome Research Institute (NIH) under grants R01-HL129239 and R01-HG006677 to SLS.