Sandra Smit, Siavash Sheikhizadeh Anari, Eric Schranz, Dick de
Transcription
Sandra Smit, Siavash Sheikhizadeh Anari, Eric Schranz, Dick de
Pangenomics for eukaryotes Sandra Smit, Siavash Sheikhizadeh Anari, Eric Schranz, Dick de Ridder Motivation Because of large genome sequencing projects, such as the 150 tomato genomes project (Aflitos et al. 2014), many plants and animals are no longer represented by a single reference genome, but rather by a group of related genomes. Therefore, we need approaches to efficiently compare all these genomes and to analyze novel data with respect to all of them. This requires a shift from reference-centric analyses to pangenome analyses. Research questions How can we condense multiple annotated genomes for a (group of) species in a single representation, which can be used to study genome variation and evolution? How can we mine a pangenome for information relevant to plant and animal breeding? D C In large genome projects, the DNA of hundreds to thousands of members of an evolutionary clade is sequenced to study domesticated species and their wild relatives. Approach We are developing a pangenome representation that is: Complete: the pangenome representation is a graph structure that contains both the genome sequences and the annotation. The sequences are condensed in a compressed de Bruijn graph (green nodes), and annotations are added as nodes pointing to the start and end position in the sequence. The nodes in the database are objects with properties, such as the occurrences of the sequence in the original genomes. Scalable and efficient: to overcome memory limitations, the constructed graph is stored in a Neo4j graph database. The graph can be built in memory or directly in the database. Genomes can be added or removed from the graph. Our solution scales to eukaryotic genomes, as the database grows linearly. The pangenome can be queried efficiently using indices built on top of the graph database. B A A pangenome representation of two HIV genomes. The graph contains multiple types of nodes. A) This pangenome contains 2 genomes (blue) with each one sequence (purple). B) Two vpu genes (red) are annotated and grouped by gene name (yellow). C) A SNP between two pol genes. D) Two annotated pol genes (red), which start and end at different k-mer nodes (green). References Aflitos et al. Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing. Plant J (2014) Usable for comparative genomics: the inclusion of the annotation makes the graph usable for biological analyses. Several relationships, such as homology, orthology, or identity, can be used to group annotations (e.g. genes). We have implemented the retrieval of genes and genomic regions. For example, it is very fast to retrieve all FRIGIDA genes in the arabidopsis pangenome and align them. Furthermore, we are working towards analyzing synteny and structural variation, read mapping, and visualization. Performance on: 93 yeast genomes • 2.5 hours • 27 GB of memory • 1.1 GB database 19 arabidopsis genomes • 5.5 hours • 61 GB of memory • 5.3 GB database The database grows linearly. Conclusions Pangenome analyses likely play an important role in future comparative genomics. Our proposed pangenome representation, in combination with the appropriate database indices, could facilitate many different analyses. The storage of the pangenome in a graph database makes it scalable to eukaryotic genomes and allows for the inclusion of genome annotations. Contact Sandra Smit Bioinformatics, Plant Sciences Group Wageningen University [email protected]