institute of genomics
Transcription
institute of genomics
LA GÉNOMIQUE ET LE HPC : L’EXPÉRIENCE ET LA VISION DE L’INSTITUT DE GÉNOMIQUE François Artiguenave, PhD, eMBA DSV/IG/CNG | PAGE 1 OUTLINE Presentation Perspectives: Personalized Bioinformatics Data Production Trends and data integration INSTITUT DE GENOMIQUE / FRENCH GENOMICS FACILITY DKFZ 13 Mission: To federate and stimulate genomics research in France INSTITUTE OF GENOMICS • Produce and analyze large amounts of sequence data from genomes of various origins (human, plants, bacteria, etc) • Activities include whole genome association studies, pan-genomic expression profiling, epigenetic studies and whole genome sequencing • Develop bioinformatic tools for genome sequence analysis, annotation and exploration • Research on the genetics of human diseases through internal and collaborative research programs • Exploration of prokaryote biological and biochemical diversity => chemical and environmental applications • Searching for interaction between gene and environment => development of a personalized medicine CENTRE NATIONAL DE GENOTYPAGE • Service and research center • High throughput technologies for human genetics • Main European Genotyping plateform • A sequencing facility Missions • Identify and understand « genetics » and genomics causes of diseases • Identify pertinent biomarkers for therapeutic innovation • Manage scientific information and access to Human Genomics Data GENOTYPING GWAS Omni 5: 4 samples / beadchip ~ 5 000 000 markers / sample Exome Genotyping Human Exome: 12 samples / beadchip > 240 000 markers / sample GWAS + Exome Genotyping HumanCoreExome: 12 samples / beadchip > 500 000 markers / sample Targeted Genotyping iSelect-12/-24: 12/24 samples / beadchip 3 000 to 90 000 markers / sample Epigenetics Methylation 450K: 12 samples / beadchip, > 450 000 methylation sites 768 samples/week 2304 samples/week 2304 samples/week 2304/3456 samples/week 2304 samples/week HIGH THROUGHPUT SEQUENCING • • • 6000Gb (raw data) in 2 weeks = 400 Gb/day Diversity • ~ Population studies Functional Genomics Translational Medicine Mécanisms HT Sequencing Bioinformatics Epigenetics Diagnostic LES UTILISATIONS DU HPC EN GÉNOMIQUE recherche fondamentale fonctionnement des génomes («genome biology»), epigénomique (aussi santé). biomédecine et santé médecine et pathologie moléculaire, causes génétiques des maladies communes multifactorielles ou rares, cancer, diagnostic et médecine prédictive. microbiomes humains (metagénomique : tube digestif, voies respiratoires, peau…), pathologies infectieuses. agronomie animaux d’élevage, plantes. biodiversité, environnement inventaire des espèces, évolution, variabilité / polymorphisme, inventaire des fonctions, familles de gènes, metagénomique des milieux naturels (sols, eaux…). OUTLINE Presentation Personalized Bioinformatics Data Production Trends and data PERSPECTIVES OF PERSONALIZED MEDECINE Human Genetic Variation Technologies Genotyping Genome Haplotyping Data Individual genomics (SNPs and mutations) Applications Diagnosis Pharmacogenetics Individualised healthcare BIOINFORMATICS & MEDICAL INFORMATICS Information Gene Expression DNA arrays MS, 2D ef Functional genomics proteomics Disease classification Pharmacogenomics Molecular causes of diseases Molecular medicine NGS FOR ONCOLOGY ? Andrea Ferreira Gonzalez BIOINFORMATICS STEPS FOR PERSONALIZED GENOME ANALYSIS CEA | 10 AVRIL 2012 | PAGE 13 Genome Medicine 2012 4:61 SEQUENCING AND POLYMORPHISMS Software QC Metrics Reads quality Sequencing Casava Read mapping Bwa Samtools PicardTools GATK Mapping report Duplicates Coverage… Variant calling GATK Variants quality - mapping,coverage Variant frequenciy SNPeff SNPsift VAAST Localisation Conservation Functional Impact score Long term storage Polymorphism detection pipeline 1 Sequencing DdSNP HapMap Cross-sample Controls Genetic filters • Not present in variant DB • Not present in controls • Segregate with disease 2 Reads mapping and SNP calling Candidate SNPs Variants 3 Crossmatch Functional filters • Amino acid change, stop, frameshift • Splicing site • Expression data • Evolutionary conservation 4 Analysis interface 5 OMIM Ensembl UCSC Geo Results display Pre-computation of results for each detected variant CEA | Novembre 2012 DEEP COVERAGE IMPROVES MUTATION DETECTION SCALING-UP WGS ANALYSIS Scaling up Today Poorly automatized Heterogeneous technology Not an strong demand X00 whole genomes Algorithmic optimization Automation of “QC” and first treatemetns Cybersecurity Personalized Service Big Data Massive genome data PERSONNALISED MEDECINE ~X0,000 genomes NG HT-OMICS TECHNOLOGIES ENABLE TO ACCESS « ALL » CELLULAR DIMENSIONS INTEGRATIVE APPROACH Sample 1 Sample 2 Sample 3 Sample 4 Genome deterministic 0/1 decision GENE1 GENE2 GENE3 GENE4 GENE5 Classic Approach Features Sequence Weight and combine Genome Integrative Score Copy Number GENE1 GENE2 Expression GENE3 GENE4 shRNA GENE5 NETWORK-BASED CLASSIFICATION Integrating heterogeneous data “State” quantification and regulatory explanation Regulatory models and interaction models Identification of “pieces of regulatory networks” as biomarkers for predictive models Chuang, Lee, Liu, Lee, Ideker, Molecular Systems Biology 3:40 OUTLINE Presentation Personalized Bioinformatics Data Production Trends and data VOLUME AND GROWTH OF GENOMICS DATA BGI, 2010 : 1.000 Tflops,, 10 PB 1E+17 1E+16 200k tumour/normal 30x genomes 1E+15 1E+14 Bases CERN Large Hadron Collider (LHC) ~10 PB/year at start ~1000 PB in ~10 years 1E+13 1E+12 Total 1E+11 8 month doubling 1E+10 200k tumour/normal 30x genomes 1E+09 100000000 2003 2005 2007 2009 Date 2011 2013 2015 2017 (Cochrane G, Pers com) CHALLENGE : INTÉGRATION DES DONNÉES. Station7 Propriétés physico-chimique Salinity, pressure, fluorescence, nitrate… Description des communautés vivantes Caractéristiques de l’ADN 25 Vision for the European Nucleotide Archive An e-infrastructure to address the data deluge challenge • • • • Cooling, Energy, Space, System administration and HPC technologies (parallel / distributed and hierarchical file systems, HPC…) France Génomique is an infrastructure dedicated to the production and analysis of genomic data – storage and data processing needs are huge http://www.france-genomique.org TGCC: TRÈS GRAND CENTRE DE CALCUL DU CEA 20ème rang mondial (15ème au précédent classement) 120.000 cœurs, 2-3 Pflop/s Salles machines : 2 x 1.300 m2, Servitudes 3.000 m2 7,5 MW aujourd’hui, extensible à 12 MW, refroidissement eau et air Ligne électrique: 60 MW Au service de la recherche et de l’industrie Extension dédiée à la communauté France-Génomique : 3.000 cœurs et 5 Po (gain x10 sur les applications exomes) Mutualisation possible sur l’ensemble des calculateurs WORKFLOW OVERVIEW Data génération Data storage and computation Data presentation and exploration • QC, production management • Data transfert (300 GB/experiment) • Raw data archiving (optional) • Data processing, on-going database construction (per-project specific processes) • Portal and data publication through per-institute Web servers • Large read-only data sets may remain at TGCC PARTICULARITÉS DES TRAITEMENTS 2 grandes catégories d’applications Assemblage : parcours de graphe et construction de tables d’index, large memory, multithread - typiquement : quelques semaines, ~16-32 cœurs, espace d’adressage de 100 Go à quelques To. Des applications intrinsèquement parallèles, sans besoin fort de synchronisation : - Adapter les données à ces applications (pré-traitements, découpage…), - Mais ratio mémoire / cœur assez important (8 Go/cœur aujourd’hui). généralités : «ecosystème» riche et productif : 1.500 à 2.000 programmes, issus de 200 packages, rythme élevé de mise à jour, besoin de workflows temps de calcul longs (> 24 heures) et imprévisibles, ou du moins très variables Gestion de production, de workflows, IHM (pilotage, reporting), évolution vers des besoins d’intégration de données hétérogènes (IHM pour l’exploration) Biological information Public Health Informatics Huge need for high-performance data analytics for biomedical area Medical Informatics Heterogeneity of data: Multi-omics, multi-technologies • Multi-factorial causality (genetic, epigenetic, environmental) • High number of variables, especially with the emergence of genomics and proteomics • Interactions between variables are major stakes to address the wide diversity of patients and situations • Data quality and homogeneity issues (missing data, noise, error margins, heterogeneous data formats) Medical Imaging Bioinformatics NEW PARADIGM IN CLINICAL GENOMICS François Artiguenave, CEA Vincent Meyer, CEA Nizar Touleimat, CEA Christophe Battail, CEA Lilia Mesrob, INSERM Aurélie Leduc CEA Edith Le Floch, CEA Xavier Benigni, CEA Florian Sandron, CEA | PAGE 32 M2 NRBC -- 15 janvierUniv 2014 -- Claude Solène Julien, Orsay Scarpelli Commissariat à l’énergie atomique et aux énergies alternatives Institut de Génomique – Genoscope 2 rue Gaston Crémieux, 91000 EVRY T. +33 (0)1 60 87 25 00 | F. +33 (0)1 60 87 25 14 Etablissement public à caractère industriel et commercial | RCS Paris B 775 685 019