institute of genomics

Transcription

institute of genomics
LA GÉNOMIQUE ET LE HPC :
L’EXPÉRIENCE ET LA VISION DE
L’INSTITUT DE GÉNOMIQUE
François Artiguenave, PhD, eMBA
DSV/IG/CNG
| PAGE 1
OUTLINE
Presentation
Perspectives: Personalized
Bioinformatics
Data Production Trends and data
integration
INSTITUT DE GENOMIQUE / FRENCH GENOMICS FACILITY
DKFZ
13
Mission: To federate and stimulate genomics research in France
INSTITUTE OF GENOMICS
• Produce and analyze large
amounts of sequence data from
genomes of various origins
(human, plants, bacteria, etc)
• Activities include whole genome
association studies, pan-genomic
expression profiling, epigenetic studies
and whole genome sequencing
• Develop bioinformatic tools for
genome sequence analysis,
annotation and exploration
• Research on the genetics of human
diseases through internal and
collaborative research programs
• Exploration of prokaryote
biological and biochemical
diversity => chemical and
environmental applications
• Searching for interaction between
gene and environment =>
development of a personalized
medicine
CENTRE NATIONAL DE GENOTYPAGE
• Service and research center
• High throughput technologies for human genetics
• Main European Genotyping plateform
• A sequencing facility
Missions
•
Identify and understand « genetics » and genomics causes of diseases
•
Identify pertinent biomarkers for therapeutic innovation
•
Manage scientific information and access to Human Genomics Data
GENOTYPING
GWAS
Omni 5: 4 samples / beadchip
~ 5 000 000 markers / sample
Exome Genotyping
Human Exome: 12 samples / beadchip
> 240 000 markers / sample
GWAS + Exome Genotyping
HumanCoreExome: 12 samples / beadchip
> 500 000 markers / sample
Targeted Genotyping
iSelect-12/-24: 12/24 samples / beadchip
3 000 to 90 000 markers / sample
Epigenetics
Methylation 450K: 12 samples / beadchip,
> 450 000 methylation sites
768 samples/week
2304 samples/week
2304 samples/week
2304/3456 samples/week
2304 samples/week
HIGH THROUGHPUT SEQUENCING
•
•
•
6000Gb (raw data) in 2 weeks
=
400 Gb/day
Diversity
• ~
Population studies
Functional Genomics
Translational Medicine
Mécanisms
HT Sequencing
Bioinformatics
Epigenetics
Diagnostic
LES UTILISATIONS DU HPC EN GÉNOMIQUE
recherche fondamentale
fonctionnement des génomes («genome biology»),
epigénomique (aussi santé).
biomédecine et santé
médecine et pathologie moléculaire,
causes génétiques des maladies communes multifactorielles ou rares,
cancer,
diagnostic et médecine prédictive.
microbiomes humains (metagénomique : tube digestif, voies respiratoires, peau…),
pathologies infectieuses.
agronomie
animaux d’élevage,
plantes.
biodiversité, environnement
inventaire des espèces, évolution, variabilité / polymorphisme,
inventaire des fonctions, familles de gènes,
metagénomique des milieux naturels (sols, eaux…).
OUTLINE
Presentation
Personalized Bioinformatics
Data Production Trends and data
PERSPECTIVES OF PERSONALIZED MEDECINE
Human
Genetic
Variation
Technologies
Genotyping
Genome
Haplotyping
Data
Individual
genomics
(SNPs and
mutations)
Applications
Diagnosis
Pharmacogenetics
Individualised
healthcare
BIOINFORMATICS & MEDICAL INFORMATICS
Information
Gene
Expression
DNA arrays
MS, 2D ef
Functional
genomics
proteomics
Disease
classification
Pharmacogenomics
Molecular
causes of
diseases
Molecular
medicine
NGS FOR ONCOLOGY ?
Andrea Ferreira Gonzalez
BIOINFORMATICS STEPS FOR PERSONALIZED
GENOME ANALYSIS
CEA | 10 AVRIL 2012 | PAGE 13
Genome Medicine 2012 4:61
SEQUENCING AND POLYMORPHISMS
Software
QC Metrics
Reads quality
Sequencing
Casava
Read mapping
Bwa
Samtools
PicardTools
GATK
Mapping report
Duplicates
Coverage…
Variant calling
GATK
Variants quality
- mapping,coverage
Variant frequenciy
SNPeff
SNPsift
VAAST
Localisation
Conservation
Functional Impact score
Long term storage
Polymorphism detection pipeline
1
Sequencing
DdSNP
HapMap
Cross-sample
Controls
Genetic filters
• Not present in variant DB
• Not present in controls
• Segregate with disease
2
Reads mapping
and
SNP calling
Candidate SNPs
Variants
3
Crossmatch
Functional filters
• Amino acid change, stop,
frameshift
• Splicing site
• Expression data
• Evolutionary conservation
4
Analysis interface
5
OMIM
Ensembl
UCSC
Geo
Results display
Pre-computation of results for each detected variant
CEA | Novembre 2012
DEEP COVERAGE IMPROVES MUTATION DETECTION
SCALING-UP WGS ANALYSIS
Scaling up
Today
Poorly automatized
Heterogeneous technology
Not an strong demand
X00 whole
genomes
Algorithmic
optimization
Automation of “QC”
and first treatemetns
Cybersecurity
Personalized
Service
Big Data
Massive genome data
PERSONNALISED MEDECINE
~X0,000 genomes
NG HT-OMICS TECHNOLOGIES ENABLE TO ACCESS « ALL »
CELLULAR DIMENSIONS
INTEGRATIVE APPROACH
Sample 1
Sample 2
Sample 3
Sample 4
Genome
deterministic
0/1 decision
GENE1
GENE2
GENE3
GENE4
GENE5
Classic
Approach
Features
Sequence
Weight
and
combine
Genome
Integrative
Score
Copy Number
GENE1
GENE2
Expression
GENE3
GENE4
shRNA
GENE5
NETWORK-BASED CLASSIFICATION
Integrating heterogeneous data
“State” quantification and regulatory
explanation
Regulatory models and interaction models
Identification of “pieces of regulatory
networks” as biomarkers for predictive
models
Chuang, Lee, Liu, Lee, Ideker, Molecular Systems Biology 3:40
OUTLINE
Presentation
Personalized Bioinformatics
Data Production Trends and data
VOLUME AND GROWTH OF GENOMICS DATA
BGI, 2010 :
1.000 Tflops,, 10 PB
1E+17
1E+16
200k
tumour/normal
30x genomes
1E+15
1E+14
Bases
CERN Large Hadron Collider (LHC)
~10 PB/year at start
~1000 PB in ~10 years
1E+13
1E+12
Total
1E+11
8 month doubling
1E+10
200k tumour/normal 30x
genomes
1E+09
100000000
2003
2005
2007
2009
Date
2011
2013
2015
2017
(Cochrane G, Pers com)
CHALLENGE :
INTÉGRATION DES DONNÉES.
Station7
Propriétés
physico-chimique
Salinity, pressure,
fluorescence, nitrate…
Description des communautés vivantes
Caractéristiques de l’ADN
25
Vision for the European Nucleotide Archive
An e-infrastructure to address the data deluge challenge
•
•
•
•
Cooling,
Energy,
Space,
System administration and HPC
technologies (parallel / distributed
and hierarchical file systems, HPC…)
France Génomique is an infrastructure dedicated to the
production and analysis of genomic data – storage and data
processing needs are huge
http://www.france-genomique.org
TGCC: TRÈS GRAND CENTRE DE CALCUL DU CEA
20ème rang mondial (15ème au précédent classement)
120.000 cœurs, 2-3 Pflop/s
Salles machines : 2 x 1.300 m2,
Servitudes 3.000 m2
7,5 MW aujourd’hui, extensible à 12 MW, refroidissement
eau et air
Ligne électrique: 60 MW
Au service de la recherche et de l’industrie
Extension dédiée à la communauté France-Génomique :
3.000 cœurs et 5 Po (gain x10 sur les applications
exomes)
Mutualisation possible sur l’ensemble des calculateurs
WORKFLOW OVERVIEW
Data génération
Data storage
and computation
Data
presentation
and exploration
• QC, production management
• Data transfert (300 GB/experiment)
• Raw data archiving (optional)
• Data processing, on-going database construction
(per-project specific processes)
• Portal and data publication through per-institute
Web servers
• Large read-only data sets may remain at TGCC
PARTICULARITÉS DES TRAITEMENTS
2 grandes catégories d’applications
Assemblage : parcours de graphe et construction de tables d’index, large memory,
multithread
- typiquement : quelques semaines, ~16-32 cœurs, espace d’adressage de 100 Go à
quelques To.
Des applications intrinsèquement parallèles, sans besoin fort de synchronisation :
- Adapter les données à ces applications (pré-traitements, découpage…),
- Mais ratio mémoire / cœur assez important (8 Go/cœur aujourd’hui).
généralités :
«ecosystème» riche et productif : 1.500 à 2.000 programmes, issus de 200 packages,
rythme élevé de mise à jour,
besoin de workflows
temps de calcul longs (> 24 heures) et imprévisibles, ou du moins très variables
Gestion de production, de workflows, IHM (pilotage, reporting),
évolution vers des besoins d’intégration de données hétérogènes (IHM pour l’exploration)
Biological information
Public Health
Informatics
Huge need for high-performance data analytics for biomedical area
Medical
Informatics
Heterogeneity of data: Multi-omics, multi-technologies
•
Multi-factorial causality (genetic, epigenetic, environmental)
•
High number of variables, especially with the emergence of genomics and
proteomics
•
Interactions between variables are major stakes to address the wide
diversity of patients and situations
•
Data quality and homogeneity issues (missing data, noise, error margins,
heterogeneous data formats)
Medical Imaging
Bioinformatics
NEW PARADIGM IN CLINICAL GENOMICS
François Artiguenave, CEA
Vincent Meyer, CEA
Nizar Touleimat, CEA
Christophe Battail, CEA
Lilia Mesrob, INSERM
Aurélie Leduc CEA
Edith Le Floch, CEA
Xavier Benigni, CEA
Florian Sandron, CEA
| PAGE 32
M2 NRBC
-- 15 janvierUniv
2014 -- Claude
Solène
Julien,
Orsay
Scarpelli
Commissariat à l’énergie atomique et aux énergies alternatives
Institut de Génomique – Genoscope
2 rue Gaston Crémieux, 91000 EVRY
T. +33 (0)1 60 87 25 00 | F. +33 (0)1 60 87 25 14
Etablissement public à caractère industriel et commercial
| RCS Paris B 775 685 019