DNA sequencing and

Transcription

DNA sequencing and
Big data in cancer research :
DNA sequencing and personalised medicine
Philippe Hupé
Conférence BIGDATA
04/04/2013
1 - Titre de la présentation - nom du département émetteur et/ ou rédacteur - 00/00/2005
Deciphering the cancer genome with high-throughput
technologies
Cancer karyotype
 Cancer
Normal karyotype
is a gene disease
 Sequence the cancer genome (i.e. read its DNA sequence) to :
→ Understand the molecular mechanisms of tumoral progression
→ Tailored the therapy for each patient individually
 Use high-throughput sequencing methods (Next-Generation sequencing)
30 years ago... the era of DNA sequencing
Walter Gilbert
Harvard Nobel Laureate, 1980
Co-inventor with Frederick Sanger of the
eponymic DNA sequencing method in 1977
“I expect that within a few years, our technology will be able to
sequence one megabase/technician-year. At that rate 100
technicians could sequence the genome in 30 years. An effort
to improve the technology over a 10-year period should raise
the rate by a factor of 10.”
The Scientist. October 20. 1986
Evolution of sequencing technologies and cost decreasing
Year
Genome
2003
HGP
2007
Venter
2008
Watson
2009
Cost $
Duration
Technology
Nb. of scientists
2,700,000,000 13 years
Sanger
2,800
100,000,000 4 years
Sanger
31
Roche 454
27
1,500,000 4.5 months
50,000 4 weeks
Helicos
3
Sources: Pushkarev et al. (2009), Wadman et al. (2008)
Roche 454
Illumina
Solid
Helicos
In 2013, around 5000$ to sequence
a human genome in one week with
one technician (1500 times faster
than Gilbert's prediction)
→ Toward the 1000$ genome
Data tsunami in cancer research
Low cost sequencing + Availability to every lab
=
Cost is divided by 2 in :
●
CPU - Moore's law: 18 months
●
Storage - Kryder's law : 12-14 months
●
Network - Butter's law : 9 months
●
NGS' law : 5 month
→ informatic challenges
Next-generation sequencing... some figures...
Sequencing with Illumina Hiseq 2500 :
– 6 billions of sequences:
– 1 sequence = 100 bases (A, T, C, G)
– 1 experiment = 600 billions of bases = 200,000 “Les Misérables”
– 1Tb of data (per week)
●
Human genome = 3 billions of bases = 1,000 “Les Misérables”
●
Reference human genome (known sequence) = dictionnary
●
Cancer genome = wrong copy the the dictionnary
●
In cancer, genes = words contains mutation = mistake
gene1 = GIRAFFE → gene1 = GILAFFE
●
Cancer creates new words = fusion genes
gene1 = GIRAFFE, genes2 = ZEBRA → new gene = GIBRA
→ The 6 billions of sequences will be compared to the reference genome to find the
mutations and fusion genes taking into account the fact that the sequencer itself
makes error when reading the sequence
Extraction of the biological signal from the raw data
 Development
of algorithms and statistical methods
 Interdisciplinary work with bioinformaticians, informaticians, biologists,
mathematiciens, statisticians and algorithmists
 HPC infrastructure
Pieces of the
cancer genome
CGAGCTG
ACGAGCT
TCCTAGC
GCTCCTA
TTTACGA
AGCTCCT
TTTACGA
AGCTCCT
ACGACTT
ACTACGA
GGCCAAC
CGGCCAA
AGCTGCG
CGAGCTG
CTACGAG
CATCTAC
Reference Genome Sequence = dictionnary
A C T A C G A C T C T A C G A G C A T C TA C G A GC T A C T A G C G A T C A C G A G C T G C G A G C A A C G GC CA A C
Mutations
Visualisation of the significant fusions
Intra-chromosome
fusions
Intra-chromosome
fusions
Source: MCF-7 breast cancer cell line, Hampton et al., Genome Research 2009
Application to personalised medicine: the SHIVA clinical trial
molecularly targeted therapy
>?
conventional therapy
Molecular profile
Molecular
abnormality
Targeted agent
Targeted agent
Chemotherapy
Chemotherapy
Chemotherapy
Targeted agent
Targeted agent
Targeted agent
Targeted agent
→ compare the efficacy of molecularly targeted therapy based on
tumor molecular profiling versus conventional therapy in patients
with refractory cancer
SHIVA clinical trial: the workflow
Patient’s
inclusion
Shipment to CRB
biopsy
clinic
Validation of amplified/deleted
genes by IHC
4 weeks
Shipment to
pathology
Shipment of DNA
to Affymetrix
platform
DNA
extraction
Affymetrix
Cytoscan HD
IHC
RO/RP/RA
Shipment of DNA
to sequencing
platform
Sequencing
Ion Torrent
Bioinformatics data integration
List of amplified/
deleted genes
Bioinformatics analysis:
detection of amplified/deleted genes
Bioinformatics analysis:
detection of mutated
genes
Elaboration of a report that is sent to the Molecular Biology Board
Therapeutic decision
The therapeutic decision is based on a report with the
list of molecular abnormalities
Simple decision rules:
● If STK11 is mutated
targeted therapy = everolimus
● Other simple rules are used for
other targeted therapies
→ Cancer biology is much more
complex and these “naive” rules
need to be improved
Cancer is a complex disease
Multiple biological layers
Interactions between chemical species
The multidimensional nature of the cancer (genome, proteome, epigenome, kinome, etc.)
has to be considered to unravel the complexity of the disease. Mathematical models and
computational systems biology are definitely needed to improve current decision rules
and understand the emergent properties of cancer cells.
→ In order to perfom such integrative analyses with sophisticated mathematical models,
the data integration of these multidimensional informations within an efficent information
system is required.
Data integration is a major challenge in cancer research
Private data
Medical
Copy Number images
data
Public data
Clinical
data
NGS
data
MS
data
Gene
expression
data
Phenotyping
data
Biobank
data
Reactome
TCGA
CCLE
ICGC
RPPA
data
A large Volume of patients' data is disseminated accross a large Variety of databases
which increase in size at a huge Velocity. In order to extract most of the hidden Value
from these data we must face challenges at :
→ the technical level : develop a powerful informatic architecture
→ the organisational and management levels : define the procedures to collect
data with hightest confidence and quality
→ the scientific level : create sophisticated mathematical models to predict the
disease evolution and patient's risk
→ At Institut Curie we are currently building an information system to fully
integrate all the molecular, biological and clinical data
Can we dream of an online prediction system to help
therapeutic prediction?
Private data
Public data
wrapper
LIMS
NGS
data
wrapper
LIMS
RPPA
data
wrapper
Reactome
wrapper
...
...
●
Every day, for several patients, information are
collected :
wrapper
Gene
expression
data
LIMS
Integrative analysis aim at building signatures to
predict disease evolution (e.g. risk of metastatis)
Clinical
data
Centralised bioinformatics
database
Virtual
database
–
pathological complete response
–
survival
–
response to therapy
–
molecular profiles
–
etc.
Therapeutic
decision
Re-evaluate prediction rules in real-time taking
into account these new informations
●
●
Apply online machine learning techniques
Prediction of pCR
New patient
Training
math
models
Observed pCR
...
time
●
Towards P4 medicine
● P4 medecine was coined by Leroy Hood (president of the Institute of System Biology)
● The practise of medicine is mainly reactive, i.e. the physician reacts to the disease state
of the patient and little is done to prevent the occurrence of the disease.
● Predictive medicine was first introduced by Jean Dausset (Nobel prize in medicine, 1980).
P4 medicine :
– Predictive : consider the genetic background of the individual and his environment
– Preventive : adapting lifestyle, traking preventing drugs
– Personalised : tailored the treatment to the unique feature of the individual (such as
patient's genetic background, tumour's genetic and epigenetic landscape, life
environment)
– Parcipatory : many options about healthcare which require in-depth exchanges between
the indivudual and his physician
→ P4 medicine = manage patient'health instead of manage a patient's disease
Big basket with a large variety of data
Data integration + mathematical models
→ leverage new information
Bienvenue à GATTACA