Presentation Lengauer (PDF

Transcription

Presentation Lengauer (PDF
Data Analysis and Modeling in
the Modern Life Sciences
Thomas Lengauer
Max-Planck-Institut für Informatik
Saarbrücken
Structure of HIV-Protease protein with inhibitor
Overview
1. Nature vs. engineering
2. Nature talking
2.1 The letters
2.2 The words
2.3 The phrases
2.4 The stories
3. Modeling, analysis and prediction
4. Challenges
©Thomas Lengauer, 2013
Overview
1. Nature vs. engineering
2. Nature talking
2.1 The letters
2.2 The words
2.3 The phrases
2.4 The stories
3. Modeling, analysis and prediction
4. Challenges
©Thomas Lengauer, 2013
Rational Design in Engineering
Limited resources
Planned design
Designer cleans up
after herself
Design requests
understanding of
her design
Approximation to
the optimum is
sufficient
Clear hierarchical
structure
Intel 10-core XEON E7 Processor
2,6 bio transistors
25nm technology
©Thomas Lengauer, 2013
Modular separation
of tasks
http://www.silicon.de/wpcontent/uploads/legacy_images/story_media/41551389/5588345603_f12cb51d4b_b.jpg
Rationaler Entwurf im Ingenieurwesen
Limited resources
Planned design
Designer cleans up
after herself
Design requests
understanding of
her design
Approximation to
the optimum is
sufficient
Clear hierarchical
structure
Combustion engine, part display
von http://www.rsportscars.com/eng/articles/images01/bmw_engine_award_05_1600.jpg
©Thomas Lengauer, 2013
Modular separation
of tasks
Evolutionary design
Evolutionary design happens over
large time scales
And uses ample resources (large
populations, complex mechanisms
of selection)
Physical scoring functions (free
energy) are optimized exactly
Nature does not plan but tries
things out
Nature does not clean up after
herself
Nature does not request
understanding of her designs
Signal path for programmed cell death
http://www.cellsignal.com/reference/pathway/images/
Apoptosis_Overview.jpg
©Thomas Lengauer, 2013
Complex many-faceted design
No clear separation of tasks
Nature is a strange engineer
von http://www.rube-goldberg.com/gallery_02.php
©Thomas Lengauer, 2013
Character of Biological Systems
Results of physico-chemical processes guided by the laws of Nature
Structure and interaction of biomolecules are the result of optimizing the free
energy of molecular ensembles
and of evolutionary history
Evolution = Reproduction + Variation + Selection
Variation can be captured theoretically, within limits
Selection encompasses complex aspects of the environment and can only
rarely be reduced to simple theoretically analyzable features
©Thomas Lengauer, 2013
Overview
1. Nature vs. engineering
2. Nature talking
2.1 The letters
2.2 The words
2.3 The phrases
2.4 The stories
3. Modeling, analysis and prediction
4. Challenges
©Thomas Lengauer, 2013
The Letters: Biomolecules
DNA
Storage molecule
Very homogeneous overall structure
Subtle variations in fine structure
Data repositories
DNA sequences databases,
162 mio sequences
DNA double helix
©Thomas Lengauer, 2013
Other helix variants
The Letters: Biomolecules
DNA
Storage molecule
Very homogeneous overall structure
Subtle variations in fine structure
Exquisitely packaged in the cell nucleus
Proteins
Molecular machines
HIV Protease, molecular scissors
©Thomas Lengauer, 2013
The Letters: Biomolecules
DNA
Storage molecule
Very homogeneous overall structure
Subtle variations in fine structure
Exquisitely packaged in the cell nucleus
Proteins
Molecular machines
HIV Protease, molecular scissors
©Thomas Lengauer, 2013
The Letters: Biomolecules
DNA
Storage molecule
Very homogeneous overall structure
Subtle variations in fine structure
Exquisitely packaged in the cell nucleus
http://www.youtube.com/watch?v=wJyUtbn0O5Y
Proteins
Molecular machines
©Thomas Lengauer, 2013
Kinesin
The Letters: Biomolecules
DNA
Storage molecule
Very homogeneous overall structure
Subtle variations in fine structure
Exquisitely packaged in the cell nucleus
Proteins
Molecular machines
Structurally highly divers
Data repositories
Protein sequence databases (Uniprot),
32 mio sequences
Protein structure database (PDB),
89,000 protein structures
©Thomas Lengauer, 2013
The Letters: Biomolecules
DNA
Ribosome, Courtesy Helmut Grubmüller
Storage molecule
Very homogeneous overall structure
Subtle variations in fine structure
Exquisitely packaged in the cell nucleus
Proteins
Molecular machines
Structurally highly divers
RNA
Evolutionary oldest allround molecule
Step-child of biology until fifteen
years ago
Has been moving into the center of
research
Data repositories
RNA sequence and structure databases
Transcriptomics databases (GEO),
Thousands of datasets, 1 mio samples
©Thomas Lengauer, 2013
Overview
1. Nature vs. engineering
2. Nature talking
2.1 The letters
2.2 The words
2.3 The phrases
2.4 The stories
3. Modeling, analysis and prediction
4. Challenges
©Thomas Lengauer, 2013
The Words: Molecular Interactions
Proteins
bind to small molecules, e.g. to
facilitate metabolism
Data repositories
Protein-ligand database (Relibase)
Enzyme database BRENDA (2.7 mio
enzyme occurrences, 510.000 references)
http://www.pdb.org/pdb/101/motm.do?momID=30
Glutamine Synthase
http://www.pdb.org/pdb/101/motm.do?momID=34
Dihydrofolate Reductase
©Thomas Lengauer, 2013
The Words: Molecular Interactions
Proteins
bind to small molecules, e.g. to
facilitate metabolism
love to form large complexes
GroEL/ES – chaperone
21 p roteins of 2 types
Data repositories
ATP Synthase
Proteasome – protein shredder
28 proteins of 2 types
Protein complex database (CORUM),
roughly 2900 complexes (2000
unique)
©Thomas Lengauer, 2013
Nuclear pore
460 proteins of 30 types
Alber et al. Nature 450, 7170 (2007)
Synaptic vesicle
410 proteins of over 80 Typen
Takamori et al., Cell 127 (2006)
The Words: Molecular Interactions
Proteins
bind to small molecules, e.g. to
facilitate metabolism
love to form large complexes
Bind to DNA to regulate cellular
processes
Data repositories
Transcription factor databases
(e.g. TRANSFAC)
18,000 transcription factors
33,000 binding sites
280,000 promoter sequences
Transcription factor “reads off” DNA
von http://www.albany.edu/cancergenomics/faculty/ikuznetsov/research.html
©Thomas Lengauer, 2013
The Words: Molecular Interactions
Proteins
bind to small molecules, e.g. to
facilitate metabolism
love to form large complexes
Bind to DNA to regulate cellular
processes
Package the genome in the cell
nucleus, thereby regulating the cell
Chromatin fiber
Von http://www.biol.ethz.ch/IMB/groups/richmond/projects/chromatin_fiber
von http://www.stkate.edu/physics/Astrobiology/
©Thomas Lengauer, 2013
The Words: Molecular Interactions
Proteins
bind to small molecules, e.g. to
facilitate metabolism
love to form large complexes
Hammerhead Ribozyme
Bind to DNA to regulate cellular
processes
Package the genome in the cell
nucleus, thereby regulating the cell
RNA
sometimes acts like an enzyme
Is involved in cell regulation at
many levels
Data repositories
Micro-RNA databases (miRBase),
1500 Chromatin
miRNAsfiber
miRNA silencing mRNA
©Thomas Lengauer, 2013
Overview
1. Nature vs. engineering
2. Nature talking
2.1 The letters
2.2 The words
2.3 The phrases
2.4 The stories
3. Modeling, analysis and prediction
4. Challenges
©Thomas Lengauer, 2013
The Phrases: Molecular Pathways and Networks
Metabolic networks: The production
department of the cell
Genome-scale metabolic networks are
known for dozens of (mostly microbial)
organisms
Metabolic network are homogeneous
Mathematical method exist for their
stationary and nonstationary analysis
Kinetic data on the involved reactions are
mostly missing
Sugar metabolism in
different organisms
Schuster, Fell (2007)
©Thomas Lengauer, 2013
Boehringer-Mannheim Chart
siehe http://www.expasy.ch/cgi-bin/search-biochem-index
The Phrases: Molecular Pathways and Networks
Metabolic networks: The production
department of the cell
Genome-scale metabolic networks are
known for dozens of (mostly microbial)
organisms
Metabolic network are homogeneous
Mathematical method exist for their
stationary and nonstationary analysis
Kinetic data on the involved reactions are
mostly missing
Data repositories
(Metabolic pathway) databases (e.g. KEGG)
430 pathway maps
in over 225,000 variations
involving 2,500 organisms
and 9,200 chemical reactions
addressing 1300 human diseases
involving 9,800 drugs and 17,000 metabolites
©Thomas Lengauer, 2013
Boehringer-Mannheim Chart
siehe http://www.expasy.ch/cgi-bin/search-biochem-index
The Phrases: Molecular Pathways and Networks
Regulatory network: Steering
department of the cell
Three levels of cell regulation
Transcriptional regulation:
Reading off genes
Data repositories
Regulatory pathway databases
(e.g. TRANSPATH)
53,000 reactions
involving 30,000 compounds
61 pathway maps
Cellular transcription machinery
Werner, T. (2007). Analyzing Regulatory Regions in Genomes. Bioinformatics –
From Genomes to Therapies. T. Lengauer. Weinheim, Wiley-VCH. 1: 159-196.
©Thomas Lengauer, 2013
The Phrases: Molecular Pathways and Networks
Regulatory network: Steering
department of the cell
Three levels of cell regulation
Transcriptional regulation:
Reading off genes
Posttranscriptional regulation:
Regulating the synthesis and
degradation of gene products
Data repositories
Micro-RNA databases (miRBase),
46,000 sequences
in 193 species
©Thomas Lengauer, 2013
Waterhouse, Helliwell, Nat Genet 4, 1 (2003)
The Phrases: Molecular Pathways and Networks
Regulatory network: Steering
department of the cell
Three levels of cell regulation
Transcriptional regulation:
Reading off genes
Posttranscriptional regulation:
Regulating the synthesis and
degradation of gene products
Epigenetic regulation:
Regulating the packaging of the
genome
Data repositories
DNA methylation databases
Epigenetic modification databases
Rapidly growing
von http://www.stkate.edu/physics/Astrobiology/
©Thomas Lengauer, 2013
The Phrases: Molecular Pathways and Networks
Regulatory network: Steering
department of the cell
Three levels of cell regulation
Transcriptional regulation:
Reading off genes
Posttranscriptional regulation:
Regulating the synthesis and
degradation of gene products
Epigenetic regulation:
Regulating the packaging of the
genome
The ENCODE project (Ten-year
project finished in 2012) has
collected genome-wide data on
regulatory elements
von http://www.stkate.edu/physics/Astrobiology/
©Thomas Lengauer, 2013
The Phrases: Molecular Pathways and Networks
Protein-protein interactions fulfill
many roles in the cell
gene regulation
communication within cells (signal
transduction)
communication between cells
Protein interactions are collected
cell-wide
data are very noisy
Data repositories
Protein interaction databases
(e.g. IntAct)
over 300,000 interactions
involving 65,000 molecules
©Thomas Lengauer, 2013
Manually curated human protein interaction network
Stelzl et al., Cell 122, 6 (2005)
Overview
1. Nature vs. engineering
2. Nature talking
2.1 The letters
2.2 The words
2.3 The phrases
2.4 The stories
3. Modeling, analysis and prediction
4. Challenges
©Thomas Lengauer, 2013
The Stories: Biological Processes
Dynamics of molecular networks
Kinetic data are largely missing,
even for metabolic networks
Thus modeling happens at
higher levels of abstraction
Switching networks (on-off)
or in restricted scenarios
steady state
Involved genes
Gene expression levels over time
Circadian rhythm in Drosophila
Smolen et al., Biophys J 86, 5 (2004)
©Thomas Lengauer, 2013
The Stories: Biological Processes
Enrichment of genetic and
epigenetic aberrations
Genetic aberrations in prostate cancer
©Thomas Lengauer, 2013
Rahnenführer et al., Bioinformatics 21, 10 (2005)
The Stories: Biological Processes
Viral evolution von HIV under
drug pressure
Resistence mutations of HIV protease
Beerenwinkel et al., J Inf Diseases 191, 11 (2005)
©Thomas Lengauer, 2013
Overview
1. Nature vs. engineering
2. Nature talking
2.1 The letters
2.2 The words
2.3 The phrases
2.4 The stories
3. Modeling, analysis and prediction
4. Challenges
©Thomas Lengauer, 2013
Finding Protein Structure
1. Physics/Chemistry
E
Protein structure is the resulting
of minimizing the free energy of
a molecular ensemble
•
•
S1
The computation of energy is
difficult
The energy landscape is
complex
This approach is only feasible in local
scenarios
•
S2
Refinement of plausible
approximate structures
von http://gold.cchem.berkeley.edu/research_path.html
E
Energy
S1, 2 structure parameters
©Thomas Lengauer, 2013
Finding Protein Structure
1.
2. History
To find a first structure we carry over
knowledge from related proteins
?? Target protein
Template protein
Target protein
Template protein
Target protein
Templat protein
Find a template protein via sequence
comparison (History)
1
MSALDNSIRVEVKTEYIEQQSSPED-EKYLFSYTITII---NL
.....||..::: |:::.|.:.|:. :..|.|.:.:|:
|.
51 NPSVTVITIQASNSTELDM-TDFVFQAAVPKTFQLQLLSPSSSIVPAFNT
39
40 GE--------QAAKLETRHWIITDANGKTSEVQGAGVVGETPTIPPNTAY
|.
...|.:.|..|....|.|.|.:|.
:.|....||. ::
100 GTITQVIKVLNPQKQQLRMRIKLTYNHKGSAMQD---LAEVNNFPPQ-SW
81
82 QYTSGTVLDTPFGIMYGTYGMVSESGEHFNAIIKPFRLATPGLLH
|
146 Q
Identical amino acids: 16%
©Thomas Lengauer, 2013
99
145
126
146
Finding Protein Structure
1.
2. History
To find a first structure we carry over
knowledge from related proteins
?? Target protein
Template protein
Target protein
Template protein
Target protein
Templat protein
Find a template protein via sequence
comparison (History)
1
MSALDNSIRVEVKTEYIEQQSSPED-EKYLFSYTITII---NL
.....||..::: |:::.|.:.|:. :..|.|.:.:|:
|.
51 NPSVTVITIQASNSTELDM-TDFVFQAAVPKTFQLQLLSPSSSIVPAFNT
39
40 GE--------QAAKLETRHWIITDANGKTSEVQGAGVVGETPTIPPNTAY
|.
...|.:.|..|....|.|.|.:|.
:.|....||. ::
100 GTITQVIKVLNPQKQQLRMRIKLTYNHKGSAMQD---LAEVNNFPPQ-SW
81
82 QYTSGTVLDTPFGIMYGTYGMVSESGEHFNAIIKPFRLATPGLLH
|
146 Q
Similar amino acids: 31%
©Thomas Lengauer, 2013
99
145
126
146
Finding Protein Structure
Template-based structure
prediction is successful and
the method of choice for
sequence identities of above
30-40%
Template protein
1.
Find a template protein via sequence
comparison (History)
2.
Model the target protein onto the template
3.
Refine the model (Physics/Chemistry)
Structure model of the target protein
Actual structure of the target protein
Jones et al., Proteins 61, Supp 7 (2005)
©Thomas Lengauer, 2013
Finding Protein Structure
If no template protein can be
found, the structure of the
target protein must be
predicted de novo
A fragment based approach
pieces together the structure
of the target protein from
templates of fragments of the
protein
This effectively extend the
history approach
The quality results varies for
different protein
There are intrinsically hard
protein for this approach
Ben-David et al., Proteins (2009)
©Thomas Lengauer, 2013
Finding Protein Structure
If no template protein can be
found, the structure of the
target protein must be
predicted de novo
A fragment based approach
pieces together the structure
of the target protein from
templates of fragments of the
protein
This effectively extend the
history approach
The quality results varies for
different protein
There are intrinsically hard
protein for this approach
Ben-David et al., Proteins 77, Suppl.9 (2009) 50-65
©Thomas Lengauer, 2013
Approaches to Prediction and Modeling
The two approaches: Physics/Chemistry and History reflect
the hybrid nature of biological systems
History uses pattern mining to discover relationships and
similarities in molecular (mostly sequence) data
There is a third approach, Association – somewhat more
general than History – that is based on data mining/pattern
recognition but does not build on a concept of history
Due to the complexity of the molecular processes, most
bioinformatics predictions use pattern mining, few use
physics/chemistry
©Thomas Lengauer, 2013
Approaches to Prediction and Modeling
Predictions via physics/chemistry
Molecular docking
Predictions via history
Taxonomy and phylogenetic trees,
comparative genomics
Predictions via association
Uncovering genotype-phenotype
relationships
©Thomas Lengauer, 2013
Overview
1. Nature vs. engineering
2. Nature talking
2.1 The letters
2.2 The words
2.3 The phrases
2.4 The stories
3. Modeling, analysis and prediction
4. Challenges
©Thomas Lengauer, 2013
Recent Data Explosion in Biology
1. New generation sequencing techniques
•
Accuracy of genome sequences is achieved via oversampling
•
Millions of sequences per sequencing experiment
2. Unfolding of genomic data
• A complex organism (mammal) has only one genome
• But many epigenomes
• 200 tissues
• Epigenomes change with environmental conditions, age and
disease
3. From species data to data of individuals
•
Personal –omics data are increasingly generated
•
Sequencing has become a commodity
©Thomas Lengauer, 2013
Recent Data Explosion in Biology
Landmark projects
Project
Goal
Budget
Time period
Human Genome Project
Reference human
genome
$ 5 bio
1988-2003
International HapMap
Project
Low resolution
variation
$ 150 mio
2002-2005+
1000 Genomes Project
High resolution
variation
$ 120 mio
2008-2012
International Cancer
Genome Consortium
25.000 cancer
genomes, 50 tumor
types
$ multi hundred mio
2008-
ENCODE
Annotation of the
human genome
$ 100 mio
2003-2012
International Human
Epigenome
Consortium
1000 reference
epigenomes
≈ $ 300 mio
2011-2016
Genome 10K Project
10,000 vertebrate
genomes
$ 100 mio
2010-
modENCODE
Annotation of the
genomes of C. elegans
and D. melanogaster
$ 140 mio
2008-2013
©Thomas Lengauer, 2013
Resource Issues
Storage
One human genome (60x) of 200Gbp requires ≈ 2 TB for storage
and analysis
Runtime
One machine produces a human genome in two days
Max ≈ 45 Tbp/day, 225 genomes, 450 TB/day
LHC: 40 TB/day
Facebook: 500 TB/day
Super-Moore
Computing power has doubled every 14 months
Sequencing volume doubles every 5 months
Never enough processing, storage, people
©Thomas Lengauer, 2013
Estimates by Sven-Eric Schelhorn
Data issues
Two sources of “noise”
Technical limitations of the measurement process
• Undesired, needs to be corrected for
Biological variation
• By design of Nature, needs to be taken into account and understood
Interlinking heterogeneous data
Ontologies, consistency of notation, harmonizing data standards, diversity of
metadata
Need to develop standards for experimental protocols and for data quality
MIAME: Minimum information about a microarray experiment
Privacy issues
New concepts of anonymization are needed as individual can be inferred from
their genomic data and associated information
©Thomas Lengauer, 2013
Analysis issues
Validation has to appropriately handle
The noise in the data
The lack of knowledge on the studied system
Spurious signals
Biological data are characterized by many variables (e.g. genes) and few
samples (e.g. patients)
Overfitting is a great danger leading to models of lower predictive power
(curse of dimensionality)
Repeated testing of the same hypothesis leads to chance signals (multiple
testing problem)
Finding promising hypotheses to test
Utopia
Unique hypothesis
Bioinformatics
©Thomas Lengauer, 2013
Hypothesis space
Reductionist method
Discovery Science
Where We Still Have Fundamental Problems
Ensuring and appropriately reporting data quality is a severe bottleneck
Separate biological variation from technical noise
Develop standards for data quality
Biological phenotypes are often not sufficiently described
Statistical inference usually does not uncover causal relationships
Covering scales of space and time requires data integration and methodical
research
Space: molecule (10-9 m) – molecular complex – cell – tissue – organ –
organism (1 m)
Time: from molecular vibrations (10-17 s) to long-term developments (disease,
ageing, 107 s)
Bioinformatic models are never exact
The prediction methods needs to assess the confidence its own predictions
©Thomas Lengauer, 2013
Perspectives
Data, especially sequencing data, will grow at an explosive rate
Sequencing technology is in rapid development
Single cell sequencing will provide a new level of data quality
Image data will become a new data type that will pervade the field
©Thomas Lengauer, 2013
Image Data
Larval development of C.
elegans
Fly brain, image and model
Brain and spinal chord of a fish
From Long, F., J. Zhou, et al. (2012). " PLoS Comput Biol 8(6): e1002519.
©Thomas Lengauer, 2013
Perspectives
Data, especially sequencing data, will grow at an explosive rate
Sequencing technology is in rapid development
Single cell sequencing will provide a new level of data quality
Image data will become a new data type that will pervade the field
Dynamic processes will increasingly be modeled
The technology will increasingly reach into the clinic
personalized / individualized / precision medicine
We need to secure a consensus in society pertaining to these
developments
This will involve an open discussion and careful addressing of the risks and
ethical issues involved
©Thomas Lengauer, 2013
Thanks for your
attention!
©Thomas Lengauer, 2013