Presentation Lengauer (PDF
Transcription
Presentation Lengauer (PDF
Data Analysis and Modeling in the Modern Life Sciences Thomas Lengauer Max-Planck-Institut für Informatik Saarbrücken Structure of HIV-Protease protein with inhibitor Overview 1. Nature vs. engineering 2. Nature talking 2.1 The letters 2.2 The words 2.3 The phrases 2.4 The stories 3. Modeling, analysis and prediction 4. Challenges ©Thomas Lengauer, 2013 Overview 1. Nature vs. engineering 2. Nature talking 2.1 The letters 2.2 The words 2.3 The phrases 2.4 The stories 3. Modeling, analysis and prediction 4. Challenges ©Thomas Lengauer, 2013 Rational Design in Engineering Limited resources Planned design Designer cleans up after herself Design requests understanding of her design Approximation to the optimum is sufficient Clear hierarchical structure Intel 10-core XEON E7 Processor 2,6 bio transistors 25nm technology ©Thomas Lengauer, 2013 Modular separation of tasks http://www.silicon.de/wpcontent/uploads/legacy_images/story_media/41551389/5588345603_f12cb51d4b_b.jpg Rationaler Entwurf im Ingenieurwesen Limited resources Planned design Designer cleans up after herself Design requests understanding of her design Approximation to the optimum is sufficient Clear hierarchical structure Combustion engine, part display von http://www.rsportscars.com/eng/articles/images01/bmw_engine_award_05_1600.jpg ©Thomas Lengauer, 2013 Modular separation of tasks Evolutionary design Evolutionary design happens over large time scales And uses ample resources (large populations, complex mechanisms of selection) Physical scoring functions (free energy) are optimized exactly Nature does not plan but tries things out Nature does not clean up after herself Nature does not request understanding of her designs Signal path for programmed cell death http://www.cellsignal.com/reference/pathway/images/ Apoptosis_Overview.jpg ©Thomas Lengauer, 2013 Complex many-faceted design No clear separation of tasks Nature is a strange engineer von http://www.rube-goldberg.com/gallery_02.php ©Thomas Lengauer, 2013 Character of Biological Systems Results of physico-chemical processes guided by the laws of Nature Structure and interaction of biomolecules are the result of optimizing the free energy of molecular ensembles and of evolutionary history Evolution = Reproduction + Variation + Selection Variation can be captured theoretically, within limits Selection encompasses complex aspects of the environment and can only rarely be reduced to simple theoretically analyzable features ©Thomas Lengauer, 2013 Overview 1. Nature vs. engineering 2. Nature talking 2.1 The letters 2.2 The words 2.3 The phrases 2.4 The stories 3. Modeling, analysis and prediction 4. Challenges ©Thomas Lengauer, 2013 The Letters: Biomolecules DNA Storage molecule Very homogeneous overall structure Subtle variations in fine structure Data repositories DNA sequences databases, 162 mio sequences DNA double helix ©Thomas Lengauer, 2013 Other helix variants The Letters: Biomolecules DNA Storage molecule Very homogeneous overall structure Subtle variations in fine structure Exquisitely packaged in the cell nucleus Proteins Molecular machines HIV Protease, molecular scissors ©Thomas Lengauer, 2013 The Letters: Biomolecules DNA Storage molecule Very homogeneous overall structure Subtle variations in fine structure Exquisitely packaged in the cell nucleus Proteins Molecular machines HIV Protease, molecular scissors ©Thomas Lengauer, 2013 The Letters: Biomolecules DNA Storage molecule Very homogeneous overall structure Subtle variations in fine structure Exquisitely packaged in the cell nucleus http://www.youtube.com/watch?v=wJyUtbn0O5Y Proteins Molecular machines ©Thomas Lengauer, 2013 Kinesin The Letters: Biomolecules DNA Storage molecule Very homogeneous overall structure Subtle variations in fine structure Exquisitely packaged in the cell nucleus Proteins Molecular machines Structurally highly divers Data repositories Protein sequence databases (Uniprot), 32 mio sequences Protein structure database (PDB), 89,000 protein structures ©Thomas Lengauer, 2013 The Letters: Biomolecules DNA Ribosome, Courtesy Helmut Grubmüller Storage molecule Very homogeneous overall structure Subtle variations in fine structure Exquisitely packaged in the cell nucleus Proteins Molecular machines Structurally highly divers RNA Evolutionary oldest allround molecule Step-child of biology until fifteen years ago Has been moving into the center of research Data repositories RNA sequence and structure databases Transcriptomics databases (GEO), Thousands of datasets, 1 mio samples ©Thomas Lengauer, 2013 Overview 1. Nature vs. engineering 2. Nature talking 2.1 The letters 2.2 The words 2.3 The phrases 2.4 The stories 3. Modeling, analysis and prediction 4. Challenges ©Thomas Lengauer, 2013 The Words: Molecular Interactions Proteins bind to small molecules, e.g. to facilitate metabolism Data repositories Protein-ligand database (Relibase) Enzyme database BRENDA (2.7 mio enzyme occurrences, 510.000 references) http://www.pdb.org/pdb/101/motm.do?momID=30 Glutamine Synthase http://www.pdb.org/pdb/101/motm.do?momID=34 Dihydrofolate Reductase ©Thomas Lengauer, 2013 The Words: Molecular Interactions Proteins bind to small molecules, e.g. to facilitate metabolism love to form large complexes GroEL/ES – chaperone 21 p roteins of 2 types Data repositories ATP Synthase Proteasome – protein shredder 28 proteins of 2 types Protein complex database (CORUM), roughly 2900 complexes (2000 unique) ©Thomas Lengauer, 2013 Nuclear pore 460 proteins of 30 types Alber et al. Nature 450, 7170 (2007) Synaptic vesicle 410 proteins of over 80 Typen Takamori et al., Cell 127 (2006) The Words: Molecular Interactions Proteins bind to small molecules, e.g. to facilitate metabolism love to form large complexes Bind to DNA to regulate cellular processes Data repositories Transcription factor databases (e.g. TRANSFAC) 18,000 transcription factors 33,000 binding sites 280,000 promoter sequences Transcription factor “reads off” DNA von http://www.albany.edu/cancergenomics/faculty/ikuznetsov/research.html ©Thomas Lengauer, 2013 The Words: Molecular Interactions Proteins bind to small molecules, e.g. to facilitate metabolism love to form large complexes Bind to DNA to regulate cellular processes Package the genome in the cell nucleus, thereby regulating the cell Chromatin fiber Von http://www.biol.ethz.ch/IMB/groups/richmond/projects/chromatin_fiber von http://www.stkate.edu/physics/Astrobiology/ ©Thomas Lengauer, 2013 The Words: Molecular Interactions Proteins bind to small molecules, e.g. to facilitate metabolism love to form large complexes Hammerhead Ribozyme Bind to DNA to regulate cellular processes Package the genome in the cell nucleus, thereby regulating the cell RNA sometimes acts like an enzyme Is involved in cell regulation at many levels Data repositories Micro-RNA databases (miRBase), 1500 Chromatin miRNAsfiber miRNA silencing mRNA ©Thomas Lengauer, 2013 Overview 1. Nature vs. engineering 2. Nature talking 2.1 The letters 2.2 The words 2.3 The phrases 2.4 The stories 3. Modeling, analysis and prediction 4. Challenges ©Thomas Lengauer, 2013 The Phrases: Molecular Pathways and Networks Metabolic networks: The production department of the cell Genome-scale metabolic networks are known for dozens of (mostly microbial) organisms Metabolic network are homogeneous Mathematical method exist for their stationary and nonstationary analysis Kinetic data on the involved reactions are mostly missing Sugar metabolism in different organisms Schuster, Fell (2007) ©Thomas Lengauer, 2013 Boehringer-Mannheim Chart siehe http://www.expasy.ch/cgi-bin/search-biochem-index The Phrases: Molecular Pathways and Networks Metabolic networks: The production department of the cell Genome-scale metabolic networks are known for dozens of (mostly microbial) organisms Metabolic network are homogeneous Mathematical method exist for their stationary and nonstationary analysis Kinetic data on the involved reactions are mostly missing Data repositories (Metabolic pathway) databases (e.g. KEGG) 430 pathway maps in over 225,000 variations involving 2,500 organisms and 9,200 chemical reactions addressing 1300 human diseases involving 9,800 drugs and 17,000 metabolites ©Thomas Lengauer, 2013 Boehringer-Mannheim Chart siehe http://www.expasy.ch/cgi-bin/search-biochem-index The Phrases: Molecular Pathways and Networks Regulatory network: Steering department of the cell Three levels of cell regulation Transcriptional regulation: Reading off genes Data repositories Regulatory pathway databases (e.g. TRANSPATH) 53,000 reactions involving 30,000 compounds 61 pathway maps Cellular transcription machinery Werner, T. (2007). Analyzing Regulatory Regions in Genomes. Bioinformatics – From Genomes to Therapies. T. Lengauer. Weinheim, Wiley-VCH. 1: 159-196. ©Thomas Lengauer, 2013 The Phrases: Molecular Pathways and Networks Regulatory network: Steering department of the cell Three levels of cell regulation Transcriptional regulation: Reading off genes Posttranscriptional regulation: Regulating the synthesis and degradation of gene products Data repositories Micro-RNA databases (miRBase), 46,000 sequences in 193 species ©Thomas Lengauer, 2013 Waterhouse, Helliwell, Nat Genet 4, 1 (2003) The Phrases: Molecular Pathways and Networks Regulatory network: Steering department of the cell Three levels of cell regulation Transcriptional regulation: Reading off genes Posttranscriptional regulation: Regulating the synthesis and degradation of gene products Epigenetic regulation: Regulating the packaging of the genome Data repositories DNA methylation databases Epigenetic modification databases Rapidly growing von http://www.stkate.edu/physics/Astrobiology/ ©Thomas Lengauer, 2013 The Phrases: Molecular Pathways and Networks Regulatory network: Steering department of the cell Three levels of cell regulation Transcriptional regulation: Reading off genes Posttranscriptional regulation: Regulating the synthesis and degradation of gene products Epigenetic regulation: Regulating the packaging of the genome The ENCODE project (Ten-year project finished in 2012) has collected genome-wide data on regulatory elements von http://www.stkate.edu/physics/Astrobiology/ ©Thomas Lengauer, 2013 The Phrases: Molecular Pathways and Networks Protein-protein interactions fulfill many roles in the cell gene regulation communication within cells (signal transduction) communication between cells Protein interactions are collected cell-wide data are very noisy Data repositories Protein interaction databases (e.g. IntAct) over 300,000 interactions involving 65,000 molecules ©Thomas Lengauer, 2013 Manually curated human protein interaction network Stelzl et al., Cell 122, 6 (2005) Overview 1. Nature vs. engineering 2. Nature talking 2.1 The letters 2.2 The words 2.3 The phrases 2.4 The stories 3. Modeling, analysis and prediction 4. Challenges ©Thomas Lengauer, 2013 The Stories: Biological Processes Dynamics of molecular networks Kinetic data are largely missing, even for metabolic networks Thus modeling happens at higher levels of abstraction Switching networks (on-off) or in restricted scenarios steady state Involved genes Gene expression levels over time Circadian rhythm in Drosophila Smolen et al., Biophys J 86, 5 (2004) ©Thomas Lengauer, 2013 The Stories: Biological Processes Enrichment of genetic and epigenetic aberrations Genetic aberrations in prostate cancer ©Thomas Lengauer, 2013 Rahnenführer et al., Bioinformatics 21, 10 (2005) The Stories: Biological Processes Viral evolution von HIV under drug pressure Resistence mutations of HIV protease Beerenwinkel et al., J Inf Diseases 191, 11 (2005) ©Thomas Lengauer, 2013 Overview 1. Nature vs. engineering 2. Nature talking 2.1 The letters 2.2 The words 2.3 The phrases 2.4 The stories 3. Modeling, analysis and prediction 4. Challenges ©Thomas Lengauer, 2013 Finding Protein Structure 1. Physics/Chemistry E Protein structure is the resulting of minimizing the free energy of a molecular ensemble • • S1 The computation of energy is difficult The energy landscape is complex This approach is only feasible in local scenarios • S2 Refinement of plausible approximate structures von http://gold.cchem.berkeley.edu/research_path.html E Energy S1, 2 structure parameters ©Thomas Lengauer, 2013 Finding Protein Structure 1. 2. History To find a first structure we carry over knowledge from related proteins ?? Target protein Template protein Target protein Template protein Target protein Templat protein Find a template protein via sequence comparison (History) 1 MSALDNSIRVEVKTEYIEQQSSPED-EKYLFSYTITII---NL .....||..::: |:::.|.:.|:. :..|.|.:.:|: |. 51 NPSVTVITIQASNSTELDM-TDFVFQAAVPKTFQLQLLSPSSSIVPAFNT 39 40 GE--------QAAKLETRHWIITDANGKTSEVQGAGVVGETPTIPPNTAY |. ...|.:.|..|....|.|.|.:|. :.|....||. :: 100 GTITQVIKVLNPQKQQLRMRIKLTYNHKGSAMQD---LAEVNNFPPQ-SW 81 82 QYTSGTVLDTPFGIMYGTYGMVSESGEHFNAIIKPFRLATPGLLH | 146 Q Identical amino acids: 16% ©Thomas Lengauer, 2013 99 145 126 146 Finding Protein Structure 1. 2. History To find a first structure we carry over knowledge from related proteins ?? Target protein Template protein Target protein Template protein Target protein Templat protein Find a template protein via sequence comparison (History) 1 MSALDNSIRVEVKTEYIEQQSSPED-EKYLFSYTITII---NL .....||..::: |:::.|.:.|:. :..|.|.:.:|: |. 51 NPSVTVITIQASNSTELDM-TDFVFQAAVPKTFQLQLLSPSSSIVPAFNT 39 40 GE--------QAAKLETRHWIITDANGKTSEVQGAGVVGETPTIPPNTAY |. ...|.:.|..|....|.|.|.:|. :.|....||. :: 100 GTITQVIKVLNPQKQQLRMRIKLTYNHKGSAMQD---LAEVNNFPPQ-SW 81 82 QYTSGTVLDTPFGIMYGTYGMVSESGEHFNAIIKPFRLATPGLLH | 146 Q Similar amino acids: 31% ©Thomas Lengauer, 2013 99 145 126 146 Finding Protein Structure Template-based structure prediction is successful and the method of choice for sequence identities of above 30-40% Template protein 1. Find a template protein via sequence comparison (History) 2. Model the target protein onto the template 3. Refine the model (Physics/Chemistry) Structure model of the target protein Actual structure of the target protein Jones et al., Proteins 61, Supp 7 (2005) ©Thomas Lengauer, 2013 Finding Protein Structure If no template protein can be found, the structure of the target protein must be predicted de novo A fragment based approach pieces together the structure of the target protein from templates of fragments of the protein This effectively extend the history approach The quality results varies for different protein There are intrinsically hard protein for this approach Ben-David et al., Proteins (2009) ©Thomas Lengauer, 2013 Finding Protein Structure If no template protein can be found, the structure of the target protein must be predicted de novo A fragment based approach pieces together the structure of the target protein from templates of fragments of the protein This effectively extend the history approach The quality results varies for different protein There are intrinsically hard protein for this approach Ben-David et al., Proteins 77, Suppl.9 (2009) 50-65 ©Thomas Lengauer, 2013 Approaches to Prediction and Modeling The two approaches: Physics/Chemistry and History reflect the hybrid nature of biological systems History uses pattern mining to discover relationships and similarities in molecular (mostly sequence) data There is a third approach, Association – somewhat more general than History – that is based on data mining/pattern recognition but does not build on a concept of history Due to the complexity of the molecular processes, most bioinformatics predictions use pattern mining, few use physics/chemistry ©Thomas Lengauer, 2013 Approaches to Prediction and Modeling Predictions via physics/chemistry Molecular docking Predictions via history Taxonomy and phylogenetic trees, comparative genomics Predictions via association Uncovering genotype-phenotype relationships ©Thomas Lengauer, 2013 Overview 1. Nature vs. engineering 2. Nature talking 2.1 The letters 2.2 The words 2.3 The phrases 2.4 The stories 3. Modeling, analysis and prediction 4. Challenges ©Thomas Lengauer, 2013 Recent Data Explosion in Biology 1. New generation sequencing techniques • Accuracy of genome sequences is achieved via oversampling • Millions of sequences per sequencing experiment 2. Unfolding of genomic data • A complex organism (mammal) has only one genome • But many epigenomes • 200 tissues • Epigenomes change with environmental conditions, age and disease 3. From species data to data of individuals • Personal –omics data are increasingly generated • Sequencing has become a commodity ©Thomas Lengauer, 2013 Recent Data Explosion in Biology Landmark projects Project Goal Budget Time period Human Genome Project Reference human genome $ 5 bio 1988-2003 International HapMap Project Low resolution variation $ 150 mio 2002-2005+ 1000 Genomes Project High resolution variation $ 120 mio 2008-2012 International Cancer Genome Consortium 25.000 cancer genomes, 50 tumor types $ multi hundred mio 2008- ENCODE Annotation of the human genome $ 100 mio 2003-2012 International Human Epigenome Consortium 1000 reference epigenomes ≈ $ 300 mio 2011-2016 Genome 10K Project 10,000 vertebrate genomes $ 100 mio 2010- modENCODE Annotation of the genomes of C. elegans and D. melanogaster $ 140 mio 2008-2013 ©Thomas Lengauer, 2013 Resource Issues Storage One human genome (60x) of 200Gbp requires ≈ 2 TB for storage and analysis Runtime One machine produces a human genome in two days Max ≈ 45 Tbp/day, 225 genomes, 450 TB/day LHC: 40 TB/day Facebook: 500 TB/day Super-Moore Computing power has doubled every 14 months Sequencing volume doubles every 5 months Never enough processing, storage, people ©Thomas Lengauer, 2013 Estimates by Sven-Eric Schelhorn Data issues Two sources of “noise” Technical limitations of the measurement process • Undesired, needs to be corrected for Biological variation • By design of Nature, needs to be taken into account and understood Interlinking heterogeneous data Ontologies, consistency of notation, harmonizing data standards, diversity of metadata Need to develop standards for experimental protocols and for data quality MIAME: Minimum information about a microarray experiment Privacy issues New concepts of anonymization are needed as individual can be inferred from their genomic data and associated information ©Thomas Lengauer, 2013 Analysis issues Validation has to appropriately handle The noise in the data The lack of knowledge on the studied system Spurious signals Biological data are characterized by many variables (e.g. genes) and few samples (e.g. patients) Overfitting is a great danger leading to models of lower predictive power (curse of dimensionality) Repeated testing of the same hypothesis leads to chance signals (multiple testing problem) Finding promising hypotheses to test Utopia Unique hypothesis Bioinformatics ©Thomas Lengauer, 2013 Hypothesis space Reductionist method Discovery Science Where We Still Have Fundamental Problems Ensuring and appropriately reporting data quality is a severe bottleneck Separate biological variation from technical noise Develop standards for data quality Biological phenotypes are often not sufficiently described Statistical inference usually does not uncover causal relationships Covering scales of space and time requires data integration and methodical research Space: molecule (10-9 m) – molecular complex – cell – tissue – organ – organism (1 m) Time: from molecular vibrations (10-17 s) to long-term developments (disease, ageing, 107 s) Bioinformatic models are never exact The prediction methods needs to assess the confidence its own predictions ©Thomas Lengauer, 2013 Perspectives Data, especially sequencing data, will grow at an explosive rate Sequencing technology is in rapid development Single cell sequencing will provide a new level of data quality Image data will become a new data type that will pervade the field ©Thomas Lengauer, 2013 Image Data Larval development of C. elegans Fly brain, image and model Brain and spinal chord of a fish From Long, F., J. Zhou, et al. (2012). " PLoS Comput Biol 8(6): e1002519. ©Thomas Lengauer, 2013 Perspectives Data, especially sequencing data, will grow at an explosive rate Sequencing technology is in rapid development Single cell sequencing will provide a new level of data quality Image data will become a new data type that will pervade the field Dynamic processes will increasingly be modeled The technology will increasingly reach into the clinic personalized / individualized / precision medicine We need to secure a consensus in society pertaining to these developments This will involve an open discussion and careful addressing of the risks and ethical issues involved ©Thomas Lengauer, 2013 Thanks for your attention! ©Thomas Lengauer, 2013