Highlights from the Third International Society for
Transcription
Highlights from the Third International Society for
BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 MEETING ABSTRACTS Open Access Highlights from the Third International Society for Computational Biology (ISCB) European Student Council Symposium 2014 Strasbourg, France. 6 September 2014 Published: 13 February 2015 These abstracts are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 MEETING ABSTRACTS A1 Highlights from the Third European International Society for Computational Biology (ISCB) Student Council Symposium 2014 Margherita Francescatto1, Susanne MA Hermans2, Sepideh Babaei3, Esmeralda Vicedo4, Alexandre Borrel5,6,7, Pieter Meysman8,9* 1 Department of Genome Biology for Neurodegenerative Diseases, German Center for Neurodegenerative Diseases (DZNE), Tübingen, Germany; 2 Computational Discovery and Design (CDD) group, Centre for Molecular and Biomolecular Informatics (CMBI), Radboudumc, Nijmegen, The Netherlands; 3Delft Bioinformatics Lab, Delft University of Technology, The Netherlands; 4Department for Bioinformatics and Computational Biology, Institut für Informatik, TU München, Munich, Germany; 5INSERM, UMRS-973, MTi, Paris, France; 6University Paris Diderot, Sorbonne Paris Cité, UMRS-973, MTi, Paris, France; 7University of Helsinki, Division of Pharmaceutical Chemistry, Faculty of pharmacy, Finland; 8Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium; 9Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/ Antwerp University Hospital, Edegem, Belgium BMC Bioinformatics 2015, 16(Suppl 3):A1 Abstract. In this meeting report, we give an overview of the talks, presentations and posters presented at the third European Symposium of the International Society for Computational Biology (ISCB) Student Council. The event was organized as a satellite meeting of the 13th European Conference for Computational Biology (ECCB) and took place in Strasbourg, France on September 6th, 2014. Introduction: The ISCB Student Council (SC) is the student organization of the International Society for Computational Biology. Its members are typically PhD students in the fields of bioinformatics or computational biology, but include as well scientists in different stages of their career. They come from all around the world and share a passion for bioinformatics and computational biology. The mission of the SC is to support the development of the next generation of computational biologists. This is achieved through the provision of scientific events, networking opportunities, soft-skills training, educational resources and career advice, while attempting to influence policy processes affecting science and education. The European Student Council Symposium (ESCS) is one of the activities organized by the SC as a satellite meeting accompanying the European Conference for Computational Biology (ECCB). It is therefore the European spin-off of the Student Council Symposium (SCS), which celebrated its 10 th anniversary this year [1] and is a satellite meeting of the annual Intelligent Systems for Molecular Biology (ISMB) conference. The ESCS has been organized every two years, when ECCB was not conjoined with ISMB, since 2010. Scope and format of the meeting: This year, the 3rd ESCS took place in Strasbourg, France on September 6th in conjunction with the 13th ECCB conference. The main goal of the meeting was to create opportunities for young researchers to meet and discuss with peers from all over the world, so that ideas could be exchanged and networks built. In addition three highly successful principal investigators were invited to deliver inspiring keynote talks. We received more than 30 abstract submissions from students who wished to present their work at the symposium. These submissions were peerreviewed by an independent program committee, and eight abstracts were selected for oral presentations. Another eighteen abstracts were selected to be presented as a poster. Thanks to the generous contributions of our sponsors, we were able to provide four travel fellowships to support student attendance to ESCS. Overall, almost 30 delegates from 13 different countries attended the symposium and the program included three inspiring keynote lectures, eight contributed student presentations, and a lively poster session. The oral presentations were divided into three themed sessions, namely Modeling, Systems Biology, and Networks and Statistics. For the first time in an event organized by the SC, the five delegates with the best posters were given the opportunity to present their work in a flash presentation. This ensured that all attendees had the chance to hear and see the top poster selection unconstrained by the population limits intrinsic to a normal poster presentation. All abstracts of the accepted oral presentations are included in this meeting report. Abstracts of the poster presentations can be found online in the symposium booklet http://escs2014.iscbsc.org/escs-booklet. Keynotes: The consistent theme of the ESCS keynotes was the different aspects of dealing and interpreting the massive amounts of biological data that is nowadays available, often publicly and without restrictions of use. In the morning, Dr. Lennart Martens introduced the concept of ‘Saprotrophics’, a field that many bioinformaticians might be working in without realising it and that has even peaked the interest of the social sciences [2]. The central idea behind Saprotrophics is that, with the appropriate methods, new knowledge can be obtained from massive amounts of public data, in directions that go far beyond the original intention and purpose. Although such analyses come with their own set of unique challenges, these can be overcome with proper approaches. Dr. Martens gave an overview of such challenges, of possible ways to tackle them and in addition he showed some interesting applications. © 2015 various authors, licensee BioMed Central Ltd. All articles published in this supplement are distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 Page 2 of 10 The second keynote, by Dr. Jeroen de Ridder, underlined the critical importance of scale in biological data sets. Depending on the scale used to analyze and interpret data, the features and patterns that emerge can change quite dramatically; this is comparable to the change in perception we have of a landscape when we are flying over it or walking in it. Through an array of working examples Dr. de Ridder guided the audience into understanding that meaningful new insights in molecular data analyses can be achieved by accounting for the importance of scale and using scaleaware analyses tools. Finally, the keynote from Dr. Lars Juhl Jensen concerned the efforts needed to collect and combine data from different sources into a single, biologically meaningful, network. In his talk, Dr. Jensen detailed the efforts and techniques that were necessary to construct the STRING database [3]. This database combines data derived from different curated databases, applying refined automatic text mining techniques and computational prediction approaches. Several of these methods have been integrated into web-based resources, which can be used to construct other databases and are extremely valuable for systems biology applications. Student presentations: From all abstracts submitted to the symposium, the best eight were selected for oral presentations, which were divided into three sessions. Session 1: Modeling: Information about the DNA replication mechanisms is scarce or absent for many viruses. Kazlauskas et al. [4] reported an analysis of DNA replication genes across more than 1500 viral genomes. This analysis allowed Kazlauskas and colleagues to identify previously unknown replication components in these genomes. Conformation alterations are often a critical step for the functionality of a variety of proteins. Narunsky et al. [5] introduced ConTemplate, a web server able to suggest potential conformations for proteins with an established molecular structure based on structural similarity to other proteins with known conformations. Session 2: Systems biology: Proteochemometrics is the modelling of the bioactivity of ligands against different targets. Cortes et al. [6] demonstrated that a Bayesian inference scheme can be successfully applied to this problem within the contexts of isoform-selective cyclooxygenase inhibition and large-scale cancer cell line drug sensitivity. Understanding the manner with which small compounds inhibit proteinprotein interactions would greatly help in the design of the next generation of therapeutic compounds. Kuenemann et al. [7] studied small molecules and protein-protein interactions of such inhibitors to identify new putative 3D characteristics that support inhibition. While rich information sources exist for protein interaction data, their adaptive nature remains poorly understood. Using advanced pattern mining techniques, Naulaerts et al. [8] discovered dynamic interaction patterns in lists of differentially expressed proteins that could be related to cancer states. Session 3: Networks and statistics: DNA methylation is an important epigenetic marker that has been shown to be involved in gene silencing. Döring et al. [9] modeled the differences in sequence bias that exist for methylation determination through microarray hybridization and bisulfite sequencing. The identification of critical residues is of great interest for the field of protein engineering. Armenta-Medina et al. [10] introduced a hybrid approach called ANMA.SCA to determine the importance of a residue in proteins, based on coevolution and cross-correlation of simulated atomic fluctuations. Gene duplications are notoriously hard to correctly position in phylogenetic reconstructions of the genomic evolutionary history. Peres et al. [11] have developed a new method to improve the positioning of gene duplication in gene trees produced by TreeBest. Award Winners: At ESCS, four awards were given to the best presenters of the day, namely two for oral presentations and two for poster presentations. The attendees determined the winners by scoring the different oral presentations based on presentation style, novelty of the work presented, slide layout and clarity of the message. The best presentation award went to Mélaine Kuenemann, while the runner-up prize went to Isidro Cortes. The best posters were selected during the noon poster session based on preferences expressed by the symposium attendants through stickers. The five top scoring posters were given the chance to give a 5 minutes flash presentation during the main meeting. From these five flash presentations, the award winners were determined by an independent jury. Poster presentation first place went to Jakob Jespersen, and second place to Aurélie Pirayre. Conclusions: As previous editions, the third ESCS was a great success, characterized by talks of high profile and quality, both at the level of keynotes and submitted work. This is confirmed by the results of an online survey that participants were asked to fill in. Most participants agree that the quality of the symposium was high to excellent, and that the equilibrium between keynotes and submitted talks was good. This year, we noted a decrease in the number of participants in comparison to ESCS of two years ago, similarly to what observed in this year’s SCS [1]. An informal survey among students attending the main conference that didn’t subscribe for ESCS showed that the main reasons for not attending were either conflicting workshops taking place on the same day or unfamiliarity with the Student Council and its activities. Considering this, we recommend the organizers of future symposia to implement sharp strategies to improve the dissemination of announcements concerning the event in order to reach a larger pool of potential delegates. We also observed that we received far more applications for the ESCS travel fellowships than we were able to provide. This, together with the explicit declaration in some of the applications that attending the symposium would only be possible upon travel fellowship awarding, suggests that the lack of funding contributed as well to the drop in the number of delegates and underlines the importance of maintaining and possibly expanding the Travel Fellowship program from ISCB and its SC. Overall, we received very positive responses from all attendees, with many comments on the high quality of the oral presentations, both from keynotes and students. Future perspectives: Next year the ISMB and ECCB conferences will be co-organised in Dublin, Ireland, in July 10th to 14th. This meeting will serve as the location for the 11th SCS and therefore the next ESCS will only take place in 2016. For information on the Student Council and other events we organize for students in computational biology and bioinformatics, please visit our website: http://www.iscbsc.org. Acknowledgements: The success of an event the size of the European Student Council Symposium depends on the commitment of many. We are greatly indebted to ECCB 2014 conference chairs Marie-Dominique Devignes and Yves Moreau for giving us the opportunity to have the 3rd European Student Council Symposium in Strasbourg. We are especially thankful for the logistical support and invaluable advice of the ECCB organizing committee; specifically the Workshops and Tutorials chairs Olivier Poch and Mario Albrecht, and our ECCB intermediary Magali Michaut. We deeply appreciate their continued support of the ISCB Student Council and the symposium. Further, we would like to acknowledge the support of the ISCB Board of Directors and their trust in our vision. The Student Council would also like to thank our keynote speakers; Dr. Martens, Dr. de Ridder and Dr. Jensen, for volunteering their time to contribute to the success of the symposium and to promote the next generation of computational biologists. Furthermore, we would like to thank everyone on the organizing committee, without them, there would have been no symposium. Also we would like to thank the SCS2014 chairs, Farzana Rahman and Tomas Di Domenico, for the synergetic symposium collaboration. In addition, we would like to thank the BMC Bioinformatics editorial office for their help in publishing this report. We are also extremely grateful for the financial support that we received from our sponsors. This year ESCS was supported by GdrBIM, IMGT, Syngenta, Novartis, BASF and Roche. Without their support many of the opportunities that we offered to the delegates at the 3 rd European Student Council Symposium would not have been possible References 1. Rahman F, Di Domenico T: Highlights from the Tenth International Society for Computational Biology (ISCB) Student Council Symposium 2014. BMC bioinformatics 2015, 16(Suppl 2):A1. 2. Mackenzie A, McNally R: Living Multiples: How Large-scale Scientific Datamining Pursues Identity and Differences. Theory, Culture & Society 2013, 30:72-91. 3. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, Jensen LJ: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic acids research 2013, 41 Database: D808-15. 4. Kazlauskas D, Venclovas C: Viral DNA replication: new insights and discoveries from large scale computational analysis. BMC Bioinformatics 2015, 16(Suppl 3):A2. BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 5. Narunsky A, Ashkenazy H, Kolodny R, Ben-Tal N: Using ConTemplate and the PDB to explore conformational space: On the detection of rare protein conformations. BMC Bioinformatics 2015, 16(Suppl 3):A3. 6. Cortes-Ciriano I, van Westen G, Murrell D, Lenselink E, Bender A, Malliavin D: Applications of Proteochemometrics - From Species Extrapolation to Cell Line Sensitivity Modelling. BMC Bioinformatics 2015, 16(Suppl 3):A4. 7. Kuenemann MA, Bourbon LML, Labbé CM, Villoutreix BO, Sperandio O: An exploration of the 3D chemical space has highlighted a specific shape profile for the compounds intended to inhibit protein-protein interactions. BMC Bioinformatics 2015, 16(Suppl 3):A5. 8. Naulaerts S, Meysman P, Vanden Berghe W, Laukens K: Mining the human proteome for conserved mechanisms. BMC Bioinformatics 2015, 16(Suppl 3):A6. 9. Döring M, Gasparoni G, Gries J, Nordstrom K, Lutsik P, Walter J, Pfeifer N: Identification and Analysis of Methylation Call Differences between Bisulfite Microarray and Bisulfite Sequencing Data with Statistical Learning Techniques. BMC Bioinformatics 2015, 16(Suppl 3):A7. 10. Armenta-Medina D, Perez-Rueda E: Hybrid approaches for the detection of networks of critical residues involved in functional motions in protein families. BMC Bioinformatics 2015, 16(Suppl 3):A8. 11. Peres A, Roest Crollius H: Improving duplicated nodes position in vertebrate gene trees. BMC Bioinformatics 2015, 16(Suppl 3):A9. A2 Viral DNA replication: new insights and discoveries from large scale computational analysis Darius Kazlauskas*, Česlovas Venclovas Institute of Biotechnology, Vilnius University, Lithuania BMC Bioinformatics 2015, 16(Suppl 3):A2 Background: The ability to replicate is essential for all living entities. Duplication of genetic information is carried out by replication proteins. DNA replication has been well studied in T7, T4 phages and herpes viruses; however, the information about replication mechanisms from other groups of viruses is either scarce or missing altogether. Double-stranded (ds) DNA viruses infect cells from all domains of life, they evolve fast and are very diverse. Their genome size varies from 5 to 2,500 kbp. Results and conclusions: To better understand viral DNA replication, we identified replication proteins in dsDNA viruses using current state-of-theart homology detection methods. Over 150,000 proteins from 1,574 genomes were analyzed. We found that the composition of replication machinery depends on the virus genome size. Small viruses (<40 kbp) use protein-primed DNA replication or rely on replication proteins from the host. Large viruses (>140 kbp) have their own RNA-primed replication apparatus often supplemented with processivity factors and DNA topoisomerases to increase replication speed and efficiency. This insight led us to a search for „missing“ replication components in large genomes and resulted in the discovery of single-stranded DNA binding (SSB) proteins in larger eukaryotic viruses. Surprisingly these proteins turned out to be homologs of SSB proteins previously thought to be specific for T7-like phages. Additionally with the analysis of the herpes viral helicaseprimase complex we found that one of its components, UL8, is a highly diverged inactivated B-family DNA polymerase. A3 Using ConTemplate and the PDB to explore conformational space: on the detection of rare protein conformations Aya Narunsky1*, Haim Ashkenazy2, Rachel Kolodny3, Nir Ben-Tal1 1 Department of Biochemistry and Molecular Biochemistry, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel; 2The Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel; 3Department of Computer Science, University of Haifa, Mount Carmel, Haifa 31905, Israel BMC Bioinformatics 2015, 16(Suppl 3):A3 Background: Conformational changes mediate important protein functions, such as opening and closing of channel gates, activation and inactivation of enzymes, etc. The entire conformational repertoire of a given query protein may not be known; however, it may be possible to infer unknown conformations from other proteins. We developed the ConTemplate method Page 3 of 10 to exploit the richness of the Protein Data Bank (PDB)[1] for this purpose. ConTemplate uses a three-step process to suggest alternative conformations for a query protein with one known conformation [2]. First, ConTemplate uses GESAMT to scan the PDB for proteins that share structural similarity with the query [3]. Next, for each of the collected proteins, additional known conformations are detected using BLAST [4], and clustered into a predefined number of clusters [5]. Finally, MODELLER [6] builds models of the query in various conformations, each representative of a cluster. Results: We demonstrate the application of ConTemplate with S100A6, a member of the S100 family of Ca2+ binding proteins. The vast majority of proteins in this family bind Ca2+ through helix-loop-helix EF-hand motifs. The structure of the protein includes four helices connected by three loops. Calcium binding is coupled to a conformational change, in which helix 3 changes its orientation with respect to helix 4 (Figure 1A and 1B) [7]. Helix 2 also changes its positioning with respect to the rest of the protein upon calcium binding, but the change is not as dramatic. The RMSD between the Ca2+-bound and -free conformations is 4.46Å. The EF-hand motif is found in many PDB entries. Yet, known structures of the Ca 2+ -free conformation are relatively rare. These features make the protein an interesting example for examining how the performance of ConTemplate is affected by the distribution of conformations in the PDB: The highly abundant Ca 2+ -bound conformation may populate a very large cluster, which could mask the Ca2+-free conformation. Thus, finding the latter conformation could be challenging. Starting from the Ca2+-free conformation as a query, it is sufficient to set the number of clusters at 2 to retrieve both the Ca 2+ -bound and -free conformations. ConTemplate reproduces the Ca2+-bound conformation with RMSD of 1.6Å (Figure 1C). This is based on the query’s structural similarity to the Ca2+-free conformation of another member of the family, the S100A2 protein [8], and the bound conformation of this protein [9]. The sequence identity between the two proteins is 47%. When the number of clusters is set to be larger than 2, each cluster represents either the Ca2+-bound or the Ca2+-free conformation. On the other hand, using the abundant Ca2+-bound conformation as a query, even with up to three clusters, the process retrieves only variants of the (initial) bound conformation. Only when the number of clusters is four or larger do we obtain at least one cluster representing the Ca2+-free conformation. In general, the ability to predict the other conformation improves as the number of clusters increases. For example, with 17 clusters, 4 clusters represent the rare conformation, and ConTemplate reproduces the Ca2+-free conformation with RMSD of 2.43Å (Figure 1D). This is based on the query’s structural similarity to the bound conformation of another member of the family, the S100A12 protein [10], and the known free conformation of this protein [11]. The sequence identity between the query and the template is 42%. Conclusions: ConTemplate suggests putative conformations for a query protein with at least one known structure, based on the query’s structural similarity to other proteins. In principle, the clustering method enables the detection of distinct conformations, including local conformational changes. However, it may be necessary to adjust ConTemplate’s parameters to reveal such changes, especially when looking for rare conformations. When ConTemplate suggests models that are similar to the query, and the clusters are very large, this may indicate that less-common conformations of the query are masked by highly-abundant conformations. Increasing the number of clusters may enable the rarer conformations to be detected. When the additional conformation is not known, it is not trivial to detect the “correct” conformation among the suggested models. A careful examination of the similar proteins and their conformational changes can be useful towards selecting the most probable conformations for the query. In addition, if the number of clusters is large enough, a pathway between the query conformation and a putative conformation may be found, with other models serving as intermediates. Identification of such a pathway could provide insight into the physiological relevance of a newly-detected conformation. Acknowledgements: A.N. and H.A. are funded in part by the Edmond J. Safra Center for Bioinformatics at Tel Aviv University References 1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235-242. 2. Narunsky A, Ben-Tal N: ConTemplate: exploiting the protein databank to propose ensemble of conformations of a query protein of known structure. BMC Bioinformatics 2014, 15(Suppl 3):A5. BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 Page 4 of 10 Figure 1(abstract A3) ConTemplate results demonstrated using the S100A6 Ca2+ binding protein. The Ca2+-free (A) and -bound (B) conformations are shown in the upper panels; helix 3 is marked in red, and the calcium ions in magenta. C. Reproducing the Ca2+-bound conformation, starting from the Ca2+-free conformation as a query. The maximal RMSD between the query and similar proteins is set to 1.2Å, the minimal Q-score to 0.4, and the number of clusters is set to 2. D. Reproducing the Ca2+-free conformation, starting from the Ca2+-bound conformation as a query. The similarity cutoffs are the same as in C, the number of clusters is set to 17 BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 3. Krissinel E: Enhanced fold recognition using efficient short fragment clustering. J Mol Biochem 2012, 1(2):76-85. 4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403-410. 5. Choi IG, Kwon J, Kim SH: Local feature frequency profile: a method to measure structural similarity in proteins. Proc Natl Acad Sci USA 2004, 101(11):3797-3802. 6. Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993, 234(3):779-815. 7. Otterbein LR, Kordowska J, Witte-Hoffmann C, Wang CL, Dominguez R: Crystal structures of S100A6 in the Ca(2+)-free and Ca(2+)-bound states: the calcium sensor mechanism of S100 proteins revealed at atomic resolution. Structure 2002, 10(4):557-567. 8. Koch M, Diez J, Fritz G: Crystal structure of Ca2+ -free S100A2 at 1.6-A resolution. J Mol Biol 2008, 378(4):933-942. 9. Koch M, Fritz G: The structure of Ca2+-loaded S100A2 at 1.3-A resolution. FEBS J 2012, 279(10):1799-1810. 10. Moroz OV, Antson AA, Grist SJ, Maitland NJ, Dodson GG, Wilson KS, Lukanidin E, Bronstein IB: Structure of the human S100A12-copper complex: implications for host-parasite defence. Acta Crystallogr D Biol Crystallogr 2003, 59(Pt 5):859-867. 11. Moroz OV, Blagova EV, Wilkinson AJ, Wilson KS, Bronstein IB: The crystal structures of human S100A12 in apo form and in complex with zinc: new insights into S100A12 oligomerisation. J Mol Biol 2009, 391(3):536-551. A4 Applications of proteochemometrics - from species extrapolation to cell line sensitivity modelling Isidro Cortes-Ciriano1*, Gerard JP van Westen2, Daniel S Murrell3, Eelke B Lenselink4, Andreas Bender3, Therese E Malliavin1 1 Institut Pasteur, Unité de Bioinformatique Structurale; CNRS UMR 3825; Département de Biologie Structurale et Chimie, 25, rue du Dr Roux, 75015, Paris, France; 2ChEMBL Group, European Molecular Biology Laboratory European Bioinformatics Institute, Wellcome Trust Genome Campus, CB10 1SD, Hinxton, Cambridge, UK; 3Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK; 4Division of Medicinal Chemistry, Leiden Academic Center for Drug Research, Leiden, The Netherlands BMC Bioinformatics 2015, 16(Suppl 3):A4 Background: Proteochemometrics (PCM) is a predictive bioactivity modelling method which simultaneously models the bioactivity of multiple ligands against multiple targets. PCM permits exploration of the selectivity and promiscuity of ligands on biomolecular systems of different complexity. This includes proteins and even cell-line models [1,2]. The suitability of PCM to predict compound polypharmacology has been validated both retrospectively and in prospective experimental validation [1,2]. In practice, each ligand-target interaction is encoded by the concatenation of ligand and target descriptor vectors used to train a single machine learning model. The inclusion of both chemical and target information enables the extra- and interpolation on the chemical and on the biological space. Therefore, PCM permits to predict compound bioactivities on targets not present in the training phase [3]. Results: In this contribution, we show a methodological advancement in the field [4], namely how Bayesian inference (Gaussian Processes) can be successfully applied in the context of PCM for (i) the prediction of compound bioactivity along with the error estimation of the prediction; (ii) the determination of the applicability domain of a PCM model; and (iii) the inclusion of experimental uncertainty of bioactivity measurements. We illustrate how the application of PCM can be useful in medicinal chemistry to concomitantly optimize compounds selectivity and potency, in the context of two application scenarios: (a) modelling isoform-selective cyclooxygenase inhibition; and (b) large-scale cancer cell line drug sensitivity prediction, where we benchmark the predictive signal of basal gene expression, gene copy-number variation, exome sequencing, and protein abundance data. We present the R package Chemically Aware Model Builder (camb) [5], which is able to perform the above mentioned modelling tasks. camb is an open source platform for the generation of StructureActivity and Structure-Property models. The functionalities of camb include: (i) standardisation of chemical structure representation, (ii) calculation of 905 Page 5 of 10 one-dimensional descriptors and 14 fingerprints for small molecules, (iii) 8 types of amino acid descriptors, (iv) 13 whole protein sequence descriptors, and (iv) training, validation and visualization of predictive models. Conclusions: Overall, the application of PCM in these two case scenarios let us conclude that PCM is a suitable technique, on this data, to model the activity of ligands exhibiting diverse bioactivity profiles across a panel of targets, which can range from protein binding sites (a), to cancer cell-lines (b). The camb package constitutes a platform encompassing all steps for the generation of predictive models from chemical structures and their associated bioactivities/properties, which will provide reproducibility and simplify the generation of predictive bioactivity/property models. References 1. van Westen GJP, Wegner JK, Ijzerman AP, van Vlijmen HWT, Bender A: Proteochemometric Modeling as a Tool to Design Selective Compounds and for Extrapolating to Novel Targets. Med Chem Commun 2011, 2:16-30. 2. Cortes-Ciriano I, Ain QU, Subramanian V, Lenselink EB, Mendez-Lucio O, Ijzerman AP, Wohlfahrt G, Prusis P, Malliavin TE, van Westen GJP, Bender A: Polypharmacology Modelling Using Proteochemometrics (PCM): Recent Methodological Developments, Applications to Target Families, and Future Prospects. Med Chem Commun in press. 3. van Westen GJP, Wegner JK, Geluykens P, Kwanten L, Vereycken I, Peeters A, Ijzerman AP, van Vlijmen HWT, Bender A: Which Compound to Select in Lead Optimization? Prospectively Validated Proteochemometric Models Guide Preclinical Development. PLoS ONE 2011, 6:e27518. 4. Cortes-Ciriano I, van Westen GJP, Lenselink EB, Murrell DS, Bender A, Malliavin TE: Proteochemometric Modelling in a Bayesian framework. J Cheminf 2014, 6:35. 5. Murrell DS, Cortes-Ciriano I, van Westen GJP, Stott IP, Bender A, Malliavin TE, Glen RC: Chemically Aware Model Builder (camb): An R package for property and bioactivity modeling of small molecules. [http://www. github.com/cambDI/camb]. A5 An exploration of the 3D chemical space has highlighted a specific shape profile for the compounds intended to inhibit protein-protein interactions Mélaine A Kuenemann1,2, Laura ML Bourbon1,2, Céline M Labbé1,2,3, Bruno O Villoutreix1,2,3, Olivier Sperandio1,2,3* 1 Université Paris Diderot, Sorbonne Paris Cité, UMRS 973 Inserm, Paris 75013, France; 2Inserm, U973, Paris 75013, France; 3CDithem, Faculté de Pharmacie, 1 rue du Prof Laguesse, 59000 Lille, France E-mail: [email protected] BMC Bioinformatics 2015, 16(Suppl 3):A5 Background: The vital role of Protein-Protein Interactions (PPI) for Life makes them the subject of a growing number of drug discovery projects. Yet, the specific properties of PPI (often described as flat, large and hydrophobic) require a dramatic paradigm shift in our way to design the small compounds meant to modulate them with therapeutic perspectives. To this end, successful inhibitors of PPI targets (iPPI) may be used to discover what singular properties make this type of inhibitors capable of binding to such intricate surfaces. Among the properties from which lessons could be learnt, the 3D characteristics of iPPI have been pinpointed as essential. Understanding the putative shape profile of iPPI could help the design of a new generation of inhibitors. Results: In an attempt to identify 3D characteristics, we have collected the bioactive conformations of 84 orthosteric iPPI and compared them to those of 1282 inhibitors of conventional targets (e.g enzymes) collectively from different databases (2P2I[1], PDBbind[2], PDB). Because the known heavier and more hydrophobic character of iPPI could conceal other characteristics, we have imposed that none of the identified descriptors could correlate with the hydrophobicity or the size of the compound. Four 3D characteristics were highlighted (Figure 1). They describe either the shape of the compounds (globularity) or the 3D distributions of the hydrophobic and hydrophilic interacting regions of the compounds (IW4, EDmin3, CW2: VolSurf descriptors [3]). More specifically the most essential property revealed in the analysis (EDmin3) illustrates how iPPI manage to bind to the hydrophobic patches often present at the core of PPI targets. The newly identified properties were further confirmed as characteristic to iPPI using the data of much larger datasets including our iPPI-DB[4], eDrugs3D[5] and a representative subset of the bindingDB[6]. BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 Page 6 of 10 Figure 1(abstract A5) Bioactive conformation of compound 1MQ as cocrystallized with Mdm2 (pdb code 4JVE). The compound is represented as transparent molecular surface and molecular sticks. The value of highlighted descriptors are : EDmin3 = -3.18 kcal/mol (represented by the green molecular field calculated using Moe 2012.10 at the levels of energy equal to -2.4 kcal/mol using a dry probe), IW4 = 4.13 (represented by the pink molecular field calculated using Moe 2012.10 at the levels of energy equal to -5.5 kcal/mol using a water probe), glob = 0.20 (represented by the molecular surface), and CW2 = 1.90 (represented by the proportion of pink surface over the full molecular surface) Conclusions: Identifying low-molecular-weight iPPI is known to be a difficult task. This has usually been translated into designing compounds with higher size, aromaticity, and hydrophobicity. Yet, lessons are being learnt from iPPI bioactive conformations in an attempt to circumvent this trend. During this analysis, we demonstrated that the capacity to bind a protein-protein interface partially rely on the combination of several structural and electrostatic features including the globularity and the distribution of hydrophilic regions but most importantly of hydrophobic interacting regions. More distinctively, iPPI seem to be characterized by a significantly higher efficiency to bind the hydrophobic patches often present at PPI interfaces. The absence of correlation of this type of property with the hydrophobicity and the size of the compounds could open new ways to design iPPI with improved ligand and lipophilic efficiencies and may allow the scientific community to anticipate an era of more drug-like iPPI. References 1. Basse MJ, Betzi S, Bourgeas R, Bouzidi S, Chetrit B, Hamon V, Morelli X, Roche P: 2P2Idb: a structural database dedicated to orthosteric modulation of protein-protein interactions. Nucleic acids research 2013, 41 Database: D824-827. 2. Wang R, Fang X, Lu Y, Wang S: The PDBbind database: collection of binding affinities for protein-ligand complexes with known threedimensional structures. Journal of medicinal chemistry 2004, 47(12):2977-2980. 3. Cruciani G, Pastor M, Guba W: VolSurf: a new tool for the pharmacokinetic optimization of lead compounds. European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences 2000, 11(Suppl 2):S29-39. 4. Labbé CM, Laconde G, Kuenemann MA, Villoutreix BO, Sperandio O: iPPI-DB: a manually curated and interactive database of small non-peptide inhibitors of protein-protein interactions. Drug discovery today 2013, 18(19-20):958-968. 5. Pihan E, Colliandre L, Guichou JF, Douguet D: e-Drug3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics 2012, 28(11):1540-1541. 6. Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK: BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic acids research 2007, 35 Database: D198-201. A6 Mining the human proteome for conserved mechanisms Stefan Naulaerts1,2*, Pieter Meysman1,2, Wim Vanden Berghe3, Kris Laukens1,2 1 ADReM research group, Department of Mathematics and Computer Science, University of Antwerp, Belgium; 2Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Belgium; 3Laboratory of Protein Science, Proteomics and Epigenetic Signaling (PPES), Department of Biomedical Sciences, University of Antwerp, Belgium BMC Bioinformatics 2015, 16(Suppl 3):A6 Background: All cells are subject to ever-changing environments to which they have to adapt, using their sensory system to provide input for the regulatory systems that integrate the information and trigger the eventual effectors. These cascades constitute a very complex cellular wiring that is highly relevant due to its medical importance. The omni-present application of high-throughput analysis techniques has resulted in an unprecedented level of available detail about gene expression and various aspects of cellular proteins, such as abundance, function and localization, often captured in well-curated compendia that are publicly available. Although these information-rich inventories exist, the adaptive nature of protein complexes and signalling cascades remain poorly understood, as the current predominant approaches are not always suited to describe the associations between proteins. For example, binary protein interactions do not necessarily occur in vivo as the proteins could be expressed in different compartments of the cell or at different time points. This severely complicates the analysis of any protein interaction data. It thus remains a challenge to find out how biological entities cooperate to regulate cellular response to stimuli. Methods: We used an integrative method, reliant on advanced pattern mining approaches to gain a deeper understanding of protein network BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 Page 7 of 10 dynamics. To this end, we created a compendium consisting of a large amount of proteomics papers for Homo sapiens that report differentially expressed proteins in cell lines. Next, we analysed this collection with frequent itemset mining to identify proteins that are often co-occurring in publications and used these patterns as the backbone structure of our further analysis. These patterns of co-occurring proteins were enriched with additional attributes, such as gene expression correlation, protein localization and functional coherence metrics derived from the Gene Ontology tree [1] and used as a filter on top of an integrated binary protein interaction network, obtained by fusing several of the most popular resources. Results: We found that several proteins and GO-functions, such as transcriptional regulation, are consistently reported and deemed significant regardless of the research topic. Furthermore, we were able to find associations across the various “omics” levels that are conserved in a wide range of human cancers and managed to identify lists of frequently occuring patterns that can be used to classify between pre- and postmetastasic tumour development. Conclusions: Pattern-based analysis on multiple “omics” levels can be used to identify the cellular logic circuits and holds many promising applications in the biotechnological and biomedical areas. Reference 1. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, et al: Gene ontology: tool for the unification of biology. Nat Genet 2000, 25(1):25-29. Furthermore, the hybrid weighted degree kernel (r = 0.234) outperformed the weighted degree kernel with shifts (r = 0.22) by also considering the frequencies of individual bases in addition to the consensus sequences. Non-sequence features were less predictive of the outcome than the sequence, e.g., RBF kernels on base quality and depth of coverage attained only correlations of r = 0.057 and r = 0.003 with the outcome, respectively. Conclusion: To our knowledge, this is the first approach indicating that differences between methylation measurements from bisulfite sequencing and the Infinium HumanMethylation450 microarray are predictable from the reads. The results suggest that features beside the sequence play only a minuscule role in the emergence of inconsistent methylation measurements. We were able to show that, in this scenario, set kernels and hybrid string kernels provide well-suited similarity measures. Further work is necessary to validate the model’s generalizability for data from other cell lines and to evaluate its practical merit. Acknowledgements: Gilles Gasparoni and Karl Nordström were funded by the BMBF project 01KU1216F (DEEP). Pavlo Lutsik was funded by the European Union’s Seventh Framework Programme (FP7/2007-2013) grant agreement No. 267038 (NOTOX) References 1. Dedeurwaerder S, Defrance M, Calonne C, Denis H, Sotiriou C, Fuks F: Evaluation of the Infinium Methylation 450K technology. Epigenomics 2011, 3(6):771-784. 2. Liu Y, Siegmund KD, Laird PW, Berman BP, et al: Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biol 2012, 13(7):R61. 3. Assenov Y, Müller F, Lutsik P, Walter J, Lengauer T, Bock C: Comprehensive Analysis of DNA Methylation Data with RnBeads. Nat Methods in press. 4. Teschendorff AE, et al: A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450K DNA methylation data. Bioinformatics 2013, 29(2):189-196. 5. Sonnenburg S, Rätsch G, Schäfer G: Learning interpretable SVMs for biological sequence classification. Research in Computational Molecular Biology Springer 2005, 389-407. 6. Rätsch G, Sonnenburg S, Schölkopf B: RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics 2005, 21(suppl 1):i369-i377. 7. Meinicke P, Tech M, Morgenstern B, Merkl R: Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics 2004, 5(1):169. 8. Gärtner T, Flach PA, Kowalczyk A, Smola AJ: Multi-Instance Kernels. Proceedings of 19th International Conference on Machine Learning San Mateo, CA: Morgan Kaufman 2002, 179-186, Edited by Sammut C, Hoffmann A. A7 Identification and analysis of methylation call differences between bisulfite microarray and bisulfite sequencing data with statistical learning techniques Matthias Döring1*, Gilles Gasparoni2, Jasmin Gries2, Karl Nordström2, Pavlo Lutsik2, Jörn Walter2, Nico Pfeifer1 1 Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany; 2 Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany BMC Bioinformatics 2015, 16(Suppl 3):A7 Background: DNA methylation is an epigenetic modification known to play a prime role in gene silencing and is an important topic in epigenetic research. However, due to technology-dependent errors there are inconsistencies between methylation measurements from different methods [1]. Incorrect methylation calls could result in the discovery of spurious associations between methylation patterns and specific phenotypes in epigenome-wide association studies (EWAS). We worked towards assigning a measure of confidence to individual CpGs to down-weigh or exclude positions with inconsistent measurements in such studies. We used methylation measurements from the Infinium HumanMethylation450 microarray (b450K) and whole genome bisulfite sequencing (bWGBS) to evaluate whether locus-specific measurement differences, Δb = b450K − bWGBS, are predictable using statistical learning techniques. Methods: Methylation for Illumina WGBS data from HepaRGd7R2 was called with Bis-SNP [2], while methylation for Infinium 450K data from the same cell line was determined using RnBeads [3] and normalized with BMIQ [4]. For a uniform feature representation, we considered windows of reads overlapping with CpGs on the microarray (Figure 1). As predictors we examined sets of read sequences, their consensus sequences (with and without base frequencies), and non-sequence features such as base quality and depth of coverage. To obtain a predictive model independent of the methylation state, we masked CpG positions by introducing gaps or zeroing base frequencies. To predict Δb, we built support vector regression models based on Illumina WGBS data. Read similarity was measured with numerical, string [5-7], and set kernels [8]. We introduced the notion of hybrid string kernels to afford a similarity measure for both numeric and string input simultaneously. These kernels are based on scaling the motif similarity scores of two sequences according to the similarity of their base frequency profiles. Results: For a read-based set kernel utilizing the weighted degree kernel with shifts [6], we found that the predicted values of Δb correlated significantly with the observed outcomes (r = 0.37, p-value < 2.2 · 10−16). A8 Hybrid approaches for the detection of networks of critical residues involved in functional motions in protein families Dagoberto Armenta-Medina1*, Ernesto Perez-Rueda1,2 1 Departamento de Ingeniería Celular y Biocatálisis, Instituto de Biotecnología, UNAM Av. Universidad 2001, Cuernavaca, Morelos CP 62210, México; 2 Unidad Multidisciplinaria de Docencia e Investigación, Sisal Facultad de Ciencias, UNAM, Sisal, Yucatán, México E-mail: [email protected] BMC Bioinformatics 2015, 16(Suppl 3):A8 Background: Currently there is great interest in identifying critical residues in proteins, to improve our understanding and allow for the engineering of protein families. Diverse approaches combine sequence information, structural data, dynamics analysis and functional description to determine the importance of amino acids with regards to protein function. In this work, we propose a hybrid approach for the identification of critical residues in proteins, combining the use of evolutionary information (co-evolution), cross-correlation of atomic fluctuations derived from Anisotropic Normal Mode Analysis simulations [1] (ANMA) and network analysis. Subsequently we have compared this method to existing approaches. Results: By combining the information of the covariance matrix derived from Statistical Coupling Analysis (SCA) [2] and the cross-correlation matrix of atomic fluctuations derived from ANMA, it was possible to identify a network of evolutionarily coupled residues involved in relevant motions in protein families. The outstanding sites revealed by our hybrid approach (ANMA.SCA) showed a high correspondence with experimental data, confirming the critical role of these sites in the functional mobility of BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 Page 8 of 10 Figure 1(abstract A7) Data preprocessing. (1) Only reads overlapping with a CpG on the Infinium 450K chip are retained. (2) Windows are extended to the left and right of each CpG according to the maximum read length, yielding a uniform feature representation. (3) For each CpG, a consensus sequence is formed from its corresponding set of reads. Additionally, the position-specific frequency of each base is extracted. (4) Finally, CpG positions are masked by introducing gaps in the sequence or zeroing frequencies proteins. In addition, our approach was found to be complementary to previous approaches. It maintained a good correspondence with approaches derived from extensive molecular dynamics, while being faster and less expensive in terms of computational resources [3]. Conclusions: The hybrid approach ANMA.SCA opens a wide range of possibilities in the study of functional motion within protein families. By means of detecting networks of critical sites and their topology it is able to reveal the hidden aspects of protein dynamics. Acknowledgements: DA-M acknowledges the PhD fellowship (35083) from CONACYT and (IN-204714) DGAPA. EP-R was supported by a grant (IN-204714) from DGAPA and (155116) from CONACYT References 1. Eyal E, Yang LW, Bahar I: Anisotropic network model: systematic evaluation and a new web interface. Bioinformatics 2006, 22(21):2619-2627. 2. Socolich M, Lockless SW, Russ WP, Lee H, Gardner KH, Ranganathan R: Evolutionary information for specifying a protein fold. Nature 2005, 437(7058):512-518. 3. Armenta-Medina D, Perez-Rueda E, Segovia L: Identification of functional motions in the adenylate kinase (ADK) protein family by computational hybrid approaches. Proteins 2011, 79(5):1662-1671. A9 Improving duplicated nodes position in vertebrate gene trees Amélie Peres*, Hugues Roest Crollius Ecole Normale Supérieure, Institut de Biologie de l’ENS, IBENS, France BMC Bioinformatics 2015, 16(Suppl 3):A9 Background: While gene phylogenies are essential for many biological evolutionary studies, phylogenetic reconstructions are difficult to model, especially when they include gene duplications. In this study, we have developed a method to improve the positions of duplications in gene trees produced by TreeBest, a widely used method at the core of the “Ensembl compara” pipeline[1]. Results: In order to automatically identify incorrectly positioned duplications, we investigated a method that relies on the confidence score, a measure between 0 and 1 introduced by TreeBest that is assigned to each duplication node. This score reflects the ratio between the number of species with a duplicated gene and the total number of species derived from this node. A well-supported duplication will thus have a score closer to 1. With our method, if a duplication node is considered to be poorly supported it is replaced by a speciation node, and the duplication is BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 Page 9 of 10 moved to the following node which is tested using the same method. If the new duplication node passes the test, the duplication is maintained at this new position in the tree. To test our method comprehensively, we ran it on all 20194 phylogenetic trees available in the Ensembl compara database version 71. The resulting 20194 new edited gene trees were then compared with the original Ensembl gene trees by feeding both databases to AGORA[2], an algorithm developed in our laboratory to reconstruct ancestral gene orders. This tool allowed us to assess the quality of the new gene trees as its performances are very sensitive to the quality of the input gene trees, in particular because the length of the reconstructed ancestral chromosomal regions varies substantially depending on the quality of the input gene trees. With the Ensembl gene trees, the number of ancestral genes increases and decreases rapidly during time, whereas with edited gene trees, the number of genes is more constant (Figure 1), which is more likely from an evolutionary perspective. Additionally, in some cases the number of ancestral genes is more reasonable. Such is the case for the common ancestor for primates and rodents, Boreoeutheria, where its genome reconstruction with the Ensembl gene trees has 30 000 genes, but its genome reconstructed with our edited gene trees is only 20 000 genes large. The latter value is much closer to what one would expect because all modern Boreoeutheria descendant genomes contain between 20 000 and 25 000 genes. We also test the N50 measurement, which is the size of an ancestral block such as 50% of genes are in larger blocks, for all reconstructed ancestral genomes. A higher N50 indicates a better ancestral genome reconstruction. Edited gene trees using our confidence score method significantly improve the N50 and most notably with a threshold of 0.3 that was obtained empirically (Figure 2). Figure 1(abstract A9) Number of genes in ancestral genomes obtained with the original Ensembl gene trees database (in blue) and with edited gene trees with the confidence score method and a threshold of 0.3 (in red) BMC Bioinformatics 2015, Volume 16 Suppl 3 http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3 Page 10 of 10 Figure 2(abstract A9) N50 measurement for the Boreoeutheria genome reconstruction with the original Ensembl gene trees database (in blue) and with our edited gene trees with the confidence score method (in red). Edited trees significantly improve the N50. The optimal threshold is 0.3. Results are similar for all other ancestral genomes Conclusions: We find that using the confidence score method significantly improves the positions of duplications within gene trees when compared to the initial Ensembl gene tree database. The optimal value is obtained with a threshold score of 0.3, at which 39% of the 197 894 duplication nodes of the Ensembl gene tree database are edited, resulting in an increase in the N50 length for the ancestral reconstruction of the 58 vertebrate ancestors. These results suggest that our improved gene trees are more reliable. References 1. Flicek P, Amode MR, Barrell D, Beal K, Brent S, et al: Ensembl 2012. Nucleic Acids Res 2011, 40:D84-90. 2. Muffato M: Reconstruction de génomes ancestraux chez les Vertébrés. PhD Thesis 2010. Cite abstracts in this supplement using the relevant abstract number, e.g.: Peres and Crollius: Improving duplicated nodes position in vertebrate gene trees. BMC Bioinformatics 2015, 16(Suppl 3):A9