Project Capsid Computational Algorithms for Protein
Transcription
Project Capsid Computational Algorithms for Protein
Project Capsid Computational Algorithms for Protein Structures and Interactions David Ritchie – 02 December 2014 Personnel The initial members of the team are all currently associated with the Orpailleur project team at Inria Nancy. Team Leader • David Ritchie, DR2, Inria. Permanent Researchers • Marie-Dominique Devignes, CR1, CNRS. Emeritus Researchers • Bernard Maigret, DR1, CNRS. PhD Students • Seyed Ziaeddin Alborzi, doctorant (October 2014 – 2017), co-supervised by DR and MDD. • Benoît Henry, doctorant (October 2013 – 2016), co-supervised with Tosca team. • Gabin Personeni, doctorant (October 2013 – 2016), co-supervised with Orpailleur team. Administrative Assistant • Emmanuelle Deschamps, Inria. 1 1 Context Many of the processes within living organisms can be studied and understood in terms of biochemical interactions between large macromolecules such as DNA, RNA, and proteins. DNA is often considered as the “genetic code” of life, because there is a direct mapping between triplets of the four common nucleic acid bases in DNA and the 20 amino acid residues which make up the majority of the protein molecules found in all living organisms. Some RNA molecules play a key role in the translation of sequences of DNA codons into protein molecules, while others catalyse various chemical transformations or regulate gene expression. For example, the physical translation of DNA into protein is orchestrated by several complex macromolecular structures such as RNA polymerase (which transcribes DNA into mRNA) and the ribosome (which synthesises new proteins according a given sequence of mRNA codons). RNA polymerase and the ribosome are just two examples of the kinds of biomolecular machines which exist in nature. Other examples of biomolecular machines include spliceosomes (which are involved in gene expression), ATP synthases (energy transfer), myosins (locomotion), and kinesins (transportation), to name just a few. Remarkably, the protein components of these systems spontaneously self-assemble into large macromolecular complexes which can sometimes be glimpsed using modern cryo-electron microscopy (cryo-EM). However, it is extremely difficult to obtain atomic models of these large assemblies using high resolution experimental techniques such as X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy. In addition to the examples mentioned, many other biological processes are governed by complex systems of proteins which interact cooperatively to regulate the chemical composition within a cell or to carry out a wide range of biochemical processes such as photosynthesis, metabolism, and cell signalling, for example. Thus, if RNA and DNA sequences represent the biological blueprint for life, then proteins make up the three-dimensional (3D) molecular machinery. These days, it is becoming increasingly feasible to isolate and characterise some of the individual protein components of such systems, but it still remains extremely difficult to achieve detailed models of how these complex systems actually work. Consequently, a new multidisciplinary approach called integrative structural biology has emerged which aims to bring together experimental data from a wide range of sources and resolution scales in order to meet this challenge [X20, X21]. From the biomedical and industrial points of view, understanding how complex biomolecular systems work is crucial for the design of highly specific therapeutic drug molecules and for the development of clean and efficient bioengineering processes. For example, many antibiotic drug molecules are designed to interfere with the machineries or processes that exist in bacterial cells but which are different or even absent in humans. Modern biotechnology processes often use the enzymes of genetically altered or enhanced micro-organisms to make industrial or pharmaceutical compounds which are difficult or expensive to make using conventional synthetic chemistry techniques. Clearly, designing better drug molecules and developing cleaner and more efficient industrial processes will contribute to improving human health and quality of life. Understanding how biological systems work at the level of 3D molecular structures presents fascinating challenges for biologists and computer scientists alike. Despite being made from a small set of simple chemical building blocks, protein molecules have a remarkable ability to self-assemble into complex molecular machines which carry out very specific biological processes. As such, these molecular machines may be considered as complex systems because their properties are much greater than the sum of the properties of their component parts. In recent decades, much scientific effort has been devoted to understanding structure-function relationships at the level of single biomolecules. However, following recent developments in high throughput sequencing and other experimental techniques, there is now much interest in taking a “systems” view of biomolecules and biomolecular processes to enrich our understanding of the complex 2 mechanisms and relationships that exist within living organisms [X22, X23, X24]. Indeed, an emerging challenge in the 21st century is to understand, represent, and ultimately to control such relationships at the level of interacting biomolecular components [X25]. According to Kitano [X26], systems biology may be considered in terms of the structures, dynamics, control, and ultimately the design of systems with specific desired properties. Structural bioinformatics is mainly concerned with first of these [X27, X28], but studying it could facilitate breakthroughs in the other three aspects, e.g. studying how systems evolve in time and space, modulating disease processes (pharmaceuticals), and industrial exploitation (biotechnology). A recent European Science Foundation report highlights the increasing importance of systems biology to the biomedical sciences [X29]. As illustrated in Figure 1, biological systems may be considered at various scales, ranging from individual atoms and molecules to multi-component cellular assemblies. The atomic scale, which we call scale 1, is the scale of physical forces and mechanics. This scale may be used to explain the biochemical functions of individual protein molecules and to calculate their physical fluctuations over short time-scales (molecular dynamics). The molecular scale (scale 2) is the scale at which knowledge of the 3D shapes of biological objects (proteins, RNA, DNA, and other small molecules) may be used to explain how they interact, and how they fit together to form larger structures which may be visible in an electron or optical microscope (scale 3). The cellular scale (scale 4) is the scale of specialised functional sub-systems devoted to data storage, signalling, energy supply, assembly, regulation, repair, reproduction, defence, and so on. At this scale level, the complexity is too great to represent the sub-system components as physical objects. Instead, they are often represented computationally as abstract nodes in a network and the edges between those nodes represent various types of interaction within that “system” [X30]. In this project, we choose to focus on scales 2 and 3, although we expect our results will feed upwards to scale 4. Figure 1: The scales of structural biology and their relationships to the EMDB (https://www.ebi.ac.uk/pdbe/emdb/) and PDB (http://www.rcsb.org/) databases, illustrated using ribosome as an example. From left to right: soft X-ray tomogram of a fission yeast cell (scale 4), electron tomogram of ribosomes in the cytsol (scale 3), cryo-EM reconstruction of the 80S ribosome (scale 2), X-ray crystal structure of the 50S ribosomal subunit (scale 2), X-ray atomic detail of the SmpB protein (scale 1). Figure taken from [X31]. 2 Objectives The American National Institute of Health (NIH)1 defines computational biology as “the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” On the other hand, the NIH defines bioinformatics as “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.” According to these definitions, the objectives of the Capsid 1 http://www.bisti.nih.gov/ 3 team span both computational biology and bioinformatics in the sense that we aim to develop new theoretical modeling and simulation methods for biological systems, and we want to implement the developed approaches as computational tools which may be used by researchers in the biological sciences. 2.1 Challenges in Structural Systems Biology Here, we wish to focus on structural biology as our primary application domain, and we wish to develop algorithms and software to help study biological systems and phenomena from a structural point of view. In particular, we want (1) to develop algorithms which can help to model the structures of large multi-component biomolecular machines and (2) to develop tools and techniques to represent and mine knowledge of the 3D shapes of proteins and protein-protein interactions. Thus, a unifying theme of this project is the recurring problem of representing and reasoning about complex macromolecular shapes. More specifically, we want to develop computational techniques to represent, analyse, and compare the shapes and interactions of protein molecules in order to help better understand how their 3D structures relate to their biological function. In summary, we wish to focus on the following closely related topics in structural bioinformatics: • integrative multi-component assembly and modeling, • new approaches for knowledge discovery in structural databases. Because our motivation is domain-driven, we do not wish to restrict ourselves to the use of any particular algorithmic or computational technique. Nonetheless, it is natural that we should begin by exploiting the existing knowledge and skills of the initial team members. 2.2 Computational Objectives As indicated above, structural biology is largely concerned with determining the 3D atomic structures of protein, RNA, and DNA molecules, and using these structures to model their biological properties and interactions. Each of these activities can be extremely time-consuming. Often, solving the 3D structure of even a single protein using X-ray crystallography or NMR can take many months or even years of effort.2 Even simulating the interaction between two proteins using a detailed atomistic molecular dynamics simulation can consume many thousands of CPU-hours. While most X-ray crystallographers, NMR spectroscopists, and molecular modelers often use conventional sequence and structure alignment tools to help propose initial structural models through the homology principle, they often study only individual structures or interactions at a time. Due to the difficulties outlined above, only relatively few research groups are able to solve the structures of large multi-component systems. Similarly, most current algorithms for comparing protein structures, and especially those for modeling protein interactions, work only at the pair-wise level. Of course, such calculations may be accelerated considerably by using dynamic programming (DP) or fast Fourier transform (FFT) techniques. However, it remains extremely challenging to scale up these techniques to model multi-component systems. For example, the use of high performance computing (HPC) facilities may be used to accelerate arithmetically intensive shape-matching calculations, but this generally does not help solve the fundamentally combinatorial nature of many multi-component problems. It is therefore necessary to devise heuristic hybrid approaches which can be tailored to exploit various sources of domain knowledge. We therefore set ourselves the following main computational objectives: 2 It is worth noting that the 2009 and 2012 Nobel prizes in chemistry were awarded for work on the atomic resolution structures of the ribosome, and the G-protein coupled receptors, respectively. 4 • develop multi-component assembly techniques for integrative structural biology, • classify and mine protein structures and protein-protein interactions. Achieving these objectives will often involve combining numerical and symbolic representations of molecular shapes and molecular interactions, developing joint projects with experimentalists, and forming collaborations with computing science experts from other teams at the LORIA (Laboratoire Lorrain de Recherche en Informatique et ses Applications)3 in Nancy and at other Inria centres. Because our research outputs will be of interest both to the structural bioinformatics community and to experimental biologists, we aim to publish our results in high profile journals such as Bioinformatics, Journal of Computational Chemistry, Journal of Structural Biology, Nucleic Acids Research, Proteins, Protein Science, and PLoS Computational Biology, for example. 2.3 Practical Applications Beyond general benchmarking tests used for evaluating our algorithms, the methods developed in the team will address particular biological problems for which understanding the interactions between biomolecules is essential. The range of such problems is very large and our choice is driven by existing collaborations with biology laboratories. Therefore, we will focus on the following practical applications: • transcription factor II-D, • prokaryotic type IV secretion systems, • G-protein coupled receptors. 3 Scientific Project The proteins that exist today represent the molecular product of some three billion years of evolution. Hence, comparing protein sequences and structures is important for understanding their functional and evolutionary relationships [X32, X33]. Historically, much of bioinformatics research has focused on developing mathematical and statistical algorithms to process, analyse, annotate, and compare protein and DNA sequences because such sequences represent the primary form of information in biological systems, and because these sequences are relatively easy to determine from biological samples in the laboratory. Analysing and comparing genomic sequences has provided key insights into the taxonomic and evolutionary relationships between the different organisms and species that we observe today. However, when viewed from a molecular or biophysical perspective, such sequences provide only a rather incomplete form of biological knowledge. There is growing evidence that structure-based methods can help to predict networks of protein-protein interactions (PPIs) with greater accuracy than those which do not use structural evidence [X34, X35]. As indicated above, each protein adopts its own distinct 3D shape, and groups of proteins often interact by forming large 3D complexes. These complexes may exist as short-lived transitory associations, as in enzyme catalysis, or as long-lived multimeric systems such as the ribosome, transcription factors, cell surface and ion channel proteins. Understanding how proteins interact is crucially important for understanding the molecular mechanisms of disease. For instance, therapeutic drugs often work by modulating or blocking PPIs, and therefore PPIs represent an important class of drug target [X36, X37]. However, we still know very little about how proteins operate at the molecular level. Genome-wide proteomics studies of “model organisms” such as yeast [X38, X39, X40, X41] are providing a growing list of putative PPIs, but understanding 3 http://www.loria.fr 5 the function of these predicted interactions requires much further biochemical and structural analysis. For example, yeast is one of the most studied model organisms and is known to have around 6,000 proteins, giving rise to between about 38,000 and 75,000 PPIs. Around 50% of these PPIs have been observed experimentally. On the other hand, the human genome encodes around 30,000 proteins, giving from 154,000 to 370,000 PPIs, of which only around 10% are known to date [X42, X43]. Nonetheless, there appears to be overwhelming evidence that all living organisms and many biological processes share a common ancestry in the tree of life. Therefore, developing techniques which can mine knowledge of the protein structures and interactions observed in yeast or other organisms is an important way to enhance our knowledge of human biology [X44]. The next two sections describe the main scientific challenges of this project. The third section describes three specific applications in which we will apply the techniques developed in collaboration with experimental biology laboratories. It should be mentioned at this point that high resolution computational modeling of 3D structural interactions between proteins can be very challenging because protein are intrinsically dynamical molecules at physiological conditions. Although the 3D structure of a protein is often highly constrained by an internal network of non-covalent interactions, at very short (nano-second) time-scales the individual atomic positions within a protein rapidly and continuously fluctuate under thermal motion. Furthermore, at longer time-scales the internal side-chain conformations within a protein can flip from one local minimum to another. At even longer time-scales, larger structural subunits of α-helices and β -sheets may undergo substantial motions which can be very difficult to predict using computational techniques. On the other hand, many proteins are observed to crystallise into the same overall 3D fold, and indeed members of the same protein family often crystallise into quite similar 3D structures. Thus, there are recurring structural patterns which can be identified and classified. Nonetheless, it should be borne in mind that proteins are potentially very flexible molecules, and that we cannot expect to model PPIs with crystallographic resolution. 3.1 Integrative Multi-Component Assembly and Modeling Participants: Ritchie, Devignes, Maigret 3.1.1 Introduction – High Dimensional Search Spaces At the molecular level, each PPI is embodied by a physical 3D protein-protein interface. Therefore, if the 3D structures of a pair of interacting proteins are known, it should in principle be possible for a docking algorithm to use this knowledge to predict the structure of the complex. However, modeling protein flexibility accurately during docking is very computationally expensive due to the very large number of internal degrees of freedom in each protein, associated with twisting motions around covalent bonds. Even if one assumes that the proteins are rigid, a brute-force search over the 6D docking space between two typical proteins can involve generating and testing some 1010 orientations.4 Therefore, it is highly impractical to use detailed force-field or geometric representations in a brute-force docking search. Instead, most protein docking algorithms use 4 The estimate of 1010 trial rigid body docking orientations is based on the assumption that reasonably small atom-sized steps are used for each dimension. For example, when using 3D rotational steps of 6◦ and a 3D translational grid of 903 elements, a typical run of the ZDOCK docking program [X45] generates and tests approximately 54, 000 × 903 = O(1010 ) trial docking orientations for a pair of medium-sized proteins. If we further assume that each protein consists of around 103 atoms and that the computational cost of calculating atomic interactions scales quadratically in the number of particles, naively using even a simple force-field model in a brute-force docking search between two typical proteins could easily cost O(1010 ) × O(103 ) × O(103 ) = O(1016 ) floating point operations. This corresponds to around O(107 ) CPU-seconds or O(102 ) CPU-days on a modern 3 GHz processor. If the proteins are represented as geometric surface meshes instead of atoms, similar estimates could be made concerning the cost of calculating an interaction score between the two meshes. For example, the above 6D brute-force docking search between two meshes, each of around 103 vertices, would involve calculating around O(1016 ) vertex-vertex distances. 6 fast heuristic methods to perform an initial rigid-body search in order to locate a relatively small number of candidate binding orientations, and these are then refined using a more expensive interaction potential or force-field model, which might also include flexible refinement using molecular dynamics (MD), for example. Some protein docking algorithms use geometric hashing (an object recognition technique adapted from computer vision [X46]) of cliques of surface triangles or critical points to avoid a brute-force search, but most now use 3D grid-based representations. While geometric hashing is very fast, it carries the risk that candidate solutions might be missed if there are large conformational changes between the free and bound structures. On the other hand, grid-based scoring functions can more readily cover the rigid body space exhaustively, and they can more easily be adapted to include other interaction types by describing each contribution to the energy as an integral over a product of a “potential” and a “density” term, as in classical electrostatics. For instance, in our Hex protein docking algorithm [T1], the in vacuo electrostatic interaction energy between two proteins, A and B, is calculated as [X47] 1 E= 2 Z 1 φA (x)ρB (x)dx + 2 Z φB (x)ρA (x)dx, (1) where φA (x) represents the electrostatic potential of protein A and ρB (x) represents the charge density of protein B, etc. The notation used here follows the common physics convention of underlining vector quantities: x ≡ (x, y, z). Thus, dx represents an infinitesimal 3D volume element and R dx denotes integration over all 3D space. The similarity between two 3D objects may be calculated in a similar way [T2]. The main advantage of grid-based representations, however, is that the scoring step can be accelerated greatly by using fast Fourier transform (FFT) techniques [X48]. Furthermore, by borrowing some techniques and notation from quantum mechanics, it can be shown that the FFT may equally be used to accelerate the search in rotational instead of translational coordinates [T3]. To give a simple example, the correlation score between a 3D cryo-EM density map ρmap (x) and a 3D density representation of a high resolution protein model ρprot (x) may be expressed as an overlap integral of the form Z (2) S(x, y, z, α, β, γ) = T̂ (x, y, z)ρmap (x0 ) × R̂(α, β, γ)ρprot (x0 ) dx0 , where T̂ (x, y, z) and R̂(α, β, γ) represent 3D translation and Euler angle rotation operators, respectively. Thanks to the existence of fast 3D (Cartesian) FFT libraries, the above calculation has almost always been implemented using multiple 3D translational FFTs and by explicitly sampling the rotational space. However, performing a rotational FFT is arguably a more natural way to map what is largely a 3D rotational shape matching problem onto the computational DOFs. It is worth mentioning that the above operator notation is very useful when working with more complex expressions involving e.g. symmetry or multiple components, or if one wishes to re-write such calculations to distribute them over multiple processors, for example. Theoretically, using a FFT to search one degree of freedom (DOF) reduces the computational cost from O(N 2 ) function evaluations to O(N log N ). However, FFT techniques are not a panacea, especially for high-dimensional problems, because the O(N log N ) speed-up is only obtained when exhaustively sampling each DOF. Furthermore, due to the large memory requirement and high latency time to stride over large multi-dimensional data arrays, the highest practical dimension for the FFT is normally just 3. The main difference between our Hex docking algorithm and other FFT docking algorithms is that Hex performs the docking search in rotational coordinates using 1D, 3D, or 5D FFT angular grids, whereas all other FFT-based docking algorithms work with regular 3D Cartesian grids. Furthermore, in the polar Fourier representation it is possible to re-write a 5D FFT as multiple 1D FFTs, and this gives a significant speed-up on modern graphics processor units (GPUs) [T4].5 But here again, FFT techniques give a speed-up only when exhaus5 On current GPUs, it is necessary to use single precision floating point arithmetic for best performance. We find that Hex docking 7 tively sampling each DOF. Thus, FFT techniques are generally not suitable for sampling small regions of a large search space, as is necessarily the case in flexible docking with atomic resolution. 3.1.2 Using Coarse-Grained Models Many approaches have been proposed in the literature to take into account protein flexibility during docking. The most thorough methods rely on expensive atomistic simulations using MD. However, much of a MD trajectory is unlikely to be relevant to a docking encounter unless it is constrained to explore a putative protein-protein interface. Consequently, MD is normally only used to refine a small number of candidate rigid body docking poses. A faster approach is to model side-chain flexibility using rotamer libraries [X49], but such techniques are still very computationally expensive. A much faster, but more approximate method is to use so-called coarse-grained (CG) normal mode analysis (NMA) to reduce the number of flexible degrees of freedom to just one or a handful of the most significant vibrational modes [T5, X50, X51, X52]. However, sampling NMA-generated conformations typically leads to a quadratic increase in the number of conformations that must be cross-docked, and this can greatly increase the number of false-positive solutions [T6]. In fact, many protein docking algorithms, such as the FFT-based approaches, avoid the flexibility problem altogether by using “soft” potentials which can “absorb” forbidden steric clashes to a certain extent. In our experience, docking ensembles of NMA conformations does not give much improvement over basic FFT-based soft docking [T6], and it is very computationally expensive to use side-chain repacking to refine candidate soft docking poses [T7]. We therefore plan to use only “soft” scoring functions for multicomponent assembly, although we expect that NMA techniques will still be useful for flexibly fitting proteins into high resolution cryo-EM density maps (see below). In the last few years, CG force-field models have become increasingly popular in the MD community because they allow very large biomolecular systems to be simulated using conventional MD programs [X53]. Typically, a CG force-field representation replaces the atoms in each amino acid with from 2 to 4 “pseudoatoms”, and it assigns each pseudo-atom a small number of parameters to represent its chemo-physical properties. By directly attacking the quadratic nature of pair-wise energy functions, coarse-graining can speed up MD simulations by up to three orders of magnitude. Nonetheless, such CG models can still produce useful models of very large multi-component assemblies [X54]. Furthermore, this kind of coarsegraining effectively integrates out many of the internal DOFs to leave a smoother but still physically realistic energy surface [X55]. We therefore plan to use simple but accurate CG force-field models such as SCORPION [X56] to score candidate configurations during multi-component assembly rapidly and accurately, without necessarily attempting to model flexibility explicitly. 3.1.3 Generating and Detecting Symmetry Although protein-protein docking algorithms are improving [X57], it still remains challenging to produce a high resolution 3D model of a protein complex using ab initio techniques, mainly due to the problem of structural flexibility described above. However, with the aid of even just one simple constraint on the docking search space, the quality of docking predictions can improve dramatically [T8, T3]. In particular, many protein complexes involve symmetric arrangements of one or more sub-units, and the presence of symmetry may be exploited to reduce the search space considerably [X58, X59, X60]. For example, using our operator notation (Equation (2)), we have already developed a prototype algorithm which can generate and score candidate docking orientations for monomers that assemble into cyclic (Cn ) multimers using scores calculated using single precision FFTs on a GPU normally agree to within four decimal digits with the scores from double precision FFT calculations on a CPU [T4]. This level of precision is sufficient to rank different docking orientations. 8 integrals of the form Z EAB (y, α, β, γ) = T̂ (0, y, 0)R̂(α, β, γ)φA (x) × R̂(0, 0, ωn )T̂ (0, y, 0)R̂(α, β, γ)ρB (x) dx, (3) where the identical monomers A and B are initially placed at the origin, and ωn = 2π/n is the rotation about the principal n-fold symmetry axis. This example shows that complexes with cyclic symmetry have just 4 rigid body DOFs, compared to 6(n − 1) DOFs for non-symmetrical n-mers. Thus, when suitable constraints are available, the size of the search space may be reduced dramatically. We are currently extending this algorithm to assemble protein complexes with arbitrary point group symmetries. Although we currently use shape-based FFT correlations, the symmetry operator technique may equally be used to refine candidate solutions using a more accurate CG force-field scoring function. We also wish to develop similar techniques to detect the possible existence of symmetry in low resolution cryo-EM density maps (Section 3.1.7). 3.1.4 Assembling General Multi-Component Complexes More generally, we wish to develop algorithms to assemble arbitrary non-symmetrical multi-component complexes in which the applied constraints will necessarily be more approximate than sharp symmetry constraints. Ideally, we would like to use prior knowledge to locate each of the components approximately correctly in 3D space, and then to use fast rotational correlations or CG potential functions to cover the rest of the rigid body search space. However, this addresses only the first part of the multi-component assembly problem. In favourable cases, pair-wise docking algorithms can provide a ranked list of predictions with a near-native orientation for the complex in the top 10 orientations, but in less favourable cases a good prediction might be found only within the top 500 or 1,000 orientations. If the goal is to assemble n proteins into a non-symmetrical complex with k possible orientations for each pair of proteins, a spanning tree argument can be used to show that there are a total of nn−2 k n−1 distinct ways to form a complex [X61]. Therefore, except for only very small values of n and k it is impractical to enumerate all possible combinations, and heuristic search algorithms must be used. Recently, we applied a particle-swarm optimisation (PSO) approach to sample the search space efficiently. We found that the simple requirement that the individual proteins should not inter-penetrate provides a very useful way to eliminate many of the incorrect trial orientations, and that a near-native orientation may often be found within the first few solutions [T9]. However, the use of heuristic sampling techniques does not change the fundamental complexity of the problem, and our PSO approach becomes impractical with more than about 6 components. 3.1.5 3D Cryo-EM Reconstruction We also want to develop related approaches for integrative cryo-EM structure modeling. Thanks to current cryo-EM instruments and technologies, its is now feasible to capture low resolution images of very large macromolecular machines. However, transforming multiple 2D micrographs into high resolution 3D structures is an extremely labour-intensive and computationally intensive task. For the people involved, it is often also an extremely tedious task. From a computational point of view, solving 3D structures by cryo-EM is a classic inverse problem, in that the aim is to reconstruct the shape of an unknown 3D particle (here, a high resolution atomic model) by back-projecting a large number of observed low resolution 2D images. In conventional biomedical tomography, for example, a 3D image may be reconstructed from multiple 2D images using the Fourier slice theorem. However, in cryo-EM, the orientations of the initial 2D images are unknown. Hence, an initial 3D density map must be estimated and then refined iteratively. Typically, in order to achieve an acceptable 3D model, an initial 3D map is used to make 2D projections in different orientations which 9 may be used as templates to pick further 2D images from the micrographs. By grouping and averaging these additional 2D images, a higher resolution 3D map may be made, which can then be refined by repeating the above cycle. Traditionally, this has been done using Cartesian FFT techniques. However, because we already have a good set of computational tools for working in polar coordinates, we also wish to explore the use of our polar Fourier correlation technique as a novel way to solve the initial 2D/3D reconstruction problem. 3.1.6 A Data Explosion in Cryo-EM We must mention at this point that a technological revolution is taking place in cryo-EM, and we will soon be faced with a “data explosion” in the size and complexity of cryo-EM imaging data that needs to be processed. The latest generation of cryo-EM instruments allow samples to be processed automatically at much greater rates than earlier instruments, and use modern complementary metal oxide semiconductor (CMOS) direct detectors instead of the earlier charge-coupled detector (CCD) cameras to record micrographs of up to 4K×4K pixels per image. CMOS detectors have a much better signal-to-noise ratio than CCDs because the CMOS chip measures the scattered electrons directly from the sample, whereas previously a phosphorescent layer had to be used to convert the scattered electrons into optical light for the CCD. Furthermore, because direct detectors are much faster than CCDs, there is less time for a sample to move as it is being imaged by the electron beam. Thus, higher resolution images may be captured. Along with the intriguing prospect of being able to trap biological systems in unprecedented levels of detail, there will also come an increasing need to analyse, annotate, and interpret the enormous volumes of data that will soon flow from the latest instruments. However, it is worth noting that while direct detectors can allow very high resolution density maps to be calculated in favourable conditions, according to the latest statistics from the public “EMDB” repository, the average resolution of the deposited cryo-EM maps is currently falling, having gone from around 16 Å RMSD6 in 2012 to around 23 Å RMSD in 2014.7 Therefore, it is still very important to be able to process low resolution density maps. Indeed, a total of only 26 high resolution maps (≤ 4 Å RMSD), which is less than 1% of all maps in the EMDB, have actually been solved to date.8 Nonetheless, the technological advances described above will mean that the number, size, and complexity of the structures that can be studied by cryo-EM will only increase in the future. 3.1.7 Integrative Cryo-EM Structure Modeling But achieving a good density map is still only part of the problem, because the final map will still often be of low resolution compared to X-ray crystallography. To achieve atomic resolution, it is necessary to fit previously solved crystallographic fragments of protein structures into the density map. However, the problem here is that large molecular machines will have multiple sub-components, some of which will be unknown, and many of which will fit each part of the map almost equally well. Thus, the general problem of building high resolution 3D models from cryo-EM data is like building a complex 3D jigsaw puzzle in which several pieces may be unknown or missing, and none of which will fit perfectly. Figure 2 illustrates the task of fitting a high resolution crystal structure into a low resolution density map using cross-correlation (CC) and normalised cross-correlation (NCC) scoring functions, calculated using our gEMfitter program [T10]. 6 In cryo-EM, the root-mean-squared deviation (RMSD) of a density map is normally calculated using a Fourier-shell overlap expres- sion for consistency with the usual crystallographic notion of resolution. 7 http://www.ebi.ac.uk/pdbe/emdb/statistics_sp_res.html/. 8 http://www.ebi.ac.uk/pdbe/emdb/statistics_num_res.html/. 10 Here, different map resolutions were simulated by applying different Gaussian filters to the initial data. This figure shows that, for this example, using a NCC with a Laplacian pre-filter gives the best performance on low resolution maps. However, real maps are often much noisier than this example, and in such cases we find that the NCC without a Laplacian filter often gives better results. While modern CMOS detectors will help to reduce the problem of noise in new datasets, we believe it is still very important to have robust tools which can process the many cryo-EM datasets that already exist.9 Problems due to structural flexibility also appear in cryo-EM. One way to deal with this is to collect and classify ever more 2D images in order to build multiple 3D density maps because, in favourable cases, different macromolecular conformations may sometimes be observed directly in the micrographs [X62]. Additionally, once a high-resolution protein sub-unit has been located reasonably well in a density map, it is sometimes possible to improve the fit by using normal mode analysis to deform flexibly the high resolution structure before re-fitting it into the map [X63]. We would also like to tackle the problem of flexible density fitting. However, given the small size of the initial team, we do not consider this to be an immediate priority. 1 RMSD (Å) 10 0 10 CC NCC CC + ∇2 NCC + ∇2 10 20 30 40 50 resolution (Å) Figure 2: Illustration of the cryo-EM density fitting problem. The figure on the left shows the correlation peaks obtained by using gEMfitter to fit the recA monomer into a low resolution cryo-EM density map when using CC and NCC scoring functions with and without a Laplacian (∇2 ) pre-filter. (a) CC, (b) CC+∇2 , (c) NCC, (d) NCC+∇2 . Sharper peaks correspond to better scoring functions. Thus, NCC+∇2 is the strongest scoring function and CC is the weakest for this example. The central figure shows the “breaking” resolution curves for the four scoring functions using simulated EM density data. In this figure, the x axis shows map resolution (large positive values correspond to low resolution data), and the y axis corresponds to the root mean squared deviation (RMSD) from the correct solution (small values correspond to good predictions). This figure shows that the NCC+∇2 scoring function gives the best overall performance for this example. The figure on the right shows the top-scoring fit (red protein backbone) overlaid on the correct position (blue backbone) that was obtained obtained using NCC with a Laplacian pre-filter. Figure adapted from [T10]. Another very challenging computational problem is how firstly to locate correctly all of the components in a large density map. We are collaborating with the Biocomputing group of Annick Dejaegere at the Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC) in Strasbourg to construct approximate “bead” models of macromolecular assemblies, which may then be used as fuzzy volumetric restraints for fitting high resolution crystallographic protein structures into low resolution density maps. To extend this approach, we want to explore ways to use biological knowledge to define connectivity restraints on the beads themselves. For example, if it is known that two residues from different proteins must be in contact, this requirement may be used to define a simple distance restraint. Furthermore, we expect that some types of data could be transformed directly into distance restraints, such as data from mutagenesis experiments and fluorescence or bioluminescence resonance energy transfer measurements (FRET or BRET, respectively) [X57, T8]. On 9 The EMDB currently contains a total of 2,666 3D density maps (http://http://www.ebi.ac.uk/pdbe/emdb/statistics_main.html/). It would be quite impractical to re-generate all of this data using the latest instruments. 11 the other hand, other types of data such as small-angle X-ray scattering (SAXS) curves might only be useful when validating a final model. One way to deal with such heterogeneous data is to construct and optimise a multi-objective scoring function of the form [X64] F (x1 , ..., xn ) = n X f (xi ) + n n X X i i g(xi , xj ), (4) j=i+1 where xi represent free variables, and f (xi ) and g(xi , xj ) represent single-body and pair-wise scoring functions, respectively. For example, f (xi ) might represent the score for fitting a protein into a cryo-EM density map, and g(xi , xj ) might represent a docking score between two proteins. However, as indicated above, with more than a handful of proteins, the search space is too large to enumerate blindly. On the other hand, if some prior knowledge or hypotheses about the solution are available to reduce the number of pair-wise terms, it can be advantageous to decompose the graph of g(xi , xj ) into a junction tree [X64, X65] in which nodes represent groups of coupled variables and edges represent dependencies between nodes. This allows optimisation techniques such as non-serial DP (which can assign optimal values to variables in a non-prescribed order [X66]) to be used to exploit regions of sparse dependencies in the junction tree in order to eliminate variables and to find a global solution efficiently. This approach has been used successfully in the multi-component cryo-EM fitting problem [X64]. However, it is essential that the tree width (i.e. the largest number of variables in a node) be small because the overall computational cost depends exponentially on this quantity. Although we do not have precise roadmap to a solution for the assembly problem, we wish to proceed firstly by putting more emphasis on the single-body terms in the scoring function, and secondly by using fast CG representations and knowledge-based distance restraints to prune large regions of the pair-wise search spaces. For example, we want to improve the cryo-EM density fitting calculations by adding the surface skin model that we originally developed for protein docking [T1]. The idea here is to apply the usual common-sense strategy when solving a 2D jigsaw puzzle of trying to place the edge pieces first. Because the sub-units in the final 3D model should pack together quite tightly, there are obviously close parallels between the cryo-EM density fitting problem and the multi-component docking problem. Since we know that proteins cannot physically interpenetrate, we want to use fast CG representations of proteins to prune large regions of the search space. Using such ideas, we wish start to explore how to combine volumetric shape matching, multi-component docking, and distance constraints in a practical and tractable way. Subsequently, we will build on the experience gained with a view to combining more diverse knowledge-based constraints with e.g. DP optimisation techniques. Because all of the problems described here are computationally intensive, we will rely heavily on multiprocessor devices and parallel processing techniques to accelerate the calculations. For example, we successfully adapted our Hex protein docking and 3D-Blast shape matching algorithms to use high performance graphics processor units (GPUs) to accelerate the calculations [T4]. In the near future, we plan to explore the suitability of the emerging “Many Integrated Cores” (MIC)10 devices for high throughput 3D shape matching and docking. 10 http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html 12 3.2 Classifying and Mining Protein Structures and Protein-Protein Interactions Participants: Devignes, Ritchie, Maigret 3.2.1 Introduction – Prerequisites for Data Mining The scientific discovery process is very often based on cycles of measurement, classification, and generalisation. It is easy to argue that this is especially true in the biological sciences. Often, proteins may be divided into modular sub-units called domains, which can be associated with specific biological functions. Thus, a protein domain may be considered as the evolutionary unit of biological structure and function [X67]. A widely used collection of protein domain families is “Pfam” [X68], constructed from multiple alignments of protein sequences. However, while it is well known that the 3D structures of protein domains are often more evolutionarily conserved than their one-dimensional (1D) amino acid sequences, comparing 3D structures is much more difficult than comparing 1D sequences. Therefore, until recently, most evolutionary studies of proteins have compared and clustered 1D amino acid and nucleotide sequences rather than 3D molecular structures. Indeed, efficient pattern matching algorithms such as FASTA [X69] and BLAST [X70] are now standard tools for searching nucleotide and amino acid sequence databases, but there is still no generally accepted standard for how to align and compare two similar protein structures [X71]. Nonetheless, it is widely accepted that in distantly related proteins, structure is more conserved than sequence [X72]. Furthermore, comparing structures can allow us to detect evolutionary relationships and similarities which cannot be seen beyond a “twilight zone” of around 25% sequence similarity [X73]. Hence there is a strong scientific need to develop new tools to study the structural relationships within and between protein families. In structural biology, two widely used protein domain structure classifications are SCOP [X74] and CATH [X75]. These classifications were created to catalogue the space of protein folds and to help identify functional and evolutionary relationships, especially those which might exist beyond the sequence twilight zone. Both SCOP and CATH describe proteins in a four-level hierarchy, and both populate their hierarchies using various sequence-based and structure-based comparison tools. Both classifications also require the use of considerable human expertise to deal with novel structures which cannot be classified using automatic tools. However, it is becoming increasingly difficult to keep such structural classifications up to date [X76]. For example, an increasing number of cases have been reported where a single fold family can give rise to multiple functions, or indeed where certain families have had to be re-classified as new structures have been found [X77]. While the majority of proteins appear to contain just one domain, around 30% to 40% of proteins contain two or more Pfam domains, and studying the nature of such multi-domain proteins could provide new evolutionary insights [X78]. In any case, as more and more new 3D structures become available from structural genomics projects, it is now clear that the space of protein folds is a smooth multi-dimensional space which cannot easily be sub-divided into a simple hierarchy. Indeed, according to Bourne and Shindyalov [X79], “the ultimate and almost certainly unanswerable question is, can we establish a structure-based phylogenetic tree that evolved from a single common ancestor – the original protein fold?” Consequently, the need for fast and reliable computational structural comparison tools and the opportunity to discover novel structural and evolutionary relationships has never been greater. Our research will contribute to these two aspects, first by improving comparison tools, and second by exploring novel classification paradigms. 3.2.2 Quantifying Structural Similarity Concerning the structure comparison problem, we recently developed a new protein structure alignment algorithm called Kpax (illustrated in Figure 3) which combines an efficient DP-based scoring function with a simple but novel Gaussian representation of protein backbone shape [T11]. This means that we can now 13 quantitatively compare 3D protein domains at a similar rate of throughput to conventional 1D NeedlemanWunsch sequence comparisons. Currently, the main limitations of Kpax are that it cannot handle repeats or transpositions of domains, and it does not detect alternative sub-optimal alignments. It only returns one (the highest scoring) global alignment for each pair. On the other hand, we recently compared Kpax with a large number of other structure alignment programs, and we found Kpax to be the fastest and amongst the most accurate, in a CATH family recognition test [T12]. Figure 3: Illustration of the Kpax structural alignment algorithm. (A) By exploiting the highly predictable tetrahedral geometry around each Cα atom, successive peptide fragments may be compared in a common local coordinate frame because similar secondary structure elements appear in similar orientations at the origin. Pairs of peptide fragments may then be compared rapidly using products of local-frame Gaussian functions. (B) The Gaussian parameters are derived using statistics from the CATH database. (C) Overlays of secondary structures may then be calculated rapidly using DP. The superpositions obtained by Kpax are often tighter than those of the widely used TM-Align algorithm [X80]. Arrows highlight particularly tight regions of the Kpax alignment. In general, protein alignment algorithms aim to maximise the number of aligned residues while simultaneously trying to minimise the root mean squared deviation (RMSD) between the aligned Cα atoms. But here again, structural flexibility can cause considerable problems. If protein structures are assumed (incorrectly) to be rigid, as is the case for most current alignment algorithms, these two measures are inversely related. Consequently, there is often some debate about what defines a “biologically meaningful” alignment, or whether one particular alignment algorithm is “better” than another, especially when comparing structures from different protein families. The latest version of Kpax (unpublished) can calculate flexible alignments at no extra cost, and thus promises to avoid such issues when comparing more distantly related protein folds and fold families. 3.2.3 Formalising and Exploiting Domain Knowledge Concerning protein structure classification, we aim to explore novel classification paradigms to circumvent the problems encountered with existing hierarchical classifications of protein folds and domains. In particular it will be interesting to set up fuzzy clustering methods taking advantage of our previous work on gene functional classification [T13], but instead using Kpax domain-domain similarity matrices. A non-trivial issue with fuzzy clustering is how to handle similarity rather than mathematical distance matrices, and how to find the optimal number of clusters, especially when using a non-Euclidean similarity measure. We will adapt the algorithms and the calculation of quality indices to the Kpax similarity measure. More fundamentally, it will be necessary to integrate this classification step in the more general process leading from data to knowledge called Knowledge Discovery in Databases (KDD) [X81]. The KDD process can be divided in three steps: data preparation, data mining, and result interpretation. It is a largely iterative process not only because data mining can be re-executed many times with various parameters, but also because interpretation of mining results often leads the expert to redefine new datasets to be mined. It is our experience that in addition to a human expert’s knowledge, using formalised domain knowledge 14 can help to guide the KDD process during each of the three steps [T14]. For example, integrating domaindomain similarity measures with knowledge about domain binding sites, as introduced by us in the KBDOCK approach [T15, T16], can help in selecting interesting subsets of domain pairs before clustering. Another example where domain knowledge can be useful is during result interpretation: several sources of knowledge have to be used to explicitly characterise each cluster and to help decide its validity. Thus, it will be useful to be able to express data models, patterns, and rules in a common formalism using a defined vocabulary for concepts and relationships. Existing approaches such as the Molecular Interaction (MI) format [X82] developed by the Human Genome Organization (HUGO) mostly address the experimental wet lab aspects leading to data production and curation [X83]. A different point of view is represented in the Interaction Network Ontology (INO),11 which is a community-driven ontology that is being developed to standardise and integrate data on interaction networks and to support computer-assisted reasoning [X84]. However, this ontology does not integrate basic 3D concepts and structural relationships. With the help of the resource Index at the NCBO portal [X85], we will survey all existing ontologies referring to protein-protein interactions and we will introduce whenever required the representation of structural features for reasoning about interactions. For example, the abstract concept of Domain Family Binding Site that we introduced in our KBDOCK approach must be integrated at its proper place in an ontology in order to formalize its use in case-based homology modeling [T15]. Using such formalisms and symbolic relationships will be beneficial, if not essential, when classifying the 3D shapes of proteins at the domain family level. 3.2.4 3D Protein Shape Mining Thanks to our KBDOCK and Kpax projects, we already have a rich set of tools with which we can start to process and compare all known protein structures and PPIs according to their component Pfam domains. Starting from Pfam-defined structural families, we wish to perform a completely automatic clustering of all structural domains in order to define a fresh structure-based classification of the protein universe. Linking this new classification to the latest “SIFTS” (Structure Integration with Function, Taxonomy and Sequence) functional annotations between standard Uniprot12 sequence identifiers and PDB structures [X86] could then provide a useful way to discover new structural and functional relationships which are difficult to detect in existing classification schemes such as CATH or SCOP. Going further, we also want to perform all-againstall binding site comparisons and docking calculations using our polar Fourier correlation algorithm. We will then apply fuzzy or non-fuzzy classification methods to the correlation matrices obtained in order to define optimal shape-density based classifications of protein binding sites and protein binding partners. 3.2.5 Mining Protein Interaction Networks From a systems biology point of view, each known protein-protein interaction may be considered as an edge linking two nodes in a network. Human diseases are often related to malfunctions or imbalances in such networks, caused by natural genetic variations in our DNA called single nucleotide polymorphisms (SNPs). However, determining the molecular origins of disease processes is extremely challenging. We will use symbolic logic-based techniques to represent, reason about, and mine the relationships between disease profiles, SNPs, and molecular interactions. In particular, we will combine these techniques with molecular modelling methods to understand the consequences of genetic polymorphisms and mutations on protein shapes and protein-protein interactions. 11 http://www.ino-ontology.org/ 12 http://www.uniprot.org/ 15 Although several groups have developed pair-wise docking algorithms, there has been very little work so far on how to assemble these individual interactions into structural and functional networks. Furthermore, while several ontologies and mark-up languages exist for the biology and chemistry domains (see e.g. http://www.biosharing.org/), to our knowledge there does not yet exist any kind of standard for describing molecular interactions at the structural level. Therefore, we will develop a formal framework to represent and manipulate structural knowledge about molecular interactions. This should lead to richer computational models of biological systems, and should open the way for the application of more sophisticated techniques for mining 3D molecular interactions. 3.2.6 Modeling Evolutionary Relationships Between 3D Protein Structures Even if Bourne and Shindyalov’s question concerning a possible “structural phylogenetic tree” is highly thought-provoking, we believe it is probably unanswerable because it seems very unlikely that there could ever have been just one ancestral protein fold. Instead, we prefer to ask some simpler but still challenging questions. Namely, can we devise a computational model to explain how one ancestral structure might diverge into two or more descendants? Or, more concretely, given some existing 3D structures of proteins from nearby family groups in the CATH or SCOP classifications, can we build a parsimonious model of how those structures might have physically mutated from some hypothetical parent structure? Going a little further, we would then ask, can we identify which existing structure most closely resembles the ancestral structure and therefore would provide a tangible 3D model of the ancestral structure? Going still further, we might even ask, can we detect examples of structural convergence, in which a given fold might arise from two different ancestral folds (in analogy to functional convergence, as observed in certain proteases)? If protein molecules could be represented definitively as 1D strings, the above questions could be explored using conventional sequence alignment-based scoring techniques (e.g. using the notion of edit distance, for example). In order to do ancestral inferencing from genetic sequence data, various techniques have been developed in the standard framework of population genetics, such as likelihood maximisation [X87], importance sampling [X88, X89], and Monte Carlo Markov Chain models [X90, X91]. However, because symbolic alignment approaches start to break down beyond the 25% similarity twilight zone, they become almost useless if the aim is to measure the differences between different structural families. Therefore, in collaboration with Nicolas Champagnat of the Institut Élie Cartan de Lorrain (IECL), we wish to extend current ancestral inferencing techniques with new structural similarity scoring functions based on our Kpax software in order to study the ancestral relationships between protein folds in a completely sequenceindependent way. This represents a novel combination of computational strategies which could open the way to answering a number of contemporary and far-reaching questions in structural biology. For example, comparing the structural ancestry of proteins with the known phylogeny of species could help to confirm or inform their function, and more generally could help to achieve a structural classification of proteins which could explain the emergence of specialised functional domains and which could reveal interesting structural relationships between the diverse fold families that we see today. 3.3 Specific Applications The following applications are described here in order of priority, with the first two being very closely linked to the team’s main objectives. 16 3.3.1 Transcription Factor II-D Participants: Ritchie, Devignes Transcription factor II-D (TFIID) is one of the key enzymes responsible for transcribing DNA into complementary RNA. This is the first step of translating a gene into a functional protein. TFIID is a large structure consisting of some 16 protein molecules, although the precise 3D arrangement of these components is still unknown [X92]. The 3D structure of TFIID is being actively studied by the cryo-EM group of Patrick Schultz and the molecular modeling group of Annnick Dejaegere at the IGBMC in Strasbourg. Figure 4 shows some views of this system. There is considerable interest solving the 3D structure of TFIID and other related transcription factors because producing high resolution models of such complexes could have significant bio-medical implications. A manual placement of some of the protein subunits into a low resolution EM map has been proposed (P. Schultz, unpublished). However, several shortcomings of commonly used FFT correlation methods became apparent during the attempts to automate this process due to the difficulties of dealing with data at different scales of both size and resolution. Therefore, the TFIID system represents a clear example of the kind of integrative structural biology problem that we wish to tackle (scales 2 and 3), as it exemplifies the need both for powerful 3D image processing techniques and to be able to incorporate biological knowledge to focus the calculations. Indeed, we consider strengthening our collaborations with the IGBMC and building new collaborations with other groups in the cryo-EM field to be an essential long-term component of the Capsid project. We expect the collaboration with the IGBMC to continue for at least the next four years, and we will try to build collaborations with other cryo-EM teams as well. Figure 4: Cryo-EM density maps and micrographs of TFIID. (a) Two views of a cryo-EM map containing TFIID. (b) Three X-ray sub-units fitted into the density map. (c) Evidence for movement of the TFIIA sub-unit, which is a co-factor that binds to and stabilises TFIID. (d) Micrographs of TFIID-TFIIA-Rap1-DNA complexes showing the formation of a DNA loop. Rap1 is one of several further co-factors necessary for transcription. Figure taken from [X93]. 3.3.2 Prokaryotic Type IV Secretion Systems Participants: Devignes, Ritchie Prokaryotic type IV secretion systems constitute a fascinating example of a family of nanomachines capable of translocating DNA and protein molecules through the cell membrane from one cell to another [X94]. The complete system involves at least 12 proteins, and is illustrated in Figure 5. The structure of the core channel involving three of these proteins has recently been determined by cryo-EM experiments [X95, X96]. However, the detailed nature of the interactions between the remaining components and those of the core channel remains to be resolved. Therefore, these secretion systems represent another family of complex biological systems (scales 2 and 3) that call for integrated modeling approaches to fully understand their machinery. In the frame of the LORIA-MBI platform (see Section 6.2), MD Devignes has initiated a collaboration with 17 Nathalie Leblond of the Genome Dynamics and Microbial Adaptation (DynAMic) laboratory (UMR 1128, Université de Lorraine, INRA) on the discovery of new integrative conjugative elements (ICEs) and integrative mobilisable elements (IMEs) in prokaryotic genomes. These elements use Type IV secretion systems for transferring DNA horizontally from one cell to another. We have discovered more than 40 new ICEs/IMEs by systematic exploration of 72 Streptomyces genome. As these elements encode all or a subset of the components of the Type IV secretion system, they constitute a valuable source of sequence data and constraints for modeling these systems in 3D. A collaboration with a crystallography group working with the DynAMic laboratory is planned for producing and crystallising the most challenging components. This set of 3D protein sub-units will be used for testing our algorithms with the objective of (i) reconstituting the already solved core channel and (ii) predicting the interactions leading to a complete functional active Type IV secretion system. Another interesting aspect of this particular system is that unlike other secretion systems, the Type IV secretion systems are not restricted to a particular group of bacteria. These nanomachines display a broad phylogenetic distribution, and constitute an interesting topic for exploring structural evolution. We expect to continue our collaboration with the DynAMic team for at least the next four years. Figure 5: The structure of an archetypal Type IV secretion system (A. tumefaciens). (A) Schematic predicted structure involving 12 different proteins [X94]. (B) Cryo-EM reconstruction of the central core region [X95]. 3.3.3 G-protein Coupled Receptors Participants: Maigret, Ritchie G-protein coupled receptors (GPCRs) are cell surface proteins which detect chemical signals (scale 1) outside a cell and which transform these signals into a cascade of cellular changes (scale 4). Figure 6 shows the structure of a recently solved example of a GPCR system, the β2 -adrenergic receptor [X97]. Historically, the most well documented signaling cascade is the one driven by G-proteins trimers (guanine nucleotide binding proteins) [X98] which ultimately regulate many cellular processes such as transcription, enzyme activity, and homeostatis, for example. But other pathways have recently been associated with the signals triggered by GPCRs, involving other proteins such as arrestins and kinases which drive other important cellular activities. For example, β -arrestin activation can block GPCR-mediated apoptosis (cell death). Malfunctions in such processes are related to diseases such as diabetes, neurological disorders, cardiovascular disease, and cancer. Thus, GPCRs are one of the main protein families targeted by therapeutic drugs [X99] and the focus of much bio-medical research. Indeed, approximately 40–50% of current therapeutic molecules target GPCRs. However, despite enormous efforts, the main difficulty here is the lack of experimentally solved 3D structures for most GPCRs. Hence, computational modeling tools are widely recognized as necessary to help understand GPCR functioning and thus biomedical innovation and drug design. 18 Figure 6: The X-ray structure of a classic GPCR system, the β2 -adrenergic receptor, shown in an artist’s representation of the cell membrane (light blue). The trans-membrane receptor domain is in blue, the intra-cellular G-protein trimer structures are in red, and a small agonist molecule bound to the receptor is shown in orange. Figure taken from http://www.nobelprize.org/nobel_prizes/chemistry/laureates/2012/. In collaboration with medicinal chemistry colleagues in the universities of Bari (Italy) and Ramon-Lull (Spain), we have long been interested in using computational techniques to develop new small-molecule inhibitors of the CCR5 and CXCR4 receptors which are attacked by human immuno-deficiency virus (HIV) [T17, T18, T19]. Together with Catherine Llorens-Cortes at the centre for Interdisciplinary Research in Biology (CIRB; UMR 7241) at Collège de France, we are studying another GPCR called the APJ apelin receptor, which is involved in the regulation of cardiovascular function (and which also appears to be one of the co-receptors of HIV) [X100, X101]. One promising route to develop therapeutic molecules to control heart disease is to design new small molecules which mimic the apelin signaling peptide but which have better transport properties and which are degraded less quickly than the natural peptide. We participated in the discovery of the only non-peptide ligand for APJ to have been found to date [X102]. As well as modeling the structures of GPCRs and working on GPCR-targeted drug discovery, we are developing new algorithms to improve conformational sampling of mutually flexible docking partners (as is the case in the APJ/apelin system), and to speed up the virtual screening pipeline. While recent technological advances now make it possible to run ever longer MD simulations, analysing the enormous datasets which result (often many terabytes) is now becoming a major bottleneck. Hence we are interested in developing novel clustering techniques to detect “interesting” events in long MD simulations. The apelin receptor system also nicely exemplifies our interest in the relationships between genetic variations and human diseases (Section 3.2). For example, while the apelin peptide appears to be the same in all mammalian species, several single nucleotide polymorphisms (SNPs) in the human APJ gene have been associated with different cardiovascular disease profiles [X101]. We are aware that research on GPCRs is highly competitive. Nonetheless, one of us (B. Maigret) is an expert on modeling GPCR structures, and it is natural that the team will continue to contribute in this area for at least the duration of B. Maigret’s emeritus position at the LORIA, i.e. at least until the end of 2015. 4 4.1 The Team Permanent Researchers David Ritchie has a PhD in Computing Science (University of Aberdeen, 1998), a Masters in Artificial Intelligence (Aberdeen, 1995), and a Bachelors in Chemistry (University of Bristol, 1978). Before coming to France, he spent 9 years as a lecturer in the Department of Computing Science at the University of Aberdeen 19 (1999–2008). Thanks to an ANR Chaires d’Excellence award, he joined Amedeo Napoli’s Orpailleur team at the LORIA in January 2009. He then obtained a permanent position with Inria in October 2010. He obtained his Habilitation à Diriger les Recherches (HDR) in 2011 from Université Henri Poincaré (now part of Université de Lorraine). Concerning research outputs, he is probably best known for his novel spherical polar Fourier correlation technique for protein docking and 3D molecular shape matching. His Hex docking software is one of the most widely used protein docking programs available (over 33,000 down-loads). He has published some 40 international journal articles (900 citations and H=17 in Thomson ISI) in the fields of structural bioinformatics and chemoinformatics. His research has been funded by grants worth approximately e 1M from ANR and the british BBSRC and EPSRC. Throughout his career, “Dave” Ritchie has been deeply involved in scientific computing, both as a professional software developer in the oil and chemicals industries (1979–1994) and later as an academic researcher. His scientific motivation centres around his desire to understand how complex biological systems work at the molecular level. His practical contributions toward this goal involve developing novel and efficient computational and knowledge-based techniques to represent and study the 3D shapes of biological molecules and their complexes. He still enjoys writing his own software, but he also firmly believes that the best way to make significant progress is to bring together a team of experts with complementary skills. Marie-Dominique Devignes was trained at the Ecole Normale Supérieure (1977–1982). She obtained her Masters in Biochemistry and Physiology in 1979 at the University of Paris VI and VII and the Agrégation in Biochemistry and Physiology in 1980. After her PhD in Molecular Biology in 1982, she spent 18 months in Germany. She was subsequently recruited by the CNRS in 1983. She received the bronze medal of the CNRS in 1986 and her Doctorat d’Etat ès Sciences (equivalent to HDR) in 1988. She worked in the field of Human Genetics at the CNRS in Villejuif from 1989 to 2000 and turned to computational biology when joining the LORIA in Nancy in 2001. She now coordinates several collaborative bioinformatics projects around the LORIA. She is internationally recognised in the fields of data integration and knowledge-based approaches for bioinformatics. She has published over 45 international journal articles which have some 650 citations. In 2014 she will chair the European Conference on Computational Biology (ECCB), which is the largest european conference in this field. Bernard Maigret originally studied Chemical Engineering at the Ecole Nationale Supérieure de Strasbourg. He received his Doctorat d’Etat ès Sciences Physiques in 1975. From 1993 to 1997 he was the head of the CNRS unit no. 510 “Interactions Moléculaires.” From 2003 to 2006, he was the head of the eDAM team (Équipe de Dynamique des Assemblages Membranaires) of UMR 7565 in Nancy. During his scientific career he published some 190 papers (4800 citations, H=39) on subjects ranging from quantum mechanics, molecular dynamics, and virtual drug screening, to molecular graphics, clustering, and optimisation. 4.2 Team Software The team members have already developed several techniques and tools for computational and knowledgebased modeling of protein structures and interactions which will provide useful foundations for this project. In particular, we mention here in approximate order of maturity (oldest first): • Hex – state of the art protein docking using polar Fourier correlations – http://hex.loria.fr/. • HexServer – a GPU-powered web server for protein docking – http://hexserver.loria.fr/. • Intelligo – a vector based semantic similarity measure for biological processes and molecular function based on gene ontology (GO) terms – http://intelligo.loria.fr/. • KBDOCK – a 3D database of structural protein-protein domain interactions – http://kbdock.loria.fr/. • Kpax – a 3D protein and peptide structure database search and alignment algorithm – http://kpax.loria.fr/. 20 • gEMpicker – a parallel GPU-based particle picking tool for cryo-EM microscopy – http://gem.loria.fr/. • gEMFitter – a parallel GPU-based cryo-EM density matching and docking tool – http://gem.loria.fr/. Although the Hex docking program might be considered as our “flagship” software, we do not intend to develop it much further as a protein-protein docking tool. Because Hex is fundamentally a rigid-body docking algorithm, it is not well suited to modeling structural flexibility during docking, and the exponential term in the spherical polar basis functions mean that high resolution docking correlation calculations are limited to protein domains of up to around 150 amino acid residues. Nonetheless, we do wish to re-structure the polar Fourier correlation code in Hex in order to provide a general rotational FFT library. Together with the GPU-accelerated Cartesian cross-correlation codes that we developed in gEMfitter, this new library will be important for the cryo-EM assembly part of the project. We also expect that the main Hex program will provide a useful test-harness with which to evaluate the new CG potentials that we plan to use during multicomponent assembly. We expect that Kpax and the KBDOCK database will play important roles in the 3D shape mining part of this project. Kpax provides a natural way to score the structural similarity of pairs of protein domains, and we are currently extending it to calculate multiple flexible alignments within groups of similar protein structures. The KBDOCK database contains information on all currently known 3D proteinprotein interactions, classified according to Pfam families. It therefore represents an important resource for analysing 3D protein binding sites and interfaces and for identifying structure-function relationships. 5 Positioning 5.1 Positioning within Inria Inria Domain: Santé, Biologie et Planète Numériques Inria Theme: Biologie Numérique The main motivation for proposing a new Inria team is to achieve greater institutional visibility and support for the scientific problems we wish to address. This will help the team members to build scientific momentum and to obtain further research funding at both national and international levels. Forming a focused Inria team for structural bioinformatics will also create a stimulating environment with which to attract and train bright young students who wish to work in this area. The permanent members of the Capsid team currently belong to the Orpailleur project team. While the Orpailleur team received an excellent assessment in the latest (2011) Inria Evaluation exercise, the evaluation report also noted that the team risked being spread across too many axes. Consequently, the Project Committee at Inria Nancy recommended that the life sciences sub-group should accelerate its plans to form a new team. Therefore, this proposal is consistent with the scientific strategy of the Inria Nancy Grand Est centre. More globally, Inria has long recognised the important contribution that the Computational Sciences can make to our economic and societal well-being. In the Computational Biology and Bioinformatics theme, several Inria teams are working on topics such as high-throughput sequence analysis (Bonsai, Genscale), cellular modeling and molecular imaging (Beagle, Morpheme, Serpico), and integrative and systems biology (Amib, Ibis), while other teams are developing sophisticated symbolic techniques to analyse genomic-scale biological information (Bamboo, Dyliss, Magnome). Although molecular structure is sometimes an important consideration for these teams, only Frédéric Cazals (ABS), Rumen Andonov (GenScale), and Jerôme Azé and Julie Bernauer (Amib) make protein structure a central interest. Andonov’s work on protein structure alignment uses constraint-based solver techniques to generate provably optimal alignments according to 21 a specified criteria (but often with a high computational cost). In contrast, our Kpax approach is designed for very fast database searching while still giving tight, but not necessarily optimal, 3D superpositions. Azé and Bernauer use 3D Voronoi representations and statistical learning techniques to construct potentials for scoring protein-protein and protein-RNA interactions [X103]. More recently, Cazals used a Voronoi representation to define a purely geometric description of the environment of each atom within a protein-protein interface [X104]. In contrast, our KBDOCK database approach uses a very simple shape clustering method to represent families of protein binding sites purely symbolically. Thus, while ABS begins from geometric foundations, our KBDOCK approach aims to bridge the gap between shape-based and knowledge-based representations of molecular interactions. It is worth noting that Cazals is also interested in modeling large multi-component protein complexes. In particular, his group recently developed a geometric “toleranced model” (TOM) approach for verifying the correctness of large multi-component models [X105]. Because the TOM approach represents each protein as a union of 18 balls, it may be considered as a kind of CG geometric representation. Although the TOM approach itself does not provide a way to build multi-component models from scratch, it could clearly be used to assess or validate models proposed by Capsid. Thus, there are opportunities for ABS and Capsid to collaborate constructively. Still, a specificity of the Capsid team will be its combination of shape-based and knowledge-based approaches for structural biology. Concerning HPC, we can point out that several areas of bioinformatics are computationally intensive and involve working with very large datasets. For high throughput sequencing, the GenScale team is studying the use of advanced in-memory indexing techniques in conjunction with parallel processing using of multiple levels of granularity. Similarly, the Bonsai team is building and collecting GPU-accelerated sequence analysis tools as part of their Biomanycores project. In contrast, our aim is to use HPC to accelerate 3D biomolecular shape comparisons. In this context, our work has some links to bio-medical image processing teams such as Athena and Asclepios because the calculation of 3D volumes, gradients, and moments is often a common feature of 3D imaging problems. We mention here that our main Inria collaborator is Sergei Grudinin (team Nano-D), with whom we collaborate formally on ANR project (see Section 7.1). 5.2 Positioning within the LORIA and the University of Lorraine The LORIA is a mixed research unit (UMR 7503) which hosts permanent researchers from Inria, CNRS, and the University of Lorraine. These researchers are distributed across some 27 teams, of which 16 are common with Inria project teams. The LORIA teams are organised into five departments: 1. Algorithms, Computation, Image & Geometry; 2. Formal Methods; 3. Networks, Systems and Services; 4. Knowledge & Language Management; 5. Complex Systems & Artificial Intelligence. The 2012 AERES report on the LORIA ranked all of these departments as “A+” or “A” on all assessment criteria. However, it noted that the computational biology activities within the Orpailleur team (department 4) is not closely linked to the principal theme of that department, although it acknowledged the application of knowledge-based approaches to the life sciences. Consequently, one of the report’s conclusions was that the computational biology theme should move to department 5. It is therefore proposed that the Capsid team will join department 5 – Complex Systems & Artificial Intelligence. Nonetheless, the new team will still maintain close collaborations with Orpailleur. For example, we recently obtained a studentship from the IAEM doctoral school to work on biomedical knowledge discovery which will be jointly supervised by Adrien Coulet (Orpailleur). The current teams of department 5 (Cortex, Kiwi, Maia, and Neurosys) study computational neuroscience, multi-agent autonomous systems, planning, robotics, cellular automata, and emergent or collective behaviours in biological systems and social networks. Hence, there exist interesting complementarities and 22 opportunities for cross-fertilisation. For example, our work with GPCRs could be of interest to the Neurosys team, which studies brain activity and anesthesia, because several of the receptors on the surface of nerve cells are GPCRs. Conversely, some of the optimisation and planning techniques being developed in the Maia team could help to find new ways to model protein flexibility, or to help solve multi-component docking problems. Furthermore, there are some close parallels between mining human preferences or very large social networks and searching for interesting biological relationships in protein interaction networks. Thus, there could also be opportunities for collaborations with members of the Kiwi team. 5.3 National Positioning Beyond Inria, our main national collaborators in structural biology are currently at the IGBMC in Strasbourg. The Strasbourg teams are members of the French Infrastructure for Integrated Structural Biology (FRISBI),13 the national network to promote the use of experimental techniques to solve large biomolecular structures. We are working with the cryo-EM group of Patrick Schultz at the IGBMC through a CNRS “PEPS” award (Projet Exploratoire / Premier Soutien) and a LORIA-funded postdoc project to develop highly parallel correlation techniques on GPU clusters for automatic 2D particle picking and 3D shape-density matching. The structural modeling team of Annick Dejaegere is developing a CG bead model of multi-component complexes. Apart from Inria-ABS, this team is one of the few groups in France who are actively developing new computational algorithms for integrative structural biology. Excluding classical force-field MD modeling software, which is mainly developed in biophysics and biochemistry laboratories, only a few groups in France or indeed world-wide are developing ab initio protein docking algorithms. The group of Chantal Prévost at the IBPC in Paris developed Ptools, a programmable docking framework that includes a normal mode-based model of protein flexibility [X106]. While this allows protein flexibility during docking to be simulated efficiently, it is computationally expensive, which makes it rather unsuitable for large-scale studies. Another French group that should be mentioned here is that of Anne Poupon (now team Bios in Tours) and Jerôme Azé (Inria-Amib). Their approach uses machine learning techniques to train a scoring function based on Voronoi tesselations of protein shape [X103]. We have a manuscript in preparation with Anne Poupon on modeling a multi-component GPCR signaling complex. Concerning Inria’s scientific strategy, by building links and collaborations with external researchers in biology and medicine, the activities of the Capsid team will closely match Inria’s objective to “integrate multiscale data, both temporally and spatially, in order to model complex biological systems” (Inria Strategic Plan 2013–2017).14 By developing computational methods to help investigate the molecular basis of diseases such as heart disease and diabetes, the team will help to address the “Health and Well-Being” strategic objective which is also mentioned in the Strategic Plan. The activities of the team will raise the profile of Inria’s participation in several national organisations: • IFB (Institut Français de Bioinformatique) – http://www.renabi.fr/ (formerly ReNaBI: Réseau National des plateformes Bioinformatiques). • FRISBI (French Infrastructure for Integrated Structural Biology) – http://frisbi.eu/. • SFBI (Société Française de Bio-Informatique) – http://www.sfbi.fr/. • SFCI (Société Française de Chémoinformatique – http://www.chemoinformatique.fr/. • GdR 3003 Bioinformatique Moléculaire – http://www.gdr-bim.u-psud.fr/. • Aviesan (Alliance nationale pour les sciences de la Vie et de la Santé) – http://www.aviesan.fr/. 13 http://frisbi.eu/ 14 http://www.inria.fr/institut/strategie/plan-strategique 23 5.4 International Positioning International progress in computational protein docking is assessed in the european conference on Critical Assessment of PRedicted Interactions (CAPRI) [X107] and its american partner conference, Modeling of Protein Interactions in Genomes (MPIG) [X108]. During the last 10 years, we have submitted docking predictions for almost all of the targets in the CAPRI experiment, and we have contributed to nearly all of the CAPRI and MPIG meetings through oral presentations and posters. While our Hex polar Fourier correlation approach can often produce acceptable predictions and is still one of the fastest, several groups get better results because they are more skilled in the use of domain knowledge and because they use more sophisticated physico-chemical scoring functions and more expensive refinement protocols. Based on results presented by Lensink and Wodak [X109] at the CAPRI-2013 conference,15 some leading protein docking groups are those of Alexandre Bonvin (U Utrecht),16 Sandor Vajda (U Boston),17 Zhiping Weng (U Massachusetts),18 Paul Bates (Cancer Research UK),19 and Juan Fernandez-Recio (Barcelona Supercomputer Center).20 Currently, very few of the CAPRI participants use CG representations, presumably because most of the targets in CAPRI have been relatively small crystallographic dimers or trimers. However, a notable exception is the ATTRACT algorithm of Martin Zacharias (Technical University of Munich).21 The ATTRACT approach has also recently been used in cryo-EM density fitting [X110]. Mathematically, the most similar docking algorithm to ours is FRODOCK [X111], developed by the groups of Pablo Chacon (CSIC, Madrid) and Ruben Abagyan (U California San Diego). The same authors also developed a fast rotational correlation technique called ADP_EM for cryo-EM density fitting [X112]. Like us, they exploit the special rotational properties of the spherical harmonic basis functions to accelerate the rotational search, but they use numerically sampled radial shells for the radial coordinate whereas we use orthogonal Gauss-Laguerre basis functions. This means that translations must be calculated numerically in FRODOCK and ADM_EM, whereas they are calculated analytically in Hex. Although nowadays it is difficult to make significant progress in ab initio docking algorithms, the CAPRI experiment shows that the best docking models are often achieved when biological knowledge is used to drive the calculation or to filter the results (“data-driven docking”) [X113]. Several groups have published databases of structural protein-protein interactions [X114], and several others have published homologybased (or “template-based”) docking protocols [X115, X116, X117]. However, it still remains unclear how best to link structural databases and docking algorithms in a reliable way. More fundamentally, there is still no generally accepted way to define what actually constitutes a protein binding site or to quantify whether or not two binding sites are structurally similar. In this respect, our KBDOCK approach represents one of the newest and most promising developments to have been described [T15]. When tested on the widely used Protein Docking Benchmark [X118], KBDOCK almost invariable finds a good docking model if a suitable homology template exists in its case base. While CAPRI continues to attract new participants, still only a few groups are using docking algorithms to study protein interactions on a large scale. To our knowledge, the only genomic-scale docking experiment to have been reported to date was made by the group of Patrick Aloy at the Institute for Research in Biomedicine (IRB) in Barcelona, who performed 3,700 protein docking calculations in yeast [X119]. This study focused on performing high-throughput pair-wise docking calculations, and it did not attempt to use 15 http://tintin.science.uu.nl/CAPRI2013/home/home 16 http://www.nmr.chem.uu.nl/ abonvin/ 17 http://structure.bu.edu/ 18 http://zlab.umassmed.edu/zlab/ 19 http://www.london-research-institute.org.uk/research/paul-bates 20 http://www.bsc.es/life-sciences/protein-interactions-and-docking 21 http://www.t38.ph.tum.de/index.php?id=17 24 biological knowledge or restraints to guide the calculations. Subsequently, a large-scale docking experiment by the group of Alfonso Valencia at the Spanish Cancer Research Centre in Madrid was carried out using our Hex docking program [X120]. By examining the profiles of docking scores obtained, they found that the true interactions between 56 pairs of known interactors could often be distinguished from a background of 922 non-interactors. These results indicate that even simple shape-based algorithms can produce a useful docking signal, which could potentially be used to train a learning algorithm. The main groups that we are aware of who are developing multi-component assembly algorithms are those of Willy Wriggers (D E Shaw Research, New York),22 Andrei Sali (U California, San Francisco),23 and Haim Wolfson (U Tel Aviv).24 Wriggers’ Sculptor program [X121] is a graphical interface to his Situs toolkit for cryo-EM modeling [X122]. This allows the user to build a multi-component model by incrementally adding one protein sub-unit at a time into an EM density map. The groups of Sali and Wolfson aim for more automated assembly of the sub-units. They use spanning-tree [X61] and junction-tree [X64, X65] techniques to represent the combinatorial search space. Sali’s group has recently made their Integrative Modeling Platform (IMP) software25 publicly available [X123]. This consists of Python modules for manipulating protein structures, cryo-EM density maps, and SAXS profiles and other data, and for applying various kinds of distance and volume restraints to guide the scoring function. IMP includes the DOMINO junction-tree solver developed by Wolfson’s group [X64]. By combining diverse multi-resolution data on the component protein structures and their interactions, Sali’s team recently used IMP to help locate the positions of the major substructures in the very large nuclear pore complex (NPC) comprising a total of 456 proteins [X124, X125]. Nonetheless, this endeavour required an enormous multi-disciplinary effort from many teams. In our estimation, integrative structural biology is one of today’s exciting frontiers of science. While we do not aspire to compete with the IMP approach, we believe we can make a useful contribution to the field by focusing on algorithmic aspects of the multi-component assembly problem. Concerning our plans to model 3D evolutionary relationships, several reviews in structural biology have pointed to the evident structural relationships between different protein families (see, e.g., [X126, X127, X128]). The originators of the SCOP and CATH protein structure classifications such as Alexey Murzin (U Cambridge), Christine Orengo (University College, London), and Willie Taylor (NIMR, London) are now considered as world experts on this subject. These groups, and many others [X129], are still actively developing tools to compare and classify protein structures. However, to our knowledge, nobody has yet attempted to compute 3D distances between related structures from an evolutionary point of view. Furthermore, the idea to use 3D structural similarity scoring to reconstruct the ancestry of different protein families is also novel, partly because until very recently no sufficiently efficient scoring algorithms has been available, but mainly because the basic notion of constructing a mathematical model of 3D protein structure evolution is also completely new, and will involve combining both micro-evolutionary and macro-evolutionary approaches. 6 6.1 Collaborations and Technology Transfer Research Collaborations Because our scientific objectives stand at the interface between informatics and biology, it is very important that we develop collaborations with external researchers in biology and medicine. This will ensure that 22 http://www.deshawresearch.com/members_c-b_wriggers.html 23 http://salilab.org/ 24 http://www.cs.tau.ac.il/∼wolfson/ 25 http://www.salilab.org/imp/ 25 we understand the state of the art in the target domain, and that we can address relevant problems at the frontiers of research. Around the University of Lorraine we have a number of bioinformatics collaborations with colleagues in the life sciences. Several of these collaborations are currently supported by the MBI platform which is funded jointly by Inria and Région Lorraine. This platform supports several projects concerning biomolecular modeling, systems biology, and knowledge discovery. As well as providing bioinformatics resources, the platform also provides a framework for training biology colleagues in molecular dynamics and data mining methods. Current MBI projects include modeling the interactions between polyphenols and certain biocatalysts (ENSAIA), clustering very large molecular datasets (with LORIA-Qgar), comparative genomic studies of forestmicroorganism ecosystems (INRA-IAM), comparing protein interaction networks to discover disease genes (with CHU Nancy), studying drug side effects (with MPI Saarbrucken), protein secondary structure prediction (Loria-ABC), modeling promoter sequences in bacteria (INRA), and genetic cohort studies (Inria-BIGS, IECL, CHU-Nancy). We are also participating in the EXPLOR project (Ensemble de Calcul Scientifique pour la LORraine; porteur: Gérald Monard, SRSMC), which is working to set up a shared computational Mésocentre for local researchers in the physical and life sciences. We are actively developing further local collaborations. For example, with Chris Chipot’s eDAM team (UMR 7565), we are preparing a proposal for the Human Frontiers in Science Program (HFSP) for a project on modeling protein-protein interactions in the brain. With Nathalie Leblond and Gérard Guedon from the DynAMic team (UMR 1128), we recently submitted a short proposal to the ANR to study integrative and conjugative elements in bacteria. This proposal was not selected for a full proposal, but we have since submitted a proposal for a joint INRA-Inria studentship for the same project. We recently obtained funding for a “PEPS Mirabelle” project with Philippe Jonveaux and other colleagues from the Faculty of Medicine (INSERM 954) to support a doctoral thesis project to study the use of Linked Open Data26 in the biomedical domain. As mentioned above, our main external collaborators in France are in Grenoble and Strasbourg. Our ANR-funded project with Sergei Grudinin (Inria-NanoD) also involves the team of Valentin Gordeliy at the Institut de Biologie Structurale in Grenoble. In Strasbourg, we are working with members of the Integrated Structural Biology Department at the IGBMC (Schultz and Dejaegere) to help the reconstruction of 3D structures from low resolution 3D cryo-EM maps. We are also working with colleagues in Reims and Lyon to build a project on modeling systems of extracellular proteins. At the international level, we have several long-standing collaborators including Sandor Vajda (Structural Bioinformatics Laboratory, U Boston), Antonio Carrieri (Dipartimento Farmaco-Chimico, Università di Bari), Jordi Teixidó (Institut Químic de Sarrià, Universitat Ramon Llull), and Tim Clark (Computer Chemistry Center, U Erlangen). Our main strategy to increase our network of collaborations will be to participate in and to help organise relevant national and international conferences (e.g. JOBIM, GGMM, CAPRI, ECCB, ISMB) and societies (e.g. SFBI, SFCI, ISCB). This in turn will lead to opportunities to form more formal collaborations through joint projects funded by the national and european funding agencies. 6.2 Technology Transfer We are involved in technology transfer with both local and national organisations. While we do not expect that our activities will directly lead to marketable software, some of our projects could lead to patentable discoveries. Where appropriate, we will consult with Inria business advisors and lawyers concerning the protection of our intellectual property and the transfer and exploitation of our results. All three of us (DR, 26 http://lod-cloud.net/ 26 MDD, and BM) are scientific advisors to Harmonic Pharma (see below). The following list summarises our current technology transfer partners: • Harmonic Pharma: this LORIA-CNRS spin-out company aims to add therapeutic value to drug-like molecules and to reposition existing drugs – http://www.harmonicpharma.com/. • BioProlor: this is a consortium of six regional enterprises who are collaborating with the University of Lorraine, INRA, CNRS, INSERM, and ourselves to develop new pharmaceutical and cosmetic products – http://www.bioprolor.com/. • IFB: through our MBI platform (Modelisation de Biomolécules et leurs Interactions), we are one of the partners of the north-east section of the IFB (formerly ReNaBI: Réseau National des plateformes Bioinformatiques) which includes the labs CIB and LIFL (Lille), IGBMC (Strasbourg), and MMP (Reims) – http://www.renabi.fr/. Because funding for the MBI platform ended in December 2013, one of us (MDD) is working to create a new platform for interdisciplinary engineering in Lorraine (project InterBioNum). This will be a shared resource amongst the tutelles for bioinformatics consulting, training, and technology transfer. We expect this platform will be closely aligned with the new regional Mésocentre, which will also animate training and technology transfer. 7 Funding 7.1 Current Projects The following list summarises our currently funded research projects: • ANR “PEPSI” (Polynomial Expansions of Protein Structures and Interactions), 2011 – 2015, joint with Inria Grenoble, e 162K for Inria Nancy. • ANR “IFB” (Institut Français de Bioinformatique), 2013, joint with CNRS, CEA, INRA, INSERM, e 60K for Inria Nancy. • FUI/FEDER project “LBS” (Le Bois Santé – to exploit wood products in the pharmaceutical and nutriment domains) 2013 – 2015, e 57K (approx). • CNRS-UL PEPS “EXPLOD-BioMed” (Exploring the Linked Open Data (LOD) for knowledge discovery: Applications to the biomedical domain), 2013 – 2014, with CHU Nancy, e 15K. We have recently obtained two doctoral bursaries: • Bourse Doctorale de l’IAEM Ecole Doctorale (Exploring the Linked Open Data (LOD) for knowledge discovery: Applications to the biomedical domain), 2013 – 2016, joint supervision with Adrien Coulet (MdC, Université de Lorraine), e 132K (approx). • Bourse Doctorale de la Fédération Charles Hermite (Modeling evolutionary relationships between three-dimensional protein structures), 2013 – 2016, joint supervision with Nicolas Champagnat (Inria/IECL), e 132K (approx), co-funded by Région Lorraine. 7.2 Future Funding Strategy We will propose projects for PhD studentships in the annual competitions of the IAEM doctoral school and Inria’s doctoral and post-doctoral training programmes. Where appropriate, we will also apply for co-funding 27 from Région Lorraine or from industrial partners (CIFRE). On a larger scale, we will discuss with colleagues in other Inria teams with a view to proposing an Action d’Envergure for structural bioinformatics, and we will seek to form partnerships with other labs for ANR projects. Naturally, we will monitor calls for proposals for opportunities in large national (e.g. LABEX), european (e.g. ERC), and international (e.g. ANR International) programmes. References Team References [T1] D. W. Ritchie and G. J. L. Kemp. Protein docking using spherical polar Fourier correlations. Proteins: Structure, Function, Genetics, 39(2):178–194, 2000. [T2] L. Mavridis and D. W. Ritchie. 3D-blast: 3D protein structure alignment, comparison, and classification using spherical polar Fourier correlations. In Pacific Symposium on Biocomputing 2010, pages 281– 292, Hawaii, USA, January 2010. World Scientific Publishing. [T3] D. W. Ritchie, D. Kozakov, and S. Vajda. Accelerating protein-protein docking correlations using a six-dimensional analytic FFT generating function. Bioinformatics, 24(4):810–823, 2008. [T4] D. W. Ritchie and V. Venkatraman. Ultra-fast FFT protein docking on graphics processors. Bioinformatics, 26:2398–2405, 2010. [T5] D. Mustard and D. W. Ritchie. Docking essential dynamics eigenstructures. Proteins: Structure, Function, Bioinformatics, 60:269–274, 2005. [T6] V. Venkatraman and D. W. Ritchie. Flexible protein docking refinement using pose-dependent normal mode analysis. Proteins, 80:2262–2274, 2012. [T7] A. Ghoorah, M. Smaïl-Tabbone, M.-D. Devignes, and D. W. Ritchie. Protein docking using case-based reasoning. Proteins, 81:2150–2158, 2013. [T8] D. W. Ritchie. Recent progress and future directions in protein-protein docking. Current Protein and Peptide Science, 9(1):1–15, 2008. [T9] V. Venkatraman and D. W. Ritchie. Predicting multicomponent protein assemblies using an ant colony approach. International Journal of Swarm Intelligence Research, 3:19–31, 2012. [T10] T. V. Hoang, X. Cavin, and D. W. Ritchie. gEMfitter: a highly parallel FFT-based 3D density fitting tool with GPU texture memory acceleration. Journal of Structural Biology, 184:348–354, 2013. [T11] D. W. Ritchie, A. W. Ghoorah, L. Mavridis, and V. Venkatraman. Fast protein structure alignment using Gaussian overlap scoring of backbone peptide fragment similarity. Bioinformatics, 28:3274– 3281, 2012. [T12] L. Mavridis, V. Venkatraman, and D. W. Ritchie. A comprehensive comparison of protein structural alignment algorithms. In 3DSIG – 8th Structural Bioinformatics and Computational Biophysics Meeting, volume 8, page 89, Long Beach, California, 2012. ISMB. 28 [T13] M.-D. Devignes, S. Benabderrahmane, M. Smail-Tabbone, N. Amedeo, and O. Poch. Functional classification of genes using semantic distance and fuzzy clustering approach: Evaluation with reference sets and overlap analysis. International Journal of Computational Biology and Drug Design, 5(3/4):245–260, 2012. [T14] A. Coulet, M. Smaïl-Tabbone, A. Napoli, and M.-D. Devignes. Ontology-based knowledge discovery in pharmacogenomics. In H. R. Arabnia and Q.-N. Tran, editors, Software Tools and Algorithms for Biological Systems, Advances in Experimental Medicine and Biology, pages 357–66. Springer, 2011. [T15] A. Ghoorah, M.-D. Devignes, M. Smaïl-Tabbone, and D. W. Ritchie. Spatial clustering of protein binding sites for template based protein docking. Bioinformatics, 27:2820–2827, 2011. [T16] A. Ghoorah, M.-D. Devignes, M. Smaïl-Tabbone, and D. W. Ritchie. KBDOCK 2013: a spatial classification of 3D protein domain family interactions. Nucleic Acids Research, 42:D389–D395, 2014. [T17] A. Fano, D. W. Ritchie, and A. Carrieri. Modelling the structural basis of human CCR5 chemokine receptor function: from homology model-building and molecular dynamics validation to agonist and antagonist docking. Journal of Chemical Information and Modeling, 46(3):1223–1235, 2006. [T18] V. I. Pérez-Nueno, D. W. Ritchie, O. Rabal, R. Pascual, J. I. Borrell, and J. Teixidó. Comparison of ligand-based and receptor-based virtual screening of HIV entry inhibitors for the CXCR4 and CCR5 receptors using 3D ligand shape matching and ligand-receptor docking. Journal of Chemical Information and Modeling, 48(3):509–533, 2008. [T19] V. I. Pérez-Nueno, S. Pettersson, D. W. Ritchie, J. I. Borrell, and J. Teixidó. Discovery of novel HIV entry inhibitors for the CXCR4 receptor by prospective virtual screening. Journal of Chemical Information and Modeling, 49(4):810–823, 2009. External References [X20] A. B. Ward, A. Sali, and I. A. Wilson. Integrative structural biology. Biochemistry, 6122:913–915, 2013. [X21] C. Morris. Towards a structural biology work bench. Acta Crystallographica, PD69:681–682, 2013. [X22] T. Ideker, T. Galitski, and L. Hood. A new approach to decoding life. Annual Review of Genomics and Human Genetics, 2:343–372, 2001. [X23] T. Ideker and R. Sharan. Protein networks in disease. Genome Research, 18:644–652, 2008. [X24] R. Sharan and T. Ideker. Modeling cellular machinery through biological network comparison. Nature Biotechnology, 24:427–433, 2006. [X25] A.S.J. Melquiond, E. Karaca, P.L. Kastritis, and A.M.J.J. Bonvin. Next challenges in protein-protein docking: from proteome to interactome and beyond. WIREs Computational Molecular Science, 2:642–651, 2011. [X26] H. Kitano. Systems biology: a brief overview. Science, 295:1662–1664, 2002. 29 [X27] P. Aloy and R. B. Russell. Structural systems biology: modelling protein interactions. Nature Reviews Molecular and Cell Biology, 7:188–197, 2006. [X28] P. Beltrao, C. Kiel, and L. Serrano. Structures in systems biology. Current Opinion in Structural Biology, 17:378–384, 2007. [X29] M. Makarow, L. Højgaard, and Reinhart Ceulmans. Advancing systems biology for medical applications. ESF Science Policy Briefing, 35:1–12, 2008. [X30] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker. Cytoscape: a software environment for integrated models of biomolecular interactions networks. Genome Research, 13:2498–2504, 2003. [X31] A. Gutmanas, T. J. Oldfield, A. Patwardhan, S. Sen, S. Velanker, and G. J. Kleywegt. The role of structural bioinformatics resources in the era of integrative structural biology. Acta Crystallographica, D69:710–721, 2013. [X32] M. L. Sierk and G. J. Kleywegt. Déjà vu all over again: Finding and analyzing protein structure similarities. Structure, 12:2103–2011, 2004. [X33] R. A. Goldstein. The structure of protein evolution and the evolution of proteins structure. Current Opinion in Structural Biology, 18:170–177, 2008. [X34] P. J. Kundrotas, Z. W. Zhu, and I. A. Vakser. GWIDD: Genome-wide protein docking database. Nucleic Acids Research, 38:D513–D517, 2010. [X35] Q. C. Zhang, D. Petrey, L. Deng, L. Qiang, Y. Shi, C. A. Thu, B. Bisikirska, C. Lefebvre, D. Accili, T. Hunter, T. Maniatis, A. Califano, and B. Honig. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature, 490:556–560, 2012. [X36] M. R. Arkin and J. A. Wells. Small-molecule inhibitors of protein-protein interactions: progressing towards the dream. Nature Reviews Drug Discovery, 3:301–317, 2004. [X37] D. González-Ruiz and H. Gohlke. Targeting protein-protein interactions with small molecules: challenges and perspectives for computational binding epitope detection and ligand finding. Current Medicinal Chemistry, 13:2607–2625, 2006. [X38] P. Uetz et al. A comprehensive analysis of protein-protein interactions in saccaromyces cerevisiae. Nature, 403:623–671, 2000. [X39] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Science, 98:4569–4574, 2001. [X40] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick, A. M. Michon, and C. M. Cruciat. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415:141–147, 2002. [X41] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S.L. Adams, A. Millar, P. Taylor, K. Bennett, and K. Boutilier. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature, 415:180–183, 2002. 30 [X42] P. Aloy and R. B. Russell. Ten thousand interactions for the molecular biologist. Nature Biotechnology, 22:1317–1321, 2004. [X43] G. T. Hart, A. K. Ramani, and E. M. Marcotte. How complete are current yeast and human protein interaction networks? Genome Biology, 7:120, 2006. [X44] P. Bork, L. J. Jensen, C. von Mering, A. K. Ramani, I. Lee, and E. M. Marcotte. Protein interaction networks from yeast to human. Current Opinion in Structural Biology, 14:292–299, 2004. [X45] R. Chen, L. Li, and Z. Weng. ZDOCK: an initial-stage protein-docking algorithm. Proteins: Structure, Function, Genetics, 52:80–87, 2003. [X46] O. Bachar, D. Fischer, R. Nussinov, and H. J. Wolfson. A computer vision based technique for 3D sequence-independent structural comparison of proteins. Protein Engineering, 6:279–288, 1993. [X47] J. D. Jackson. Classical Electrodynamics. Wiley, New York, 1975. [X48] E. Katchalski-Katzir, I. Shariv, M. Eisenstein, A. A. Friesem, C. Aflalo, and I. A. Vakser. Molecular surface recognition: Determination of geometric fit between proteins and their ligands by correlation techniques. Proceedings of the National Academy of Science, 89:2195–2199, 1992. [X49] J. J. Gray, S. Moughan, C. Wang, O. Schueler-Furman, B. Kuhlman, C. A. Rohl, and D. Baker. Proteinprotein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. Journal of Molecular Biology, 331:281–299, 2003. [X50] S. E. Dobbins, V. I. Lesk, and M. J. E. Sternberg. Insights into protein flexibility: The relationship between normal modes and conformational change upon protein–protein docking. Proceedings of the National Academy of Science, 105(30):10390–10395, 2008. [X51] A. May and M. Zacharias. Energy minimization in low-frequency normal modes to efficiently allow for global flexibility during systematic protein-protein docking. Proteins: Structure, Function, Bioinformatics, 70:794–809, 2008. [X52] I.H. Moal and P.A. Bates. Swarmdock and the use of normal modes in protein-protein docking. Int. J. Mol. Sci., 11(10):3623–3648, 2010. [X53] M. Baaden and S. R. Marrink. Coarse-grained modelling of protein-protein interactions. Current Opinion in Structural Biology, 23:878–886, 2013. [X54] M. G. Saunders and G. A. Voth. Coarse-grainiing of multiprotein assemblies. Current Opinion in Structural Biology, 22:144–150, 2012. [X55] H. I. Ingólfsson, C. A. Lopez, J. J. Uusitalo, D. H. de jong, S. M. Gopal, X. Periole, and S. R. Marrink. The power of coarse graining in biomolecular simulations. WIRES Comput. Mol. Sci., DOI:10.1002/wcms.1169, 2013. [X56] N. Basdevant, B. Borgis, and T. Ha-Duong. Modeling protein-protein recognition in solution using the coarse-grained force field SCORPION. Journal of Chemical Theory and Computation, 9:803–813, 2012. [X57] M. F. Lensink and S. J. Wodak. Docking and scoring protein interactions: CAPRI 2009. Proteins: Structure, Function, Bioinformatics, 78:3073–3084, 2010. 31 [X58] A. Berchanski and M. Eisenstein. Construction of molecular assemblies via docking: modeling of tetramers with D2 symmetry. Proteins: Structure, Function, Genetics, 53:817–829, 2003. [X59] B. Pierce, W. Tong, and Z. Weng. M-ZDOCK: a Grid-Based approach for Cn symmetric multimer docking. Bioinformatics, 21(8):1472–1478, 2005. [X60] D. Schneidman-Duhovny, Y. Inbar, R. Nussinov, and H. J. Wolfson. Geometry-based flexible and symmetric protein docking. Proteins, 60(2):224–231, 2005. [X61] Y. Inbar, H. Benyamini, R. Nussinov, and H. J. Wolfson. Prediction of multimolecular assemblies by multiple docking. Journal of Molecular Biology, 349:435–447, 2005. [X62] H. E. White, E. V. Orlova, S. Chen, L. Wang, A. Ignatiou, B. Gowen, T. Stromer, T. M. Franzmann, M. Haslbeck, J. Buchner, and H. R. Saibil. Multiple distinct assemblies reveal conformational flexibility in the small heat shock protein Hsp26. Journal of Structural Biology, 14:1197–1204, 2006. [X63] P. Chacon J. R. Lopéz-Blanco. Journal of Structural Biology, 184:261–270, 2013. [X64] K. Lasker, M. Topf, A. Sali, and H. Wolfson. Inferential optimization for simultaneous fitting of multiple components into a cryoEM map of their assembly. Journal of Molecular Biology, 388:180–194, 2009. [X65] K. Lasker, A. Sali, and H. J. Wolfson. Determining macromolecular assembly structures by molecular docking and fitting into an electron density map. Proteins: Structure, Function, Bioinformatics, 78:3205–3211, 2010. [X66] U. Bertele. Nonserial Dynamic Programming. Academic Press, New York, 1972. [X67] S. Yand and P. E. Bourne. The evolutionary history of protein domains viewed by species phylogeny. PLoS One, 4:e8378, 2009. [X68] R. D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin, P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E. L. L. Sonnhammer, S. R. Eddy, and A. Bateman. The Pfam protein families database. Nucleic Acids Research, 38:D211–D222, 2010. [X69] D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science, 227:1435– 1441, 1985. [X70] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990. [X71] M. J. Sippl and M. Wiederstein. A note on difficult structure alignment problems. Bioinformatics, 24:426–427, 2008. [X72] V. B. R. Boojala and P. E. Bourne. Protein Structure and Evolution and the SCOP Database. In: Structural Bioinformatics (eds P.E. Bourne, H. Weissig). Wiley-Liss, New Jersey, 2003. [X73] G. S. Chan, Y. Hong, K. D. Ko, G. Bhardwaj, E. C. Holmes, R. L. Patterson, and D. B. van Rossum. Phylogeneric profiles reveal evolutionary relationships within the twighlight zone of sequence similarity. Proceedings of the National Academy of Science, 105:13474–13479, 2008. [X74] A. G. Murzin S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536– 540, 1995. 32 [X75] C. A. Orengo, A. D. Michine, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton. CATH - A hierarchic classification of protein domain structures. Structure, 5(8):1093–1108, 1997. [X76] L. Holm, S. Kääriänen, P. Rosentröm, and A. Schenkel. Seaching protein structure databases with DaliLite v.3. Bioinformatics, 24:2780–2781, 2008. [X77] A. Andreeva, D. Howarth, S. E. Brenner, T. J. P. Hubbard, C. Chothia, and A. G. Murzin. SCOP database in 2004: Refinements integrate structure and sequence familiy data. Nucleic Acids Research, 32:D226–D229, 2004. [X78] H. Tordai, A. Nagy, K. Farkas, L. Bányai, and L. Patthy. Modules, multidomain proteins and organismic complexity. FEBS Journal, 272:5064–5078, 2005. [X79] P. E. Bourne and I. N. Shindyalov. Structure Comparison and Alignment. In: Structural Bioinformatics (eds P.E. Bourne, H. Weissig). Wiley-Liss, New Jersey, 2003. [X80] Y. Zhang and J. Skolnick. TM-align: a protein structure alignment algorithm based on TM-score. Nucleic Acids Research, 33(7):2302–2309, 2005. [X81] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge discovery in databases: An overview. AI Magazine, 13:57–70, 1992. [X82] H. Hermjakob et al. The HUPO PSI’s molecular interaction format – a community standard for the representation of protein interaction data. Nature Biotechnology, 22(2):177–183, 2004. [X83] S. Orchard et al. Protein interaction data curation: the international molecular exchange (IMEx) consortium. Nature Methods, 9(4):345–350, 2012. [X84] A. Özgur, Z. Xiang, D. R. Radev, and Y. He. Mining of vaccine-associated IFN-γ gene interaction networks using the vaccine ontology. Journal of Biomedical Semantics, 2 (Suppl 2):S8, 2011. [X85] C. Jonquet, P. Lependu, S. Falconer, A. Coulet, N.F. Noy, M.A. Musen, and N.H. Shah. NCBO resource index: Ontology-based search and mining of biomedical resources. Web Semantics, 9:316– 324, 2011. [X86] S. Velankar, J. M. Dana, J. Jacobsen, G. van Ginkel, P. J. Gane, J. Luo, T. J. Oldfield, C. O’Donovan, M.-J. Martin, and G. J. Kleywegt. SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic Acids Research, 41:D483–D489, 2012. [X87] R. C. Griffiths and S. Tavaré. Ancestral inference in population genetics. Statistical Science, 9:307– 319, 1994. [X88] S. Tavaré, D. J. Balding, R. C. Griffiths, and P. Donnelly. Inferring coalescence times from molecular sequence data. Genetics, 145:505–518, 1997. [X89] M. Stephens and P. Donnelly. Inferrence in molecular population genetics. Journal of the Royal Statistical Society, B62:605–655, 2000. [X90] M. Kuhner, J. Yamamoto, and J. Felsenstein. Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics, 140:1421–1430, 1995. [X91] S. Tavaré. Ancestral inference in population genetics. Lectures on probability theory and statistics. Lecture Notes in Mathemetics, 1837:1–188, 2004. 33 [X92] G. Papai, P. A. Weil, and P. Schultz. New insights into the function of transcription factor TFIID from recent structural studies. Current Opinion in Genetics and Development, 21:219–224, 2011. [X93] G. Papai, M. K. Tripathi, C. Ruhlmann, J. H. Layer, P. A. Weil, and P. Schultz. TFIIA and the transactivator Rap1 cooperate to commit TFIID for transcription initiation. Nature, 465:956–961, 2011. [X94] C. E. Alvarez-Martinez and P. J. Christie. Biological diversity of prokaryotic type IV secretion systems. Microbiology and Molecular Biology Reviews, 73:775–808, 2011. [X95] R. Fronzes, E. Schäfer, L. Wang, H. R. Saibil, E. V. Orlova, and G. Waksman. Structure of a type IV secretion system core complex. Science, 323:266–268, 2011. [X96] A. Rivera-Calzada, R. Fronzes, C. G. Savva, V. Chandran, P. W. Lian, T. Laeremans, E. Pardon, H. Steyaert, J. Remaut, and E. V. Waksman, G. Orlova. Structure of a bacterial type IV secretion core complex at subnanometre resolution. EMBO Journal, 32:1195–1204, 2013. [X97] S. G. F. Rasmussen et al. Crystal structure of the β2 adrenergic receptor–Gs protein complex. Nature, 477:549–557, 2011. [X98] A. G. Gilman. G proteins: transducers of receptor-generated signaling. Annual Review of Biochemistry, 56:615–649, 1987. [X99] D. Filmore. It’s a GPCR world. Modern Drug Discovery, 7:24–28, 2004. [X100] M. J. Kleinz and I. B. Wilkinson. Emerging roles of apelin in biology and medicine. Pharmacology and Therapeutics, 107:198–211, 2005. [X101] S. L. Pitkin, J. J. Maguire, T. I. Bonner, and A. P. Davenport. International union of basic and clinical pharmacology. LXXIV. Apelin receptor nomenclature, distribution, pharmacology, and function. Pharmacological Reviews, 62:331–342, 2010. [X102] X. Iturrioz, R. Alvear-Perez, N. De Mota, C. Franchet, F. Guillier, V. Leroux, H. Dabire, M. Le Jouan, H. Chabane, R. Gerbier, D. Bonnet, A. Berdeaux, B. Maigret, J.-L. Galzi, M. Hibert, and C. LlorensCortes. Identification and pharmacological properties of E339-3D6, the first nonpeptidic apelin receptor agonist. FASEB Journal, 24:1506–1517, 2010. [X103] J. Bernauer, J. Azé, J. Janin, and A. Poupon. A new protein-protein docking scoring function based on interface residue properties. Bioinformatics, 23:555–562, 2007. [X104] B. Bouvier, R. Grünberg, M. Nilges, and F. Cazals. Shelling the Voronoi interface of protein-protein complexes reveals patterns of residue conservation, dynamics, and composition. Proteins, 76:677– 692, 2009. [X105] T. Dreyfus, V. Doye, and F. Cazals. Assessing the reconstruction of macromolecular assemblies with toleranced models. Proteins: Structure, Function, Bioinformatics, 80:2125–2136, 2012. [X106] A. Saladin, S. Fiorucci, P. Poulain, C. Prévost, and M. Zacharias. PTools: an open source molecular docking library. BMC Structural Biology, 9(1):27, 2009. [X107] J. Janin, K. Henrick, J. Moult, L. Ten Eyck, M. J. E. Sternberg, S. Vajda, I. Vakser, and S. J. Wodak. CAPRI: a critical assessment of predicted interactions. 52:2–9, 2003. 34 Proteins: Structure, Function, Genetics, [X108] S. Vajda, I. A. Vakser, M. J. E. Sternberg, and J. Janin. Modeling of protein interactions in genomes. Proteins: Structure, Function, Genetics, 47(4):444–446, 2002. [X109] M. F. Lensink and S. J. Wodak. Docking, scoring, and affinity prediction in CAPRI. Proteins, 81:2082–2095, 2013. [X110] S. J. de Vries and M. Zacharias. ATTRACT-EM: a new method for computational assembly of large molecular machines using cryo-EM maps. PLOS One, 12:e49733, 2012. [X111] J. I. Garzón, J. R. Lopéz-Blanco, C. Pons, J. Kovacs, R. Abagyan, and P. Chacón. FRODOCK: a new approach for fast rotational protein-protein docking. Bioinformatics, 25:2544–2551, 2009. [X112] J. I. Garzón, J. Kovacs, R. Abagyan, and P. Chacón. ADP_EM: fast exhaustive multi-resolution docking for high throughput coverage. Bioinformatics, 23:427–433, 2007. [X113] C. Dominguez, R. Boelens, and A. M. J. J. Bonvin. HADDOCK: a protein-protein docking ap- proach based on biochemical or biophysical information. Journal of the American Chemical Society, 125:1731–1737, 2003. [X114] N. Tuncbag, G. Kar, O. Keskin, A. Gursoy, and R. Nussinov. A survey of available tools and web servers for analysis of protein-protein interactions and interfaces. Briefings In Bioinformatics, 10(3):217–232, 2009. [X115] D. Korkin, F. P. Davis, F. Alber, T. Luong, M.-Y. Shen, V. Lucic, M. B. Kennedy, and A. Sali. Structural modeling of protein interactions by analogy: application to PSD-95. PLoS Computational Biology, 2(11):e153, 2006. [X116] P. J. Kundrotas, M. F. Lensink, and E. Alexov. Homology-based modeling of 3D structures of proteinprotein complexes using alignments of modified sequence profiles. International Journal of Biological Macromolecules, 43(2):198–208, 2008. [X117] G. Launay and T. Simonson. Homology modelling of protein-protein complexes: a simple method and its possibilities and limitations. BMC Bioinformatics, 9:427, 2008. [X118] H. Hwang, T. Vreven, J. Janin, and Z. Weng. Protein-protein docking benchmark version 4.0. Proteins: Structure Function and Bioinformatics, 78:3111–3114, 2010. [X119] R. Mosca, C. Pons, J. Fernandez-Recio, and P. Aloy. Pushing structural information into the yeast interactome by high-throughput protein docking experiments. PLoS Computational Biology, 5(8):e1000490, 2009. [X120] M. N. Wass, G. Fuentes, C. Pons, P. Pazos, and A. Valencia. Towards the prediction of protein interaction partners using physical docking. Molecular Systems Biology, 7(469):1–8, 2011. [X121] S. Birmanns, M. Rusu, and W. Wriggers. Using Scupltor and Situs for simultaneous assembly of atomic components into low-resolution maps. Journal of Structural Biology, 173:428–435, 2010. [X122] W. Wriggers. Using Situs for the integration of multi-resolution structures. Biophysical Review, 2:21–27, 2010. [X123] D. Russel, K. Lasker, B. Webb, J. Velazquez-Muriel, E. Tjioe, D. Schneidman, B. Peterson, and A. Sali. Putting the pieces together: Integrative modeling platform software for structure determination of macromolecular assemblies. PLoS Biology, 10:e1001244, 2012. 35 [X124] F. Alber et al. Determining the architectures of macromolecular assemblies. Nature, 450:683–694, 2007. [X125] F. Alber et al. The molecular architecture of the nuclear pore complex. Nature, 450:695–701, 2007. [X126] N. V. Grishin. Fold change in evolution of protein structures. Journal of Structural Biology, 134:167– 185, 2001. [X127] L. N. Kinsch and N. V. Grishin. Evolution of protein structures and function. Current Opinion in Structural Biology, 12:400–4008, 2002. [X128] A. Andreeva and A. G. Murzin. Evolution of protein fold in the presence of functional constraints. Current Opinion in Structural Biology, 16:399–408, 2006. [X129] H. Hasegawa and L. Holm. Advances and pitfalls of protein structure alignment. Current Opinion in Structural Biology, 19:341–348, 2009. 36