Project Capsid Computational Algorithms for Protein

Transcription

Project Capsid Computational Algorithms for Protein
Project Capsid
Computational Algorithms for Protein Structures and Interactions
David Ritchie – 02 December 2014
Personnel
The initial members of the team are all currently associated with the Orpailleur project team at Inria Nancy.
Team Leader
• David Ritchie, DR2, Inria.
Permanent Researchers
• Marie-Dominique Devignes, CR1, CNRS.
Emeritus Researchers
• Bernard Maigret, DR1, CNRS.
PhD Students
• Seyed Ziaeddin Alborzi, doctorant (October 2014 – 2017), co-supervised by DR and MDD.
• Benoît Henry, doctorant (October 2013 – 2016), co-supervised with Tosca team.
• Gabin Personeni, doctorant (October 2013 – 2016), co-supervised with Orpailleur team.
Administrative Assistant
• Emmanuelle Deschamps, Inria.
1
1
Context
Many of the processes within living organisms can be studied and understood in terms of biochemical interactions between large macromolecules such as DNA, RNA, and proteins. DNA is often considered as
the “genetic code” of life, because there is a direct mapping between triplets of the four common nucleic
acid bases in DNA and the 20 amino acid residues which make up the majority of the protein molecules
found in all living organisms. Some RNA molecules play a key role in the translation of sequences of DNA
codons into protein molecules, while others catalyse various chemical transformations or regulate gene
expression. For example, the physical translation of DNA into protein is orchestrated by several complex
macromolecular structures such as RNA polymerase (which transcribes DNA into mRNA) and the ribosome
(which synthesises new proteins according a given sequence of mRNA codons). RNA polymerase and
the ribosome are just two examples of the kinds of biomolecular machines which exist in nature. Other
examples of biomolecular machines include spliceosomes (which are involved in gene expression), ATP
synthases (energy transfer), myosins (locomotion), and kinesins (transportation), to name just a few. Remarkably, the protein components of these systems spontaneously self-assemble into large macromolecular
complexes which can sometimes be glimpsed using modern cryo-electron microscopy (cryo-EM). However,
it is extremely difficult to obtain atomic models of these large assemblies using high resolution experimental
techniques such as X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy.
In addition to the examples mentioned, many other biological processes are governed by complex
systems of proteins which interact cooperatively to regulate the chemical composition within a cell or to
carry out a wide range of biochemical processes such as photosynthesis, metabolism, and cell signalling, for
example. Thus, if RNA and DNA sequences represent the biological blueprint for life, then proteins make up
the three-dimensional (3D) molecular machinery. These days, it is becoming increasingly feasible to isolate
and characterise some of the individual protein components of such systems, but it still remains extremely
difficult to achieve detailed models of how these complex systems actually work. Consequently, a new
multidisciplinary approach called integrative structural biology has emerged which aims to bring together
experimental data from a wide range of sources and resolution scales in order to meet this challenge [X20,
X21].
From the biomedical and industrial points of view, understanding how complex biomolecular systems
work is crucial for the design of highly specific therapeutic drug molecules and for the development of
clean and efficient bioengineering processes. For example, many antibiotic drug molecules are designed to
interfere with the machineries or processes that exist in bacterial cells but which are different or even absent
in humans. Modern biotechnology processes often use the enzymes of genetically altered or enhanced
micro-organisms to make industrial or pharmaceutical compounds which are difficult or expensive to make
using conventional synthetic chemistry techniques. Clearly, designing better drug molecules and developing
cleaner and more efficient industrial processes will contribute to improving human health and quality of life.
Understanding how biological systems work at the level of 3D molecular structures presents fascinating challenges for biologists and computer scientists alike. Despite being made from a small set of simple
chemical building blocks, protein molecules have a remarkable ability to self-assemble into complex molecular machines which carry out very specific biological processes. As such, these molecular machines may
be considered as complex systems because their properties are much greater than the sum of the properties of their component parts. In recent decades, much scientific effort has been devoted to understanding
structure-function relationships at the level of single biomolecules. However, following recent developments
in high throughput sequencing and other experimental techniques, there is now much interest in taking a
“systems” view of biomolecules and biomolecular processes to enrich our understanding of the complex
2
mechanisms and relationships that exist within living organisms [X22, X23, X24]. Indeed, an emerging challenge in the 21st century is to understand, represent, and ultimately to control such relationships at the level
of interacting biomolecular components [X25].
According to Kitano [X26], systems biology may be considered in terms of the structures, dynamics,
control, and ultimately the design of systems with specific desired properties. Structural bioinformatics is
mainly concerned with first of these [X27, X28], but studying it could facilitate breakthroughs in the other
three aspects, e.g. studying how systems evolve in time and space, modulating disease processes (pharmaceuticals), and industrial exploitation (biotechnology). A recent European Science Foundation report
highlights the increasing importance of systems biology to the biomedical sciences [X29].
As illustrated in Figure 1, biological systems may be considered at various scales, ranging from individual
atoms and molecules to multi-component cellular assemblies. The atomic scale, which we call scale 1, is
the scale of physical forces and mechanics. This scale may be used to explain the biochemical functions
of individual protein molecules and to calculate their physical fluctuations over short time-scales (molecular
dynamics). The molecular scale (scale 2) is the scale at which knowledge of the 3D shapes of biological
objects (proteins, RNA, DNA, and other small molecules) may be used to explain how they interact, and how
they fit together to form larger structures which may be visible in an electron or optical microscope (scale
3). The cellular scale (scale 4) is the scale of specialised functional sub-systems devoted to data storage,
signalling, energy supply, assembly, regulation, repair, reproduction, defence, and so on. At this scale level,
the complexity is too great to represent the sub-system components as physical objects. Instead, they are
often represented computationally as abstract nodes in a network and the edges between those nodes
represent various types of interaction within that “system” [X30]. In this project, we choose to focus on
scales 2 and 3, although we expect our results will feed upwards to scale 4.
Figure 1: The scales of structural biology and their relationships to the EMDB (https://www.ebi.ac.uk/pdbe/emdb/) and
PDB (http://www.rcsb.org/) databases, illustrated using ribosome as an example. From left to right: soft X-ray tomogram
of a fission yeast cell (scale 4), electron tomogram of ribosomes in the cytsol (scale 3), cryo-EM reconstruction of the
80S ribosome (scale 2), X-ray crystal structure of the 50S ribosomal subunit (scale 2), X-ray atomic detail of the SmpB
protein (scale 1). Figure taken from [X31].
2
Objectives
The American National Institute of Health (NIH)1 defines computational biology as “the development and
application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” On the other hand, the NIH
defines bioinformatics as “research, development, or application of computational tools and approaches for
expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.” According to these definitions, the objectives of the Capsid
1 http://www.bisti.nih.gov/
3
team span both computational biology and bioinformatics in the sense that we aim to develop new theoretical modeling and simulation methods for biological systems, and we want to implement the developed
approaches as computational tools which may be used by researchers in the biological sciences.
2.1
Challenges in Structural Systems Biology
Here, we wish to focus on structural biology as our primary application domain, and we wish to develop
algorithms and software to help study biological systems and phenomena from a structural point of view. In
particular, we want (1) to develop algorithms which can help to model the structures of large multi-component
biomolecular machines and (2) to develop tools and techniques to represent and mine knowledge of the 3D
shapes of proteins and protein-protein interactions. Thus, a unifying theme of this project is the recurring
problem of representing and reasoning about complex macromolecular shapes. More specifically, we want
to develop computational techniques to represent, analyse, and compare the shapes and interactions of
protein molecules in order to help better understand how their 3D structures relate to their biological function.
In summary, we wish to focus on the following closely related topics in structural bioinformatics:
• integrative multi-component assembly and modeling,
• new approaches for knowledge discovery in structural databases.
Because our motivation is domain-driven, we do not wish to restrict ourselves to the use of any particular
algorithmic or computational technique. Nonetheless, it is natural that we should begin by exploiting the
existing knowledge and skills of the initial team members.
2.2
Computational Objectives
As indicated above, structural biology is largely concerned with determining the 3D atomic structures of
protein, RNA, and DNA molecules, and using these structures to model their biological properties and interactions. Each of these activities can be extremely time-consuming. Often, solving the 3D structure of even a
single protein using X-ray crystallography or NMR can take many months or even years of effort.2 Even simulating the interaction between two proteins using a detailed atomistic molecular dynamics simulation can
consume many thousands of CPU-hours. While most X-ray crystallographers, NMR spectroscopists, and
molecular modelers often use conventional sequence and structure alignment tools to help propose initial
structural models through the homology principle, they often study only individual structures or interactions
at a time. Due to the difficulties outlined above, only relatively few research groups are able to solve the
structures of large multi-component systems.
Similarly, most current algorithms for comparing protein structures, and especially those for modeling
protein interactions, work only at the pair-wise level. Of course, such calculations may be accelerated considerably by using dynamic programming (DP) or fast Fourier transform (FFT) techniques. However, it remains extremely challenging to scale up these techniques to model multi-component systems. For example,
the use of high performance computing (HPC) facilities may be used to accelerate arithmetically intensive
shape-matching calculations, but this generally does not help solve the fundamentally combinatorial nature
of many multi-component problems. It is therefore necessary to devise heuristic hybrid approaches which
can be tailored to exploit various sources of domain knowledge. We therefore set ourselves the following
main computational objectives:
2 It
is worth noting that the 2009 and 2012 Nobel prizes in chemistry were awarded for work on the atomic resolution structures of
the ribosome, and the G-protein coupled receptors, respectively.
4
• develop multi-component assembly techniques for integrative structural biology,
• classify and mine protein structures and protein-protein interactions.
Achieving these objectives will often involve combining numerical and symbolic representations of molecular shapes and molecular interactions, developing joint projects with experimentalists, and forming collaborations with computing science experts from other teams at the LORIA (Laboratoire Lorrain de Recherche
en Informatique et ses Applications)3 in Nancy and at other Inria centres. Because our research outputs
will be of interest both to the structural bioinformatics community and to experimental biologists, we aim
to publish our results in high profile journals such as Bioinformatics, Journal of Computational Chemistry,
Journal of Structural Biology, Nucleic Acids Research, Proteins, Protein Science, and PLoS Computational
Biology, for example.
2.3
Practical Applications
Beyond general benchmarking tests used for evaluating our algorithms, the methods developed in the team
will address particular biological problems for which understanding the interactions between biomolecules
is essential. The range of such problems is very large and our choice is driven by existing collaborations
with biology laboratories. Therefore, we will focus on the following practical applications:
• transcription factor II-D,
• prokaryotic type IV secretion systems,
• G-protein coupled receptors.
3
Scientific Project
The proteins that exist today represent the molecular product of some three billion years of evolution. Hence,
comparing protein sequences and structures is important for understanding their functional and evolutionary
relationships [X32, X33]. Historically, much of bioinformatics research has focused on developing mathematical and statistical algorithms to process, analyse, annotate, and compare protein and DNA sequences
because such sequences represent the primary form of information in biological systems, and because
these sequences are relatively easy to determine from biological samples in the laboratory. Analysing
and comparing genomic sequences has provided key insights into the taxonomic and evolutionary relationships between the different organisms and species that we observe today. However, when viewed from
a molecular or biophysical perspective, such sequences provide only a rather incomplete form of biological knowledge. There is growing evidence that structure-based methods can help to predict networks of
protein-protein interactions (PPIs) with greater accuracy than those which do not use structural evidence
[X34, X35].
As indicated above, each protein adopts its own distinct 3D shape, and groups of proteins often interact
by forming large 3D complexes. These complexes may exist as short-lived transitory associations, as in
enzyme catalysis, or as long-lived multimeric systems such as the ribosome, transcription factors, cell surface and ion channel proteins. Understanding how proteins interact is crucially important for understanding
the molecular mechanisms of disease. For instance, therapeutic drugs often work by modulating or blocking
PPIs, and therefore PPIs represent an important class of drug target [X36, X37]. However, we still know very
little about how proteins operate at the molecular level. Genome-wide proteomics studies of “model organisms” such as yeast [X38, X39, X40, X41] are providing a growing list of putative PPIs, but understanding
3 http://www.loria.fr
5
the function of these predicted interactions requires much further biochemical and structural analysis. For
example, yeast is one of the most studied model organisms and is known to have around 6,000 proteins,
giving rise to between about 38,000 and 75,000 PPIs. Around 50% of these PPIs have been observed experimentally. On the other hand, the human genome encodes around 30,000 proteins, giving from 154,000
to 370,000 PPIs, of which only around 10% are known to date [X42, X43]. Nonetheless, there appears to be
overwhelming evidence that all living organisms and many biological processes share a common ancestry
in the tree of life. Therefore, developing techniques which can mine knowledge of the protein structures
and interactions observed in yeast or other organisms is an important way to enhance our knowledge of
human biology [X44].
The next two sections describe the main scientific challenges of this project. The third section describes
three specific applications in which we will apply the techniques developed in collaboration with experimental
biology laboratories. It should be mentioned at this point that high resolution computational modeling of 3D
structural interactions between proteins can be very challenging because protein are intrinsically dynamical
molecules at physiological conditions. Although the 3D structure of a protein is often highly constrained
by an internal network of non-covalent interactions, at very short (nano-second) time-scales the individual
atomic positions within a protein rapidly and continuously fluctuate under thermal motion. Furthermore, at
longer time-scales the internal side-chain conformations within a protein can flip from one local minimum
to another. At even longer time-scales, larger structural subunits of α-helices and β -sheets may undergo
substantial motions which can be very difficult to predict using computational techniques. On the other hand,
many proteins are observed to crystallise into the same overall 3D fold, and indeed members of the same
protein family often crystallise into quite similar 3D structures. Thus, there are recurring structural patterns
which can be identified and classified. Nonetheless, it should be borne in mind that proteins are potentially
very flexible molecules, and that we cannot expect to model PPIs with crystallographic resolution.
3.1
Integrative Multi-Component Assembly and Modeling
Participants: Ritchie, Devignes, Maigret
3.1.1
Introduction – High Dimensional Search Spaces
At the molecular level, each PPI is embodied by a physical 3D protein-protein interface. Therefore, if the 3D
structures of a pair of interacting proteins are known, it should in principle be possible for a docking algorithm
to use this knowledge to predict the structure of the complex. However, modeling protein flexibility accurately
during docking is very computationally expensive due to the very large number of internal degrees of freedom in each protein, associated with twisting motions around covalent bonds. Even if one assumes that the
proteins are rigid, a brute-force search over the 6D docking space between two typical proteins can involve
generating and testing some 1010 orientations.4 Therefore, it is highly impractical to use detailed force-field
or geometric representations in a brute-force docking search. Instead, most protein docking algorithms use
4 The
estimate of 1010 trial rigid body docking orientations is based on the assumption that reasonably small atom-sized steps are
used for each dimension. For example, when using 3D rotational steps of 6◦ and a 3D translational grid of 903 elements, a typical
run of the ZDOCK docking program [X45] generates and tests approximately 54, 000 × 903 = O(1010 ) trial docking orientations for
a pair of medium-sized proteins. If we further assume that each protein consists of around 103 atoms and that the computational
cost of calculating atomic interactions scales quadratically in the number of particles, naively using even a simple force-field model in
a brute-force docking search between two typical proteins could easily cost O(1010 ) × O(103 ) × O(103 ) = O(1016 ) floating point
operations. This corresponds to around O(107 ) CPU-seconds or O(102 ) CPU-days on a modern 3 GHz processor. If the proteins
are represented as geometric surface meshes instead of atoms, similar estimates could be made concerning the cost of calculating
an interaction score between the two meshes. For example, the above 6D brute-force docking search between two meshes, each of
around 103 vertices, would involve calculating around O(1016 ) vertex-vertex distances.
6
fast heuristic methods to perform an initial rigid-body search in order to locate a relatively small number of
candidate binding orientations, and these are then refined using a more expensive interaction potential or
force-field model, which might also include flexible refinement using molecular dynamics (MD), for example.
Some protein docking algorithms use geometric hashing (an object recognition technique adapted from
computer vision [X46]) of cliques of surface triangles or critical points to avoid a brute-force search, but
most now use 3D grid-based representations. While geometric hashing is very fast, it carries the risk that
candidate solutions might be missed if there are large conformational changes between the free and bound
structures. On the other hand, grid-based scoring functions can more readily cover the rigid body space
exhaustively, and they can more easily be adapted to include other interaction types by describing each
contribution to the energy as an integral over a product of a “potential” and a “density” term, as in classical
electrostatics. For instance, in our Hex protein docking algorithm [T1], the in vacuo electrostatic interaction
energy between two proteins, A and B, is calculated as [X47]
1
E=
2
Z
1
φA (x)ρB (x)dx +
2
Z
φB (x)ρA (x)dx,
(1)
where φA (x) represents the electrostatic potential of protein A and ρB (x) represents the charge density of
protein B, etc. The notation used here follows the common physics convention of underlining vector quantities: x ≡ (x, y, z). Thus, dx represents an infinitesimal 3D volume element and
R
dx denotes integration
over all 3D space. The similarity between two 3D objects may be calculated in a similar way [T2].
The main advantage of grid-based representations, however, is that the scoring step can be accelerated
greatly by using fast Fourier transform (FFT) techniques [X48]. Furthermore, by borrowing some techniques
and notation from quantum mechanics, it can be shown that the FFT may equally be used to accelerate the
search in rotational instead of translational coordinates [T3]. To give a simple example, the correlation score
between a 3D cryo-EM density map ρmap (x) and a 3D density representation of a high resolution protein
model ρprot (x) may be expressed as an overlap integral of the form
Z
(2)
S(x, y, z, α, β, γ) =
T̂ (x, y, z)ρmap (x0 ) × R̂(α, β, γ)ρprot (x0 ) dx0 ,
where T̂ (x, y, z) and R̂(α, β, γ) represent 3D translation and Euler angle rotation operators, respectively.
Thanks to the existence of fast 3D (Cartesian) FFT libraries, the above calculation has almost always been
implemented using multiple 3D translational FFTs and by explicitly sampling the rotational space. However,
performing a rotational FFT is arguably a more natural way to map what is largely a 3D rotational shape
matching problem onto the computational DOFs. It is worth mentioning that the above operator notation is
very useful when working with more complex expressions involving e.g. symmetry or multiple components,
or if one wishes to re-write such calculations to distribute them over multiple processors, for example.
Theoretically, using a FFT to search one degree of freedom (DOF) reduces the computational cost from
O(N 2 ) function evaluations to O(N log N ). However, FFT techniques are not a panacea, especially for
high-dimensional problems, because the O(N log N ) speed-up is only obtained when exhaustively sampling each DOF. Furthermore, due to the large memory requirement and high latency time to stride over
large multi-dimensional data arrays, the highest practical dimension for the FFT is normally just 3. The main
difference between our Hex docking algorithm and other FFT docking algorithms is that Hex performs the
docking search in rotational coordinates using 1D, 3D, or 5D FFT angular grids, whereas all other FFT-based
docking algorithms work with regular 3D Cartesian grids. Furthermore, in the polar Fourier representation
it is possible to re-write a 5D FFT as multiple 1D FFTs, and this gives a significant speed-up on modern
graphics processor units (GPUs) [T4].5 But here again, FFT techniques give a speed-up only when exhaus5 On
current GPUs, it is necessary to use single precision floating point arithmetic for best performance. We find that Hex docking
7
tively sampling each DOF. Thus, FFT techniques are generally not suitable for sampling small regions of a
large search space, as is necessarily the case in flexible docking with atomic resolution.
3.1.2
Using Coarse-Grained Models
Many approaches have been proposed in the literature to take into account protein flexibility during docking. The most thorough methods rely on expensive atomistic simulations using MD. However, much of a
MD trajectory is unlikely to be relevant to a docking encounter unless it is constrained to explore a putative
protein-protein interface. Consequently, MD is normally only used to refine a small number of candidate
rigid body docking poses. A faster approach is to model side-chain flexibility using rotamer libraries [X49],
but such techniques are still very computationally expensive. A much faster, but more approximate method
is to use so-called coarse-grained (CG) normal mode analysis (NMA) to reduce the number of flexible degrees of freedom to just one or a handful of the most significant vibrational modes [T5, X50, X51, X52].
However, sampling NMA-generated conformations typically leads to a quadratic increase in the number of
conformations that must be cross-docked, and this can greatly increase the number of false-positive solutions [T6]. In fact, many protein docking algorithms, such as the FFT-based approaches, avoid the flexibility
problem altogether by using “soft” potentials which can “absorb” forbidden steric clashes to a certain extent. In our experience, docking ensembles of NMA conformations does not give much improvement over
basic FFT-based soft docking [T6], and it is very computationally expensive to use side-chain repacking to
refine candidate soft docking poses [T7]. We therefore plan to use only “soft” scoring functions for multicomponent assembly, although we expect that NMA techniques will still be useful for flexibly fitting proteins
into high resolution cryo-EM density maps (see below).
In the last few years, CG force-field models have become increasingly popular in the MD community because they allow very large biomolecular systems to be simulated using conventional MD programs [X53].
Typically, a CG force-field representation replaces the atoms in each amino acid with from 2 to 4 “pseudoatoms”, and it assigns each pseudo-atom a small number of parameters to represent its chemo-physical
properties. By directly attacking the quadratic nature of pair-wise energy functions, coarse-graining can
speed up MD simulations by up to three orders of magnitude. Nonetheless, such CG models can still
produce useful models of very large multi-component assemblies [X54]. Furthermore, this kind of coarsegraining effectively integrates out many of the internal DOFs to leave a smoother but still physically realistic
energy surface [X55]. We therefore plan to use simple but accurate CG force-field models such as SCORPION [X56] to score candidate configurations during multi-component assembly rapidly and accurately,
without necessarily attempting to model flexibility explicitly.
3.1.3
Generating and Detecting Symmetry
Although protein-protein docking algorithms are improving [X57], it still remains challenging to produce
a high resolution 3D model of a protein complex using ab initio techniques, mainly due to the problem
of structural flexibility described above. However, with the aid of even just one simple constraint on the
docking search space, the quality of docking predictions can improve dramatically [T8, T3]. In particular,
many protein complexes involve symmetric arrangements of one or more sub-units, and the presence of
symmetry may be exploited to reduce the search space considerably [X58, X59, X60]. For example, using
our operator notation (Equation (2)), we have already developed a prototype algorithm which can generate
and score candidate docking orientations for monomers that assemble into cyclic (Cn ) multimers using
scores calculated using single precision FFTs on a GPU normally agree to within four decimal digits with the scores from double
precision FFT calculations on a CPU [T4]. This level of precision is sufficient to rank different docking orientations.
8
integrals of the form
Z
EAB (y, α, β, γ) =
T̂ (0, y, 0)R̂(α, β, γ)φA (x) × R̂(0, 0, ωn )T̂ (0, y, 0)R̂(α, β, γ)ρB (x) dx,
(3)
where the identical monomers A and B are initially placed at the origin, and ωn = 2π/n is the rotation about
the principal n-fold symmetry axis. This example shows that complexes with cyclic symmetry have just 4
rigid body DOFs, compared to 6(n − 1) DOFs for non-symmetrical n-mers. Thus, when suitable constraints
are available, the size of the search space may be reduced dramatically. We are currently extending this
algorithm to assemble protein complexes with arbitrary point group symmetries. Although we currently use
shape-based FFT correlations, the symmetry operator technique may equally be used to refine candidate
solutions using a more accurate CG force-field scoring function. We also wish to develop similar techniques
to detect the possible existence of symmetry in low resolution cryo-EM density maps (Section 3.1.7).
3.1.4
Assembling General Multi-Component Complexes
More generally, we wish to develop algorithms to assemble arbitrary non-symmetrical multi-component
complexes in which the applied constraints will necessarily be more approximate than sharp symmetry
constraints. Ideally, we would like to use prior knowledge to locate each of the components approximately
correctly in 3D space, and then to use fast rotational correlations or CG potential functions to cover the rest
of the rigid body search space. However, this addresses only the first part of the multi-component assembly
problem. In favourable cases, pair-wise docking algorithms can provide a ranked list of predictions with a
near-native orientation for the complex in the top 10 orientations, but in less favourable cases a good prediction might be found only within the top 500 or 1,000 orientations. If the goal is to assemble n proteins into
a non-symmetrical complex with k possible orientations for each pair of proteins, a spanning tree argument
can be used to show that there are a total of nn−2 k n−1 distinct ways to form a complex [X61]. Therefore,
except for only very small values of n and k it is impractical to enumerate all possible combinations, and
heuristic search algorithms must be used. Recently, we applied a particle-swarm optimisation (PSO) approach to sample the search space efficiently. We found that the simple requirement that the individual
proteins should not inter-penetrate provides a very useful way to eliminate many of the incorrect trial orientations, and that a near-native orientation may often be found within the first few solutions [T9]. However,
the use of heuristic sampling techniques does not change the fundamental complexity of the problem, and
our PSO approach becomes impractical with more than about 6 components.
3.1.5
3D Cryo-EM Reconstruction
We also want to develop related approaches for integrative cryo-EM structure modeling. Thanks to current
cryo-EM instruments and technologies, its is now feasible to capture low resolution images of very large
macromolecular machines. However, transforming multiple 2D micrographs into high resolution 3D structures is an extremely labour-intensive and computationally intensive task. For the people involved, it is often
also an extremely tedious task. From a computational point of view, solving 3D structures by cryo-EM is a
classic inverse problem, in that the aim is to reconstruct the shape of an unknown 3D particle (here, a high
resolution atomic model) by back-projecting a large number of observed low resolution 2D images. In conventional biomedical tomography, for example, a 3D image may be reconstructed from multiple 2D images
using the Fourier slice theorem. However, in cryo-EM, the orientations of the initial 2D images are unknown.
Hence, an initial 3D density map must be estimated and then refined iteratively. Typically, in order to achieve
an acceptable 3D model, an initial 3D map is used to make 2D projections in different orientations which
9
may be used as templates to pick further 2D images from the micrographs. By grouping and averaging these
additional 2D images, a higher resolution 3D map may be made, which can then be refined by repeating
the above cycle. Traditionally, this has been done using Cartesian FFT techniques. However, because we
already have a good set of computational tools for working in polar coordinates, we also wish to explore
the use of our polar Fourier correlation technique as a novel way to solve the initial 2D/3D reconstruction
problem.
3.1.6
A Data Explosion in Cryo-EM
We must mention at this point that a technological revolution is taking place in cryo-EM, and we will soon
be faced with a “data explosion” in the size and complexity of cryo-EM imaging data that needs to be processed. The latest generation of cryo-EM instruments allow samples to be processed automatically at much
greater rates than earlier instruments, and use modern complementary metal oxide semiconductor (CMOS)
direct detectors instead of the earlier charge-coupled detector (CCD) cameras to record micrographs of up
to 4K×4K pixels per image. CMOS detectors have a much better signal-to-noise ratio than CCDs because
the CMOS chip measures the scattered electrons directly from the sample, whereas previously a phosphorescent layer had to be used to convert the scattered electrons into optical light for the CCD. Furthermore,
because direct detectors are much faster than CCDs, there is less time for a sample to move as it is being
imaged by the electron beam. Thus, higher resolution images may be captured. Along with the intriguing
prospect of being able to trap biological systems in unprecedented levels of detail, there will also come an
increasing need to analyse, annotate, and interpret the enormous volumes of data that will soon flow from
the latest instruments.
However, it is worth noting that while direct detectors can allow very high resolution density maps to
be calculated in favourable conditions, according to the latest statistics from the public “EMDB” repository,
the average resolution of the deposited cryo-EM maps is currently falling, having gone from around 16 Å
RMSD6 in 2012 to around 23 Å RMSD in 2014.7 Therefore, it is still very important to be able to process low
resolution density maps. Indeed, a total of only 26 high resolution maps (≤ 4 Å RMSD), which is less than
1% of all maps in the EMDB, have actually been solved to date.8 Nonetheless, the technological advances
described above will mean that the number, size, and complexity of the structures that can be studied by
cryo-EM will only increase in the future.
3.1.7
Integrative Cryo-EM Structure Modeling
But achieving a good density map is still only part of the problem, because the final map will still often
be of low resolution compared to X-ray crystallography. To achieve atomic resolution, it is necessary to
fit previously solved crystallographic fragments of protein structures into the density map. However, the
problem here is that large molecular machines will have multiple sub-components, some of which will be
unknown, and many of which will fit each part of the map almost equally well. Thus, the general problem of
building high resolution 3D models from cryo-EM data is like building a complex 3D jigsaw puzzle in which
several pieces may be unknown or missing, and none of which will fit perfectly. Figure 2 illustrates the task
of fitting a high resolution crystal structure into a low resolution density map using cross-correlation (CC)
and normalised cross-correlation (NCC) scoring functions, calculated using our gEMfitter program [T10].
6 In
cryo-EM, the root-mean-squared deviation (RMSD) of a density map is normally calculated using a Fourier-shell overlap expres-
sion for consistency with the usual crystallographic notion of resolution.
7 http://www.ebi.ac.uk/pdbe/emdb/statistics_sp_res.html/.
8 http://www.ebi.ac.uk/pdbe/emdb/statistics_num_res.html/.
10
Here, different map resolutions were simulated by applying different Gaussian filters to the initial data. This
figure shows that, for this example, using a NCC with a Laplacian pre-filter gives the best performance on
low resolution maps. However, real maps are often much noisier than this example, and in such cases we
find that the NCC without a Laplacian filter often gives better results. While modern CMOS detectors will
help to reduce the problem of noise in new datasets, we believe it is still very important to have robust tools
which can process the many cryo-EM datasets that already exist.9
Problems due to structural flexibility also appear in cryo-EM. One way to deal with this is to collect and
classify ever more 2D images in order to build multiple 3D density maps because, in favourable cases,
different macromolecular conformations may sometimes be observed directly in the micrographs [X62].
Additionally, once a high-resolution protein sub-unit has been located reasonably well in a density map, it is
sometimes possible to improve the fit by using normal mode analysis to deform flexibly the high resolution
structure before re-fitting it into the map [X63]. We would also like to tackle the problem of flexible density
fitting. However, given the small size of the initial team, we do not consider this to be an immediate priority.
1
RMSD (Å)
10
0
10
CC
NCC
CC + ∇2
NCC + ∇2
10
20
30
40
50
resolution (Å)
Figure 2: Illustration of the cryo-EM density fitting problem. The figure on the left shows the correlation peaks obtained
by using gEMfitter to fit the recA monomer into a low resolution cryo-EM density map when using CC and NCC scoring
functions with and without a Laplacian (∇2 ) pre-filter. (a) CC, (b) CC+∇2 , (c) NCC, (d) NCC+∇2 . Sharper peaks
correspond to better scoring functions. Thus, NCC+∇2 is the strongest scoring function and CC is the weakest for this
example. The central figure shows the “breaking” resolution curves for the four scoring functions using simulated EM
density data. In this figure, the x axis shows map resolution (large positive values correspond to low resolution data), and
the y axis corresponds to the root mean squared deviation (RMSD) from the correct solution (small values correspond
to good predictions). This figure shows that the NCC+∇2 scoring function gives the best overall performance for this
example. The figure on the right shows the top-scoring fit (red protein backbone) overlaid on the correct position (blue
backbone) that was obtained obtained using NCC with a Laplacian pre-filter. Figure adapted from [T10].
Another very challenging computational problem is how firstly to locate correctly all of the components in
a large density map. We are collaborating with the Biocomputing group of Annick Dejaegere at the Institut de
Génétique et de Biologie Moléculaire et Cellulaire (IGBMC) in Strasbourg to construct approximate “bead”
models of macromolecular assemblies, which may then be used as fuzzy volumetric restraints for fitting high
resolution crystallographic protein structures into low resolution density maps. To extend this approach, we
want to explore ways to use biological knowledge to define connectivity restraints on the beads themselves.
For example, if it is known that two residues from different proteins must be in contact, this requirement may
be used to define a simple distance restraint. Furthermore, we expect that some types of data could be
transformed directly into distance restraints, such as data from mutagenesis experiments and fluorescence
or bioluminescence resonance energy transfer measurements (FRET or BRET, respectively) [X57, T8]. On
9 The
EMDB currently contains a total of 2,666 3D density maps (http://http://www.ebi.ac.uk/pdbe/emdb/statistics_main.html/). It
would be quite impractical to re-generate all of this data using the latest instruments.
11
the other hand, other types of data such as small-angle X-ray scattering (SAXS) curves might only be useful
when validating a final model.
One way to deal with such heterogeneous data is to construct and optimise a multi-objective scoring
function of the form [X64]
F (x1 , ..., xn ) =
n
X
f (xi ) +
n
n
X
X
i
i
g(xi , xj ),
(4)
j=i+1
where xi represent free variables, and f (xi ) and g(xi , xj ) represent single-body and pair-wise scoring
functions, respectively. For example, f (xi ) might represent the score for fitting a protein into a cryo-EM
density map, and g(xi , xj ) might represent a docking score between two proteins. However, as indicated
above, with more than a handful of proteins, the search space is too large to enumerate blindly. On the
other hand, if some prior knowledge or hypotheses about the solution are available to reduce the number of
pair-wise terms, it can be advantageous to decompose the graph of g(xi , xj ) into a junction tree [X64, X65]
in which nodes represent groups of coupled variables and edges represent dependencies between nodes.
This allows optimisation techniques such as non-serial DP (which can assign optimal values to variables in a
non-prescribed order [X66]) to be used to exploit regions of sparse dependencies in the junction tree in order
to eliminate variables and to find a global solution efficiently. This approach has been used successfully in
the multi-component cryo-EM fitting problem [X64]. However, it is essential that the tree width (i.e. the largest
number of variables in a node) be small because the overall computational cost depends exponentially on
this quantity.
Although we do not have precise roadmap to a solution for the assembly problem, we wish to proceed
firstly by putting more emphasis on the single-body terms in the scoring function, and secondly by using
fast CG representations and knowledge-based distance restraints to prune large regions of the pair-wise
search spaces. For example, we want to improve the cryo-EM density fitting calculations by adding the
surface skin model that we originally developed for protein docking [T1]. The idea here is to apply the usual
common-sense strategy when solving a 2D jigsaw puzzle of trying to place the edge pieces first. Because
the sub-units in the final 3D model should pack together quite tightly, there are obviously close parallels
between the cryo-EM density fitting problem and the multi-component docking problem. Since we know that
proteins cannot physically interpenetrate, we want to use fast CG representations of proteins to prune large
regions of the search space. Using such ideas, we wish start to explore how to combine volumetric shape
matching, multi-component docking, and distance constraints in a practical and tractable way. Subsequently,
we will build on the experience gained with a view to combining more diverse knowledge-based constraints
with e.g. DP optimisation techniques.
Because all of the problems described here are computationally intensive, we will rely heavily on multiprocessor devices and parallel processing techniques to accelerate the calculations. For example, we successfully adapted our Hex protein docking and 3D-Blast shape matching algorithms to use high performance
graphics processor units (GPUs) to accelerate the calculations [T4]. In the near future, we plan to explore the
suitability of the emerging “Many Integrated Cores” (MIC)10 devices for high throughput 3D shape matching
and docking.
10 http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html
12
3.2
Classifying and Mining Protein Structures and Protein-Protein Interactions
Participants: Devignes, Ritchie, Maigret
3.2.1
Introduction – Prerequisites for Data Mining
The scientific discovery process is very often based on cycles of measurement, classification, and generalisation. It is easy to argue that this is especially true in the biological sciences. Often, proteins may be
divided into modular sub-units called domains, which can be associated with specific biological functions.
Thus, a protein domain may be considered as the evolutionary unit of biological structure and function [X67].
A widely used collection of protein domain families is “Pfam” [X68], constructed from multiple alignments of
protein sequences. However, while it is well known that the 3D structures of protein domains are often more
evolutionarily conserved than their one-dimensional (1D) amino acid sequences, comparing 3D structures
is much more difficult than comparing 1D sequences. Therefore, until recently, most evolutionary studies
of proteins have compared and clustered 1D amino acid and nucleotide sequences rather than 3D molecular structures. Indeed, efficient pattern matching algorithms such as FASTA [X69] and BLAST [X70] are
now standard tools for searching nucleotide and amino acid sequence databases, but there is still no generally accepted standard for how to align and compare two similar protein structures [X71]. Nonetheless,
it is widely accepted that in distantly related proteins, structure is more conserved than sequence [X72].
Furthermore, comparing structures can allow us to detect evolutionary relationships and similarities which
cannot be seen beyond a “twilight zone” of around 25% sequence similarity [X73]. Hence there is a strong
scientific need to develop new tools to study the structural relationships within and between protein families.
In structural biology, two widely used protein domain structure classifications are SCOP [X74] and CATH
[X75]. These classifications were created to catalogue the space of protein folds and to help identify functional and evolutionary relationships, especially those which might exist beyond the sequence twilight zone.
Both SCOP and CATH describe proteins in a four-level hierarchy, and both populate their hierarchies using
various sequence-based and structure-based comparison tools. Both classifications also require the use of
considerable human expertise to deal with novel structures which cannot be classified using automatic tools.
However, it is becoming increasingly difficult to keep such structural classifications up to date [X76]. For example, an increasing number of cases have been reported where a single fold family can give rise to multiple
functions, or indeed where certain families have had to be re-classified as new structures have been found
[X77]. While the majority of proteins appear to contain just one domain, around 30% to 40% of proteins contain two or more Pfam domains, and studying the nature of such multi-domain proteins could provide new
evolutionary insights [X78]. In any case, as more and more new 3D structures become available from structural genomics projects, it is now clear that the space of protein folds is a smooth multi-dimensional space
which cannot easily be sub-divided into a simple hierarchy. Indeed, according to Bourne and Shindyalov
[X79], “the ultimate and almost certainly unanswerable question is, can we establish a structure-based phylogenetic tree that evolved from a single common ancestor – the original protein fold?” Consequently, the
need for fast and reliable computational structural comparison tools and the opportunity to discover novel
structural and evolutionary relationships has never been greater. Our research will contribute to these two
aspects, first by improving comparison tools, and second by exploring novel classification paradigms.
3.2.2
Quantifying Structural Similarity
Concerning the structure comparison problem, we recently developed a new protein structure alignment
algorithm called Kpax (illustrated in Figure 3) which combines an efficient DP-based scoring function with a
simple but novel Gaussian representation of protein backbone shape [T11]. This means that we can now
13
quantitatively compare 3D protein domains at a similar rate of throughput to conventional 1D NeedlemanWunsch sequence comparisons. Currently, the main limitations of Kpax are that it cannot handle repeats or
transpositions of domains, and it does not detect alternative sub-optimal alignments. It only returns one (the
highest scoring) global alignment for each pair. On the other hand, we recently compared Kpax with a large
number of other structure alignment programs, and we found Kpax to be the fastest and amongst the most
accurate, in a CATH family recognition test [T12].
Figure 3: Illustration of the Kpax structural alignment algorithm. (A) By exploiting the highly predictable tetrahedral
geometry around each Cα atom, successive peptide fragments may be compared in a common local coordinate frame
because similar secondary structure elements appear in similar orientations at the origin. Pairs of peptide fragments
may then be compared rapidly using products of local-frame Gaussian functions. (B) The Gaussian parameters are
derived using statistics from the CATH database. (C) Overlays of secondary structures may then be calculated rapidly
using DP. The superpositions obtained by Kpax are often tighter than those of the widely used TM-Align algorithm [X80].
Arrows highlight particularly tight regions of the Kpax alignment.
In general, protein alignment algorithms aim to maximise the number of aligned residues while simultaneously trying to minimise the root mean squared deviation (RMSD) between the aligned Cα atoms. But
here again, structural flexibility can cause considerable problems. If protein structures are assumed (incorrectly) to be rigid, as is the case for most current alignment algorithms, these two measures are inversely
related. Consequently, there is often some debate about what defines a “biologically meaningful” alignment,
or whether one particular alignment algorithm is “better” than another, especially when comparing structures
from different protein families. The latest version of Kpax (unpublished) can calculate flexible alignments at
no extra cost, and thus promises to avoid such issues when comparing more distantly related protein folds
and fold families.
3.2.3
Formalising and Exploiting Domain Knowledge
Concerning protein structure classification, we aim to explore novel classification paradigms to circumvent
the problems encountered with existing hierarchical classifications of protein folds and domains. In particular
it will be interesting to set up fuzzy clustering methods taking advantage of our previous work on gene
functional classification [T13], but instead using Kpax domain-domain similarity matrices. A non-trivial issue
with fuzzy clustering is how to handle similarity rather than mathematical distance matrices, and how to find
the optimal number of clusters, especially when using a non-Euclidean similarity measure. We will adapt
the algorithms and the calculation of quality indices to the Kpax similarity measure.
More fundamentally, it will be necessary to integrate this classification step in the more general process
leading from data to knowledge called Knowledge Discovery in Databases (KDD) [X81]. The KDD process
can be divided in three steps: data preparation, data mining, and result interpretation. It is a largely iterative
process not only because data mining can be re-executed many times with various parameters, but also
because interpretation of mining results often leads the expert to redefine new datasets to be mined. It
is our experience that in addition to a human expert’s knowledge, using formalised domain knowledge
14
can help to guide the KDD process during each of the three steps [T14]. For example, integrating domaindomain similarity measures with knowledge about domain binding sites, as introduced by us in the KBDOCK
approach [T15, T16], can help in selecting interesting subsets of domain pairs before clustering.
Another example where domain knowledge can be useful is during result interpretation: several sources
of knowledge have to be used to explicitly characterise each cluster and to help decide its validity. Thus,
it will be useful to be able to express data models, patterns, and rules in a common formalism using a
defined vocabulary for concepts and relationships. Existing approaches such as the Molecular Interaction
(MI) format [X82] developed by the Human Genome Organization (HUGO) mostly address the experimental
wet lab aspects leading to data production and curation [X83]. A different point of view is represented in
the Interaction Network Ontology (INO),11 which is a community-driven ontology that is being developed to
standardise and integrate data on interaction networks and to support computer-assisted reasoning [X84].
However, this ontology does not integrate basic 3D concepts and structural relationships. With the help of
the resource Index at the NCBO portal [X85], we will survey all existing ontologies referring to protein-protein
interactions and we will introduce whenever required the representation of structural features for reasoning
about interactions. For example, the abstract concept of Domain Family Binding Site that we introduced in
our KBDOCK approach must be integrated at its proper place in an ontology in order to formalize its use in
case-based homology modeling [T15]. Using such formalisms and symbolic relationships will be beneficial,
if not essential, when classifying the 3D shapes of proteins at the domain family level.
3.2.4
3D Protein Shape Mining
Thanks to our KBDOCK and Kpax projects, we already have a rich set of tools with which we can start to
process and compare all known protein structures and PPIs according to their component Pfam domains.
Starting from Pfam-defined structural families, we wish to perform a completely automatic clustering of all
structural domains in order to define a fresh structure-based classification of the protein universe. Linking
this new classification to the latest “SIFTS” (Structure Integration with Function, Taxonomy and Sequence)
functional annotations between standard Uniprot12 sequence identifiers and PDB structures [X86] could
then provide a useful way to discover new structural and functional relationships which are difficult to detect
in existing classification schemes such as CATH or SCOP. Going further, we also want to perform all-againstall binding site comparisons and docking calculations using our polar Fourier correlation algorithm. We will
then apply fuzzy or non-fuzzy classification methods to the correlation matrices obtained in order to define
optimal shape-density based classifications of protein binding sites and protein binding partners.
3.2.5
Mining Protein Interaction Networks
From a systems biology point of view, each known protein-protein interaction may be considered as an edge
linking two nodes in a network. Human diseases are often related to malfunctions or imbalances in such
networks, caused by natural genetic variations in our DNA called single nucleotide polymorphisms (SNPs).
However, determining the molecular origins of disease processes is extremely challenging. We will use
symbolic logic-based techniques to represent, reason about, and mine the relationships between disease
profiles, SNPs, and molecular interactions. In particular, we will combine these techniques with molecular
modelling methods to understand the consequences of genetic polymorphisms and mutations on protein
shapes and protein-protein interactions.
11 http://www.ino-ontology.org/
12 http://www.uniprot.org/
15
Although several groups have developed pair-wise docking algorithms, there has been very little work so
far on how to assemble these individual interactions into structural and functional networks. Furthermore,
while several ontologies and mark-up languages exist for the biology and chemistry domains (see e.g.
http://www.biosharing.org/), to our knowledge there does not yet exist any kind of standard for describing
molecular interactions at the structural level. Therefore, we will develop a formal framework to represent
and manipulate structural knowledge about molecular interactions. This should lead to richer computational
models of biological systems, and should open the way for the application of more sophisticated techniques
for mining 3D molecular interactions.
3.2.6
Modeling Evolutionary Relationships Between 3D Protein Structures
Even if Bourne and Shindyalov’s question concerning a possible “structural phylogenetic tree” is highly
thought-provoking, we believe it is probably unanswerable because it seems very unlikely that there could
ever have been just one ancestral protein fold. Instead, we prefer to ask some simpler but still challenging
questions. Namely, can we devise a computational model to explain how one ancestral structure might
diverge into two or more descendants? Or, more concretely, given some existing 3D structures of proteins
from nearby family groups in the CATH or SCOP classifications, can we build a parsimonious model of
how those structures might have physically mutated from some hypothetical parent structure? Going a little
further, we would then ask, can we identify which existing structure most closely resembles the ancestral
structure and therefore would provide a tangible 3D model of the ancestral structure? Going still further, we
might even ask, can we detect examples of structural convergence, in which a given fold might arise from
two different ancestral folds (in analogy to functional convergence, as observed in certain proteases)?
If protein molecules could be represented definitively as 1D strings, the above questions could be explored using conventional sequence alignment-based scoring techniques (e.g. using the notion of edit distance, for example). In order to do ancestral inferencing from genetic sequence data, various techniques
have been developed in the standard framework of population genetics, such as likelihood maximisation
[X87], importance sampling [X88, X89], and Monte Carlo Markov Chain models [X90, X91]. However, because symbolic alignment approaches start to break down beyond the 25% similarity twilight zone, they
become almost useless if the aim is to measure the differences between different structural families. Therefore, in collaboration with Nicolas Champagnat of the Institut Élie Cartan de Lorrain (IECL), we wish to
extend current ancestral inferencing techniques with new structural similarity scoring functions based on our
Kpax software in order to study the ancestral relationships between protein folds in a completely sequenceindependent way. This represents a novel combination of computational strategies which could open the
way to answering a number of contemporary and far-reaching questions in structural biology. For example,
comparing the structural ancestry of proteins with the known phylogeny of species could help to confirm or
inform their function, and more generally could help to achieve a structural classification of proteins which
could explain the emergence of specialised functional domains and which could reveal interesting structural
relationships between the diverse fold families that we see today.
3.3
Specific Applications
The following applications are described here in order of priority, with the first two being very closely linked
to the team’s main objectives.
16
3.3.1
Transcription Factor II-D
Participants: Ritchie, Devignes
Transcription factor II-D (TFIID) is one of the key enzymes responsible for transcribing DNA into complementary RNA. This is the first step of translating a gene into a functional protein. TFIID is a large structure
consisting of some 16 protein molecules, although the precise 3D arrangement of these components is still
unknown [X92]. The 3D structure of TFIID is being actively studied by the cryo-EM group of Patrick Schultz
and the molecular modeling group of Annnick Dejaegere at the IGBMC in Strasbourg. Figure 4 shows some
views of this system. There is considerable interest solving the 3D structure of TFIID and other related
transcription factors because producing high resolution models of such complexes could have significant
bio-medical implications. A manual placement of some of the protein subunits into a low resolution EM
map has been proposed (P. Schultz, unpublished). However, several shortcomings of commonly used FFT
correlation methods became apparent during the attempts to automate this process due to the difficulties of
dealing with data at different scales of both size and resolution. Therefore, the TFIID system represents a
clear example of the kind of integrative structural biology problem that we wish to tackle (scales 2 and 3), as
it exemplifies the need both for powerful 3D image processing techniques and to be able to incorporate biological knowledge to focus the calculations. Indeed, we consider strengthening our collaborations with the
IGBMC and building new collaborations with other groups in the cryo-EM field to be an essential long-term
component of the Capsid project. We expect the collaboration with the IGBMC to continue for at least the
next four years, and we will try to build collaborations with other cryo-EM teams as well.
Figure 4: Cryo-EM density maps and micrographs of TFIID. (a) Two views of a cryo-EM map containing TFIID. (b)
Three X-ray sub-units fitted into the density map. (c) Evidence for movement of the TFIIA sub-unit, which is a co-factor
that binds to and stabilises TFIID. (d) Micrographs of TFIID-TFIIA-Rap1-DNA complexes showing the formation of a
DNA loop. Rap1 is one of several further co-factors necessary for transcription. Figure taken from [X93].
3.3.2
Prokaryotic Type IV Secretion Systems
Participants: Devignes, Ritchie
Prokaryotic type IV secretion systems constitute a fascinating example of a family of nanomachines capable
of translocating DNA and protein molecules through the cell membrane from one cell to another [X94].
The complete system involves at least 12 proteins, and is illustrated in Figure 5. The structure of the core
channel involving three of these proteins has recently been determined by cryo-EM experiments [X95, X96].
However, the detailed nature of the interactions between the remaining components and those of the core
channel remains to be resolved. Therefore, these secretion systems represent another family of complex
biological systems (scales 2 and 3) that call for integrated modeling approaches to fully understand their
machinery.
In the frame of the LORIA-MBI platform (see Section 6.2), MD Devignes has initiated a collaboration with
17
Nathalie Leblond of the Genome Dynamics and Microbial Adaptation (DynAMic) laboratory (UMR 1128, Université de Lorraine, INRA) on the discovery of new integrative conjugative elements (ICEs) and integrative
mobilisable elements (IMEs) in prokaryotic genomes. These elements use Type IV secretion systems for
transferring DNA horizontally from one cell to another. We have discovered more than 40 new ICEs/IMEs
by systematic exploration of 72 Streptomyces genome. As these elements encode all or a subset of the
components of the Type IV secretion system, they constitute a valuable source of sequence data and constraints for modeling these systems in 3D. A collaboration with a crystallography group working with the
DynAMic laboratory is planned for producing and crystallising the most challenging components. This set of
3D protein sub-units will be used for testing our algorithms with the objective of (i) reconstituting the already
solved core channel and (ii) predicting the interactions leading to a complete functional active Type IV secretion system. Another interesting aspect of this particular system is that unlike other secretion systems, the
Type IV secretion systems are not restricted to a particular group of bacteria. These nanomachines display
a broad phylogenetic distribution, and constitute an interesting topic for exploring structural evolution. We
expect to continue our collaboration with the DynAMic team for at least the next four years.
Figure 5: The structure of an archetypal Type IV secretion system (A. tumefaciens). (A) Schematic predicted structure
involving 12 different proteins [X94]. (B) Cryo-EM reconstruction of the central core region [X95].
3.3.3
G-protein Coupled Receptors
Participants: Maigret, Ritchie
G-protein coupled receptors (GPCRs) are cell surface proteins which detect chemical signals (scale 1) outside a cell and which transform these signals into a cascade of cellular changes (scale 4). Figure 6 shows the
structure of a recently solved example of a GPCR system, the β2 -adrenergic receptor [X97]. Historically, the
most well documented signaling cascade is the one driven by G-proteins trimers (guanine nucleotide binding proteins) [X98] which ultimately regulate many cellular processes such as transcription, enzyme activity,
and homeostatis, for example. But other pathways have recently been associated with the signals triggered
by GPCRs, involving other proteins such as arrestins and kinases which drive other important cellular activities. For example, β -arrestin activation can block GPCR-mediated apoptosis (cell death). Malfunctions
in such processes are related to diseases such as diabetes, neurological disorders, cardiovascular disease,
and cancer. Thus, GPCRs are one of the main protein families targeted by therapeutic drugs [X99] and the
focus of much bio-medical research. Indeed, approximately 40–50% of current therapeutic molecules target
GPCRs. However, despite enormous efforts, the main difficulty here is the lack of experimentally solved 3D
structures for most GPCRs. Hence, computational modeling tools are widely recognized as necessary to
help understand GPCR functioning and thus biomedical innovation and drug design.
18
Figure 6: The X-ray structure of a classic GPCR system, the β2 -adrenergic receptor, shown in an artist’s representation
of the cell membrane (light blue). The trans-membrane receptor domain is in blue, the intra-cellular G-protein trimer
structures are in red, and a small agonist molecule bound to the receptor is shown in orange. Figure taken from
http://www.nobelprize.org/nobel_prizes/chemistry/laureates/2012/.
In collaboration with medicinal chemistry colleagues in the universities of Bari (Italy) and Ramon-Lull
(Spain), we have long been interested in using computational techniques to develop new small-molecule
inhibitors of the CCR5 and CXCR4 receptors which are attacked by human immuno-deficiency virus (HIV)
[T17, T18, T19]. Together with Catherine Llorens-Cortes at the centre for Interdisciplinary Research in
Biology (CIRB; UMR 7241) at Collège de France, we are studying another GPCR called the APJ apelin
receptor, which is involved in the regulation of cardiovascular function (and which also appears to be one
of the co-receptors of HIV) [X100, X101]. One promising route to develop therapeutic molecules to control
heart disease is to design new small molecules which mimic the apelin signaling peptide but which have
better transport properties and which are degraded less quickly than the natural peptide. We participated in
the discovery of the only non-peptide ligand for APJ to have been found to date [X102].
As well as modeling the structures of GPCRs and working on GPCR-targeted drug discovery, we are developing new algorithms to improve conformational sampling of mutually flexible docking partners (as is the
case in the APJ/apelin system), and to speed up the virtual screening pipeline. While recent technological
advances now make it possible to run ever longer MD simulations, analysing the enormous datasets which
result (often many terabytes) is now becoming a major bottleneck. Hence we are interested in developing
novel clustering techniques to detect “interesting” events in long MD simulations. The apelin receptor system also nicely exemplifies our interest in the relationships between genetic variations and human diseases
(Section 3.2). For example, while the apelin peptide appears to be the same in all mammalian species,
several single nucleotide polymorphisms (SNPs) in the human APJ gene have been associated with different cardiovascular disease profiles [X101]. We are aware that research on GPCRs is highly competitive.
Nonetheless, one of us (B. Maigret) is an expert on modeling GPCR structures, and it is natural that the
team will continue to contribute in this area for at least the duration of B. Maigret’s emeritus position at the
LORIA, i.e. at least until the end of 2015.
4
4.1
The Team
Permanent Researchers
David Ritchie has a PhD in Computing Science (University of Aberdeen, 1998), a Masters in Artificial
Intelligence (Aberdeen, 1995), and a Bachelors in Chemistry (University of Bristol, 1978). Before coming to
France, he spent 9 years as a lecturer in the Department of Computing Science at the University of Aberdeen
19
(1999–2008). Thanks to an ANR Chaires d’Excellence award, he joined Amedeo Napoli’s Orpailleur team
at the LORIA in January 2009. He then obtained a permanent position with Inria in October 2010. He
obtained his Habilitation à Diriger les Recherches (HDR) in 2011 from Université Henri Poincaré (now part
of Université de Lorraine). Concerning research outputs, he is probably best known for his novel spherical
polar Fourier correlation technique for protein docking and 3D molecular shape matching. His Hex docking
software is one of the most widely used protein docking programs available (over 33,000 down-loads).
He has published some 40 international journal articles (900 citations and H=17 in Thomson ISI) in the
fields of structural bioinformatics and chemoinformatics. His research has been funded by grants worth
approximately e 1M from ANR and the british BBSRC and EPSRC.
Throughout his career, “Dave” Ritchie has been deeply involved in scientific computing, both as a professional software developer in the oil and chemicals industries (1979–1994) and later as an academic
researcher. His scientific motivation centres around his desire to understand how complex biological systems work at the molecular level. His practical contributions toward this goal involve developing novel and
efficient computational and knowledge-based techniques to represent and study the 3D shapes of biological
molecules and their complexes. He still enjoys writing his own software, but he also firmly believes that the
best way to make significant progress is to bring together a team of experts with complementary skills.
Marie-Dominique Devignes was trained at the Ecole Normale Supérieure (1977–1982). She obtained
her Masters in Biochemistry and Physiology in 1979 at the University of Paris VI and VII and the Agrégation
in Biochemistry and Physiology in 1980. After her PhD in Molecular Biology in 1982, she spent 18 months
in Germany. She was subsequently recruited by the CNRS in 1983. She received the bronze medal of the
CNRS in 1986 and her Doctorat d’Etat ès Sciences (equivalent to HDR) in 1988. She worked in the field
of Human Genetics at the CNRS in Villejuif from 1989 to 2000 and turned to computational biology when
joining the LORIA in Nancy in 2001. She now coordinates several collaborative bioinformatics projects
around the LORIA. She is internationally recognised in the fields of data integration and knowledge-based
approaches for bioinformatics. She has published over 45 international journal articles which have some
650 citations. In 2014 she will chair the European Conference on Computational Biology (ECCB), which is
the largest european conference in this field.
Bernard Maigret originally studied Chemical Engineering at the Ecole Nationale Supérieure de Strasbourg. He received his Doctorat d’Etat ès Sciences Physiques in 1975. From 1993 to 1997 he was the head
of the CNRS unit no. 510 “Interactions Moléculaires.” From 2003 to 2006, he was the head of the eDAM
team (Équipe de Dynamique des Assemblages Membranaires) of UMR 7565 in Nancy. During his scientific
career he published some 190 papers (4800 citations, H=39) on subjects ranging from quantum mechanics,
molecular dynamics, and virtual drug screening, to molecular graphics, clustering, and optimisation.
4.2
Team Software
The team members have already developed several techniques and tools for computational and knowledgebased modeling of protein structures and interactions which will provide useful foundations for this project.
In particular, we mention here in approximate order of maturity (oldest first):
• Hex – state of the art protein docking using polar Fourier correlations – http://hex.loria.fr/.
• HexServer – a GPU-powered web server for protein docking – http://hexserver.loria.fr/.
• Intelligo – a vector based semantic similarity measure for biological processes and molecular function
based on gene ontology (GO) terms – http://intelligo.loria.fr/.
• KBDOCK – a 3D database of structural protein-protein domain interactions – http://kbdock.loria.fr/.
• Kpax – a 3D protein and peptide structure database search and alignment algorithm – http://kpax.loria.fr/.
20
• gEMpicker – a parallel GPU-based particle picking tool for cryo-EM microscopy – http://gem.loria.fr/.
• gEMFitter – a parallel GPU-based cryo-EM density matching and docking tool – http://gem.loria.fr/.
Although the Hex docking program might be considered as our “flagship” software, we do not intend
to develop it much further as a protein-protein docking tool. Because Hex is fundamentally a rigid-body
docking algorithm, it is not well suited to modeling structural flexibility during docking, and the exponential
term in the spherical polar basis functions mean that high resolution docking correlation calculations are
limited to protein domains of up to around 150 amino acid residues. Nonetheless, we do wish to re-structure
the polar Fourier correlation code in Hex in order to provide a general rotational FFT library. Together with
the GPU-accelerated Cartesian cross-correlation codes that we developed in gEMfitter, this new library will
be important for the cryo-EM assembly part of the project. We also expect that the main Hex program will
provide a useful test-harness with which to evaluate the new CG potentials that we plan to use during multicomponent assembly. We expect that Kpax and the KBDOCK database will play important roles in the 3D
shape mining part of this project. Kpax provides a natural way to score the structural similarity of pairs of
protein domains, and we are currently extending it to calculate multiple flexible alignments within groups of
similar protein structures. The KBDOCK database contains information on all currently known 3D proteinprotein interactions, classified according to Pfam families. It therefore represents an important resource for
analysing 3D protein binding sites and interfaces and for identifying structure-function relationships.
5
Positioning
5.1
Positioning within Inria
Inria Domain: Santé, Biologie et Planète Numériques
Inria Theme: Biologie Numérique
The main motivation for proposing a new Inria team is to achieve greater institutional visibility and
support for the scientific problems we wish to address. This will help the team members to build scientific
momentum and to obtain further research funding at both national and international levels. Forming a
focused Inria team for structural bioinformatics will also create a stimulating environment with which to attract
and train bright young students who wish to work in this area. The permanent members of the Capsid team
currently belong to the Orpailleur project team. While the Orpailleur team received an excellent assessment
in the latest (2011) Inria Evaluation exercise, the evaluation report also noted that the team risked being
spread across too many axes. Consequently, the Project Committee at Inria Nancy recommended that the
life sciences sub-group should accelerate its plans to form a new team. Therefore, this proposal is consistent
with the scientific strategy of the Inria Nancy Grand Est centre.
More globally, Inria has long recognised the important contribution that the Computational Sciences can
make to our economic and societal well-being. In the Computational Biology and Bioinformatics theme,
several Inria teams are working on topics such as high-throughput sequence analysis (Bonsai, Genscale),
cellular modeling and molecular imaging (Beagle, Morpheme, Serpico), and integrative and systems biology
(Amib, Ibis), while other teams are developing sophisticated symbolic techniques to analyse genomic-scale
biological information (Bamboo, Dyliss, Magnome). Although molecular structure is sometimes an important
consideration for these teams, only Frédéric Cazals (ABS), Rumen Andonov (GenScale), and Jerôme Azé
and Julie Bernauer (Amib) make protein structure a central interest. Andonov’s work on protein structure
alignment uses constraint-based solver techniques to generate provably optimal alignments according to
21
a specified criteria (but often with a high computational cost). In contrast, our Kpax approach is designed
for very fast database searching while still giving tight, but not necessarily optimal, 3D superpositions. Azé
and Bernauer use 3D Voronoi representations and statistical learning techniques to construct potentials for
scoring protein-protein and protein-RNA interactions [X103]. More recently, Cazals used a Voronoi representation to define a purely geometric description of the environment of each atom within a protein-protein
interface [X104]. In contrast, our KBDOCK database approach uses a very simple shape clustering method
to represent families of protein binding sites purely symbolically. Thus, while ABS begins from geometric
foundations, our KBDOCK approach aims to bridge the gap between shape-based and knowledge-based
representations of molecular interactions.
It is worth noting that Cazals is also interested in modeling large multi-component protein complexes. In
particular, his group recently developed a geometric “toleranced model” (TOM) approach for verifying the
correctness of large multi-component models [X105]. Because the TOM approach represents each protein
as a union of 18 balls, it may be considered as a kind of CG geometric representation. Although the TOM
approach itself does not provide a way to build multi-component models from scratch, it could clearly be
used to assess or validate models proposed by Capsid. Thus, there are opportunities for ABS and Capsid
to collaborate constructively. Still, a specificity of the Capsid team will be its combination of shape-based
and knowledge-based approaches for structural biology.
Concerning HPC, we can point out that several areas of bioinformatics are computationally intensive and
involve working with very large datasets. For high throughput sequencing, the GenScale team is studying
the use of advanced in-memory indexing techniques in conjunction with parallel processing using of multiple levels of granularity. Similarly, the Bonsai team is building and collecting GPU-accelerated sequence
analysis tools as part of their Biomanycores project. In contrast, our aim is to use HPC to accelerate 3D
biomolecular shape comparisons. In this context, our work has some links to bio-medical image processing
teams such as Athena and Asclepios because the calculation of 3D volumes, gradients, and moments is
often a common feature of 3D imaging problems. We mention here that our main Inria collaborator is Sergei
Grudinin (team Nano-D), with whom we collaborate formally on ANR project (see Section 7.1).
5.2
Positioning within the LORIA and the University of Lorraine
The LORIA is a mixed research unit (UMR 7503) which hosts permanent researchers from Inria, CNRS,
and the University of Lorraine. These researchers are distributed across some 27 teams, of which 16 are
common with Inria project teams. The LORIA teams are organised into five departments: 1. Algorithms,
Computation, Image & Geometry; 2. Formal Methods; 3. Networks, Systems and Services; 4. Knowledge
& Language Management; 5. Complex Systems & Artificial Intelligence. The 2012 AERES report on the
LORIA ranked all of these departments as “A+” or “A” on all assessment criteria. However, it noted that the
computational biology activities within the Orpailleur team (department 4) is not closely linked to the principal
theme of that department, although it acknowledged the application of knowledge-based approaches to the
life sciences. Consequently, one of the report’s conclusions was that the computational biology theme should
move to department 5. It is therefore proposed that the Capsid team will join department 5 – Complex
Systems & Artificial Intelligence. Nonetheless, the new team will still maintain close collaborations with
Orpailleur. For example, we recently obtained a studentship from the IAEM doctoral school to work on
biomedical knowledge discovery which will be jointly supervised by Adrien Coulet (Orpailleur).
The current teams of department 5 (Cortex, Kiwi, Maia, and Neurosys) study computational neuroscience, multi-agent autonomous systems, planning, robotics, cellular automata, and emergent or collective
behaviours in biological systems and social networks. Hence, there exist interesting complementarities and
22
opportunities for cross-fertilisation. For example, our work with GPCRs could be of interest to the Neurosys
team, which studies brain activity and anesthesia, because several of the receptors on the surface of nerve
cells are GPCRs. Conversely, some of the optimisation and planning techniques being developed in the
Maia team could help to find new ways to model protein flexibility, or to help solve multi-component docking
problems. Furthermore, there are some close parallels between mining human preferences or very large
social networks and searching for interesting biological relationships in protein interaction networks. Thus,
there could also be opportunities for collaborations with members of the Kiwi team.
5.3
National Positioning
Beyond Inria, our main national collaborators in structural biology are currently at the IGBMC in Strasbourg.
The Strasbourg teams are members of the French Infrastructure for Integrated Structural Biology (FRISBI),13
the national network to promote the use of experimental techniques to solve large biomolecular structures.
We are working with the cryo-EM group of Patrick Schultz at the IGBMC through a CNRS “PEPS” award
(Projet Exploratoire / Premier Soutien) and a LORIA-funded postdoc project to develop highly parallel correlation techniques on GPU clusters for automatic 2D particle picking and 3D shape-density matching. The
structural modeling team of Annick Dejaegere is developing a CG bead model of multi-component complexes. Apart from Inria-ABS, this team is one of the few groups in France who are actively developing new
computational algorithms for integrative structural biology.
Excluding classical force-field MD modeling software, which is mainly developed in biophysics and biochemistry laboratories, only a few groups in France or indeed world-wide are developing ab initio protein
docking algorithms. The group of Chantal Prévost at the IBPC in Paris developed Ptools, a programmable
docking framework that includes a normal mode-based model of protein flexibility [X106]. While this allows
protein flexibility during docking to be simulated efficiently, it is computationally expensive, which makes it
rather unsuitable for large-scale studies. Another French group that should be mentioned here is that of
Anne Poupon (now team Bios in Tours) and Jerôme Azé (Inria-Amib). Their approach uses machine learning techniques to train a scoring function based on Voronoi tesselations of protein shape [X103]. We have
a manuscript in preparation with Anne Poupon on modeling a multi-component GPCR signaling complex.
Concerning Inria’s scientific strategy, by building links and collaborations with external researchers in
biology and medicine, the activities of the Capsid team will closely match Inria’s objective to “integrate multiscale data, both temporally and spatially, in order to model complex biological systems” (Inria Strategic Plan
2013–2017).14 By developing computational methods to help investigate the molecular basis of diseases
such as heart disease and diabetes, the team will help to address the “Health and Well-Being” strategic
objective which is also mentioned in the Strategic Plan. The activities of the team will raise the profile of
Inria’s participation in several national organisations:
• IFB (Institut Français de Bioinformatique) – http://www.renabi.fr/
(formerly ReNaBI: Réseau National des plateformes Bioinformatiques).
• FRISBI (French Infrastructure for Integrated Structural Biology) – http://frisbi.eu/.
• SFBI (Société Française de Bio-Informatique) – http://www.sfbi.fr/.
• SFCI (Société Française de Chémoinformatique – http://www.chemoinformatique.fr/.
• GdR 3003 Bioinformatique Moléculaire – http://www.gdr-bim.u-psud.fr/.
• Aviesan (Alliance nationale pour les sciences de la Vie et de la Santé) – http://www.aviesan.fr/.
13 http://frisbi.eu/
14 http://www.inria.fr/institut/strategie/plan-strategique
23
5.4
International Positioning
International progress in computational protein docking is assessed in the european conference on Critical
Assessment of PRedicted Interactions (CAPRI) [X107] and its american partner conference, Modeling of
Protein Interactions in Genomes (MPIG) [X108]. During the last 10 years, we have submitted docking predictions for almost all of the targets in the CAPRI experiment, and we have contributed to nearly all of the
CAPRI and MPIG meetings through oral presentations and posters. While our Hex polar Fourier correlation
approach can often produce acceptable predictions and is still one of the fastest, several groups get better
results because they are more skilled in the use of domain knowledge and because they use more sophisticated physico-chemical scoring functions and more expensive refinement protocols. Based on results
presented by Lensink and Wodak [X109] at the CAPRI-2013 conference,15 some leading protein docking
groups are those of Alexandre Bonvin (U Utrecht),16 Sandor Vajda (U Boston),17 Zhiping Weng (U Massachusetts),18 Paul Bates (Cancer Research UK),19 and Juan Fernandez-Recio (Barcelona Supercomputer
Center).20 Currently, very few of the CAPRI participants use CG representations, presumably because most
of the targets in CAPRI have been relatively small crystallographic dimers or trimers. However, a notable
exception is the ATTRACT algorithm of Martin Zacharias (Technical University of Munich).21 The ATTRACT
approach has also recently been used in cryo-EM density fitting [X110].
Mathematically, the most similar docking algorithm to ours is FRODOCK [X111], developed by the groups
of Pablo Chacon (CSIC, Madrid) and Ruben Abagyan (U California San Diego). The same authors also
developed a fast rotational correlation technique called ADP_EM for cryo-EM density fitting [X112]. Like
us, they exploit the special rotational properties of the spherical harmonic basis functions to accelerate the
rotational search, but they use numerically sampled radial shells for the radial coordinate whereas we use
orthogonal Gauss-Laguerre basis functions. This means that translations must be calculated numerically in
FRODOCK and ADM_EM, whereas they are calculated analytically in Hex.
Although nowadays it is difficult to make significant progress in ab initio docking algorithms, the CAPRI
experiment shows that the best docking models are often achieved when biological knowledge is used to
drive the calculation or to filter the results (“data-driven docking”) [X113]. Several groups have published
databases of structural protein-protein interactions [X114], and several others have published homologybased (or “template-based”) docking protocols [X115, X116, X117]. However, it still remains unclear how
best to link structural databases and docking algorithms in a reliable way. More fundamentally, there is still
no generally accepted way to define what actually constitutes a protein binding site or to quantify whether or
not two binding sites are structurally similar. In this respect, our KBDOCK approach represents one of the
newest and most promising developments to have been described [T15]. When tested on the widely used
Protein Docking Benchmark [X118], KBDOCK almost invariable finds a good docking model if a suitable
homology template exists in its case base.
While CAPRI continues to attract new participants, still only a few groups are using docking algorithms
to study protein interactions on a large scale. To our knowledge, the only genomic-scale docking experiment to have been reported to date was made by the group of Patrick Aloy at the Institute for Research in
Biomedicine (IRB) in Barcelona, who performed 3,700 protein docking calculations in yeast [X119]. This
study focused on performing high-throughput pair-wise docking calculations, and it did not attempt to use
15 http://tintin.science.uu.nl/CAPRI2013/home/home
16 http://www.nmr.chem.uu.nl/
abonvin/
17 http://structure.bu.edu/
18 http://zlab.umassmed.edu/zlab/
19 http://www.london-research-institute.org.uk/research/paul-bates
20 http://www.bsc.es/life-sciences/protein-interactions-and-docking
21 http://www.t38.ph.tum.de/index.php?id=17
24
biological knowledge or restraints to guide the calculations. Subsequently, a large-scale docking experiment
by the group of Alfonso Valencia at the Spanish Cancer Research Centre in Madrid was carried out using
our Hex docking program [X120]. By examining the profiles of docking scores obtained, they found that the
true interactions between 56 pairs of known interactors could often be distinguished from a background of
922 non-interactors. These results indicate that even simple shape-based algorithms can produce a useful
docking signal, which could potentially be used to train a learning algorithm.
The main groups that we are aware of who are developing multi-component assembly algorithms are
those of Willy Wriggers (D E Shaw Research, New York),22 Andrei Sali (U California, San Francisco),23 and
Haim Wolfson (U Tel Aviv).24 Wriggers’ Sculptor program [X121] is a graphical interface to his Situs toolkit
for cryo-EM modeling [X122]. This allows the user to build a multi-component model by incrementally adding
one protein sub-unit at a time into an EM density map. The groups of Sali and Wolfson aim for more automated assembly of the sub-units. They use spanning-tree [X61] and junction-tree [X64, X65] techniques to
represent the combinatorial search space. Sali’s group has recently made their Integrative Modeling Platform (IMP) software25 publicly available [X123]. This consists of Python modules for manipulating protein
structures, cryo-EM density maps, and SAXS profiles and other data, and for applying various kinds of distance and volume restraints to guide the scoring function. IMP includes the DOMINO junction-tree solver
developed by Wolfson’s group [X64]. By combining diverse multi-resolution data on the component protein
structures and their interactions, Sali’s team recently used IMP to help locate the positions of the major
substructures in the very large nuclear pore complex (NPC) comprising a total of 456 proteins [X124, X125].
Nonetheless, this endeavour required an enormous multi-disciplinary effort from many teams. In our estimation, integrative structural biology is one of today’s exciting frontiers of science. While we do not aspire
to compete with the IMP approach, we believe we can make a useful contribution to the field by focusing on
algorithmic aspects of the multi-component assembly problem.
Concerning our plans to model 3D evolutionary relationships, several reviews in structural biology have
pointed to the evident structural relationships between different protein families (see, e.g., [X126, X127,
X128]). The originators of the SCOP and CATH protein structure classifications such as Alexey Murzin (U
Cambridge), Christine Orengo (University College, London), and Willie Taylor (NIMR, London) are now considered as world experts on this subject. These groups, and many others [X129], are still actively developing
tools to compare and classify protein structures. However, to our knowledge, nobody has yet attempted to
compute 3D distances between related structures from an evolutionary point of view. Furthermore, the idea
to use 3D structural similarity scoring to reconstruct the ancestry of different protein families is also novel,
partly because until very recently no sufficiently efficient scoring algorithms has been available, but mainly
because the basic notion of constructing a mathematical model of 3D protein structure evolution is also
completely new, and will involve combining both micro-evolutionary and macro-evolutionary approaches.
6
6.1
Collaborations and Technology Transfer
Research Collaborations
Because our scientific objectives stand at the interface between informatics and biology, it is very important that we develop collaborations with external researchers in biology and medicine. This will ensure that
22 http://www.deshawresearch.com/members_c-b_wriggers.html
23 http://salilab.org/
24 http://www.cs.tau.ac.il/∼wolfson/
25 http://www.salilab.org/imp/
25
we understand the state of the art in the target domain, and that we can address relevant problems at the
frontiers of research.
Around the University of Lorraine we have a number of bioinformatics collaborations with colleagues in
the life sciences. Several of these collaborations are currently supported by the MBI platform which is funded
jointly by Inria and Région Lorraine. This platform supports several projects concerning biomolecular modeling, systems biology, and knowledge discovery. As well as providing bioinformatics resources, the platform
also provides a framework for training biology colleagues in molecular dynamics and data mining methods.
Current MBI projects include modeling the interactions between polyphenols and certain biocatalysts (ENSAIA), clustering very large molecular datasets (with LORIA-Qgar), comparative genomic studies of forestmicroorganism ecosystems (INRA-IAM), comparing protein interaction networks to discover disease genes
(with CHU Nancy), studying drug side effects (with MPI Saarbrucken), protein secondary structure prediction (Loria-ABC), modeling promoter sequences in bacteria (INRA), and genetic cohort studies (Inria-BIGS,
IECL, CHU-Nancy). We are also participating in the EXPLOR project (Ensemble de Calcul Scientifique
pour la LORraine; porteur: Gérald Monard, SRSMC), which is working to set up a shared computational
Mésocentre for local researchers in the physical and life sciences.
We are actively developing further local collaborations. For example, with Chris Chipot’s eDAM team
(UMR 7565), we are preparing a proposal for the Human Frontiers in Science Program (HFSP) for a project
on modeling protein-protein interactions in the brain. With Nathalie Leblond and Gérard Guedon from the
DynAMic team (UMR 1128), we recently submitted a short proposal to the ANR to study integrative and
conjugative elements in bacteria. This proposal was not selected for a full proposal, but we have since
submitted a proposal for a joint INRA-Inria studentship for the same project. We recently obtained funding
for a “PEPS Mirabelle” project with Philippe Jonveaux and other colleagues from the Faculty of Medicine
(INSERM 954) to support a doctoral thesis project to study the use of Linked Open Data26 in the biomedical
domain.
As mentioned above, our main external collaborators in France are in Grenoble and Strasbourg. Our
ANR-funded project with Sergei Grudinin (Inria-NanoD) also involves the team of Valentin Gordeliy at the
Institut de Biologie Structurale in Grenoble. In Strasbourg, we are working with members of the Integrated
Structural Biology Department at the IGBMC (Schultz and Dejaegere) to help the reconstruction of 3D
structures from low resolution 3D cryo-EM maps. We are also working with colleagues in Reims and Lyon
to build a project on modeling systems of extracellular proteins. At the international level, we have several
long-standing collaborators including Sandor Vajda (Structural Bioinformatics Laboratory, U Boston), Antonio Carrieri (Dipartimento Farmaco-Chimico, Università di Bari), Jordi Teixidó (Institut Químic de Sarrià,
Universitat Ramon Llull), and Tim Clark (Computer Chemistry Center, U Erlangen).
Our main strategy to increase our network of collaborations will be to participate in and to help organise
relevant national and international conferences (e.g. JOBIM, GGMM, CAPRI, ECCB, ISMB) and societies
(e.g. SFBI, SFCI, ISCB). This in turn will lead to opportunities to form more formal collaborations through
joint projects funded by the national and european funding agencies.
6.2
Technology Transfer
We are involved in technology transfer with both local and national organisations. While we do not expect
that our activities will directly lead to marketable software, some of our projects could lead to patentable
discoveries. Where appropriate, we will consult with Inria business advisors and lawyers concerning the
protection of our intellectual property and the transfer and exploitation of our results. All three of us (DR,
26 http://lod-cloud.net/
26
MDD, and BM) are scientific advisors to Harmonic Pharma (see below). The following list summarises our
current technology transfer partners:
• Harmonic Pharma: this LORIA-CNRS spin-out company aims to add therapeutic value to drug-like
molecules and to reposition existing drugs – http://www.harmonicpharma.com/.
• BioProlor: this is a consortium of six regional enterprises who are collaborating with the University of
Lorraine, INRA, CNRS, INSERM, and ourselves to develop new pharmaceutical and cosmetic products – http://www.bioprolor.com/.
• IFB: through our MBI platform (Modelisation de Biomolécules et leurs Interactions), we are one of
the partners of the north-east section of the IFB (formerly ReNaBI: Réseau National des plateformes Bioinformatiques) which includes the labs CIB and LIFL (Lille), IGBMC (Strasbourg), and MMP
(Reims) – http://www.renabi.fr/.
Because funding for the MBI platform ended in December 2013, one of us (MDD) is working to create a new
platform for interdisciplinary engineering in Lorraine (project InterBioNum). This will be a shared resource
amongst the tutelles for bioinformatics consulting, training, and technology transfer. We expect this platform
will be closely aligned with the new regional Mésocentre, which will also animate training and technology
transfer.
7
Funding
7.1
Current Projects
The following list summarises our currently funded research projects:
• ANR “PEPSI” (Polynomial Expansions of Protein Structures and Interactions), 2011 – 2015, joint with
Inria Grenoble, e 162K for Inria Nancy.
• ANR “IFB” (Institut Français de Bioinformatique), 2013, joint with CNRS, CEA, INRA, INSERM, e 60K
for Inria Nancy.
• FUI/FEDER project “LBS” (Le Bois Santé – to exploit wood products in the pharmaceutical and nutriment domains) 2013 – 2015, e 57K (approx).
• CNRS-UL PEPS “EXPLOD-BioMed” (Exploring the Linked Open Data (LOD) for knowledge discovery:
Applications to the biomedical domain), 2013 – 2014, with CHU Nancy, e 15K.
We have recently obtained two doctoral bursaries:
• Bourse Doctorale de l’IAEM Ecole Doctorale (Exploring the Linked Open Data (LOD) for knowledge
discovery: Applications to the biomedical domain), 2013 – 2016, joint supervision with Adrien Coulet
(MdC, Université de Lorraine), e 132K (approx).
• Bourse Doctorale de la Fédération Charles Hermite (Modeling evolutionary relationships between
three-dimensional protein structures), 2013 – 2016, joint supervision with Nicolas Champagnat (Inria/IECL), e 132K (approx), co-funded by Région Lorraine.
7.2
Future Funding Strategy
We will propose projects for PhD studentships in the annual competitions of the IAEM doctoral school and
Inria’s doctoral and post-doctoral training programmes. Where appropriate, we will also apply for co-funding
27
from Région Lorraine or from industrial partners (CIFRE). On a larger scale, we will discuss with colleagues
in other Inria teams with a view to proposing an Action d’Envergure for structural bioinformatics, and we will
seek to form partnerships with other labs for ANR projects. Naturally, we will monitor calls for proposals for
opportunities in large national (e.g. LABEX), european (e.g. ERC), and international (e.g. ANR International)
programmes.
References
Team References
[T1]
D. W. Ritchie and G. J. L. Kemp. Protein docking using spherical polar Fourier correlations. Proteins:
Structure, Function, Genetics, 39(2):178–194, 2000.
[T2]
L. Mavridis and D. W. Ritchie. 3D-blast: 3D protein structure alignment, comparison, and classification
using spherical polar Fourier correlations. In Pacific Symposium on Biocomputing 2010, pages 281–
292, Hawaii, USA, January 2010. World Scientific Publishing.
[T3]
D. W. Ritchie, D. Kozakov, and S. Vajda. Accelerating protein-protein docking correlations using a
six-dimensional analytic FFT generating function. Bioinformatics, 24(4):810–823, 2008.
[T4]
D. W. Ritchie and V. Venkatraman. Ultra-fast FFT protein docking on graphics processors. Bioinformatics, 26:2398–2405, 2010.
[T5]
D. Mustard and D. W. Ritchie.
Docking essential dynamics eigenstructures.
Proteins: Structure,
Function, Bioinformatics, 60:269–274, 2005.
[T6]
V. Venkatraman and D. W. Ritchie. Flexible protein docking refinement using pose-dependent normal
mode analysis. Proteins, 80:2262–2274, 2012.
[T7]
A. Ghoorah, M. Smaïl-Tabbone, M.-D. Devignes, and D. W. Ritchie. Protein docking using case-based
reasoning. Proteins, 81:2150–2158, 2013.
[T8]
D. W. Ritchie. Recent progress and future directions in protein-protein docking. Current Protein and
Peptide Science, 9(1):1–15, 2008.
[T9]
V. Venkatraman and D. W. Ritchie. Predicting multicomponent protein assemblies using an ant colony
approach. International Journal of Swarm Intelligence Research, 3:19–31, 2012.
[T10] T. V. Hoang, X. Cavin, and D. W. Ritchie. gEMfitter: a highly parallel FFT-based 3D density fitting tool
with GPU texture memory acceleration. Journal of Structural Biology, 184:348–354, 2013.
[T11] D. W. Ritchie, A. W. Ghoorah, L. Mavridis, and V. Venkatraman. Fast protein structure alignment
using Gaussian overlap scoring of backbone peptide fragment similarity. Bioinformatics, 28:3274–
3281, 2012.
[T12] L. Mavridis, V. Venkatraman, and D. W. Ritchie. A comprehensive comparison of protein structural
alignment algorithms. In 3DSIG – 8th Structural Bioinformatics and Computational Biophysics Meeting, volume 8, page 89, Long Beach, California, 2012. ISMB.
28
[T13] M.-D. Devignes, S. Benabderrahmane, M. Smail-Tabbone, N. Amedeo, and O. Poch.
Functional
classification of genes using semantic distance and fuzzy clustering approach: Evaluation with reference sets and overlap analysis. International Journal of Computational Biology and Drug Design,
5(3/4):245–260, 2012.
[T14] A. Coulet, M. Smaïl-Tabbone, A. Napoli, and M.-D. Devignes. Ontology-based knowledge discovery
in pharmacogenomics. In H. R. Arabnia and Q.-N. Tran, editors, Software Tools and Algorithms for
Biological Systems, Advances in Experimental Medicine and Biology, pages 357–66. Springer, 2011.
[T15] A. Ghoorah, M.-D. Devignes, M. Smaïl-Tabbone, and D. W. Ritchie.
Spatial clustering of protein
binding sites for template based protein docking. Bioinformatics, 27:2820–2827, 2011.
[T16] A. Ghoorah, M.-D. Devignes, M. Smaïl-Tabbone, and D. W. Ritchie. KBDOCK 2013: a spatial classification of 3D protein domain family interactions. Nucleic Acids Research, 42:D389–D395, 2014.
[T17] A. Fano, D. W. Ritchie, and A. Carrieri. Modelling the structural basis of human CCR5 chemokine
receptor function: from homology model-building and molecular dynamics validation to agonist and
antagonist docking. Journal of Chemical Information and Modeling, 46(3):1223–1235, 2006.
[T18] V. I. Pérez-Nueno, D. W. Ritchie, O. Rabal, R. Pascual, J. I. Borrell, and J. Teixidó. Comparison of
ligand-based and receptor-based virtual screening of HIV entry inhibitors for the CXCR4 and CCR5
receptors using 3D ligand shape matching and ligand-receptor docking. Journal of Chemical Information and Modeling, 48(3):509–533, 2008.
[T19] V. I. Pérez-Nueno, S. Pettersson, D. W. Ritchie, J. I. Borrell, and J. Teixidó.
Discovery of novel
HIV entry inhibitors for the CXCR4 receptor by prospective virtual screening. Journal of Chemical
Information and Modeling, 49(4):810–823, 2009.
External References
[X20] A. B. Ward, A. Sali, and I. A. Wilson. Integrative structural biology. Biochemistry, 6122:913–915,
2013.
[X21] C. Morris. Towards a structural biology work bench. Acta Crystallographica, PD69:681–682, 2013.
[X22] T. Ideker, T. Galitski, and L. Hood. A new approach to decoding life. Annual Review of Genomics and
Human Genetics, 2:343–372, 2001.
[X23] T. Ideker and R. Sharan. Protein networks in disease. Genome Research, 18:644–652, 2008.
[X24] R. Sharan and T. Ideker. Modeling cellular machinery through biological network comparison. Nature
Biotechnology, 24:427–433, 2006.
[X25] A.S.J. Melquiond, E. Karaca, P.L. Kastritis, and A.M.J.J. Bonvin. Next challenges in protein-protein
docking: from proteome to interactome and beyond.
WIREs Computational Molecular Science,
2:642–651, 2011.
[X26] H. Kitano. Systems biology: a brief overview. Science, 295:1662–1664, 2002.
29
[X27] P. Aloy and R. B. Russell. Structural systems biology: modelling protein interactions. Nature Reviews
Molecular and Cell Biology, 7:188–197, 2006.
[X28] P. Beltrao, C. Kiel, and L. Serrano.
Structures in systems biology.
Current Opinion in Structural
Biology, 17:378–384, 2007.
[X29] M. Makarow, L. Højgaard, and Reinhart Ceulmans. Advancing systems biology for medical applications. ESF Science Policy Briefing, 35:1–12, 2008.
[X30] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski,
and T. Ideker. Cytoscape: a software environment for integrated models of biomolecular interactions
networks. Genome Research, 13:2498–2504, 2003.
[X31] A. Gutmanas, T. J. Oldfield, A. Patwardhan, S. Sen, S. Velanker, and G. J. Kleywegt. The role of
structural bioinformatics resources in the era of integrative structural biology. Acta Crystallographica,
D69:710–721, 2013.
[X32] M. L. Sierk and G. J. Kleywegt.
Déjà vu all over again: Finding and analyzing protein structure
similarities. Structure, 12:2103–2011, 2004.
[X33] R. A. Goldstein. The structure of protein evolution and the evolution of proteins structure. Current
Opinion in Structural Biology, 18:170–177, 2008.
[X34] P. J. Kundrotas, Z. W. Zhu, and I. A. Vakser.
GWIDD: Genome-wide protein docking database.
Nucleic Acids Research, 38:D513–D517, 2010.
[X35] Q. C. Zhang, D. Petrey, L. Deng, L. Qiang, Y. Shi, C. A. Thu, B. Bisikirska, C. Lefebvre, D. Accili,
T. Hunter, T. Maniatis, A. Califano, and B. Honig. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature, 490:556–560, 2012.
[X36] M. R. Arkin and J. A. Wells. Small-molecule inhibitors of protein-protein interactions: progressing
towards the dream. Nature Reviews Drug Discovery, 3:301–317, 2004.
[X37] D. González-Ruiz and H. Gohlke. Targeting protein-protein interactions with small molecules: challenges and perspectives for computational binding epitope detection and ligand finding.
Current
Medicinal Chemistry, 13:2607–2625, 2006.
[X38] P. Uetz et al. A comprehensive analysis of protein-protein interactions in saccaromyces cerevisiae.
Nature, 403:623–671, 2000.
[X39] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki.
A comprehensive two-hybrid
analysis to explore the yeast protein interactome. Proceedings of the National Academy of Science,
98:4569–4574, 2001.
[X40] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick, A. M.
Michon, and C. M. Cruciat. Functional organization of the yeast proteome by systematic analysis of
protein complexes. Nature, 415:141–147, 2002.
[X41] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S.L. Adams, A. Millar, P. Taylor, K. Bennett, and
K. Boutilier.
Systematic identification of protein complexes in saccharomyces cerevisiae by mass
spectrometry. Nature, 415:180–183, 2002.
30
[X42] P. Aloy and R. B. Russell. Ten thousand interactions for the molecular biologist. Nature Biotechnology,
22:1317–1321, 2004.
[X43] G. T. Hart, A. K. Ramani, and E. M. Marcotte. How complete are current yeast and human protein
interaction networks? Genome Biology, 7:120, 2006.
[X44] P. Bork, L. J. Jensen, C. von Mering, A. K. Ramani, I. Lee, and E. M. Marcotte. Protein interaction
networks from yeast to human. Current Opinion in Structural Biology, 14:292–299, 2004.
[X45] R. Chen, L. Li, and Z. Weng. ZDOCK: an initial-stage protein-docking algorithm. Proteins: Structure,
Function, Genetics, 52:80–87, 2003.
[X46] O. Bachar, D. Fischer, R. Nussinov, and H. J. Wolfson. A computer vision based technique for 3D
sequence-independent structural comparison of proteins. Protein Engineering, 6:279–288, 1993.
[X47] J. D. Jackson. Classical Electrodynamics. Wiley, New York, 1975.
[X48] E. Katchalski-Katzir, I. Shariv, M. Eisenstein, A. A. Friesem, C. Aflalo, and I. A. Vakser. Molecular
surface recognition: Determination of geometric fit between proteins and their ligands by correlation
techniques. Proceedings of the National Academy of Science, 89:2195–2199, 1992.
[X49] J. J. Gray, S. Moughan, C. Wang, O. Schueler-Furman, B. Kuhlman, C. A. Rohl, and D. Baker. Proteinprotein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. Journal of Molecular Biology, 331:281–299, 2003.
[X50] S. E. Dobbins, V. I. Lesk, and M. J. E. Sternberg. Insights into protein flexibility: The relationship
between normal modes and conformational change upon protein–protein docking. Proceedings of
the National Academy of Science, 105(30):10390–10395, 2008.
[X51] A. May and M. Zacharias. Energy minimization in low-frequency normal modes to efficiently allow for
global flexibility during systematic protein-protein docking. Proteins: Structure, Function, Bioinformatics, 70:794–809, 2008.
[X52] I.H. Moal and P.A. Bates. Swarmdock and the use of normal modes in protein-protein docking. Int.
J. Mol. Sci., 11(10):3623–3648, 2010.
[X53] M. Baaden and S. R. Marrink. Coarse-grained modelling of protein-protein interactions. Current
Opinion in Structural Biology, 23:878–886, 2013.
[X54] M. G. Saunders and G. A. Voth. Coarse-grainiing of multiprotein assemblies. Current Opinion in
Structural Biology, 22:144–150, 2012.
[X55] H. I. Ingólfsson, C. A. Lopez, J. J. Uusitalo, D. H. de jong, S. M. Gopal, X. Periole, and S. R.
Marrink.
The power of coarse graining in biomolecular simulations.
WIRES Comput. Mol. Sci.,
DOI:10.1002/wcms.1169, 2013.
[X56] N. Basdevant, B. Borgis, and T. Ha-Duong. Modeling protein-protein recognition in solution using the
coarse-grained force field SCORPION. Journal of Chemical Theory and Computation, 9:803–813,
2012.
[X57] M. F. Lensink and S. J. Wodak. Docking and scoring protein interactions: CAPRI 2009. Proteins:
Structure, Function, Bioinformatics, 78:3073–3084, 2010.
31
[X58] A. Berchanski and M. Eisenstein. Construction of molecular assemblies via docking: modeling of
tetramers with D2 symmetry. Proteins: Structure, Function, Genetics, 53:817–829, 2003.
[X59] B. Pierce, W. Tong, and Z. Weng. M-ZDOCK: a Grid-Based approach for Cn symmetric multimer
docking. Bioinformatics, 21(8):1472–1478, 2005.
[X60] D. Schneidman-Duhovny, Y. Inbar, R. Nussinov, and H. J. Wolfson. Geometry-based flexible and
symmetric protein docking. Proteins, 60(2):224–231, 2005.
[X61] Y. Inbar, H. Benyamini, R. Nussinov, and H. J. Wolfson. Prediction of multimolecular assemblies by
multiple docking. Journal of Molecular Biology, 349:435–447, 2005.
[X62] H. E. White, E. V. Orlova, S. Chen, L. Wang, A. Ignatiou, B. Gowen, T. Stromer, T. M. Franzmann,
M. Haslbeck, J. Buchner, and H. R. Saibil. Multiple distinct assemblies reveal conformational flexibility
in the small heat shock protein Hsp26. Journal of Structural Biology, 14:1197–1204, 2006.
[X63] P. Chacon J. R. Lopéz-Blanco. Journal of Structural Biology, 184:261–270, 2013.
[X64] K. Lasker, M. Topf, A. Sali, and H. Wolfson. Inferential optimization for simultaneous fitting of multiple
components into a cryoEM map of their assembly. Journal of Molecular Biology, 388:180–194, 2009.
[X65] K. Lasker, A. Sali, and H. J. Wolfson. Determining macromolecular assembly structures by molecular docking and fitting into an electron density map. Proteins: Structure, Function, Bioinformatics,
78:3205–3211, 2010.
[X66] U. Bertele. Nonserial Dynamic Programming. Academic Press, New York, 1972.
[X67] S. Yand and P. E. Bourne. The evolutionary history of protein domains viewed by species phylogeny.
PLoS One, 4:e8378, 2009.
[X68] R. D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin, P. Gunasekaran,
G. Ceric, K. Forslund, L. Holm, E. L. L. Sonnhammer, S. R. Eddy, and A. Bateman. The Pfam protein
families database. Nucleic Acids Research, 38:D211–D222, 2010.
[X69] D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science, 227:1435–
1441, 1985.
[X70] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool.
Journal of Molecular Biology, 215:403–410, 1990.
[X71] M. J. Sippl and M. Wiederstein. A note on difficult structure alignment problems. Bioinformatics,
24:426–427, 2008.
[X72] V. B. R. Boojala and P. E. Bourne. Protein Structure and Evolution and the SCOP Database. In:
Structural Bioinformatics (eds P.E. Bourne, H. Weissig). Wiley-Liss, New Jersey, 2003.
[X73] G. S. Chan, Y. Hong, K. D. Ko, G. Bhardwaj, E. C. Holmes, R. L. Patterson, and D. B. van Rossum.
Phylogeneric profiles reveal evolutionary relationships within the twighlight zone of sequence similarity. Proceedings of the National Academy of Science, 105:13474–13479, 2008.
[X74] A. G. Murzin S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins
database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536–
540, 1995.
32
[X75] C. A. Orengo, A. D. Michine, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton. CATH - A
hierarchic classification of protein domain structures. Structure, 5(8):1093–1108, 1997.
[X76] L. Holm, S. Kääriänen, P. Rosentröm, and A. Schenkel. Seaching protein structure databases with
DaliLite v.3. Bioinformatics, 24:2780–2781, 2008.
[X77] A. Andreeva, D. Howarth, S. E. Brenner, T. J. P. Hubbard, C. Chothia, and A. G. Murzin.
SCOP
database in 2004: Refinements integrate structure and sequence familiy data. Nucleic Acids Research, 32:D226–D229, 2004.
[X78] H. Tordai, A. Nagy, K. Farkas, L. Bányai, and L. Patthy. Modules, multidomain proteins and organismic
complexity. FEBS Journal, 272:5064–5078, 2005.
[X79] P. E. Bourne and I. N. Shindyalov. Structure Comparison and Alignment. In: Structural Bioinformatics
(eds P.E. Bourne, H. Weissig). Wiley-Liss, New Jersey, 2003.
[X80] Y. Zhang and J. Skolnick. TM-align: a protein structure alignment algorithm based on TM-score.
Nucleic Acids Research, 33(7):2302–2309, 2005.
[X81] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus.
Knowledge discovery in databases: An
overview. AI Magazine, 13:57–70, 1992.
[X82] H. Hermjakob et al. The HUPO PSI’s molecular interaction format – a community standard for the
representation of protein interaction data. Nature Biotechnology, 22(2):177–183, 2004.
[X83] S. Orchard et al.
Protein interaction data curation: the international molecular exchange (IMEx)
consortium. Nature Methods, 9(4):345–350, 2012.
[X84] A. Özgur, Z. Xiang, D. R. Radev, and Y. He. Mining of vaccine-associated IFN-γ gene interaction
networks using the vaccine ontology. Journal of Biomedical Semantics, 2 (Suppl 2):S8, 2011.
[X85] C. Jonquet, P. Lependu, S. Falconer, A. Coulet, N.F. Noy, M.A. Musen, and N.H. Shah.
NCBO
resource index: Ontology-based search and mining of biomedical resources. Web Semantics, 9:316–
324, 2011.
[X86] S. Velankar, J. M. Dana, J. Jacobsen, G. van Ginkel, P. J. Gane, J. Luo, T. J. Oldfield, C. O’Donovan,
M.-J. Martin, and G. J. Kleywegt. SIFTS: structure integration with function, taxonomy and sequences
resource. Nucleic Acids Research, 41:D483–D489, 2012.
[X87] R. C. Griffiths and S. Tavaré. Ancestral inference in population genetics. Statistical Science, 9:307–
319, 1994.
[X88] S. Tavaré, D. J. Balding, R. C. Griffiths, and P. Donnelly. Inferring coalescence times from molecular
sequence data. Genetics, 145:505–518, 1997.
[X89] M. Stephens and P. Donnelly. Inferrence in molecular population genetics. Journal of the Royal
Statistical Society, B62:605–655, 2000.
[X90] M. Kuhner, J. Yamamoto, and J. Felsenstein. Estimating effective population size and mutation rate
from sequence data using Metropolis-Hastings sampling. Genetics, 140:1421–1430, 1995.
[X91] S. Tavaré. Ancestral inference in population genetics. Lectures on probability theory and statistics.
Lecture Notes in Mathemetics, 1837:1–188, 2004.
33
[X92] G. Papai, P. A. Weil, and P. Schultz. New insights into the function of transcription factor TFIID from
recent structural studies. Current Opinion in Genetics and Development, 21:219–224, 2011.
[X93] G. Papai, M. K. Tripathi, C. Ruhlmann, J. H. Layer, P. A. Weil, and P. Schultz. TFIIA and the transactivator Rap1 cooperate to commit TFIID for transcription initiation. Nature, 465:956–961, 2011.
[X94] C. E. Alvarez-Martinez and P. J. Christie. Biological diversity of prokaryotic type IV secretion systems.
Microbiology and Molecular Biology Reviews, 73:775–808, 2011.
[X95] R. Fronzes, E. Schäfer, L. Wang, H. R. Saibil, E. V. Orlova, and G. Waksman. Structure of a type IV
secretion system core complex. Science, 323:266–268, 2011.
[X96] A. Rivera-Calzada, R. Fronzes, C. G. Savva, V. Chandran, P. W. Lian, T. Laeremans, E. Pardon,
H. Steyaert, J. Remaut, and E. V. Waksman, G. Orlova. Structure of a bacterial type IV secretion core
complex at subnanometre resolution. EMBO Journal, 32:1195–1204, 2013.
[X97] S. G. F. Rasmussen et al. Crystal structure of the β2 adrenergic receptor–Gs protein complex. Nature,
477:549–557, 2011.
[X98] A. G. Gilman. G proteins: transducers of receptor-generated signaling. Annual Review of Biochemistry, 56:615–649, 1987.
[X99] D. Filmore. It’s a GPCR world. Modern Drug Discovery, 7:24–28, 2004.
[X100] M. J. Kleinz and I. B. Wilkinson. Emerging roles of apelin in biology and medicine. Pharmacology
and Therapeutics, 107:198–211, 2005.
[X101] S. L. Pitkin, J. J. Maguire, T. I. Bonner, and A. P. Davenport.
International union of basic and
clinical pharmacology. LXXIV. Apelin receptor nomenclature, distribution, pharmacology, and function.
Pharmacological Reviews, 62:331–342, 2010.
[X102] X. Iturrioz, R. Alvear-Perez, N. De Mota, C. Franchet, F. Guillier, V. Leroux, H. Dabire, M. Le Jouan,
H. Chabane, R. Gerbier, D. Bonnet, A. Berdeaux, B. Maigret, J.-L. Galzi, M. Hibert, and C. LlorensCortes. Identification and pharmacological properties of E339-3D6, the first nonpeptidic apelin receptor agonist. FASEB Journal, 24:1506–1517, 2010.
[X103] J. Bernauer, J. Azé, J. Janin, and A. Poupon. A new protein-protein docking scoring function based
on interface residue properties. Bioinformatics, 23:555–562, 2007.
[X104] B. Bouvier, R. Grünberg, M. Nilges, and F. Cazals. Shelling the Voronoi interface of protein-protein
complexes reveals patterns of residue conservation, dynamics, and composition. Proteins, 76:677–
692, 2009.
[X105] T. Dreyfus, V. Doye, and F. Cazals. Assessing the reconstruction of macromolecular assemblies
with toleranced models. Proteins: Structure, Function, Bioinformatics, 80:2125–2136, 2012.
[X106] A. Saladin, S. Fiorucci, P. Poulain, C. Prévost, and M. Zacharias. PTools: an open source molecular
docking library. BMC Structural Biology, 9(1):27, 2009.
[X107] J. Janin, K. Henrick, J. Moult, L. Ten Eyck, M. J. E. Sternberg, S. Vajda, I. Vakser, and S. J. Wodak.
CAPRI: a critical assessment of predicted interactions.
52:2–9, 2003.
34
Proteins: Structure, Function, Genetics,
[X108] S. Vajda, I. A. Vakser, M. J. E. Sternberg, and J. Janin. Modeling of protein interactions in genomes.
Proteins: Structure, Function, Genetics, 47(4):444–446, 2002.
[X109] M. F. Lensink and S. J. Wodak.
Docking, scoring, and affinity prediction in CAPRI. Proteins,
81:2082–2095, 2013.
[X110] S. J. de Vries and M. Zacharias. ATTRACT-EM: a new method for computational assembly of large
molecular machines using cryo-EM maps. PLOS One, 12:e49733, 2012.
[X111] J. I. Garzón, J. R. Lopéz-Blanco, C. Pons, J. Kovacs, R. Abagyan, and P. Chacón. FRODOCK: a
new approach for fast rotational protein-protein docking. Bioinformatics, 25:2544–2551, 2009.
[X112] J. I. Garzón, J. Kovacs, R. Abagyan, and P. Chacón.
ADP_EM: fast exhaustive multi-resolution
docking for high throughput coverage. Bioinformatics, 23:427–433, 2007.
[X113] C. Dominguez, R. Boelens, and A. M. J. J. Bonvin.
HADDOCK: a protein-protein docking ap-
proach based on biochemical or biophysical information. Journal of the American Chemical Society,
125:1731–1737, 2003.
[X114] N. Tuncbag, G. Kar, O. Keskin, A. Gursoy, and R. Nussinov.
A survey of available tools and
web servers for analysis of protein-protein interactions and interfaces. Briefings In Bioinformatics,
10(3):217–232, 2009.
[X115] D. Korkin, F. P. Davis, F. Alber, T. Luong, M.-Y. Shen, V. Lucic, M. B. Kennedy, and A. Sali. Structural
modeling of protein interactions by analogy: application to PSD-95. PLoS Computational Biology,
2(11):e153, 2006.
[X116] P. J. Kundrotas, M. F. Lensink, and E. Alexov. Homology-based modeling of 3D structures of proteinprotein complexes using alignments of modified sequence profiles. International Journal of Biological
Macromolecules, 43(2):198–208, 2008.
[X117] G. Launay and T. Simonson. Homology modelling of protein-protein complexes: a simple method
and its possibilities and limitations. BMC Bioinformatics, 9:427, 2008.
[X118] H. Hwang, T. Vreven, J. Janin, and Z. Weng.
Protein-protein docking benchmark version 4.0.
Proteins: Structure Function and Bioinformatics, 78:3111–3114, 2010.
[X119] R. Mosca, C. Pons, J. Fernandez-Recio, and P. Aloy.
Pushing structural information into the
yeast interactome by high-throughput protein docking experiments.
PLoS Computational Biology,
5(8):e1000490, 2009.
[X120] M. N. Wass, G. Fuentes, C. Pons, P. Pazos, and A. Valencia. Towards the prediction of protein
interaction partners using physical docking. Molecular Systems Biology, 7(469):1–8, 2011.
[X121] S. Birmanns, M. Rusu, and W. Wriggers. Using Scupltor and Situs for simultaneous assembly of
atomic components into low-resolution maps. Journal of Structural Biology, 173:428–435, 2010.
[X122] W. Wriggers.
Using Situs for the integration of multi-resolution structures.
Biophysical Review,
2:21–27, 2010.
[X123] D. Russel, K. Lasker, B. Webb, J. Velazquez-Muriel, E. Tjioe, D. Schneidman, B. Peterson, and
A. Sali. Putting the pieces together: Integrative modeling platform software for structure determination
of macromolecular assemblies. PLoS Biology, 10:e1001244, 2012.
35
[X124] F. Alber et al. Determining the architectures of macromolecular assemblies. Nature, 450:683–694,
2007.
[X125] F. Alber et al. The molecular architecture of the nuclear pore complex. Nature, 450:695–701, 2007.
[X126] N. V. Grishin. Fold change in evolution of protein structures. Journal of Structural Biology, 134:167–
185, 2001.
[X127] L. N. Kinsch and N. V. Grishin. Evolution of protein structures and function. Current Opinion in
Structural Biology, 12:400–4008, 2002.
[X128] A. Andreeva and A. G. Murzin. Evolution of protein fold in the presence of functional constraints.
Current Opinion in Structural Biology, 16:399–408, 2006.
[X129] H. Hasegawa and L. Holm. Advances and pitfalls of protein structure alignment. Current Opinion in
Structural Biology, 19:341–348, 2009.
36