3DSig 2008 - Najmanovich Research Group

Transcription

3DSig 2008: Structural Bioinformatics and Computational Biophysics Satellite Meeting, July 18-19, 2008, Toronto, Canada
3DSig 2008: Structural Bioinformatics and Computational Biophysics Satellite
Meeting
July 18-19, 2008
Toronto, Canada
3DSig Organizing Committee:
Ilan Samish, University of Pennsylvania
Melissa Landon, Brandeis University
Rafael Najmanovich, European Bioinformatics Institute
John Moult, University of Maryland Biotechnology Institute
3DSig Scientific Committee:
John Moult, University of Maryland Biotechnology Institute
Brian Shoichet, UCSF
Ivet Bahar, Pittsburgh U.
Phil Bourne, UCSD
Tanja Kortemme, UCSF
Alfonso Valencia, CNIO-Madrid
Tamar Schlick, New York University
1
Table of Contents
Program
4
Keynote Abstracts
6
Oral Presentation Abstracts
7
Laptop Presentation Abstracts
28
73
List of Registrants
76
Index by Abstract Number
2
At Roche we contribute to improving
people’s health and quality of life
by developing and marketing innovative
therapeutic and diagnostic products
and services. Your ideas could help
shape tomorrow’s innovations in healthcare.
Plans are life’s roadmap to the future.
Come realise your plans with us:
www.careers.roche.ch
3
Protein Structure, Function and Dynamics
Day 1
Time
ID
Title
Presenting author
8:55
Opening remarks – Ilan samish
9:05
9:35
Session 1 Predicting, analyzing and evaluating dynamic function (Chair: Melissa Landon)
Toward elucidating allosteric mechanisms of function via
K1
Ivet Bahar
structure-based analysis of protein dynamics
Dariya S. Glazer, Randall J. Radmer &
64
4D Structure-based Function Prediction
Russ B. Altman
9:55
68
An Automatic Server for Function Prediction Evaluation
Michael Tress, Alfonso Valencia, Michael
Sternberg & Mark Wass
Page
5
8
9
Coffee
10:15
Session 2 Joint with Automatic Function Prediction SIG (Chair: John Moult)
10:35
11:15
On the nature of protein fold space: extracting functional Donald Petrey, Markus Fischer & Barry
information from apparently remote structural neighbors
Honig
Assessing functional novelty of PSI structures via
Benoît H Dessailly, Oliver C Redfern &
AFP structurefunction analysis of large and diverse
Christine A Orengo
superfamilies
K2
11:35
16
The evolution of protein function driven by a multidomain repertoire (MGMS awardee)
11:55
K3
Prediction of functional
sequence and structure
12:35
15:00
15:30
15:50
16:10
characteristics
based
6
Syed Ali & Michael Sternberg
10
Alfonso Valencia
6
on
Lunch & Poster/Laptop session (odd ID numbers)
Session 3 Protein – nucleic acid complexes (Chair: Chakra Chennubhotla)
Chromatin structure insights revealed by mesoscale
6
Tamar Schlick
K4
modeling
Predicting DNA-binding affinity of modularly designed Peter Zaback, Jeffry D. Sander, J. Keith
11
66
Joung, Daniel, F. Voytas & Drena Dobbs
zinc finger proteins
Remo Rohs, Sean West, Peng Liu & Barry
29
Minor groove electrostatics and binding specificity
12
Honig
Ben A. Lewis Mateusz Kurcinski, Deepak
Combining Predictions of Protein Structure and ProteinReyon, Jae-Hyung Lee, Vasant Honavar,
60
RNA Interaction to Model the Structure of the Human
13
Robert L. Jernigan, Andrzej Kolinski,
Telomerase Complex
Andrzej Kloczkowski & Drena Dobbs
16:30
Coffee
Session 4 From protein structure to mechanism (Chair: Roland Dunbrack)
17:00
17:20
17:40
Channeling protein structure analysis towards
Ilan Samish & William F. DeGrado
understanding cough dynamics
Classification of mechanistically diverse enzyme
31
superfamilies according to similarities in reaction Daniel Almonacid & Patricia C. Babbitt
mechanism
Discussion Panel I: Dynamics is all? (Moderators: Ivet Bahar, Tanja Kortemme & Yaoqi Zhou )
67
14
15
18:30
19:00
Dinner (7:00 Reception, 7:45 Dinner)
21:00
K5
I am not a PDBid I am a Biological Macromolecule
4
Philip Bourne
7
Zooming in: proteins, residues, atoms, cofactors and drugs
Day 2
Time
ID
Title
Presenting author
Page
Session 5 From protein stability and flexibility to folding and design (Chair: Yaoqi Zhou)
9:00
K6
9:30
26
9:50
50
10:10
9
Conformational flexibility and
computational protein design
sequence
diversity
11:30
11:50
12:10
15:30
7
Ivelin Georgiev, Cheng-Yu Chen &
16
Bruce Randall Donald
Poing: a fast and simple model for protein structure
Benjamin Jefferys, Lawrence Kelley &
17
prediction
Michael Sternberg
Proteins: coexistence of stability and flexibility (MGMS Shlomi Reuveni Rony Granek & Joseph
18
awardee)
Klafter
Coffee
Session 6 Ligand binding prediction and analysis (Chair: Warren Gallin)
Hits, Leads & Artifacts from Virtual and High-Throughput
Brian Shoichet
K7
Screening
Predicting small ligand binding sites on proteins using low17
Andrew Bordner
resolution structures
Scoring confidence index: statistical evaluation of ligand
Maria Zavodszky, Andrew Stumpff19
Kan, David Lee & Michael Feig
binding mode predictions
Functional insights from binding sites similarities
22 complement existing methods for prediction of protein
Rafael Najmanovich & Janet Thornton
function
12:30
15:00
Tanja Kortemme
Algorithms for protein design
10:30
11:00
in
7
19
20
20
Lunch & Poster/Laptop session (even ID numbers)
Session 7 New algorithms - from docking to drug discovery (Chair: Graham Wood)
Michael Sternberg, Stephen
Muggleton, Ata Amini, Huma Lodhi,
14 Logic-based drug discovery
David Gough & Paul Shrimpton
Conformational free energy of protein structures: computing Hetunandan Kamisetty & Christopher
43
upper and lower bounds
Langmead
15:50
6
Crystal contacts as nature's docking solutions
16:10
18
Vibin Ramakrishnan, Saeed Salem,
Geofold: a mechanistic model to study the effect of topology
Saipraveen Srinivasan, Wilfredo Colon,
on protein unfolding pathways and kinetics
Mohammed Zaki & Chris Bystroff
Eugene Krissinal
16:30
22
23
24
25
Coffee
Session 8 Residue level structure prediction (Chair: BK Lee)
17:00
38
The next generation of the backbone dependent rotamer
library
17:20
65
Two stage residue-residue contact predictor
17:40
Discussion II - Ligand binding (Moderator: Brian Shoichet)
18:25
Closing Remarks – Rafael Najmanovich & John Moult
18:30
End of 3DSig 2007
5
Maxim Shapovalov & Roland
Dunbrack
26
George Shackelford & Kevin Karplus
27
proteins without clear similarities with proteins of known
structure and function. Surprisingly far less attention has
been dedicated to the prediction of function, i.e. binding
sites, in proteins with clear homologs of known structure
(homology based function prediction), a non-trivial problem
that is of direct interest for experimental biologists. In this
presentation I will review some of the methods and
resources that my group has developed in this area (López
G, Valencia A, Tress ML. firestar--prediction of
functionally important residues using structural templates
and alignment reliability. Nucleic Acids Res. 2007
Jul;35(Web Server issue):W573-7. Lopez G, Valencia A,
Tress M. FireDB--a database of functionally important
residues from proteins of known structure. Nucleic Acids
Res. 2007 Jan;35(Database issue):D219-23.) A second large
area of activity is the one related with the prediction of
functional sites and more specifically the detection of
binding specificity sites that regulate the differential
interaction of proteins with specific substrates/effectors in
the context of large protein families. A full range of methods
for analysis of the variation in multiple sequence alignments
have been published. Still it is fair to say in general we still
do not sufficiently understand the basic principles behind
the organization of specificity sites. I will present here our
recent efforts to analyze systematically the characteristics of
specificity sites in large collections of protein families and
structures (Rausell et al., in preparation). Finally, the third
field in which significant progress has been made in the
recent years is the extraction of functional information
directly from the scientific literature. I will review the
current status of the text-mining methodology applied to
biological problems (Krallinger M, Hirschman L, Valencia
A. Linking genes to literature: text-mining, information
extraction and retrieval applications for Biology. Genome
Biology 2008, in press), describe some of the current efforts
to integrate text mining methods in function prediction
pipelines (Krallinger M, Rojas AM, Valencia A. Creating
reference datasets for Systems Biology applications using
text minino. New York Acad Sci. 2008. In press), and its
application to specific biological problems (Krallinger et al.,
in preparation). The integration of the methods developed in
these three area and many other new and old function
prediction strategies remains certainly as a key future
challenge.
K4: CHROMATIN STRUCTURE INSIGHTS
REVEALED BY MESOSCALE MODELING
Tamar Schlick and Gaurav Arya in collaboration with S.
Grigoryev, S. Correll, and C. Woodcock (New York
University)
Eukaryotic chromatin is the fundamental protein/nucleic
acid unit that stores the genetic material. Understanding how
chromatin fibers fold and unfold in physiological conditions
(divalent ions, with linker histones) is important for
interpreting fundamental biological processes like DNA
replication and transcription regulation. Using a mesoscopic
model of oligonucleosome chains and tailored sampling
protocols, we elucidate the energetics of oligonucleosome
folding/unfolding and the role of each histone tail, linker
histones, and divalent ions in regulating chromatin structure.
KEYNOTE ABSTRACTS
K1: TOWARD ELUCIDATING ALLOSTERIC
MECHANISMS OF FUNCTION VIA STRUTUREBASED ANALYSIS OF PROTEIN DYNAMICS
Ivet Bahar (University of Pittsburgh)
____________________________________________
K2: ON THE NATURE OF PROTEIN FOLD SPACE:
EXTRACTING FUNCTIONAL INFORMATION
FROM APPARENTLY REMOTE STRUCTURAL
NEIGHBORS
Donald Petrey, Markus Fischer & Barry Honig (Howard
Hughes Medical Institute and Department of Biochemistry
and Molecular Biophysics, Center for Computational
Biology and Bioinformatics, Columbia University)
It has become increasingly apparent that geometric
relationships often exist between regions of two proteins
that have quite different global topologies. In this report, we
examine whether such relationships can be used to infer a
functional and evolutionary connection between the two
proteins in question. Our results indicate that there are often
unexpected functional similarities between proteins that
would normally be considered to be structurally dissimilar.
This suggests that, in analogy to protein sequence motifs,
locally similar geometric regions can be used to infer
functional relationships. The development of methods that
can detect common structural motifs should significantly
enhance our ability to extract information from structural
and functional databases.
K3: PREDICTION OF FUNCTIONAL
CHARACTERISTICS FROM STRUCTURE,
SEQUENCE AND PAPERS
Alfonso Valencia (Spanish National Cancer Research
Centre)
The limitations of the current function prediction
methodology, which is essentially based on the
extrapolation from database annotation of similar sequences,
are well known (López G, Rojas A, Tress M, Valencia A.
Assessment of predictions submitted for the CASP7
function prediction category. Proteins. 2007;69 Suppl 8:16574 and Valencia A. Automatic annotation of protein
function. Curr Opin Struct Biol. 2005 Jun;15(3):267-74.
Review). Considerable scientific efforts are dedicated to the
development of computational methods that work outside
this paradigm and extract information from alternative
sources. I will focus in this talk in three areas of
Bioinformatics in which my group has done some recent
contributions.
The extrapolation of functional annotations, in particular
binding and catalytic sites, from the analysis of conserved
structural features is one of the more challenging fields of
Structural Bioinformatics. The availability of large
collection of proteins of known structure and poorly
characterized functions have channeled most of the efforts
towards the very hard problems of predicting function for
6
can begin to answer. Questions such as, how pervasive are
references to structure across the biomedical literature?
What can be extracted that provides valuable automated
annotation? What previously unexpected and meaningful
associations can be made between structures by virtue of
their co-occurrence in the literature? The majority of the
structural biology literature has not been open until now, so
these are more questions for the future than today, but we
will discuss some initial findings and work that is being
done to leverage these associations [4].
Providing more of an identity to a macromolecular structure
does not necessarily come from the literature, but can come
from the community at large. Efforts such as Proteopedia [5]
exemplify this wisdom of crowds approach. Yet another
approach that puts a human face on a structure is the notion
of a mashup where the traditional content as found in a
database or journal article is combined with multimedia
content to create a different kind of learning experience
[6,7]. Will this profoundly change how we study 3D
structure in the future? Time will provide the answer to this
question, but we at least believe that in the vernacular of The
Prisoner, number six will be identified for who he really is
in the next few years.
[1] K. Henrick, Z. Feng, W.F. Bluhm, D. Dimitropoulos,
J.F. Doreleijers, S. Dutta, J.L. Flippen-Anderson, J. Ionides,
C. Kamada, E. Krissinel, C.L. Lawson, J.L. Markley, H.
Nakamura, R. Newman, Y. Shimizu, J. Swaminathan, S.
Velankar, J. Ory, E.L. Ulrich, W. Vranken , J. Westbrook,
R. Yamashita, H. Yang, J. Young, M. Yousufuddin, H.M.
Berman 2008 Nucleic Acids Research. 36: D426-D433.
[2] N. Deshpande, K.J. Addess, W.F. Bluhm, J.C. MerinoOtt, W.Townsend-Merino, Q. Zhang, C. Knezevich, L.
Chen, Z. Feng, R. Kramer Green, J.L. Flippen-Anderson, J.
Westbrook, H.M. Berman and P.E. Bourne 2005 The RCSB
Protein Data Bank: A Redesigned Query System and
Relational Database Based on the mmCIF Schema Nucleic
Acids Research. 33: D233-D237.
[3] P.E. Bourne 2005 In the Future will a Biological
Database Really be Different from a Biological Journal?
PLoS Comp. Biol. 1(3) e34.
[4] J.L.Fink, S. Kushch, P. Williams and P.E.Bourne 2008
BioLit: Integrating Biological Literature with Databases
Nucleic
Acids
Research. 36:
W385-W389.
http://biolit.ucsd.edu.
[5] Eran Hodis, Eric Martz, Jaime Prilusky and Joel
Sussman 2008 http://www.proteopedia.org.
[6] J.L. Fink and P.E.Bourne 2007 Reinventing Scholarly
Communication for the Electronic Age. CT Watch, 3, 26-31.
[7] P.E.Bourne, J.L.Fink, M.Gerstein 2008 Open Access:
Taking Full Advantage of the Content PLoS Comp. Biol.
4(3) e1000037.
K6: CONFORMATIONAL FLEXIBILITY AND
SEQUENCE DIVERSITY IN COMPUTATIONAL
PROTEIN DESIGN
Tanja Kortemme (New York University)
The overall compact topologies reconcile features of the
zigzag model with straight linker DNAs with the solenoid
model with bent linker DNAs for optimal fiber organization
and reveal a dynamic synergism of internal and external
factors in chromatin compaction.
K5: I AM NOT A PDBID I AM A BIOLOGICAL
MACROMOLECULE
Philip E. Bourne, Parker
Williams and J. Lynn Fink
(Skaggs School of Pharmacy
and Pharmaceutical Sciences,
University of California San
Diego)
For the few in this audience who
can see the parody between the
title of this talk and the quotes
from The Prisoner "I am not a
number I am a person" or “I am
not a number I am a free man”
the theme of my talk may be apparent. For the remainder,
may I first suggest you take a look, as we all do for so many
things
these
days,
at
the
Wikipedia
page
http://en.wikipedia.org/wiki/-The_Prisoner . Somewhat of a
parody within a parody as the wisdom of crowds is also
featured in this talk. In The Prisoner, Patrick McGoohan
(left) strived to have his true identity recognized, so it is
with a macromolecular structure in the Protein Data Bank
(PDB). While strides have been made to create a better
identity for a PDBid through the wwPDB remediation effort
[1], and these will be summarized, PDB entries remain
somewhat featureless, some would say unannotated with
respect to function, structural features interactions with
other proteins and so on. Each site supporting the same
primary raw PDB data creates something of an identity,
typically through the associated UniProt sequence which
provides the necessary association to a variety of biological
resources [2]. Notwithstanding, either little else is known
about the structure, as is true of many structures determined
not through a functional motivation, but via structural
genomics, or what is known is found only in the literature.
In other words the data resides in one place, typically a
database, and the knowledge associated with that data
resides somewhere else, typically in one or more journal
articles [3]. This makes comprehending the full meaning of
a structure more difficult than it need be. The issue becomes
how to break this tradition either pre or post the
deposition/publication process? Pre is hard because it
involves changing scientists’ perceptions of what constitutes
a database entry versus a publication. Post has just been
made a little easier with the emergence of open access (OA)
publishing. Among other things OA implies that journal
articles will contain associated metadata amenable to
manipulation by computer. This is not quite the same as text
mining, which relies solely on establishing syntactic and
semantic relationships in written text. Here additional
tagging at the various stages of authoring and publication
can be bought to bear. This provides some interesting
prospects and raises some interesting questions which we
K7: HITS, LEADS, AND ARTIFACTS FROM
VIRTUAL AND HIGH-THROUGHPUT SCREENING
Brian Shoichet (UCSF)
7
Our work aims to improve structure-based function
prediction methods, such as FEATURE [1], by coupling
them to structural diversity generating methods, such as
Molecular Dynamics (MD) simulations. Our test function
was Ca2+ binding and our test set consisted of 5 molecules,
with two structures for each molecule: a HOLO (with Ca2+
present in the structure) and an APO (no Ca2+ present in the
structure). For each structure, a 1 nanosecond MD
simulation with explicit solvent was created using
GROMACS [2] software suit. For each system, 401
structures were extracted from the simulation trajectory, one
every 2.5 picoseconds.
Based on physico-chemical properties across several
concentric spherical shells FEATURE determines whether a
3D structure contains a local environment that resembles a
site of interest, which for this work is a calcium binding site.
Using FEATURE we scanned the structures generated over
the course of simulation over a 1Ǻ grid, identifying potential
centers of calcium binding. Several such points were
identified in each structure. This posed a new challenge: to
determine whether the identified points represented a single
putative calcium binding site or several, within a single
structure and among all structures generated by MD for each
ensemble. Such analysis is important in order to identify
true positive results. In general terms this challenge exists
whenever sites need to be tracked within a set of structures
generated by methods that explore conformational space.
Slight side chain deviations preclude simple geometric
comparisons between points in Cartesian space within
different structures in the structural ensemble generated by
MD simulations. We propose the following clustering
scheme as a plausible solution to this challenge. First,
FEATURE hits are compared in the bounds of their
respective structures. A Wilcoxon distance (z-value)
between all the pairs of hits within each structure is
calculated using the paired Wilcoxon rank sum test based on
the 50 atoms closest to each of the hits. Given a Wilcoxon
distance cut-off, all the hits for this structure are clustered,
and the Cartesian coordinates of the centers of the newlyformed clusters are calculated. Second, the cluster centers
from all structures are compared. A Wilcoxon distance
between all the pairs of cluster centers is calculated using
the paired Wilcoxon rank sum test based on the 50 atoms
closest to those centers in respective structures.
Then the cluster centers are clustered based on a given
Wilcoxon distance cut-off to form super-clusters. These
superclusters represent the number of independent sites
identified by FEATURE coupled to MD simulations as
putative calcium binding sites. Additionally, super-clusters
can be related to Ca2+ binding sites as related to the location
of the bound Ca2+ ions in the HOLO structures.
In our dataset, there were 12 Ca2+ binding sites in the
HOLO structures and 11 equivalent Ca2+ binding sites in
the APO structures; one site in a single APO structure is
destroyed by mutations. By itself, FEATURE identified 7
sites in the HOLO and 3 sites in the APO structures. When
coupled with structural ensembles, FEATURE identified 10
sites in the HOLO and 6 sites in the APO structures. As
such, we observed a 60% improvement in sensitivity when
ORAL PRESENTATION ABSTRACTS
64: 4D STRUCTURE-BASED FUNCTION
PREDICTION
Dariya S. Glazer (Genetics Department, Stanford
University, USA), Randall J. Radmer (SIMBIOS National
Center, Stanford University, USA) and Russ B. Altman
(Departments of Bioengineering and Genetics, Stanford
University, USA).
Structural dynamics of molecules play an important role
in function execution, and as such should be considered
by structure-based function prediction methods. We
demonstrate the value of coupling molecular dynamics to
function prediction methods, and propose a solution to
the challenge of comparing 3D environments in
equivalent structures.
There are numerous computational methods which may
assist experimental efforts in predicting molecular function.
These methods rely on sequence and or structural similarity
which can exist in molecules that perform similar functions.
Structure-based methods depend on correctness of 3D
structural models generated by X-ray crystallography or
Nuclear Magnetic Resonance (NMR) spectroscopy.
Unfortunately, the validity of many such structures suffers
from inherent limitations of the methods used to generate
them:
crystal
packing
conditions,
experimental
modifications,
averaging
of
coordinates,
solvent
composition. With the increasing number of structures being
solved by the Structural Genomics initiatives, which do not
bear similarity to already known folds, it is imperative that
function prediction methods overcome limitations imposed
on them by imperfect static structures.
Typically, in order to assign putative function, function
prediction methods scan a single 3D structure of a molecule.
However, molecules are not static entities, and the
intramolecular dynamics are very important for molecular
function. Therefore, coupling function prediction methods to
molecular dynamics may improve their performance.
Several methods exist that explore conformational space of
molecules. These methods generate ensembles of structures
that allow glimpses at the dynamic motions of molecules.
When coupled to structure diversity generating methods,
function prediction algorithms would examine many
structures for each molecule, and thus have many
opportunities to assign function correctly.
8
molecular dynamics were considered by the function
prediction method.
In order to validate these results, we explored another Ca2+
binding site prediction method based on valence by Nayal et
al. [3]. In the same dataset the valence method identified 1
site in the HOLO and 0 sites in the APO structures. When
coupled to the same structural ensembles as FEATURE
examined, the valence method identified 10 sites in the
HOLO and 1 site in the APO structures. With this method,
we observed a 1000% improvement in sensitivity when it
took into account dynamics of the molecules.
In our work, we have demonstrated that performance of
structure-based function prediction methods can be
improved by considering the dynamic nature of molecules.
Additionally, we proposed a solution to the challenge of
identifying equivalent 3D environments in spatially
distributed structures that are otherwise identical.
REFERENCES:
1. Halperin, I., Glazer, D.S., Wu, S., and Altman, R.B., The
Feature Framework for Protein Function Annotation:
Modelling New Functions, Improving Performance, and
Extending to Novel Applications. BMC Genomics, 2008(In
print).
2. Lindahl, E., Hess, B., and Spoel, D.v.d., Gromacs 3.0: A
Package for Molecular Simulation and Trajectory Analysis.
J
Mol Modeling, 2001. 7: p. 306.
3. Nayal, M. and Cera, E.D., Predicting Ca2+-Binding Sties
in Proteins. Proc Natl Acad Sci USA, 1994. 91: p. 817.
Dramatic improvements in high throughput sequencing
technologies have lead to a substantial increase in wholegenome sequencing projects. The rapid growth in sequenced
genomes is leading to radical changes in our understanding
of genomics and provides unparalleled opportunities for
research.
However, while genome-sequencing projects are generating
almost unimaginable numbers of protein sequences, these
sequences are not annotated with functional information.
The spectacular increase in unannotated sequences is
widening the gap between sequenced genes and known
protein
functions.
Experimental
procedures
for
characterising protein function are expensive, time
consuming and difficult to automate, so researchers are
turning increasingly to computational annotation to close the
gap. Providing functional annotations for the torrent of new
sequence information is one of the greatest challenges facing
computational biology today and it is clear that function
prediction is becoming an increasingly important field.
Function assignment is far from simple. Although functional
annotations can be transferred by homology, a common
evolutionary origin does not guarantee identical function
and the more distant the evolutionary relationship, the less
reliable the transfer will be. Although protein 3D structure
can be of use in predicting function, predicting function for
proteins with known structure still presents researchers with
problems. While structure may be conserved within a
superfamily of proteins, it is not always true that function is
conserved to the same extent.
Function prediction was included in CASP6 for the first
time with the aim of discovering whether computational
methods could use 3D structure to add useful molecular or
biological information to the target proteins. However,
CASP is an experiment that evaluates the state of structure
prediction and is based on structures that can be hidden from
the predictors, thus making predictions blind.
The same cannot be done with the function prediction
category. The assessment of function was hampered by the
lack of new functional information. In fact, with the
exception of bound ligands, the assessors had no more
functional information at the end of the experiment than was
available to the predictors during the experiment.
One other somewhat surprising development was the low
number of predicting groups that entered the function
prediction experiment. The prediction of function is an
important and growing field, as evinced by the numbers of
GO-based prediction servers that are already working or in
development, so it was unfortunate that so few groups were
prepared to participate in the experiment. It is almost
certainly true that the slow release of functional information
that hampered the assessment was also the cause of this low
turnout.
There are a number of difficulties in running a function
prediction assessment in CASP, and the CASP assessment
format and the slow release of functional information is not
ideal for a rapidly developing field where predictors need to
make use of the results and the evaluation in order to refine
their methods. The main problem for an experiment like
68: AN AUTOMATIC SERVER FOR FUNCTION
PREDICTION EVALUATION
Michael Tress (Spanish National Cancer Research Centre,
Spain), Alfonso Valencia (Protein Design Group, Centro
Nacional de Biotecnologia, Madrid, Spain), Michael
Sternberg (Imperial College London, UK) and Mark Wass
(Imperial College London, UK)
Whole-genome sequencing projects are generating
unannotated sequences in increasing numbers. There is a
great deal of interest in predicting function for these
proteins and many groups are developing methods to
predict GO functional terms. Here we present a server
that will perform a continuous assessment of structurebased function prediction methods.
9
in proteins [3], they are very clade specific and unique
enough to build accurate evolutionary trees [4]. A larger
fraction of proteins in eukaryotes than prokaryotes are
multi-domain; the trend formally known as ‘domain
accretion’. Domain accretion is believed to reflect the
increasing complexity brought about by domain multiplicity
and the formation of novel domain combinations [5].
Most studies of multi-domain proteins to date have been
confined to the structural basis of their formation with
limited analysis of the relationship to function e.g. [3, 6, 7].
The function approach was taken to some extent by George
and co-workers, with the integration of catalytic activity (a
small fraction of the function space) to protein domains [8].
However, their work was focused on single domain
functions and ignores the interaction between domains, an
important consideration in a predominantly multi-domain
repertoire. Here, we have developed a novel domainfunction map, encompassing single and multi-domain
combinations, providing the first comprehensive
examination of the structure-function relationship from a
multi-domain perspective. The study has allowed a
numerical approach for the systematic analysis of the
structural basis for functional change, which has been made
possible by the recent development of a graph based
function ontology database, the Gene Ontology resource [9].
The domain-function map integrates a domain-to-sequence
and a function-to-sequence map, using co-occurrence scores
to associate domain combinations with functions. SWISSPROT sequences [10] are assigned and represented as a
combination of SCOP domains [2] using homology
detection procedures [11,12,13,14], and functionally
annotated using the GOA [15] database. The sequences are
clustered using CD-HIT [16] such that no two sequences
have greater than 40% sequence identity, to prevent over
representation of domain combinations. The domain
combinations are associated with functional terms using cooccurrence ratios to normalise for convergent (functions
maybe performed by more than one domain combination)
and divergent (domain combinations may perform multiple
functions) evolution. The SCOP superfamily database [2] is
used for domain representation, while the Gene Ontology
resource
[9]
provides
functional
description.
Functional diversity follows a Pareto distribution with most
domain combinations encoding a few functions, while a few
are very functionally diverse. The domains central in the
domain-function network are also central in the protein
interaction network [17] and in taxonomic distribution [18].
The most functionally promiscuous domains in the
repertoire include the P-loop NTP hydrolase and Rossmann
domains. Our results show that functional diversity
decreases with increasing number of domains within a
combination (multi-domain combinations perform a more
limited set of functions compared to single domains), while
the architectural specificity of functions (the number of
domain combinations that perform a particular function)
increase when coded by combinations with increasing
number of domains (multi-domain combinations perform
more architecturally specific functions, with less
evolutionary convergence, than single domains).
CASP is the fact that it may take several years for functional
annotations to be known.
After CASP 6 and 7 the need to organize a more effective
blind function prediction category is obvious. The prediction
of function is important and it is crucial that it is properly
assessed.
We are developing a server to assess the prediction of
function in a continuous fashion. The server will be similar
in concept to the EVA/LiveBench structure prediction
evaluation servers in that the assessment will be automatic
and built on updates from the PDB. Servers will have to
predict GO terms for each of the sequences, but since the
function will not immediately be known the assessment will
take place some time after the release and will be revisited
periodically.
The predictions will be assessed with a range of methods,
since there is no single definitive method to assess GO term
prediction. The server will assess the prediction of GO
Molecular Function, Cellular Component and Biological
Process terms where possible, and targets will be
handicapped by prediction difficulty.
16: THE EVOLUTION OF PROTEIN FUNCTION
DRIVEN BY A MULTI-DOMAIN REPERTOIRE
Syed Ali &
Michael
Sternberg
(Imperial
College London,
UK)
We present a
novel map of
protein domain
(SCOP)
combinations to
functions (GO)
using co-occurrence scores, to allow a pan-genomic
analysis of functional evolution. Using simple metrics to
define change in domain organisation and function we
analysed functional transfer via domains, showing a
clear correlation between domain combination and
function.
During the course of evolution, forms of life with increasing
complexity have arisen, the driving force behind which has
been the expansion of the protein repertoire giving rise to
proteins with novel functions. Proteins in the repertoire have
been formed by the genetic mechanisms of gene divergence,
duplication and recombination [1]. These mechanisms are
paralleled at the proteome level with the processes of
domain divergence, duplication and recombination, where a
domain is an evolutionary unit that folds independently in
protein structure space [2]. The earliest evolution of the
repertoire began with the ab initio formation of protein
domains giving rise to single-domain proteins. However,
later in the evolutionary process as domain recombination
became a major force, multi-domain proteins became more
prominent in protein space. Although only a tiny fraction
(<0.5%) of all possible domain combinations are observed
10
Furthermore, based on taxonomic diversity and the protein
interaction network, we confer domain age and propose a
simplistic simulation for the evolution of the domain
repertoire. The simulation is used to measure the functional
effects of domain divergence, duplication and recombination
(defined on the basis of domain arrangements within
proteins). Our analysis suggests that early in the
evolutionary process domain divergence is the leading
mechanism for functional diversity, while domain
recombination becomes the major force as the repertoire
expands. This suggests that the current stage of evolution is
focused on domain multiplicity where domains are reused,
possibly a more evolutionary ‘cost-effective’ approach for
expanding the function space, and provides an explanation
for the multi-domain protein repertoire we see today.
Consequently, scientists have begun to discover genomes in
which most proteins are the product of extensive
recombination [6, 19].
With this focus on domain multiplicity it is important to
understand the functional consequence of domain
interactions within proteins. Using a modified hamming
distance to calculate change in domain combination (that
calculates the number of domains added or removed to
change from one combination to another), and the directed
acyclic graph provided by the Gene Ontology (GO) resource
to measure functional change (as the shortest distance
between two GO terms via a common ancestor), we show
that change in domain composition causes a correlated
change in function (see Figure), providing evidence for an
evolutionary unit of function within the structural domain.
The relation between domain combinations highlights the
difficulty of inferring function for multi-domain proteins
from knowledge of any single domain; it is important to
decipher the set of domains within a sequence associated
with a specific function. As such our domain combinationto-function map can provide a valuable tool for improved
function prediction.
REFERENCES:
1. Chothia C, Gough J, Vogel C, Teichmann SA. Science
1998, 300: 1701-1703.
2. Murzin AG, Brenner SE, Hubbard T, Chothia C. J. of
Molecular Biology 1995, 247: 536-540.
3. Apic G, Huber W, Teichmann SA. J. of Structure &
Functional Genomics 2003, 4: 67-78.
4. Yang S, Doolittle RF, Bourne PE. PNAS 2004, 102(2):
362-378.
5. Koonin E, Aravind L, Kondrashov A. Cell 2000, 101(6):
573-576.
6. Teichmann SA, Park J, Chothia C. PNAS 1998, 95:
14658-14663.
7. Apic G, Gough J, Teichmann SA. J. of Molecular Biology
2001, 310: 311-325
8. George RA, Spriggs RV, Thornton JM, Al-Lazikani B,
Swindells MB. ISMB Bioinformatics 2004, 20 Suppl 1:
I130-I136.
9. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M,
Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C et. al
Nuc. Acids Research 2004, 32: D258-261.
10. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker
WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez
R et. al Nuc. Acids Research 2006, 34: D187-189.
11. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang
Z, Miller W, Lipman DJ. Nuc. Acids Research 1997, 25:
3389-3402.
12. Schäffer AA, Wolf YI, Ponting CP, Koonin EV,
Aravind L, Altschul SF. Bioinformatics 1999, 15(12): 10001011.
13. Eddy SE. Bioinformatics 1998, 14: 755-763.
14. Bennet-Lovsey RM, Herbert AD, Sternberg MJ, Kelley
LA. Proteins 2008, 70(3): 611-625.
15. Camon E, Magrane M, Barrell D, Lee V, Dimmer E,
Maslen J, Binns D, Harte N, Lopez R, Apweiler R. Nuc.
Acids Research 2004, 32(1): D262-266.
16. Li W, Godzik A. Bioinformatics 2006, 22: 1658-1659.
17. Park J, Lappe M, Teichmann SA. J. of Molecular
Biology 2001, 307: 929-938.
18. Park J, Bolser D. Genome Informatics 2001, 12: 135140.
19. Gough J, Karplus K, Hughey R, Chothia C. J. of
Molecular Biology 2001, 313(4): 903-919.
66: PREDICTING DNA-BINDING AFFINITY OF
MODULARLY DESIGNED ZINC FINGER
PROTEINS
Peter
Zaback1
Jeffry D.
Sander1, J.
Keith Joung2,
Daniel F.
Voytas2 &
Drena
Dobbs1
(1Iowa State
University,
USA,
2
Center for Cancer Research, and Center for Computational
and Integrative Biology, Massachusetts General Hospital;
Department of Pathology, Harvard Medical School, USA
3
Department of Genetics, Cell Biology & Development and
Beckman Center for Genome Engineering, University of
Minnesota, Minneapolis, USA)
Consisting of modular nucleic acid binding domains,
C2H2 zinc finger proteins provide an excellent
framework for engineering “customized” sequencespecific DNA binding proteins. We present new methods
that accurately predict both in vivo and in vitro
efficacies of zinc finger proteins engineered by modular
design.
Replication and regulated expression of information
encoded in genomes require proteins that bind to DNA with
high sequence specificity. Researchers have long sought to
fully understand the molecular mechanisms underlying this
specificity, with the goal of developing powerful new tools
for both research and gene therapy. Zinc finger proteins
11
(ZFPs) bind specific DNA motifs using highly similar
helical “finger” domains that recognize adjacent DNA
triplets [1].
In the “modular assembly” approach to engineering novel
zinc finger proteins, individual modules are assembled into a
three-finger array expected to target a specific 9 bp target
sequence. In practice, however, ZFPs engineered using this
approach display a wide range of binding specificities and
affinities and function with highly variable success rates [2].
Due to this erratic behavior, it was previously impossible to
predict whether or not a particular ZFP would function in
vivo.
Here, we demonstrate that it is possible to predict which
combinations of zinc finger modules are most likely to
successfully target specific sites in genomic DNA, based on
existing in vitro binding data for individual modules. Using
previously characterized GNN-specific modules[3] in the
standardized framework provided by the Zinc Finger
Consortium[4] (http://zincfingers.org), we designed and
assembled 27 different three-finger arrays and assessed their
binding to cognate target sites in vivo using a quantitative
bacterial two-hybrid assay. For 7 of the assembled ZFP
arrays, we also directly measured binding affinities in vitro
using fluorescence anisotropy. Our predicted DNA binding
affinities were highly correlated with binding constants
measured in vitro (r = 0.91) and in vivo (r = 0.80). Similar
accuracy was achieved on an independently generated and
tested set of 23 zinc finger proteins[2]. By providing the first
validated system for ranking genomic target sites, this work
should lead to significantly enhanced success rates for
modularly designed zinc finger proteins. An updated server
that
facilitates
ZFP
design
is
available
at:
http://bindr.gdcb.iastate.edu/ ZiFiT/ [5].
REFERENCES:
[1] Miller, J., McLachlan, A. D. & Klug, A. (1985)
Repetitive zinc-binding domains in the protein transcription
factor IIIA from Xenopus oocytes. EMBO J, 4, 1609-1614.
[2] Ramirez, C.L., Foley, J.E., Wright, D.A., Muller-Lerch,
F., Rahman, S.H., Cornu, T.I., Winfrey, R.J., Sander, J.D.,
Fu, F., Townsend, J.A. et al. (2008) Unexpected failure rates
for modular assembly of engineered zinc fingers. Nat
Methods, 5, 374-375.
[3] Segal, D.J., Dreier, B., Beerli, R.R. and Barbas, C.F.,
3rd. (1999) Toward controlling gene expression at will:
selection and design of zinc finger domains recognizing
each of the 5'-GNN-3' DNA target sequences. Proc Natl
Acad Sci U S A, 96, 2758-2763.
[4] Wright, D.A., Thibodeau-Beganny, S., Sander, J.D.,
Winfrey, R.J., Hirsh, A.S., Eichtinger, M., Fu, F., Porteus,
M.H., Dobbs, D., Voytas, D.F. et al. (2006) Standardized
reagents and protocols for engineering zinc finger nucleases
by modular assembly. Nat Protoc, 1, 1637-1652.
[5] Sander, J.D., Zaback, P., Joung, J.K., Voytas, D.F. and
Dobbs, D. (2007) Zinc Finger Targeter (ZiFiT): an
engineered zinc finger/target site design tool. Nucleic Acids
Res, 35, W599-605.
29: MINOR GROOVE ELECTROSTATICS
PROVIDES A MOLECULAR ORIGIN FOR
PROTEIN-DNA SPECIFICITY
Remo
Rohs
(HHMI &
Columbia
University,
USA),
Sean West
(Columbia
University,
USA),,
Peng Liu
(Columbia
University,
USA), and Barry Honig (HHMI & Columbia University,
USA).
Hox proteins confer specificity by reading the structure
and electrostatic potentials of the minor groove. Local
shape recognition is distinct from known readout
mechanisms and is used by proteins binding AT-rich
DNA. Base sequence induces structures that enhance
negative electrostatic potentials and attract basic side
chains into the minor groove.
The molecular basis for protein-DNA recognition and its
specificity is still widely unknown. Complexes of proteins
from various families bound to DNA have been solved by
means of X-ray crystallography and NMR spectroscopy.
However, the molecular mechanisms through which proteins
specifically recognize their DNA binding sites are only
partially understood. Direct readout through specific
contacts between amino acids and bases dominates
recognition within the DNA major groove. Different base
pairs account for specific patterns of hydrogen bond donors
and acceptors in the major groove with thymine offering, in
addition, a methyl group for hydrophobic contacts. Direct
readout in the minor groove is limited because there is no
differentiation in terms of the location of hydrogen bond
donors or acceptors between A-T and T-A or between G-C
and C-G base pairs.
Indirect readout accounts for the recognition of the overall
shape of a DNA binding site by proteins. Overall shape is a
function of base sequence and comprises global deformation
effects such as DNA bending. It has been shown for the
papillomavirus E2 protein, for example, that its binding
affinity is affected by base pairs which are not contacted by
the protein but which facilitate bending that enables protein
contacts with base pairs in other regions of the binding site
[1, 2].
In a recent study of the Hox family of transcription factors,
we have identified a third mode of protein-DNA recognition
that involves recognition of minor groove shape [3, 4]. Hox
proteins bind DNA by making nearly identical major groove
contacts via the recognition helices of their homeodomains.
In vivo specificity, however, depends on extended and
unstructured regions that link Hox homeodomains to a DNA
12
and structure with electrostatic potential in the DNA minor
groove as a result of shape-induced electrostatic focusing.
Local shape recognition also explains the avoidance of TpA
base pair steps in some transcription factor binding sites, an
observation that we validated for the tumor suppressor
protein p53. Our observation of the causal relationship
between minor groove structure and enhanced negative
electrostatic potentials reveals the biological function of Atract motifs. In addition, our results suggest recognition of
local DNA shape as a novel readout mechanism crucial for
proteins that bind DNA with narrow minor groove regions.
[1] R. Rohs, H. Sklenar, and Z. Shakked, Structure 13,
1499-509 (2005).
[2] Commentary on [1]: T. Siggers, T. Silkov, and B. Honig,
Structure 13, 1400-1 (2005).
[3] R. Joshi, J. M. Passner, R. Rohs, R. Jain, A. Sosinsky,
M. A. Crickmore, V. Jacob, A. K. Aggarwal, B. Honig, and
R.
S. Mann, Cell 131, 530-43 (2007).
[4] Commentary on [3]: S. C. Harrison, Nat. Struct. Mol.
Biol. 14, 1118-9 (2007).
[5] B. Honig and A. Nicholls, Science 268, 1144-9 (1995).
[6] H. Sklenar, D. Wustner, and R. Rohs, J. Comput. Chem.
27, 309-15 (2006).
60: COMBINING PREDICTIONS OF PROTEIN
STRUCTURE AND PROTEIN-RNA INTERACTIONS
TO MODEL HUMAN TELOMERASE STRUCTURE
bound cofactor, Extradenticle (Exd). Crystal structures were
determined for one of the eight drosophila Hox proteins, Sex
combs reduced (Scr), bound to its specific DNA sequence
(fkh250) and a consensus Hox-Exd site (fkh250con*).
Comparison of the structures of these two Hox-Exd-DNA
ternary complexes demonstrates that the overall arrangement
of the proteins is similar but additional Scr residues are
ordered in the fkh250 complex. The intrusion of these
residues into the minor groove is shown in Figures A and B
with the accessibility surface of the DNA binding sites
color-coded for shape. Specifically, an Arg and His residue
insert into a narrow region of the fkh250 minor groove
whereas they are disordered when presented with the
fkh250con* sequence. Arg5 also inserts into the minor
groove in a region where the groove is narrow in both
sequences (blue plots in Figures C and D).
The electrostatic potential is affected by the shape and
charge distribution of macromolecules [5]. For both the
fkh250 and fkh250con* sequences, there is a near-perfect
correlation between minor groove width and the magnitude
of the negative electrostatic potential (red plots in Figures C
and D). This data reveals a relationship between groove
geometry and the insertion of basic amino acids into the
minor groove. This finding is particularly important as both
minor groove contacts only seen in the fkh250 complex
(Arg3 and His-12) were shown to be critical for specific invitro and in-vivo Scr properties.
The recognition of local shape by a single protein implies
that the DNA conformation being recognized is an intrinsic
property of the base sequence, and thus, already prevalent in
unbound DNA rather than induced by protein binding. That
is, since the fkh250 and fkh250con* complexes only differ
in their DNA sequence, the distinct minor groove shape in
each must be a property of the base sequence. All-atom
Monte Carlo simulations [6] of the free DNA binding sites
predict a similar sequence-dependence of minor groove
shape as seen in the crystal structures [3]. These simulations
predict a single minor groove width minimum in
fkh250con* and two minima in fkh250 (green plots in
Figures C and D) as a result of different locations of the
TpA base pair step in both sequences. Our results on HoxDNA recognition indicate that the intrinsically narrow minor
groove of fkh250 induces an enhanced negative electrostatic
potential, which in turn attracts the positively charged
Arg/His pair.
Our current studies focus on the question if the local shape
recognition that we found for Hox proteins is of more
general nature. Electrostatics calculations along with MC
structure predictions of DNA binding sites indicate that
homeodomain proteins are an example of a family that
employs this readout mechanism. Homeodomains bind to Atracts, which are rigid AT-rich DNA regions of three or
more consecutive ApT or ApA (TpT) base pair steps.
Narrow minor grooves are a common structural feature of
A-tracts. TpA steps break A-tract structure since they act as
flexible hinges due to unfavorable stacking interactions. Our
studies on Hox proteins have proven that the location of a
TpA step is key for the intrinsic structure of a binding site.
Strikingly, our data shows a correlation of A-tract sequence
Ben A. Lewis1, Mateusz Kurcinski2, Deepak Reyon1, JaeHyung Lee1, Vasant Honavar1, Robert L Jernigan1, Andrzej
Kolinski2, Andrzej Kloczkowski2 & Drena Dobbs1 (1Iowa
State University, USA, 2University of Warsaw, Poland)
Telomerase is a ribonucleoprotein enzyme pivotal in
cellular senescence and aging. Despite its importance,
high resolution structures of the enzyme with or without
its RNA component have proved difficult to obtain. This
study uses machine learning predictions of RNA binding
sites, along with template-based and de novo protein
structure prediction, to develop a tentative model for the
holoenzyme.
Telomerase is a ribonucleoprotein enzyme that adds
telomeric DNA repeat sequences to the ends of linear
chromosomes. The enzyme is pivotal in cellular senescence
and aging, and because it is overexpressed in ~90% of
human cancers, it is also a potential therapeutic target.
Despite its importance, a high-resolution structure of the
13
telomerase enzyme has been elusive, with high-resolution
structures of only two of its four protein domains having
been determined: those of the Nterminal domain (TEN) and
RNA binding domain (TRBD) from the telomerase reverse
transcriptase subunit (TERT) of Tetrahymena thermophila
(1,2). Structures of the reverse transcriptase (RT) and Cterminal (TEC) domains have not yet been reported.
Moreover, while secondary and tertiary structural elements
within the human telomerase RNA component (hTERC)
have been identified through NMR spectroscopy (3), cocrystallization of telomerase with its intrinsic RNA
component has not yet been accomplished.
We have used sequence-based machine learning classifiers
(Naive Bayes and SVM) to identify amino acid residues in
telomerase that are likely to make direct contact with either
DNA or RNA (4). More recently, we generated structural
models for the human and yeast TEN domains by homology
modeling
and
threading,
using
the
experimentallydetermined
Tetrahymena TEN structure as a template (5) and, based on
comparative analyses, suggested that the RNAbinding
surfaces of the human and Tetrahymena enzymes are likely
conserved.
Building on these initial studies, here we present:
- structural models for all four telomerase protein domains,
generated using an ultrafast coarse-grained CABS approach
(6) for template-based modeling of each domain, followed
by accurate all-atom molecular dynamics simulations for
structural refinement
- a comparison of our models of TEN and TRBD with
experimentally-determined structures of the corresponding
Tetrahymena protein domains
- a preliminary model for the complete human TERT
complex (lacking the RNA subunit), generated using a rigid
docking procedure
- a refined model for the complete human TERT complex,
generated by performing CABS simulations covering all
TERT domains, followed by a model selection procedure
based on hierarchical clustering and all-atom refinement to
produce the final model
- preliminary results in which RNA-binding residue
predictions are used to position folded portions of the human
telomerase RNA component (hTERC) structure within the
modeled protein complex
Taken together, these results indicate that computational
approaches can be used to gain valuable insight into the
structure and function of ribonucleoprotein complexes for
which high-resolution structural information is incomplete.
(1) Jacobs et al., Nat. Struct. Mol. Biol. (2006), 13:218-225
(2) Rouda and Skordalakes, Structure (2007), 15:1403-1412
(3) Theimer et al., Mol. Cell. (2005), 17:671-682
(4) Terribilini et al., RNA (2006), 12:1450-1462
(5) Lee et al., Pac Symp. Biocomput. (2008), 13:501-512
(6) Kurcinski and Kolinski, J. Steroid Biochem. Mol. Biol.
(2007), 103:357-360
67: CHANNELING PROTEIN STRUCTURE
ANALYSIS TOWARDS UNDERSTANDING COUGH
DYNAMICS
Ilan Samish & William F.
DeGrado (Department of
Chemistry and Department of
Biochemistry and Biophysics,
University of Pennsylvania,
USA)
The M2 influenza proton
channel is a major drug
target of the flu virus as
well as a model structure
for
membrane
protein
channels. Following recent
X-ray and NMR structural
elucidation, we utilized an array of structural
bioinformatics methods to understand, and suggest a
dynamic mechanism of this slow-conducting channel.
Influenza virus infection is a major public health concern,
causing significant morbidity, mortality, and economic
losses worldwide. Not less important, this ion channel is
among the smallest bona fide channels with full properties
of ion selectivity and activation, thus providing a minimal
model for studying channels. Mechanistically, the influenza
virion is engulfed by a lung epithelial cell and
compartmentalized into an endosome. The low pH of the
endosome induces proton leakage into the virion via the M2
proton channel resulting in uncoating of the viral RNA [1].
Indeed M2 blockers, e.g. amantadine, were utilized as
influenza drugs till the emergence of new strains that are
generally resistant to this once commonly prescribed drug.
Following the recent elucidation of the protein structure via
X-ray crystallography [2] and via NMR [3] we aimed at
gaining insight into a possible mechanism; especially as the
proposed models exhibit marked differences [4] and as the
full dynamic mechanism is yet to be deciphered. We were
aided by the fact that each one of the four transmembrane
helices was crystallized in a different conformation within
the assymetric tetramer, thus enabling to construct four
symmetric models, each in a different conformation.
As the specific focus was the dynamic properties, special
emphasis was put on bioinformatic datamining of the
crystallographic snapshots towards dynamic insight.
Methods included distribution of normalized B-factors along
the transmembrane helices, analysis of hydrogen bonds
energetics and dynamics, analysis of local structural
deformation with an emphasis on backbone tilting, normal
mode analysis, distribution of pore radii and the local
flexibility of the pore lining atoms. A comparative analysis
was conducted to the different available structural models as
well as to a hybrid of the high resolution crystal structure
and the more complete NMR model was constructed.
Further comparison was conducted to a structure that was
simulated via molecular dynamics for 20 nanoseconds with
explicit solvation.
Cumulatively, the analysis suggests a dynamic mechanism
for this slow channel that may act more like a transporter
than like a channel. Backbone regions of elevated dynamics
exhibit local deformations in the helical structures including
14
To capture information about enzymes that follow
“chemistry-constrained” evolution, our group developed the
Structure-Function Linkage Database (SFLD) [3]. The
SFLD holds detailed information on the reactions catalyzed
by members of six different mechanistically diverse enzyme
superfamilies: amidohydrolase, crotonase, enolase, haloacid
dehalogenase, terpene cyclase, and vicinal oxygen chelate.
In total, the SFLD covers 6499 sequences, 392 structures
and 165 different reactions. Each superfamily member
maintains the ability to catalyze a key mechanistic step that
is mediated by conserved active site residues and/or
cofactors, but different families in these superfamilies use
that common mechanistic step in different chemical
reactions and/or with different substrates.
Enzymes are typically classified using measures of
similarity relating to sequence, structure and overall
function. In the SFLD, enzymes are classified into
superfamilies according to sequence and structure
conservation and conservation of unique residues that
catalyze the superfamily’s common mechanistic step.
Conservation of additional residues is used to define
subgroups and families within each superfamily. Recently,
O’Boyle and colleagues developed a novel method that
measures similarity of enzymes based upon the explicit
mechanism of the catalyzed reaction [4]. This method opens
a new avenue for classification of enzymes.
Here, we have used measurements of reaction mechanism
similarity to classify enzymes of the mechanistically diverse
superfamilies in the SFLD. Each overall reaction is
described as a sequence of mechanistic steps (or partial
reactions). Each step is then represented as the set of bond
changes occurring in the transformation from substrate(s) to
product(s) in that step. Similarity between sets of bond
changes for each possible combination of steps among two
reactions is computed using Tanimoto coefficients and
stored in a similarity matrix. To obtain the total similarity
between sequences of steps (“step similarity” or
“mechanism similarity”), an alignment of the steps is
performed using the Needleman-Wunsch algorithm. To take
into account the maximum possible similarity that can be
calculated given the two reaction sequences under
comparison, a new Tanimoto coefficient is computed using
the number of steps in each reaction and the NeedlemanWunsch similarity as inputs. Additionally, similarity of
overall reactions is computed, also using Tanimoto
coefficients, by representing the set of bond changes
occurring in the transformation of the overall substrate(s) to
overall product(s) of the reaction catalyzed (‘overall
similarity”). In our study, reversibility of enzyme reactions
is considered explicitly by inverting the bond changes in
each set of bonds and by inverting the order of the steps in
the reaction sequences.
Our results quantitatively show that for mechanistically
diverse enzyme superfamilies, the overall reactions can vary
greatly, but the similarity among reaction steps is always
high. We use as an example chloromuconate cycloisomerase
and dipeptide epimerase, both members of the muconate
cycloisomerase subgroup of the enolase superfamily. The
former enzyme catalyzes the cycloisomerization of
dynamic bifurcated hydrogen bonds. Unlike other four-helix
bundel channels, this protein does not exhibit large
concerted backbone-mediated interhelical sliding motions
and does not exhibit a constitutive 'open' conformation The
mechanism agrees with previous experiments and provides a
starting point for further mutational analysis and biophysical
characterization. Moreover, the newly derived local
structure-function-dynamics relationships provide important
insight for the continuing efforts to develop drugs to this
important disease.
REFERENCES
1. Pinto, L.H. and R.A. Lamb, The M2 proton channels of
influenza A and B viruses. J Biol Chem, 2006. 281(14): p.
8997-9000.
2. Stouffer, A.L., et al., Structural basis for the function and
inhibition of an influenza virus proton channel. Nature,
2008. 451(7178): p. 596-9.
3. Schnell, J.R. and J.J. Chou, Structure and mechanism of
the M2 proton channel of influenza A virus. Nature, 2008.
451(7178): p. 591-5.
4. Miller, C., Ion channels: coughing up flu's proton
channels. Nature, 2008. 451(7178): p. 532-3.
31: CLASSIFICATION OF MECHANISTICALLY
DIVERSE ENZYME SUPERFAMILIES ACCORDING
TO SIMILARITIES IN REACTION MECHANISM
Daniel
E.
Almonacid
&
Patricia C. Babbitt
(UCSF, USA).
We
classify
enzymes
from
mechanistically
diverse
superfamilies
in
the
StructureFunction Linkage
Database using a
novel algorithm that quantifies similarity in reaction
mechanisms. We conclude that traditional approaches of
classification of enzymes based on structure and function
similarity are effectively complemented by clustering
according to reaction mechanism.
During evolution, gene duplication and sequence divergence
generates functionally different but structurally related
proteins. Gerlt and Babbitt cite three possible strategies that
lead to divergence of function in homologous proteins [1]:
(i) substrate specificity-constrained evolution (substrate
specificity is conserved whilst chemistry changes); (ii)
chemistryconstrained evolution (chemistry is conserved
whilst the substrate specificity is changed); and (iii) active
site-constrained evolution (neither chemistry nor substrate
specificity is maintained, and the conserved active site
residues support different reactions). Structural and
functional analysis of the known protein universe suggests
that of the three strategies, chemistry-constrained evolution
is dominant [2].
15
Biochemistry, Duke
University, USA)
& Bruce
Randall Donald
(Computer
Science Department,
Duke University,
USA)
chlorinated muconates by forming a C-O bond to create a 5membered ring, and eliminating HCl by cleaving a C-Cl
bond. The latter enzyme, instead, catalyzes the
epimerization of dipeptides, with the preferred substrate
often L-Ala-D/L-Glu. This overall transformation is attained
by the cleavage of a C-H bond, and its re-formation from the
opposite face of the double bond in the intermediate. The
overall reaction similarity for this pair of reactions is zero
according to our measure as no bond changes are shared
between the reactions. In terms of mechanistic steps,
however, the reactions are highly similar. Chloromuconate
cycloisomerase catalyzes a two-step reaction, and dipeptide
epimerase a three-step reaction. Despite the different
number of steps, both reactions share an identical step
consisting of the abstraction of the α-proton to a carboxylic
acid in the substrate resulting in a stabilized enolate anion
intermediate. Furthermore, the enol-to-keto tautomerization
that occurs in the step after the proton abstraction is also
shared by both enzymes. Compared to the traditional
approach of classifying enzymes according to overall
reaction similarity (such as that of the Enzyme
Commission), the method based on step similarity is better
able to capture these elements of functional conservation.
Our results also indicate that divergence of sequence and
active site residues does not necessarily imply divergence of
reaction mechanism. This is the case, for instance, of Dtartrate dehydratase, enolase and o-succinylbenzoate
synthase.
These three enzymes are highly divergent and belong to
different subgroups within the enolase superfamily, yet they
share identical sets of bond changes in each of their two
mechanistic steps. Conversely, we found that not all
members of the same subgroup within a superfamily use the
same mechanism to perform catalysis, as with the case of
chloromuconate cycloisomerase and dipeptide epimerase
discussed above. This implies that the relationship between
sequence/structure and function is yet more complicated
than previously envisaged. As chemistry-constrained
evolution is the major player of divergent evolution, we
expect our study to be useful for guiding functional
annotation of new homologues of known superfamilies. To
provide access to these results, work is underway to create a
knowledgebase to validate and predict overall
transformations and mechanisms of enzyme reactions and to
help guide engineering of enzyme functions by identifying
enzyme templates capable of catalyzing the key mechanistic
step of a transformation.
1. Gerlt, J.A. and Babbitt, P.C. Annu. Rev. Biochem., 2001,
70: 209-246.
2. Bartlett, G.J.; Borkakoti, N. and Thornton, J.M. J. Mol.
Biol., 2003, 331: 829-860.
3. Pegg, S.C.-H.; Brown, S.D.; Ojha, S.; Seffernick, J.;
Meng, E.C.; Morris, J.H.; Chang, P.J.; Huang, C.C.; Ferrin,
T.E. and Babbitt, P.C. Biochemistry, 2006, 45: 2545-2555.
4. O'Boyle, N.M.; Holliday, G.L.; Almonacid, D.E. and
Mitchell, J.B.O. J. Mol. Biol., 2007, 368: 1484-1499.
26: ALGORITHMS FOR PROTEIN DESIGN
Ivelin Georgiev (Duke University, Computer Science
Department, USA), Cheng-Yu Chen (Department of
We present a suite
of
provablyaccurate algorithms
for computational
protein
design
developed in our lab. We report the application of our
algorithms to switch the substrate specificity of a nonribosomal peptide synthetase (NRPS) enzyme.
Experimental tests on a set of the top in silico predictions
showed the desired improvement in substrate specificity,
confirming the feasibility of our approach.
Background and Motivation
Protein redesign aims at improving target protein properties,
such as increasing the stability of the protein, switching an
enzyme's specificity towards a non-cognate substrate, or
redesigning the protein so that it will perform a completely
novel function. Exhaustively testing protein mutations in
vitro is infeasible, due to the enormous size of the space of
possible mutations. Computational in silico approaches can
efficiently and accurately explore the combinatorial space of
candidate solutions, and have proven valuable for protein
redesign and protein engineering. Typically, structure-based
protein design approaches aim at identifying the single
global minimum energy conformation (GMEC) for an input
model consisting of a rigid protein backbone, rigid rotamers,
and a pairwise energy function. Here we present K*
(pronounced "K-Star") [1,2], a provably-accurate ensemblebased (as opposed to GMEC-based) algorithm for protein
design and protein-ligand binding prediction. We further
present MinDEE [2] and BD [3], provably accurate
enhancements to the traditional Dead-End Elimination
(DEE) algorithms that guarantee the identification of the
GMEC with, respectively, continuously flexible rotamers
and a flexible backbone. We describe additional techniques
and approaches that are combined with our K*, MinDEE,
and BD algorithms into a general suite for computational
protein design.
Approach
K*: K* [1,2] is a statistical mechanics-derived algorithm
that computes Boltzmann-weighted partition functions over
energy-minimized conformational ensembles and generates
a provably-accurate approximation to Kd, the binding
constant for a given protein-ligand complex. For a given
protein, a set of mutations, and a target substrate, K*
computes a Kd approximation score for each candidate
mutant with the target substrate (for computational
efficiency, MinDEE (see below) and sophisticated pruning
filters are applied during the mutation search). Mutants are
16
then ranked according to the computed scores; top-ranked
mutants are predicted to have the desired specificity.
MinDEE: The MinDEE theorems [2] extend their traditional
DEE analogs to achieve provable correctness even when
rotamers are not rigid and are allowed to flex and minimize
from their initial conformations (as given by the rotamer
library used). In our algorithm, rotameric energy
minimization is performed by allowing the rotamer chi
angles to flex within a predefined continuous voxel. The
main difference between MinDEE and traditional DEE is
that, by computing voxel-constrained ranges of energies
instead of rigid energies, MinDEE takes into account
possible energy changes during rotameric energy
minimization.
BD: Unlike traditional DEE and MinDEE, BD is provably
accurate with backbone flexibility. BD places restraining
boxes around each residue in a protein, in order to define a
continuous family of backbone conformations with small
phi/psi changes that nonetheless can cause global shifts in
the backbone coordinates. Upper and lower bounds on the
pairwise rotameric energy interactions are then precomputed
within the defined restraining boxes and used to determine
which rotamers are provably not part of the respective
GMEC. An analogous approach, but for finite sets of
backbone conformations defined by backrub-type motions,
will be presented as part of the main conference program of
ISMB 2008 [4].
Both MinDEE and BD are fully-compatible with K*, and
can thus be used as pre-processing filters to prune the
majority of the candidate mutations and conformations that
must be subsequently evaluated by the K* ensemble-based
partition function computation algorithm. We will describe
additional computational and modeling approaches
incorporated into our algorithms for improved
computational efficiency and prediction accuracy. Results in
computational tests, allowing additional rotamer/backbone
flexibility as part of the protein design algorithms was
shown to result in significantly lower-energy conformations
than those generated by the rigid-rotamer/rigid-backbone
traditional DEE-based algorithms. We applied our K*
algorithm in a redesign to switch the substrate specificity of
the adenylate ion domain of the NRPS enzyme GrsA-PheA
from the wildtype substrate Phe towards several noncognate substrates. Experimental tests on a set of the top in
silico predictions showed the desired improvement in
substrate specificity, confirming the feasibility of our
approach.
REFERENCES
50: POING - A FAST AND SIMPLE MODEL FOR
PROTEIN STRUCTURE PREDICTION
Benjamin Jefferys, Lawrence
Kelley & Michael Sternberg
(Imperial College London,
UK)
Poing is a fast new model for
template-free
protein
structure prediction based
upon Langevin dynamics
with novel models for
physicochemical effects. We
have tested it on a
benchmark set and on the
template-free CASP 7 targets, and we have found its
performance is comparable to the best fragment folding
methods.
Over the last two years we have been developing a
simplified approach to modelling protein folding. The
original aim was to model protein evolution, but we are
obtaining successful predictions for protein structure
prediction.
The model developed thus far reduces a protein structure to
a string of C-alpha points, and for each of these a sidechain
point representing the mean location of all the non-hydrogen
sidechain atoms. This is similar to the Levitt & Warshel and
the Scheraga approaches. The novelty in comparison to
these methods lies in the increased detail of the force field
and the solvent model, developed to represent the
biophysical effects driving protein folding. The model is
designed to predict structures through iterative simulation of
a folding pathway which enforces a number of heuristic
constraints inspired by biophysical effects known to be
important for in vivo protein folding. It enforces these
constraints using classical mechanics under the Langevin
equation, involving forces between the particles representing
the protein structure. A notable feature of the force field is
that it is designed to maintain the stability of the native state.
Three features of protein folding are modelled in a novel
way.
The steric force is designed to capture some of the subtleties
of molecule packing in a complex force field between
particles. The repulsive interaction between two particles
depends upon the probability that atoms in an all-atom
model of those particles a given distance apart would clash
sterically, based upon analysis of sidechain and backbone
conformations in the PDB.
The polar interactions of the backbone (i.e. hydrogen bonds)
are modelled by initially calculating the likely position of
the O and H atoms involved in the interactions. Forces
between the relevant backbone particles aim to bring the
notional O and H atoms closer together. This novel model
for polar interactions is a compromise between adding more
particles representing the O and H sidechains, slowing down
simulation, and having a simpler associative force between
backbone points, which would ignore some important
restrictions on how strands can be arranged into sheets.
[1] R. Lilien, B. Stevens, A. Anderson, and B. R. Donald. J.
Comput. Biol., 12(6-7):740-761, 2005.
[2] I. Georgiev, R. Lilien, and B. R. Donald. J. Comput.
Chem., [Epub ahead of print, 2008 Feb 21]. PMID:
18293294
[3] I. Georgiev and B. R. Donald. Bioinformatics, 23, i185i194, 2007. Special issue on ISMB 2007, Vienna, Austria.
[4] I. Georgiev, D. Keedy, J. S. Richardson, D. C.
Richardson, and B. R. Donald. Bioinformatics. Special issue
on ISMB 2008, Toronto, Canada.
17
Debye density of low frequency modes. Recently, however,
it became clear that proteins can be described as fractals;
namely, geometrical objects that possess self similarity
[3,5]. Adopting the fractal point of view to proteins makes it
possible to describe within the same framework essential
information regarding topology and dynamics using three
parameters: the number of amino acids along the protein
backbone, the spectral dimension and the fractal dimension.
Based on a generalization of the Landau-Peierls instability
criterion and on a melting criterion for proteins, we derive a
relation between the spectral dimension, the fractal
dimension and the number of amino acids along the protein
backbone. In words this relation states that for every protein
the sum of the inverse of the fractal dimension and twice the
inverse of the spectral dimension is equal to unity plus a
constant, denoted "b", times inverse the logarithm of the
number of amino acids. Deviations from this equation may
render a protein unfolded. The fractal nature of proteins is
shown to bridge their seemingly conflicting properties of
stability and flexibility.
The spectral dimension governs the density of low
frequency normal modes, obtained using the Gaussian
Network Model (described later), of a fractal/protein. More
precisely, a power law relation, with the spectral dimension
as exponent, holds between the cumulative density of modes
and the frequency. Describing the mass fractal dimension is
most convenient using a three dimensional example. Draw a
sphere of radius "R" enclosing some lattice points in space
and calculate their mass, increase "R" and calculate again.
Do this several times and if the mass as a function of "R"
scales as R to some power this power is called the mass
fractal dimension. For a regular 3D lattice both spectral and
fractal dimensions coincide with the usual dimension of 3.
For proteins however, it is usually found that the spectral
dimension is smaller than 2 and that the fractal dimension is
smaller than 3 but larger than 2 leading to an excess of low
frequency modes and a sparser fill of space. The parameter
"b" in our equation weakly depends on temperature and
interaction parameters and hence may be considered almost
constant.
Analysing the harmonic vibrations spectrum of proteins we
rely on the Gaussian Network Model (GNM) [4]. The GNM
considers proteins to be elastic networks whose nodes
correspond to the positions of the alpha-carbons in the
native structure and the interactions among nodes are
modelled as homogeneous harmonic springs. An interaction
between two nodes exists only if the nodes are separated by
less than a prefixed distance known as the interaction cutoff.
The cutoff distance is usually taken in the range of six to
seven angstrom, based on the radius of the first coordination
shell around residues observed in PDB structures. The only
information required to implement the method is the
knowledge of the native structure. GNM has been widely
applied because it yields results in agreement with X-ray
spectroscopy and NMR experiments.
The physics behind our equation has its roots in a paper
generalizing the Landau-Peierls criterion. Burioni et al
showed that thermodynamic instability also appears in
inhomogeneous structures and is determined by the spectral
We have enhanced the standard implicit solvent model of
the Langevin equation by ensuring that drag and kicks only
act upon parts of a particle exposed to solvent. This ensures
that the internal parts of a protein are not subject to solvent
effects, a key advantage of modelling an explicit solvent.
The solvent-accessible surface of each particle is modelled
by a sphere centered on the particle position. For particles
modelling the sidechains, the sphere radius depends upon
the amino acid type. The backbone particle radii are all the
same. The precise radii used have been optimised to
maximise the difference in accessible area between known
native and a set of non-native states for a small test set of
proteins, the aim being to destabilise non-native states.
The simplicity of the model leads to very fast folding which
can be viewed as it proceeds using a custom visualisation
tool. Proteins up to 90 residues can fold to a stable state in
5-20 minutes. Our current predictions use 100 fold replicates
and require between 8 and 30 CPU hours. In CASP-like
testing conditions, for our test set of 30 domains less than 90
amino acids, we predict structures exceeding TM-score 0.3
for 24 domains, and exceeding 0.4 for 4 domains. We also
tested the system on 12 of the template-free targets from
CASP 7, and we equal or better the world leading Rosetta
server for six of the targets, using tens of hours of CPU time
as compared to thousands of CPU hours used by Rosetta.
9: PROTEINS: COEXISTENCE OF STABILITY AND
FLEXIBILITY
Shlomi Reuveni (School of Chemistry, Tel-Aviv
University, Israel), Rony Granek (Department of
Biotechnology Engineering, Ben-Gurion University, Israel),
Joseph Klafter (Tel Aviv University, Israel).
We introduce an equation for proteins native topology
based on GNM analysis of PDB data and a
generalization of the Landau-Peierls instability criterion
for fractals. The equation relates the number of amino
acids with the fractal and spectral dimensions describing
the protein fold and was tested successfully over 543
proteins [1].
Two seemingly conflicting properties of native proteins,
such as enzymes and antibodies, are known to coexist.
While proteins need to keep their specific native fold
structure thermally stable, the native fold displays the ability
to perform flexible motions that allow proper function [2].
This conflict cannot be bridged by compact objects which
are characterized by small amplitude vibrations and by a
18
dimension. They demonstrated that for a spectral dimension
smaller than 2, the mean square displacement of a structural
unit (for example a single amino acid) in a system composed
of N elements, diverges in the limit of large N. This result
when used in tandem with a proper melting criterion for
proteins leads to an equation describing the native protein
fold. We are led to our equation from two different
independent pathways to protein melting. The first approach
utilizes the Gaussian Network Model (GNM). The melting
of a protein is treated in this approach in a way similar to the
melting of a solid crystal, with an additional assumption:
surface residues initiate the melting process in proteins.
Another approach is motivated by the viewpoint of a folded
protein as a collapsed polymer. It introduces a nonLindemann criterion and a bond bending Hamiltonian rather
than the GNM Hamiltonian used in the first approach.
In order to test the validity of our equation, we calculated
the spectral and fractal dimensions for a data set of 543
proteins. Calculations were preformed on known protein
structures, all structures were downloaded from the Protein
Data Bank (PDB). The proteins that were chosen may differ
in function and/or source organism and represent a wide
length scale ranging from 100 to 3000 residues. Statistical
analysis of the data gathered reveals satisfying agreement
with our equation. Furthermore, in contrast to [3], were the
authors suggested a relation similar to ours, we are able to
recover empirically the unity on the right hand side of our
equation. The results are shown in the figure attached. One
may wonder what will happen if a protein is forced to
strongly deviate from our equation and how artificial
deformations of the protein fold may lead to a breakdown of
our relation. Strong deformations of the protein fold may
actually happen in vivo as part of a natural process. A
possible example is GroEL, a protein chaperon that is
required for the proper folding of many proteins. Recent
molecular dynamics simulations demonstrate the unfolding
action of GroEL on a protein substrate. Our work provides a
theoretical framework that may help understand GroEL
induced unfolding. In addition our work opens new
possibilities for nanoscale and biologically inspired
engineering of catalysts, emphasizing the importance of
internal
motion.
REFERENCES
1. S.Reuveni, R.Granek and J.Klafter, Proteins: Coexistence
of Stability and Flexibility, Phys. Rev. Lett. 100, 208101
(2008).
2. D. Joseph, G.A. Petsko, M. Karplus, Science, 249, 1425,
(1990).
3. R. Burioni, D. Cassi, F. Cecconi & A.Vulpiani, Proteins
55, 529 (2004).
4. T. Haliloglu, I. Bahar & B. Erman, Phys. Rev. Lett. 79:
3090 (1997).
5. R. Granek & J. Klafter, Phys. Rev. Lett. 95, 098106(1),
(2005).
17: PREDICTING SMALL LIGAND BINDING SITES
ON PROTEINS USING LOW-RESOLUTION
STRUCTURES
Andrew Bordner (Mayo Clinic, USA)
The SitePredict method uses Random Forests to predict
which protein residues bind specific metal ions and small
molecules based on evolutionary conservation and
spatial clustering of residue types. Because it requires
only a backbone structure, the method performs well for
unbound structures and can be applied to unrefined
homology models.
Specific non-covalently bound metal ions and small ligands,
such as nucleotides and cofactors, are essential for the
function and regulation of many proteins. However the
identity of the natural ligands and their binding sites on a
particular protein are often unknown, even if a highresolution structure is available. Computational prediction
of ligand binding sites can be used to guide their
experimental verification and thus save considerable effort.
SitePredict is a machine learning based method for
predicting binding sites of different metal ions and small
ligands on low-resolution protein structures. Because
ligands generally bind to residues that are non-contiguous in
the amino acid sequence, prediction methods that use a
protein structure, when available, are expected to perform
better than sequence-only methods. Furthermore, because
only residue-level information is required the method works
well with apo structures and may be directly applied to
homology models without side chain refinement.
SitePredict uses Random Forests trained on binding site
properties that include neighboring residue pair counts, local
enrichment of residue types, evolutionary conservation, and
a rough measure of solvent accessibility. Sites for metal
ions are 10 residue clusters located throughout the entire
protein whereas only surface pockets are considered as sites
for small molecules. Additional information on the shape of
the pocket, namely its volume and principal components, are
also included for small molecule binding site prediction.
Prediction training and validation was performed using a
comprehensive non-redundant set of protein-ligand
structures. A sufficient number of structures were found for
six different metal ions and six different small molecule
ligands. Prediction performance was assessed by the area
the under the ROC curve (AUC) calculated from 10-fold
cross-validation results. Also matching cross-validation
training and test sets contained data for proteins from
distinct sets of Pfam families in order to insure their
independence. While prediction performance varied for
19
different ligands, the AUC remained at least 0.80 for all
ligands. A realistic test using apo structures for binding site
predictions resulted in only a small decrease in prediction
performance, as expected for a method that does not rely on
detailed atomic level structural information.
Although the Random Forest classifier gives a binary
prediction (binding or non-binding site) the output score
contains more information. Higher values above the cutoff
indicate more confident binding site predictions. A
likelihood ratio derived from score histograms for each class
was used to calculate the prediction confidence. These
values can be used for prioritizing predictions for
experimental testing.
The Random Forest method can also estimate the relative
contribution of each site property to the overall prediction
accuracy. The top 10 most important properties were
examined for each ligand. Evolutionary conservation was
among the most important properties for all metal ions but
appeared in the top properties only for ATP and NAD
among the small molecules. Even so, removing
conservation from the input data resulted in a small decrease
in performance, showing that while it is important relative to
other variables it does not contribute inordinately to the
accuracy by itself. This is advantageous since about 20% of
the proteins did not have enough homologous sequences for
calculating evolutionary conservation. SASA and specific
residue propensity and residue pairs were also found to be
important for the metal ion predictions. The residue types
contributing most to prediction accuracy were different for
each ion and agreed with a previous analysis of common
coordinating residues (Harding 2004). Properties related to
the surface pocket shape and size were important for 4 out
of the 6 small molecules. Interestingly, no residue
propensities were among the most important properties for
any small molecule, possibly due to the larger size of these
sites compared with those for metal ions. However, residue
types appearing in important residue pairs for ATP and
NAD binding sites agreed with those in previously
identified sequence motifs.
Discrimination between different ligands was assessed by
cross-prediction in which a model trained on one ligand is
used to make predictions for proteins that bind a different
ligand. The ability of SitePredict to distinguish between two
different ligands was found to be non-symmetric, i.e. depend
on which one was used for training. Calcium and
magnesium were the most difficult to distinguish metal
ions. This is probably related to the fact that some proteins
can bind either ion at the same site. ATP and AMP were the
most difficult to distinguish small molecules, presumably
due to their chemical similarity.
As a demonstration of the usefulness of SitePredict for
function annotation, binding site predictions for
uncharacterized proteins from PSI structural genomics
projects were examined. Several examples were found in
which the binding site predictions corroborated independent
experimental evidence and led to a consistent functional
assignment.
19: SCORING CONFIDENCE INDEX: STATISTICAL
EVALUATION OF LIGAND BINDING MODE
PREDICTIONS
Maria
Zavodszky,
Andrew Stumpff-Kan,
David Lee, Michael Feig
(Michigan
State
University, USA).
We
developed
a
statistical approach to
quantify the confidence
users can have in the
ability of a scoring
function
to
rank
docked ligand poses
correctly without relying on any knowledge about
correct binding modes. The method can successfully
differentiate between protein-ligand complexes with
funnel-like and flat binding energy landscapes.
Protein-ligand docking programs can generate a large
number of possible binding orientations for each ligand. The
challenge is to identify the orientations closest to the native
binding mode using a scoring method. We developed a
confidence measure of scoring performance in ranking the
docked ligand poses that does not rely on any knowledge
about the correct binding mode. The method exploits the
fact that an adequately performing scoring function captures
the roughly funnel-like shape of the binding energy
landscape, with scores generally improving as the docked
ligand orientations get closer to the correct binding mode.
For such cases, the correlation coefficient of scores versus
distances is expected to be the highest when the most nativelike orientation is used as a reference. This correlation
coefficient, called the correlize score, was calculated for
each docked ligand pose and it was found to be a good
indicator of how far the docking is from the orientation
corresponding to the global minimum of the binding energy.
The correlation coefficient between the original scores and
correlize scores as well as the range of correlize scores were
found to be good measures of scoring performance. They
were combined into a single quantity, called the Scoring
Confidence Index to quantify the confidence the user can
have in the ability of a scoring function to rank the docked
poses correctly. The diagnostic ability of the Scoring
Confidence Index was tested on 50 protein-ligand
complexes scored with three commonly employed scoring
functions: AffiScore, DrugScore and X-Score. Binding
mode predictions were found to be three times more reliable
for complexes with Scoring Confidence Index values above
0.8 than for cases with lower values. This new confidence
measure of scoring performance is expected to be a valuable
tool for virtual screening applications.
22: FUNCTIONAL INSIGHTS FROM BINDING SITE
SIMILARITIES COMPLEMENT EXISTING
METHODS FOR THE PREDICTION OF PROTEIN
FUNCTION
20
More recently, we utilized IsoCleft to ask whether nonhomologous binding sites that bind similar ligands can be
discriminated by means of their binding site similarities
(Najmanovich et al., 2008). This study showed that there
exists a certain level of uniqueness across non-homologous
binding sites. The ability to predict binding sites is very
sensitive to knowing the identity of the binding site atoms
within the cleft. As this information becomes less accurate,
the more difficult it is to determine what ligand the protein
may bind.
In the present work we describe a database of cognate
binding sites. Each binding site in the IsoCleft Database is
defined as the atoms in contact with a cognate ligand and
includes the respective residue’s C-alpha atoms. The
database is a subset of the Procognate database (Bashton et
al., 2008) selecting one example for each Pfam
family/ligand combination where the ligand is at least 95%
similar to a ligand present in the KEGG reaction for that
protein family. For each Pfam cognate-ligand combination,
we select the example with lowest solvent accessible surface
area (McConkey et al., 2002). The IsoCleft Database
contains 1198 examples comprising 508 Pfam families and
486 ligands.
To demonstrate the usefulness of the IsoCleft method and
database in providing complementary information to
existing methods, we show here results for particular cases
of structural genomics proteins with unknown function, for
which the variety of current state of the art methods for the
prediction of function from sequence and structure present
in the ProFunc server (Laskowski et al., 2005), do not offer
functional clues.
The first example is that of PDB code 2pd0, a
Cryptosporidium parvum protein of unknown function.
Search against the IsoCleft database detects as the top two
distinct Pfam hits the product and substrate analogs of the
same purine nucleoside phosphorylase reaction in humans
and E.coli respectively.
The second and third examples correspond to cases where a
specific function could not be suggested yet, clear ligand
similarities exist between the cognate ligands that bind the
top scoring binding sites and may serve as initial guesses for
rational drug design or for narrowing down the space of
potential functions. In the case of 1sed, a hypothetical
protein from bacillus subtilis, the three top distinct Pfam hits
to are bound to D-glutamic acid, glutamine and fumarate,
three very similar molecules. In the case of 3d0j, a protein of
unknown function and origin, the three top hits are all
cofactors contain the AMP moiety.
We are currently working on setting up a web-based
interface to query the IsoCleft Database.
While no method can be accurate in all possible cases, the
examples shown here were specifically chosen to show the
potential of the IsoCleft method and associated database as a
valuable complement to the myriad of existing methods in
the quest for ever more accurate predictions of function
from structure.
REFERENCES:
Allali-Hassani A, Pan PW, Dombrovski L, Najmanovich R,
Tempel W, Dong A, Loppnau P, Martin F, Thornton JM,
Rafael Najmanovich
& Janet M. Thornton (European
Bioinformatics Institute, UK)
The detection of binding site similarities may help
pinpoint protein function and serve as a starting point
for rational drug design. In the present work we describe
the use of IsoCleft, a program to compare binding sites,
and the associated IsoCleft database on structural
genomics targets of unknown function.
Current computational methods for the prediction of
function from structure are focused on to the detection of
similarities and subsequent transfer of functional annotation.
Such similarities may reflect a distant evolutionary
relationship as well as unique physico-chemical constraints
necessary for binding similar ligands.
IsoCleft is a graph-matching based method for the detection
of 3D atomic similarities introducing two innovations that
allow us to extend its applicability to the analysis of large
all-atom binding site models. IsoCleft does not require
atoms to be connected either in sequence or space.
The first innovation is to perform the graph matching in two
stages. In the first stage, an initial superimposition is
performed via the detection of the largest clique in an
association graph constructed using only C-alpha atoms of
equivalent residues in the two clefts. This superimposition is
used as a means to simplify the second all-atom graph
matching stage in which only atoms within a certain
distance threshold are considered as potentially
correspondent.
The second innovation introduced is the exploitation of the
fact, noted by Bron & Kerbosch (1973), that the algorithm
has the tendency to produce the larger cliques first in order
to implement what we call Approximate Bron & Kerbosch.
In the Approximate Bron & Kerbosch, the first clique is
selected as the solution (and the search procedure stopped)
rather than detecting all cliques in order to find the largest.
Approximate Bron & Kerbosch allows us to obtain an
optimal or nearly optimal solution in a fraction of the time
that would be needed without noticeable effects on the
results.
In the past we have used IsoCleft to study the relation of
binding site similarities and experimentally determined
functional similarities within members of the Human
Sulfotransferase family (Najmanovich et al., 2007; AllaliHassani et al. 2007).
21
Edwards AM, Bochkarev A, Plotnikov AN, Vedadi M,
Arrowsmith CH. Structural and Chemical Profiling of the
Human Cytosolic Sulfotransferases. PLoS Biology (2007)
vol. 5 (5) pp. e97.
Bashton M, Nobeli I, Thornton JM. PROCOGNATE: a
cognate ligand domain mapping for enzymes. Nucleic Acids
Research (2008) vol. 36 (Database issue) pp. D618-22.
Bron C & Kerbosch J. Algorithm 457: finding all cliques of
an undirected graph. Communications of the ACM (1973)
vol. 16 (9) pp. 575-577.
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh
M, Katayama T, Kawashima S, Okuda S, Tokimatsu T,
Yamanishi Y. KEGG for linking genomes to life and the
environment. Nucleic Acids Research (2008) vol. 36
(Database issue) pp. D480-4.
Laskowski RA, Watson JD, Thornton JM. ProFunc: a server
for predicting protein function from 3D structure. Nucleic
Acids Res (2005) vol. 33 (Web Server issue) pp. W89-93
McConkey BJ, Sobolev V, Edelman M. Quantification of
protein surfaces, volumes and atom-atom contacts using a
constrained Voronoi procedure. Bioinformatics (2002) vol.
18 (10) pp. 1365-73.
Najmanovich R, Kurbatova N, Thornton JM. Detection of
3D atomic similarities and their use in the discrimination of
small-molecule protein binding sites. Bioinformatics (2008)
vol. 24 (18) in press.
Najmanovich R, Allali-Hassani A, Morris RJ, Dombrovsky
L, Pan PW, Vedadi M,
Plotnikov AN, Edwards AM, Arrowsmith CH, Thornton
JM. Analysis of binding site similarity, small-molecule
similarity and experimental binding profiles in the human
cytosolic sulfotransferase family. Bioinformatics (2007) vol.
23 (2) pp. e104-9.
14: LOGIC-BASED DRUG DISCOVERY
derive rules from experimental data of ligand activity. In
blind trials on two GPCR targets, 50% of novel virtual
hits exhibited inhibitory activity upon experimental
screening.
The application of Quantitative Structure Activity
Relationship (QSAR) is a central tool in drug discovery and
development due to its role in identifying key structural
features for activity or toxicity. A variety of methods have
been developed and each has its merits and
limitations. Over the last few years we have been using a
form of logic-based machine learning based on Inductive
Logic Programming (ILP) combined with regression to
derive QSARs. The approach is able to identify key
chemical features from large datasets and to learn rules
which can be understood by medicinal chemists. This talk
will present a series of studies using logic-based machine
learning in drug discovery.
The first set of studies combined ILP with support vector
programming in an approach termed SVILP. A QSAR
describing inhibition of thermolysin had an Rsquared-CV
(cross-validated squared Pearson correlation coefficient) of
0.79 compared to an industry-standard method Comparative
Molecular Field Analysis (CoMFA) of 0.55. The learnt rules
based on the structures of the inhibitors correctly identified
features of thermolysin inhibition in accord with protein
crystallographic results (see Figure).
SVILP was also used to derive predictive rules for
toxicology from the DSSTOX dataset of fathead minnow
toxicity. SVILP yielded Rsquared-CV of 0.57 compared to
an industry standard TOPKAT which yielded 0.26 (ref 1).
The learnt rules provided insight into key chemical alerts for
toxicity.
The SVILP approach has also been applied to model
protein-ligand interaction (3). Despite the increased use of
protein-ligand docking in the drug discovery process due to
advances in computational power, the difficulty of
accurately ranking the binding affinities of a series of
ligands docked to a protein remains largely unsolved. This
problem has lead to the development of scoring functions
tailored to rank the binding affinities of a series of ligands to
a specific system. We have used SVILP to produce binding
affinity predictions of a series of ligands to a particular
protein. Our results show that SVILP performs comparably
with other state of the art methods such as CoMFA on five
protein ligand systems. The ability graphically to display
and understand the SVILP produced rules is demonstrated.
The above studies demonstrated the applicability of SVILP
to generate accurate QSARs. A major challenge in drug
discovery is to use a QSAR to identify active molecules
from a database of possible molecules and thus to suggest
novel molecules for progression through a hit to lead
programme (i.e. virtual screening). In virtual screening, one
aim is to identify molecules that are sufficiently chemically
different from the currently known ligands (i.e. a novel
chemotype) so that can be patented. In addition, novel
chemotpes may exhibit different pharmacological effects
including adverse side effects of current molecules.
The SVILP approach has been recently been developed into
a more general logic-based approach, known as
Michael Sternberg (Imperial College London, UK),
Stephen Muggleton (Department of Computing, Imperial
College London, UK), Ata Amini (Equinox Pharma Ltd,
UK), Huma Lodhi (Department of Computing, Imperial
College London, UK), David Gough (Equinox Pharma Ltd,
UK), Paul Shrimpton (Structural Bioinformatics Group,
Imperial College London, UK).
A powerful approach to identify new chemotypes for
drug discovery by virtual screening is presented. The
approach is based on logic-based machine learning to
22
INDDExTM. We have used INDDEx™ in two blind trials
to find novel chemotypes against two series A GPCR (Gprotein coupled receptors) targets. In trial 1, the training data
consisted of 479 active antagonist molecues (<1000nM
inhibition) and 209 inactives. The learnt QSAR was used to
screen a subset of 400,000 drug-like molecules from the
ZINC database. An initial screen identified 500 potential
hits with a predicted inhibition of < 1000nM. From this list,
molecules were removed that were chemically similar to any
molecule in the training data, quantified as requiring a
Tanimoto coefficient of <0.8. 157 in silico hits were then
purchased and tested in primary screens yielding 76 actives,
i.e. a hit rate of just under 50%. From these, 40 diverse
molecules were selected for secondary screening and 30 had
an IC50 of <12muM. Of these 30, 28 were quite different
from any of the training data (Tanimoto coefficient < 0.7)
and it unlikely that their inhibitory activity could have been
predicted by expert inspection. Broadly similar results
were obtained for target 2.
There are several features on INDDEx that are responsible
for our high predictive accuracy of c. 50% in these two blind
trials. In particular, INDDEx can learn from a large dataset
and use information from both actives and inactives. In
addition, INDDEx is not based on global superposition but
rather identifies sub-structures that are important for
activity. Further blind studies are in progress to explore the
power of INDDEx to discover novel antagonists and
agonists using logic-based machine learning.
Amini, A., Muggleton, S.H., Lodhi, H. and Sternberg, M.J.
(2007) A novel logic-based approach for quantitative
toxicology prediction, J Chem Inf Model, 47, 998-1006.
Amini, A., Shrimpton, P.J., Muggleton, S.H. and Sternberg,
M.J. (2007) A general approach for developing systemspecific functions to score protein-ligand docked complexes
using support vector inductive logic programming, Proteins,
69, 823-831.
FIGURE LEGEND An example of a learnt logic rule
describing the structural features of an inhibitor of
thermolysin.
43: CONFORMATIONAL FREE ENERGY OF
PROTEIN STRUCTURES: COMPUTING UPPER
AND LOWER BOUNDS
Hetunandan Kamisetty &
Christopher
Langmead
(Carnegie
Mellon
University, USA)
We describe an approach to compute the Conformational
Free Energy (G) of a Protein with a given backbone
conformation for the Protein. Our technique models protein
structures with a fixed backbone as a complex probability
distribution over a set of torsion angles, represented by a set
of rotamers. Specifically, we model protein structures using
undirected probabilistic graphical models, also known as
Markov Random Fields. Our representation is complete in
that it models every atom in the protein. A probabilistic
representation confers several advantages including that it
provides a framework for predicting changes in free energy
in response to internal or external changes. For example,
structural changes due to changes in temperature, ligand
binding, and mutation, can all be cast as inference problems
over the model. Existing inference algorithms can then be
used to efficiently solve these problems.
In theory, the energy of interaction between any two
residues of the protein is non-zero. However, due to the
nature of these interactions, this energy is negligible if the
two residues are distally located. Also, if all the residues that
directly influence a pair of residues are in specific
conformations then the random variables corresponding to
these residues become conditionally independent of each
other. These conditional independencies can be compactly
encoded using a Markov Random Field(MRF).
In general, an MRF encodes the following conditional
independencies: each vertex is conditionally independent of
every other set of vertices in the graph, given its immediate
neighbors in the graph.
While an MRF allows for a compact encoding of the
probability distribution, performing statistical inference
exactly can still be expensive. In fact, even computing exact
marginals is NP-Hard, if the graph, like the MRF described
above, has cycles. The Junction Tree algorithm for exact
inference has a running time that is exponential in the tree
width of the graph, which can be prohibitively expensive in
large graphs. However, recent advances within the Machine
Learning community on approximate algorithms for
inference
now
allow
efficient
computation
of
approximations to the free energy. In particular, Generalized
Belief Propagation gives estimates of free energy that have
been shown to work well in practice, even though they have
no theoretical guarantees [4], mean field and other
variational approximations [3] give upper bounds on the free
energy while the methods of [2], which we shall refer to as
Tree-reweighted BP, give lower bounds on the free energy.
Since the log partition function and the free energy differ
only in the sign, we will use both in our results. When
comparing different algorithms with each other, we use
estimates of the log partition function and when comparing
them with experimental results, we will use the negatives of
the log partition estimates as free energy estimates.
We will present results showing that simple bounds obtained
using Naive Mean Field and Tree Reweighted Belief
Propagation are reasonably tight. While GBP isn't
guaranteed to give good estimates, we showed that it
outperforms the two other approaches on most datapoints,
often significantly. Admittedly it is possible to find better
bounds, using for example, a structured variational approach
We describe an approach
to
compute
the
Conformational
Free
Energy(G) of a protein
with a fixed backbone, by
posing it as a statistical
inference problem in a Markov Random Field([1]).
Using this framework, we shall describe fast algorithms
with strong theoretical guarantees to compute lower and
upper bounds for G.
23
instances, macromolecular complexes do not change during
crystallization and therefore crystal packing should reflect
significant, or biologically relevant, macromolecular
interactions. This assumption is exploited in most, if not all,
studies, where structural aspects of protein interactions are
inferred from crystals.
However, crystals exemplify thermodynamic systems in
global minimum of free energy, taking into account both
significant, biologically relevant interfaces found within
biological complexes, and artifactual, inter-complex
contacts that originate from the structure of crystal packing.
Therefore, it is possible that crystals may misrepresent
natural, in-solvent, interactions by sacrificing their binding
energy if it is overweighed by the formation of more
energetically favourable inter-complex contacts. Although
this point is conceptually clear, no systematic study has been
performed so far, where the correspondence between natural
and in-crystal interactions were studied.
The present work aims to approach the outlined problem by
analyzing dimeric protein complexes obtained from
crystallographic PDB entries. Two goals are pursued. First,
we would like to find out to what degree our understanding
of macromolecular interactions allows one to reproduce
these complexes outside crystal context. Secondly, we
would like to see whether, and if yes, then under what
conditions, these complexes may be misrepresented by
crystals.
I will report the results of a massive docking experiment,
which included docking of 4065 non-redundant dimeric
protein complexes identified in crystal packings using PISA
software [1]. Before the docking, the complexes were
disassembled and their monomeric units were randomly
oriented in order to exclude the possibility of docking by
trivial translation. Then, a specially written program was
used to find the only most energetically favourable contact
of the units. Unlike in many other docking programs, no
geometrical scoring of docking quality was used. The
optimal docking position was identified solely by the
minimum of free Gibbs energy of generated complexes,
calculated in exactly the same way as in PISA software [1].
Obviously, the described experiment corresponds to the
simplest case of bound docking, and if docking calculations
were exact and crystal dimers were identical to complexes in
solution then all dimeric structures would be reproduced.
It was found, however, that in 38% of instances, the toprated orientation of docked subunits was different of the
original crystal dimers (8 Å r.m.s.d. threshold was used to
identify successful dockings). This unexpectedly high rate
of failures demonstrates a sound dependence on the free
Gibbs energy of dissociation seen in the Figure (red line). At
zero dissociation energy, when no energetically preferable
orientation of docked units may be found and successful
dockings emerge by chance, the success rate is estimated at
10%. This suggests that an average protein chain may form
about 10 geometrically suitable contacts, which agrees
reasonably well with an average of 8 interfaces per chain in
the PDB. With increasing free energy of dissociation, the
rate of failures shows a remarkable exponential decrease,
for lower bounds. While that is out of the scope of this
study, it is a promising direction for future studies.
REFERENCES
[1] Hetunandan Kamisetty, Eric P. Xing and Chris J.
Langmead, "Free Energy Estimates of All-atom Protein
Structures using Generalized Belief Propagation."
Proceedings of the Eleventh Annual International
Conference on Research in Computational Molecular
Biology (RECOMB 2007), pp:366-380.
[2] M. J. Wainwright, Tommi S. Jaakkola, Alan S. Willsky,
A new class of upper bounds on the log partition function,
IEEE Trans. on Information Theory, vol.51, pp: 2313-2335.
[3] Michael I. Jordan, Zoubin Ghahramani, Tommi S.
Jaakkola, Lawrence K. Saul, An Introduction to Variational
Methods for Graphical Models", Learning in Graphical
Models, 1998.
[4] Yedidia, J.S., Freeman, W.T., Weiss, Y., Characterizing
Belief
Propagation
and
its
generalizations,
http://www.merl.com/reports/TR2002-35/, 2002.
6: CRYSTAL CONTACTS AS NATURE'S DOCKING
SOLUTIONS
Eugene
Krissinel
(EBI,
Genome
Campus,
Hinxton,
Cambridge
CB10 1SD,
UK).
The
assumption
that crystal
contacts reflect natural macromolecular interactions
makes a basis for many studies in structural biology.
However, crystal state may correspond to global
minimum of free energy where biologically relevant
interactions are sacrificed in favour to unspecific
contacts. A large-scale docking experiment was
performed in order to assess the extent of
misrepresentation of natural complexes by crystal
packing.
The ability of proteins to interact with each other and form
complexes makes a basis of many important biochemical
processes. In general, protein interactions are thought to be
specific, which means that a given protein manifests sound
interaction only with particular type of proteins and in
particular spots on protein surface. This specificity is
important for research and applications, and considerable
amount of effort in both experimental and theoretical studies
is applied to the identification of structural aspects of protein
binding. Solution to this problem may bring about a better
understanding of protein function and give a clue for drug
discovery and design.
Most of our today’s knowledge on the geometry of
macromolecular interactions comes from protein
crystallography. It is commonly assumed that, in most
24
and no docking failures have been recorded at dissociation
energies higher than 50 kcal/mol.
In order to rationalize the obtained results, a theoretical
model for docking failures has been developed. This model
considers a finite number of geometrically suitable contacts
for each pair of docked proteins, and assumes that, firstly,
crystal may capture geometrically different complexes with
probability dependent on their free energy of dissociation,
and, secondly, free energy is calculated with a normal error.
Fitting to experimental results in the Figure suggests an
average of 10 suitable docking contacts (geometrically
different complexes) for each pair of proteins and
calculation error of 2.3 kcal/mol (green line). Assuming the
average of 10 contacts to be a property of crystal packing,
the model gives a finite rate of failures even at zero
calculation error, as shown by magenta line. This line
indicates the measure of misrepresentation of dimeric
complexes by crystal packing. The weaker is protein
interaction in the complex, the higher are chances that it will
be completely lost at crystallization due to the emergence of
unspecific, inter-complex interactions.
The numbered spots in the Figure indicate CAPRI targets
with failure rate probabilities reported in [2]. Usually, the
low success of CAPRI dockings is attributed to algorithmic
imperfectness and difficulties of bound docking. However,
the present study suggests that most of CAPRI targets have
been chosen in the region where complexes are very likely
to be misrepresented by crystals. Therefore, it is possible
that, in some cases, computational docking yields correct,
lowest-energy dimers that are not found in crystal packings,
while docking solutions that were rated as successful could
be a result of a mere chance. Only two higher-energy
CAPRI targets, shown as diamonds, have been successfully
docked by program used in this study. It may be also noted
that considerable number of CAPRI targets appears to be
unstable complexes, for which no docking solution should
be sought in first place.
[1] E. Krissinel & K. Henrick (2007) J.Mol.Biol. 372:774797
[2] S. Vajda (2005) Proteins, 60:176-180
estimating the barriers we construct a kinetic model
capable of determining the pathway and rate of
unfolding.
The configurational space of a polypeptide chain is
astronomically large, yet the folding of most proteins is
completed within a fraction of a second. This paradoxical
observation suggests that a pathway for folding, dictated by
physical interactions and topological constraints, must exist.
Protein topology is the primary determinant of the folding
energy landscape [1], and routes through this landscape are
like trees [2] where branches represent condensation events,
merging two substructures into one, and the branch order
represents the topological dependence of these events. The
discovery of a direct linear correlation between folding rate
and relative contact order was the first indication of a
topological dependence on the folding rate, and hints at a
mechanistic description of folding [3], but analytical models
cannot capture the details of the pathway. Recent
experimental observations by Colon endorse the view of
topology-dependent unfolding rates in a survey of
kinetically stable proteins [4]. In that study the most
kinetically stable proteins are multimeric and have complex
geometry, with features such as buried terminal strands, and
"latches" that wrap around the protein like a belt.
Our earlier model, called UNFOLD, described a protein as a
weighted secondary structure element graph [5]. Contact
energies were defined between secondary structure
elements, and min-cuts were found such that the graph was
heirarchically partitioned in the lowest energy way at each
step. In the new model, GeoFold, we refine the energy
expression, adding configurational and sidechain entropy,
and writing the solvation energy as a function of denaturant
concentration. Also new in GeoFold is the ability to carry
out a kinetic simulation. To do so, we have defined
reasonable estimates of the rates of transition between
kinetic intermediates in the unfolding pathway.
A kinetic simulation is carried out by moving concentrations
along the edges of the unfolding graph. The unfolding graph
is composed of elemental subsystems (Figure 1a) where
substructure f is partitioned into two substructures, u1 and
u2, passing over an energy barrier in the process. It is
assumed that the two substructures u1 and u2, are solvated
before they are separated. This means that an energy barrier
can be calculated if we can only estimate the solvation
energy and the gain in configurational entropy. The
solvation energy is assumed to be unfavorable and
proportional to the change in buried surface area, whereas
the configurational entropy change is assumed to be
positive, and its magnitude depends on the number of
degrees of freedom gained by unfolding.
Topology defines the allowable unfolding motions, which in
turn define the entropy gain at each step. Three topological
operators can be defined to describe all non-distorting linear
transformations on a chain (Figure 1b). 1) If the chain
crosses only once from u1 and u2, then the allowable motion
is a pivot, which is the set of all rotations around a point. If
the chain crosses twice, the two crossing points define a
hinge, allowing rotations only around one axis. If the chain
18: GEOFOLD: A MECHANISTIC MODEL TO
STUDY THE EFFECT OF TOPOLOGY ON PROTEIN
UNFOLDING PATHWAYS AND KINETICS
Vibin Ramakrishnan (Insitute for
Bioinformatics
and
Applied
Biotechnology, Bangalore, India),
Saeed Salem,
Saipraveen
Srinivasan,
Wilfredo
Colon,
Mohammed Zaki & Chris
Bystroff (Rensselaer Polytechnic
Institute USA)
We seek to explain the effect of
protein topology on kinetic
stability using a graph-based
model for unfolding. Proteins
open to an unfolded state by
either pivoting, hinging or separating chains. By
25
does not cross from u1 and u2, then the model consists of
multiple chains or disjoint segments of one chain, and the
motion is a simple translation, called a break in this study. A
break is assigned the highest entropy gain, followed by
pivots, then hinges.
The structure of the unfolding graph (Figure 1c) depends on
the topology of the protein. If there are strong topological
dependencies on the possible ways to unfold the protein,
then the unfolding graph will contain one or more bottleneck
edges. The rate of passage through an edge depends on the
amount of buried surface exposed and the type of unfolding
motion. A series of bottleneck edges would lead to slower
unfolding in general.
GeoFold was applied to several proteins, some that unfold
fast (factor for inversion stimulation FIS, 1F36; protein-G,
2IGD) and others that are extremely kinetically stable
(papain, 1PPN; cyanase 1DWK), having unfolding halflives measured in years or decades. An unfolding graph was
generated for each protein by finding all topologically
possible pivots, hinges and breaks, recursively, starting with
a graph node representing the complete native structure (N)
and ending in graph nodes that contain single residues. The
concentration of N was initialized to a non-zero molarity
and all other nodes were set to zero molar to start the
simulation. Concentration changes were calculated using
transition state theory, where the barrier height for unfolding
was set to the solvation energy times a Hammond factor
(theta), and the barrier height to folding was set to the
configurational energy gain minus the solvation energy
times
the
difference
Hammond
factor
(1
theta). Concentration changes were then calculated until the
whole system reached an equilibrium state.
A solvation energy factor (omega) was used to calculate the
solvation free energy from the buried surface area. The
configurational entropy of a break (S_break), pivot
(S_pivot) and hinge (S_hinge) motion were also user
defined, allowing us to empirically fit the unfolding rates to
real
experimental
values.
Unfolding simulations were carried out at various values of
omega near the melting point. The unfolding rates in pure
water were found by linearly extrapolating the rates to the
solvation value for pure water (omega_H2O). This is the
same method that is used experimentally.
The initial results of simulations on fast unfolders and
kinetically stable proteins show the expected trend. FIS, a
dimer, unfolds the fastest and shows 3-state kinetics, with a
dimeric intermediate state that dominates at the melting
omega. A dimeric equilibrium intermediate has been shown
experimentally for this protein [6]. Avidin (1RAV) has a
beta barrel structure that forces the protein to unfold by way
of an unfavorable hinge motion. Avidin unfolds much
slower than FIS. Papain (1PPN), a monomeric protein
having a complex topology with N and C-terminal latches,
unfolds even more slowly, and exhibits 2-state behavior.
The simulation data fit the experimental data qualitatively
and provide a detailed look at the unfolding pathway.
[1]
Baker,
D.,
Nature
2000,
405,
39-42.
[2] Hockenmaier, J. J., K A. and Dill, K A., Proteins 2007,
66, 1-15.
[3] Makarov, D. E., Plaxco, K. W., Protein Sci 2003, 12, 1726.
[4] Xia, K., Manning, M., Hesham, H., Lin, Q., et al., Proc
Nat Acad Sci 2007, 104, 17329-17334.
[5] Zaki, M. J., Nadimpally, V., Bardhan, D., Bystroff, C.,
Bioinformatics (Oxford, England) 2004, 20, i386-393.
[6] Meinhold, D., Boswell, S., Colon, W., Biochemistry
2005, 44, 14715-14724.
38: THE NEXT GENERATION OF THE BACKBONEDEPENDENT ROTAMER LIBRARY
Maxim Shapovalov
and
Roland
Dunbrack
(Fox
Chase
Cancer
Center, USA).
We present the
next generation of
the
backbonedependent
rotamer library,
which is widely
used in structure
prediction and protein design programs. We have used
adaptive kernel density estimation to achieve smooth,
differentiable phi,psi dependent probability estimates
and angles. These libraries are useful in methods that
account for backbone flexibility.
As the number of high-resolution X-ray structures has
increased, it has become possible to develop more detailed
statistical analyses of side-chain conformational data. The
backbone-dependent rotamer library, which provides
rotamer frequencies and the means and variances of dihedral
angles, is used in many homology modeling programs and
most protein design methods. As part of improving
homology modeling using the SCWRL side-chain prediction
program and other programs, we have developed the next
generation of a backbone-dependent rotamer library. Our
central goal in releasing a new rotamer library was to
provide smooth estimates of the rotamer probabilities as a
function of the backbone dihedrals phi and psi_ Previous
versions of the library were quite bumpy due to a lack of
smoothing on the phi,psi grid, especially in regions of the
Ramachandran map that are not densely populated. As these
probabilities (or rather logs thereof) are often used as energy
functions in programs that allow backbone flexibility (e.g.,
Rosetta), it is important that the density estimates have wellbehaved derivatives with respect to the backbone dihedrals.
We present a purely non-parametric approach to generate a
smooth, differentiable backbone-dependent rotamer library
for all standard protein residue types. We applied our
electron-density based method (Shapovalov and Dunbrack,
2007) to remove unreliable conformations in a set of 3000
protein chains. We used a recently developed program,
siocs, to flip Asn, Gln, and His residues according to
hydrogen bonding patterns within each crystal. To derive
probabilities of the different rotamers for each side-chain
type, we used adaptive kernel density estimation to calculate
26
p(phi,psi | r) for each rotamer r and Bayes’ rule to generate
p(r | phi,psi). The figure shows the probability of serine g+
(+60) rotamer vs. phi and psi.
A kernel is a Gaussian-like function used to spread out
single data points; the amount of smoothing depends on the
width of the kernel function, with greater smoothing
generated by wider kernels. An adaptive kernel varies the
width of the kernel depending on the local density of points,
such that there is greater smoothing in sparse regions of the
data set. The adaptive kernel thus reduces noise from
outliers in the data. We use both data-adaptive kernels that
vary from data point to data point depending on a local pilot
density estimate, as well as query-point adaptive kernels,
where all data points have the same kernel width but the
kernel width varies the density near the query point (in this
case, phi,psi). The rotamer probabilities use data-adaptive
kernels. To calculate mean angles and their variances as a
function of phi and psi, we used an adaptive kernel
regression, using query-adaptive kernels. The kernel used in
all of these calculations is a von Mises function, which is the
analogue of the normal distribution for periodic variables
(i.e., angles).
One particularly difficult statistical problem is backbonedependent density estimates for non-rotameric dihedral
degrees of freedom, such as _hi_ of Asp and Asn and
_hi__of Glu and Gln. This is effectively a regression of a
density estimate; that is, we provide p(chi2 | phi,psi,r1) for
Asp and Asn, where r1 is the _hi1 rotamer. We have solved
this problem with a novel combination of query-adaptive
kernels for the backbone angles on the one hand and dataadaptive kernels for the side-chain dihedrals on the other.
Effectively, in sparse or empty regions of the Ramachandran
map p(chi2 | phi,psi,r1) looks like a backbone-independent
estimate. In populated parts of the Ramachandran map, the
local data contribute strongly and the estimate varies
significantly from the backbone-independent estimate. We
have also applied these methods to the aromatic _hi2
dihedral angles.
The new rotamer libraries improve structure prediction in
SCWRL, in particular for the aromatic amino acids and the
nonrotameric degrees of freedom. We believe the new
rotamer library will be an important step toward improving
protein structure prediction and modeling with SCWRL, and
especially for programs that rely on continuous and
differentiable energy functions such as Rosetta.
65: A TWO-STAGE RESIDUE-RESIDUE CONTACT
PREDICTOR
Actual residue-residue contacts comprise only about 3%
of possible pairs in a sequence. We use one neural
network to provide an enriched set of pairs for training a
second neural network. While the results show higher
accuracy, the gains appear to come from pairs with low
separation.
Protein structure prediction continues to be a challenge
despite the gains from model builders such as Modeller,
Rosetta, and undertaker. The best predictions today depend
on templates, known protein structures whose sequence is
sufficiently similar in part or in whole to the target
sequence. These templates provide important constraints in
building accurate models. However there are target
sequences which have no templates.
For these targets, there is a need for other constraints
especially in terms of the super-secondary structure, that
aspect of structure between the secondary structure and the
actual tertiary structure. Knowing that two residues are in
close proximity to each other when the two residues are
actually far apart in the sequence is part of such information
so accurate predictions of these residue-residue contacts
may help in building models for such difficult targets.
We developed a predictor for CASP7 using local structure
predictions along with paired statistics including a novel
correlation statistic. Its predictions were assessed as the best
for CASP7. Since then we have developed a new neural
network for CASP8 that employs more inputs. While
developing the new predictor, we discovered that by just
using local structure predictions, we could build a good
predictor. Until then we had assumed that the paired
statistics were the main source of predictability and the local
structure predictions added only a small amount.
With this new result, we revisited an issue that arises in
developing a contact predictor: the sparseness of positive
examples. Actual contacts are only about 3% of the total
possible pairs of residues. Originally we dealt with the
sparseness by reducing the number of negative examples to
get a better balance of negative and positive examples while
training. The new two-stage predictor resolves this issue by
providing a second stage neural network with an enriched
set of predictions where the positive examples comprise
about 10% of the total examples; no balancing is required.
The first stage uses only local structure predictions and
regularized amino acid composition as inputs. We limit
resulting predictions to 10*sequence_length. Then paired
statistics are calculated for this restricted set of pairs. These
statistics along with the log(rank) of the first stage
predictions and matching local structure predictions provide
the inputs for training a second neural network. The result is
a gain of about 3% in overall accuracy.
This gain comes at a cost in the quality of the predictions.
To explain what is meant by quality, we present a new
measure called weighted accuracy. Given two residues, we
define separation as the absolute difference between the
indices of the two residues. Residue pairs with low
separation have a significantly higher probability of contact
than pairs with high separation (> 50). CASP assessors deal
with this issue by dividing predictions into three categories:
George Shackelford & Kevin Karplus (UCSC, USA)
27
Surface, as defined by H. Edelsbrunner at the end of the 90’s
[8]. This particular surface definition is referred to as the
Molecular Skin Surface (MSS) when applied to molecular
assemblies. It is basically comparable to the MS
representation but providing additional smoothness and
decomposability. Used in MetaMol, the MSS provides
further advantages as compared to other molecular surface
definitions. First, the surface does not self-intersect and is
everywhere tangent continuous [8]. Moreover, MSS is
composed of quadrics - whereas MS comprises torus slices which simplify calculations. Another advantage of the MSS
is that the nature of the surface depends on a single
parameter, the shrink factor. Adjusting it allows to evolve in
real time from a van der Waals surface to the MSS (close to
the MS) and finally to a simplified surface that can be very
useful for coarse-grained protein docking.
Some works already triangulate MSS [9-11], but this has
two drawbacks: (1) the surface topology must be preserved,
making the algorithm complicated (and slow); and (2) as
previously, at a certain level of zoom, the triangles generate
display artifacts. To overcome these limitations, we use a
ray-casting method. This has two advantages: (1) the raycasting algorithm directly uses the MSS equation and does
not need to resample it; and (2) pixelaccurate images are
generated. In order to speed up the calculations
significantly, we implemented GPU ray-casting, which has
already been used to represent simple molecular models as
“CPK” or “Balls and Sticks” [12, 13] but, to our knowledge,
our program is the first one that achieves GPU ray-casting to
the more complicated case of MSS.
As a result MetaMol is able to display MSS interactively
and with the best rendering quality. Furthermore, MetaMol
provides sophisticated lighting effects that enhance the
displaying quality and it is possible to visualize the MSS
deformations with smooth transitions, which may be used
for displaying, in real time, molecular surface movement
during Molecular Dynamics simulations. See also:
http://www.loria.fr/~chavent/metamol.htm
REFERENCES:
1. Connolly, M. L., molecular surface triangulation, Journal
of Applied Crystallography, 1985, 18, pp. 499-505.
2. Varshney, A., Brooks, F. P. J. and Wright, W. V.,
Linearly Scalable Computation of
Smooth Molecular, Invited submission,, IEEE Computer
Graphics and Applications, 1994.
3. Sanner, M. F., Olson, A. J. and Spehner, J. C., Reduced
surface: an efficient way to compute molecular surfaces.,
Biopolymers, 1996, 38, pp. 305-320.
4. Can, T., Chen, C.-I. and Wang, Y.-F., Efficient molecular
surface generation using levelset methods., J Mol Graph
Model, 2006, 25, pp. 442-454.
5. Bates, P. W., Wei, G. W. and Zhao, S., Minimal
molecular surfaces and their applications, J Comput Chem,
2007.
6. Vorobjev, Y. N. and Hermans, J., SIMS: computation of a
smooth invariant molecular surface, Biophys J, 1997, 73, pp.
722-32.
those with separation of 6 or greater, 12 or greater, and 24 or
greater. Accuracy is measured in all three categories for
assessment. Correct predictions with large separation can be
considered more valuable than those with small separation.
The new measure, weighted accuracy, takes the impact of
separation into account. Weighted accuracy for a prediction
is C(i,j)/p(|i-j|) where C is 1 if residues i and j are in contact
and 0 otherwise, and p(|i-j|) is the probability that the
residues with that separation are in contact. This provides a
higher value for correct predictions when the separation is
large.
Using this measure we show that the two-stage predictor
may provide better accuracy but lower weighted accuracy.
This can be explained if we assume the two-stage predictor
making more correct predictions but the predictions have
smaller separations than those of a single-stage predictor.
LAPTOP PRESENTATION ABSTRACTS
2: METAMOL: HIGH QUALITY VISUALIZATION
OF MOLECULAR SKIN SURFACE
Matthieu Chavent (France CNRS),
Bruno Levy (France INRIA), Bernard
Maigret (France CNRS).
MetaMol is a new program that
generates
high-quality
3D
representations in interactive time.
In contrast with existing software
that discretize the surface with
triangles or grids, our program is
based on a GPU-accelerated raycasting algorithm that directly uses the piecewise-defined
algebraic equation of the Molecular Skin Surface.
The Solvent Excluded Surface (SES) or Molecular Surface
(MS) is the most widely-used surface for representing
macromolecular assemblies. Starting from the pioneering
algorithm proposed by Connolly [1], numerous works have
been devoted to the improvement of the related methods, in
order to provide fast and robust generation of high quality
pictures of MS.
In 1994, Varshney et al. developed a program that was
easily parallelizable [2]. The year after, Sanner proposed a
method based on reduced surfaces [3] to visualize large
molecules (more than 10,000 atoms). More recently, using a
grid associated with a marching front algorithm, Can et al.
proposed a level-set-based method [4] while Bates et al.
defined a Minimal Molecular Surface [5]. All these
approaches are efficient but suffer from precision problems:
in Varshney and Sanner algorithms, the molecular surface is
triangulated while, for the Can and Bates algorithms, the
surface is represented as the union of cubes so that a level of
zoom is always found where triangles or cubes appear.
Furthermore, the generated MS is not exempt from
singularities due to self intersections [6, 7, 3, 5].
With MetaMol, we tackle the problems of precision and
singularities by using the Skin
28
in average from the native conformation) were refined to
within 2A. This happened despite the lack of hydrogenbonding or any orientation-dependent term in the DFIRE
energy function. However, success deteriorates significantly
as the initial structures of the helical/strand segments deviate
more from their respective native conformations.
Here, we propose a “dipolar” DFIRE (dDFIRE) energy
function based on the orientation angles involved in dipoledipole interactions. This is done by treating each polar atom
as a dipole. The orientation of the dipole is defined by the
bond vectors that connect the polar atom with other heavy
atoms. The dDFIRE energy function is then extracted from
protein structures based on the distance between two atoms
and the three angles involved in dipole-dipole interactions.
This approach takes into account the hydrogen-bonding
interaction via the physical dipole-dipole interaction. More
importantly, it provides a consistent treatment for the
possible orientation-dependent interactions between polar
and nonpolar atoms and between polar atoms that are nonhydrogen-bonded. Moreover, an integrated treatment of
distance and angle dependence produces a parameter-free
statistical energy function. Existing orientation-dependent
knowledge-based energy functions are limited to either
hydrogen bonding or geometry-based orientation in coarsegrained models.
This all-atom statistical energy function was employed to
fold protein terminal regions with secondary-structures.
Folding completely unfolded terminal segments is
challenging because it requires the restoration of both
mainchain and sidechain conformations. Moreover,
compared to internal regions, terminal regions are more
flexible and often exposed. This test is necessary because
native-like fragment structures are difficult to produce by
contemporary energy functions and the prevailing structureprediction techniques is to mix and/or match known native
structures either in whole (template-based modeling) or in
part (fragment assembly). The ab initio refolding of a
completely unfolded segment also has its own biological
significance, as protein folding assisted by a prefolded
domain (pro-domain) is common in many proteins.
It is important to learn which orientation-dependent
interaction is responsible for the success of the dDFIRE
energy function in segment refolding. Three orientation
components of the dDFIRE energy function are employed to
refold five terminal regions (two single helix segments, one
two-helix bundle, one strand, and one beta hairpin of five
separate proteins). The three dDFIRE components are the
orientation dependence involving hydrogen-bonded polar
atoms only, polar-nonpolar atoms only, and polar atoms
only. (Note that the last one includes hydrogen-bonded
atoms.) The three individual orientation components can
restore single helix in 2guzb and 1i2ta as accurately as the
full dDFIRE energy function. However, they produced
slightly less accurate structures (1.5A to 1.7A in global
rmsd) than the dDFIRE (0.8A) for the terminal two-helix
bundle in 1o82a. While every single orientation component
can refold helix-containing segments with reasonable
accuracy, they cannot restore the structures of strandcontaining segments well. The orientation components
7. Geng, W., Yu, S. and Wei, G., Treatment of charge
singularities in implicit solvent models, J Chem Phys, 2007,
127, pp. 114106.
8. Edelsbrunner, H., Deformable Smooth Surface Design.,
Discrete & Computational Geometry, 1999, 21, pp. 87-115.
9. Kruithof, N. and Vegter, G., Approximation by skin
surfaces, SM '03: Proceedings of the eighth ACM
symposium on Solid modeling and applications, ACM
Press, New York, NY, USA, 2003, pp. 86-95.
10. Cheng, H.-L. and Shi, X., Guaranteed Quality
Triangulation of Molecular Skin Surfaces, Proceedings of
the conference on Visualization '04, IEEE Computer
Society, 2004, pp. 481-488.
11. Cheng, H.-L. and Shi, X., Quality Mesh Generation for
Molecular Skin Surfaces Using Restricted Union of Balls,
vis, 2005, 00, pp. 51-57.
12. Toledo, R. and Levy, B., Extending the graphic pipeline
with new GPU-accelerated primitives, Tech report, 2004.
13. Sigg, C., Weyrich, T., Botsch, M. and Gross, M., GPUBased Ray-Casting of Quadratic Surfaces, Symposium on
Point-Based Graphics, 2006, pp. 56-65.
3: SPECIFIC INTERACTIONS FOR AB INITIO
FOLDING OF PROTEINS
Yuedong
Yang
(Indiana
University
School
of
Informatics, USA), Yaoqi
Zhou (Indiana University,
USA)
Proteins
interact
via
orientation-dependent
interactions between aminoacid residues. We propose a
statistical potential that consistently treats the
orientation and distance dependence of interactions
between all polar atoms and between polar and nonpolar
atoms (in addition to hydrogen-bonded atoms). The
potential is tested by ab initio refolding of protein
terminal regions.
The most well-known specific interaction in proteins is
hydrogen-bonding. Little attention, however, has been paid
to the orientation dependence of interactions between polar
atoms that are not hydrogen bonded, despite evidence of
their role in the formation of alpha helices and beta sheets.
Moreover, the possible orientation dependence of
interactions between polar and nonpolar atoms is ignored
even though the hydrophobic effect is caused by the reorientation of water molecules near a hydrophobic surface.
Recently, Zhu, Xie, and Honig compared several statistical
energy functions and physical-based energy functions and
analyzed their respective abilities to refold partially
unfolded helices or strands. They found that among the
energy functions tested, the most effective one is an allatom, distance-dependent, pairwise statistical energy
function based on a Distance-scaled, Finite-Ideal gas
Reference (DFIRE) state. In one test, more than 80% of
conformations from 104 segments of 81 proteins (4A rmsd,
29
between hydrogen-bonded atoms and between polar and
nonpolar atoms failed to fold the C-terminal beta strand of
1fltx within 2A global rmsd. Additionally, none of the three
individual components can refold the C-terminal betahairpin of 2extb (in a dimeric form) to within 2A in global
rmsd.
The results reported here underline the importance of
orientation-dependent interactions, in addition to the wellstudied hydrogen-bonding interaction, for the successful
restoration of specific structural segments of proteins. The
absence of orientation dependence leads to short helices or
coils rather than secondary-structure elements. These results
confirm the importance of orientation preference between
non-hydrogen-bonded atoms in the formation of secondary
structures of proteins. Additionally, the results call for the
attention to the relative orientation between polar and
nonpolar atoms. So far, orientation-dependent interactions
other than hydrogen bonding have been ignored in
constructing all-atom knowledge-based or empirical energy
functions. This explains why contemporary energy functions
are difficult to produce native-like fragment structures.
Thus, this work has significant implications for developing
more specific energy function for folding and molecular
recognition.
Fig. 1 compares five native structures to structures whose
fragments in five different structural elements are refolded
by DFIRE and by dDFIRE, respectively. There is a clear
difference between the structures refolded by dDFIRE and
those by DFIRE. For example, dDFIRE can refold the Cterminal single helix segment of 1i2ta very well, while
DFIRE breaks it into two segments. A similar phenomenon
is observed for 2guzb and 1u84. In addition, unlike dDFIRE,
DFIRE fails to yield two helices in 1r690 (as shown) and
1o82a. Moreover, for single strand, DFIRE produces either a
strand that is coil-like (2ptl, as shown, 1fltx, 1csp) or even a
helix (2extb, as shown) while dDFIRE produces strands that
have a more normal structural pattern. There is a marked
difference in the quality of the secondary-structure segments
refolded by the two energy functions as indicated by the
local rmsd values.
REFERENCES:
[1] H. Zhou and Y. Zhou, Distance-scaled, finite ideal-gas
reference state improves structure-derived potentials of
mean force for structure selection and stability prediction,
Protein Science, 11, 2714--2726 (2002).
[2] Y. Yang and Y. Zhou, Specific interactions for ab initio
folding of protein terminal regions with secondary
structures., Proteins 71, Published Online: Feb 7 2008
11:56AM DOI: 10.1002/prot.21968.
Fig. 1 The segment structures (in red) refolded by DFIRE
(left) and dDFIRE (center) for five proteins as labeled are
compared to their respective native conformations (right).
The fixed portion of each protein is colored in light green.
4: STRUCTURE DETERMINATION OF PROTEINPROTEIN COMPLEXES USING PARAMETERS OF
THEIR OVERALL ROTATIONAL DYNAMICS
AVAILABLE VIA NMR RELAXATION DATA
Yaroslav Ryabov & Charles Schwieters (NIH, USA).
Structure and dynamics
of proteins have obvious
mutual relationships. For
example, the size and
shape of a protein
determine rates of its
overall
rotational
tumbling. We present a
computational approach
which utilizes parameters
of this tumbling encoded
in experimental NMR
relaxation
data
for
structure determination
of single domain proteins and protein-protein complexes.
This work presents a further step in utilization of Nuclear
Magnetic Resonance (NMR) data in the Xplor-NIH
structure determination package [1]. Namely, we report a
new algorithm which uses dynamic information encoded in
NMR relaxation times for protein structure determination.
The initial attempt to use the ratio of longitudinal (T1) and
transverse (T2) NMR relaxation times for refinement of
NMR protein structures was first undertaken by Tjandra et
al. [2]. The authors of that work used residue specific
dependency of NMR relaxation times on the molecular
angle between an NH bond and the longer principal axis of
the protein diffusion tensor. While that approach provided
some improvement of protein structure quality it suffered
from a number of limitations primary because the diffusion
tensor anisotropy was estimated from the dispersion of the
T1/T2 ratios.
Recently, a fast method for direct calculation of protein
diffusion tensor components has become available [3]. This
method employs an ellipsoidal approximation to the
protein’s shape. In particular, it considers the solvent
accessible surface of a hydrated protein structure mapped by
SURF tessellation method [4]. The original algorithm [3]
treats vertexes of tessellated mesh with Principal
Component Analysis (PCA) [5] to obtain dimensions of
equivalent ellipsoid and applies further Perrin’s equations
[6] to calculate components of the protein diffusion tensor
for the equivalent ellipsoid shell approximating a protein’s
shape. This method is about 500 times faster than
conventional bead algorithms [7] with comparable accuracy.
In other words, it is fast and accurate enough to be
incorporated in an integrative structure calculation
procedure. Preliminary work [8] used this method in
combination with a Simplex search algorithm to position
domains in two-domain protein complexes when only the
translational degrees of freedom were searched. In that case,
orientations of the protein domains were derived from other
considerations.
The present contribution reports the implementation of this
fast method for calculating components of protein rotational
diffusion tensor within the Xplor-NIH structure
determination package. To achieve this goal the method was
modified to calculate gradients of the chi-square function
with respect to the positions of all protein atoms. Thus, it is
30
[2] Tjandra, N., Garrett, D.S., Gronenborn, A.M., Bax, A.,
Clore, G.M. (1997) Nature Struct. Bio. 4, 443 - 449.
[3] Ryabov Y.E., Geraghty C., Varshney A., Fushman D.
(2006) JACS, 128, 15432 - 15444.
[4] Varshney, A., Brooks, F. P., Jr., Wright, W. V. (1994)
IEEE Comput. Graphics Appl. 14, 19-25.
[5] Jolliffe, I. T. Principal Component Analysis; SpringerVerlag: New York, 1986.
[6] Perrin, F. (1934) J. Phys. Radium, 5, 497-511; Perrin, F.
(1936) J. Phys. Radium, 7, 1-11.
[7] Garcia de la Torre, J.; Huertas, M. L.; Carrasco, B.
(2000) J. Magn. Reson. B147, 138-146.
[8] Ryabov Y., Fushman D., (2007) JACS 129, 7894-7902.
[9] Tjandra, N., Wingfield, P., Stahl, S., Bax A. (1996) J.
Biomol. NMR 8, 273-284.
[10] Yamazaki, T., Hinck, A.P., Wang, Y.X., Nicholson,
L.K., Torchia, D.A., Wingfield, P., Stahl, S.J., Kaufman,
J.D., Chang, C.H., Domaille, P.J., Lam, P.Y.S. (1996)
Protein Sci. 5, 495 - 506.
now able to explore all degrees of freedom used in protein
structure elucidation. Components of the protein diffusion
tensor, derived from experimentally measured T1 and T2
NMR relaxation times, are used as structural restraints for
standard Xplor-NIH simulated annealing protocols.
Therefore, this method essentially restrains the overall shape
of a protein molecule making it conceptually different from
the previous approach [2]. The ability to restrain the overall
of a protein shape may help to resolve the problem of poor
packing density of NMR protein structures. Here, however,
we utilize these overall shape restraints for positioning and
orienting domains in multi-domain protein complexes. This
is especially important for NMR based methods of structure
elucidation when a small number of inter-domain NOE
distance restraints make positioning the subunits difficult.
Figure 1 illustrates application of this method for the
particular case of the HIV-1 protease homodimer. In this
case, centers of gravity of both domains, which were treated
as rigid bogies, were initially superimposed. Then, the
position and orientation of one domain was randomized
within a cube of 60X60X60 angstrom dimensions to prepare
an ensemble of 512 different random initial conditions to
start standard Xplor-NIH structure refinement protocol. We
used previously measured [9] components of HIV-1
protease’s rotational diffusion tensor as the only
experimental structural restraints. The refined structures
were sorted in ascending order with respect to the values of
chi-square differences between measured components of
diffusion tensor and those calculated for refined structures.
The first 5 percent of the sorted list contain structures which
are practically equivalent to each other (less than 0.001
angstrom of Root Mean Square Deviation (rmsd) for alphacarbon positions) and very close to the reference HIV-1
structure [10] (pdb code 1BVG) derived from NOE interdomain restraints (0.3 angstrom of alpha-carbon rmsd). The
average value of the chi-square function terms
corresponding to the diffusion tensor restraints for these 30
structures with lowest rmsd is about 5 times lower than the
same chi-square term for the structure immediately
following them in the list. This makes these structures
reliably recognizable among others and proves the ability of
the method to obtain correct arrangement of protein’s
subunits for the cases when reference structure is not
available
a
priori.
The algorithm is rather fast requiring about 200 seconds for
refinement a single structure on a single core of a standard
desktop. Parallelization makes calculation time on an 8 core
cluster less than 4 hours.
ACKNOWLEDGEMENTS
The authors acknowledge stimulating discussions with Drs.
G.M. Clore and J.J. Kuszewski. Y.R is supported by
National Research Council Associateship Program (Award #
0710430). C.D.S is supported by the Intramural Research
Program of CIT, NIH.
[1] C.D. Schwieters, J.J. Kuszewski, N. Tjandra and G.M.
Clore, (2003) 160, 66-74; C.D. Schwieters, J.J. Kuszewski,
and G.M. Clore, (2006) Progr. NMR Spectroscopy 48, 4762.
5: FOCUSED DOCKING: A COMPUTATIONAL
APPROACH TO IMPROVE SMALL-MOLECULE
DOCKING INTO PROTEIN STRUCTURES
Dario Ghersi & Roberto
Sanchez (Mount Sinai
School of Medicine, USA).
A computational protocol
that combines protein
binding sites detection
and docking is presented
here and evaluated on a
set of 77 cases. The
comparison with blind
docking shows that our
protocol achieves a higher rate of binding site detection,
more accurate results and requires significantly less
computational time.
The goal of protein-ligand docking is to predict the position
and orientation of a ligand (usually a small molecule) when
it is bound to a receptor protein. When the binding site to be
targeted by the small-molecule is known, selecting a
reasonably small docking box around this site facilitates
docking by focusing sampling of the translational, rotational
and torsional degrees of freedom of the ligand. This is the
usual situation in lead optimization, where predicting the
binding mode or pose of the ligand is needed for rational
design of improved potency and selectivity, and in hit
identification through virtual screening where the goal is the
discovery of ligands, out of a large library, that are likely to
bind a protein target. The reverse question is more difficult
to address. Given a ligand, is it possible to discover its most
likely target? In this “reverse virtual screening” case, since
the binding site is not known it becomes necessary to
explore the entire protein surface by docking, a procedure
that has been named “blind docking”. Since the space where
blind docking takes place must accommodate the entire
protein and is therefore much larger than a regular docking
31
of protein ligand docking for those cases where the binding
site is unknown.
This approach is especially relevant in applications such as
reverse virtual screening and structure-based functional
annotation of proteins, since it requires only the knowledge
of the three-dimensional structure of the target proteins and
can allow for the discovery of unexpected interactions that
may occur at previously unidentified binding sites.
7: SUPPORT VECTOR MACHINE-BASED
TRANSMEMBRANE PROTEIN TOPOLOGY
PREDICTION
Tim Nugent and
David
Jones
(University
College London,
UK)
box, the number of energy evaluations carried out by the
docking program is usually set up to a proportionally higher
value, with a corresponding increase in the running time.
This shortcoming has been partially overcome by using
known protein binding sites as targets for reverse-virtual
screening. While this approach enables faster reverse virtual
screening, it limits the universe of candidate targets to those
proteins that have clearly identified binding sites and only to
those sites within the protein. Ideally, a reverse virtual
screening approach would require only the knowledge of the
three-dimensional structure of the candidate target proteins
and would allow for the discovery of unexpected
interactions that may occur at previously unidentified
binding sites.
The use of predicted binding sites is evaluated here as a tool
to focus the docking of small molecule ligands into protein
structures, simulating cases where the real binding sites are
unknown. The resulting approach consists of few
independent docking jobs carried out on small boxes that are
centered on the predicted binding sites, as opposed to one
larger blind docking job that samples the complete protein
structure. The assumption behind the use of a few predicted
binding sites is that only a handful of possible smallmolecule binding sites exist on protein structures, and that
these sites can be reliably identified. Therefore, it is not
necessary to explore a very large number of sites and a gain
in speed is possible without a significant loss in coverage.
Tested on a set of 77 protein-ligand complexes and
compared with blind docking this approach is shown to: (1)
identify the correct binding site more frequently than blind
docking; (2) produce more accurate docking poses for the
ligand; (3) require less computational time. Additionally, the
results show that very few real binding sites are missed in
spite of focusing on 3 predicted binding sites per protein.
We also illustrate the performance of the binding site
detection algorithm on comparative models, simulating a
scenario where an experimental structure of a protein is not
available.
We present another approach for biasing the docking toward
the predicted binding sites that is alternative to running
independent docking experiments with smaller grids
centered on the predicted site. The approach consists in
masking the regions that are outside a sphere of 11.0Å
radius centered at the predicted sites by assigning to them
extremely high energy. We tested this alternative protocol
by masking all but the first three predicted sites, with a
resulting overall accuracy that is still much lower than with
any of the focused docking protocols. As a control, we
repeated the same experiment by masking one site at a time,
and we yielded results that were indistinguishable from the
ones produced by the focused docking protocol. Therefore,
we conclude that the simultaneous presence of the hot-spots
regions is suboptimal for achieving a thorough exploration
of the correct binding site, and that there is an advantage in
exploring individually the predicted sites one at a time.
Overall the results indicate that, by improving the sampling
in regions that are likely to correspond to binding sites, the
focused docking approach increases accuracy and efficiency
Due
to
the
paucity of alphahelical
transmembrane
protein crystal structures, in silico approaches are
essential for structural analysis. We present a support
vector machine-based topology predictor that integrates
both signal peptide and re-entrant helix prediction, and
present the results of application to a number of
complete genomes.
Alpha-helical transmembrane (TM) proteins constitute
roughly 30% of a typical genome and are involved in a wide
variety of important biological processes including cell
signaling, transport of membrane-impermeable molecules
and cell recognition. However, due to the experimental
difficulties involved in obtaining high quality crystals, this
class of protein is severely under represented in structural
databases, making up only 1% of known structures in the
PDB. Given the biological and pharmacological importance
of TM proteins, an understanding of their topology - the
total number of TM helices, their boundaries and in/out
orientation relative to the membrane - is essential for
structural and functional analysis, and directing further
experimental work. In the absence of structural data,
bioinformatic strategies thus turn to sequence-based
prediction methods.
Early prediction methods, based on the physicochemical
principle of a sliding window of hydrophobicity combined
with the 'positive-inside' rule [1], have been superceded by
machine learning approaches which prevail due to their
probabilistic orientation. These include Hidden Markov
models (HMMs), neural networks (NNs) and more recently,
support vector machines (SVMs). While NNs and HMMs
are capable of producing multiple outputs, SVMs are binary
classifiers therefore multiple SVMs must be employed to
classify the numerous residue preferences before being
combined into a probabilistic framework. While multiclass
ranking SVMs do exist, they are generally considered
unreliable, since in many cases no single mathematical
function exists to separate all classes of data from one
another.
32
Biol. 1992 May 20;225(2):487-94.
[2] Jannick Dyrløv Bendtsen, Henrik Nielsen, Gunnar von
Heijne and Søren Brunak. Improved prediction of signal
peptides: SignalP 3.0. J. Mol. Biol., 340:783-795, 2004.
[3] Emanuelsson O, Brunak S, von Heijne G, Nielsen H.
Locating proteins in the cell using TargetP, SignalP and
related tools. Nat Protoc. 2007;2(4):953-71.
[4] Käll L, Krogh A, Sonnhammer EL. A combined
transmembrane topology and signal peptide prediction
method. J Mol Biol. 2004 May 14;338(5):1027-36.
[5] Käll L, Krogh A, Sonnhammer EL. An HMM posterior
decoder for sequence feature prediction that includes
homology information. Bioinformatics. 2005 Jun;21 Suppl
1:i251-7.
[6] Viklund H, Granseth E, Elofsson A. Structural
classification and prediction of reentrant regions in alphahelical transmembrane proteins: application to complete
genomes. J Mol Biol. 2006 Aug 18;361(3):591-603.
[7] Jones DT. Improving the accuracy of transmembrane
protein topology prediction using evolutionary information.
Bioinformatics. 2007 Mar 1;23(5):538-44.
10: DOMAIN REARRANGEMENT AND DOMAIN
CREATION IN THE EVOLUTION OF NEW
PROTEINS
However, SVMs are capable of learning complex
relationships among the amino acids within a given window
with which they are trained, particularly when provided with
evolutionary information, and are also more resilient to the
problem of over-training compared to other machine
learning methods.
One problem faced by modern topology predictors is the
discrimination between TM helices and other features
composed largely of hydrophobic residues. These include
targeting motifs such as signal peptides and signal anchors,
amphipathic helices, and re-entrant helices – membrane
penetrating helices that enter and exit the membrane on the
same side, common in many ion channel families. The high
similarity between such features and the hydrophobic profile
of a TM helix frequently leads to crossover between the
different types of predictions. Should these elements be
predicted as TM helices, the ensuing topology prediction is
likely to be disrupted. Some prediction methods, such as
SignalP [2] and TargetP [3], are effective in identifying
signal peptides, and may be used as a pre-filter prior to
analysis using a TM topology predictor. Phobius [4] uses a
HMM to successfully address the problem of signal peptides
in TM protein topology prediction, while PolyPhobius [5]
further increases accuracy by including homology
information. Other methods such as TOP-MOD [6] have
attempted to incorporate identification of re-entrant regions
into a TM topology predictor but there is significant room
for improvement.
A key element when constructing any prediction method is
the use of a high quality data set for both training and
validation purposes. Extracting a training set from available
databases requires requires a number of critical decisions to
be made. As an example in the case of TM proteins,
searches of databases such as the PDB using the keyword
'transmembrane' will return both genomically encoded TM
proteins as well as TM proteins that are not native, such as
venoms and bacterial colicins. Furthermore, orientation and
helix boundary errors in databases are not infrequent and
add an element of noise. While such noise is often well
tolerated by machine learning methods, the problem is more
significant in smaller data sets.
We thus present a new TM topology predictor trained and
benchmarked with full cross-validation on a novel data set
of
131 sequences, with topologies derived solely from crystal
structures. The method uses evolutionary information and
four SVMs, combining the outputs using a dynamic
programming algorithm, to return a list of predicted
topologies ranked by overall likelihood, and incorporates
signal peptide and re-entrant helix prediction. Overall, the
method predicted the correct topology and location of TM
helices for 88% of the test set, an improvement of 11% on
our previous NN-based method [7]. An additional SVM has
been trained to discriminate between TM and globular
proteins with a low false positive rate of 0.4%, making this
method highly suitable for whole genome analysis.
REFERENCES
[1] von Heijne G. Membrane Protein Structure Prediction,
Hydrophobicity Analysis and the Positive-inside Rule. Mol
Diana Ekman, Åsa K. Björklund and Arne Elofsson
(Biochemistry and Biophysics, Stockholm University,
Sweden)
The metazoan lineage has unusually high rates of
domain architecture creation, and the architectures
contain relatively large numbers of domains. The
introduction of domains amenable to exon shuffling
seems to explain some of the increase. Further, most
domain families are ancient and de novo domain
creation is a rare event.
Duplication, domain rearrangement and de novo creation are
some of the mechanisms involved in evolution of new
proteins. Domain rearrangements are interesting since new
functionalities can be created through a single event,
frequently insertion of a domain at either terminus. We have
found that the rates of domain architecture creation are
similar in different phylogenetic groups and have remained
roughly constant throughout evolution. An exception is the
metazoan lineage where the rates are clearly elevated, and
their domain architectures also contain relatively large
numbers of domains. The introduction of a set of domains
amenable to exon shuffling seems to have been an important
33
ligands into account during the comparison. Here we
propose a novel methodology that combines the advantages
of both approaches, without being impaired by the abovementioned limitations.
Comparison method
Our method identifies structural motifs in protein binding
pockets in a ligand-dependent manner and does not require
the proteins, or their bound ligands, to be similar. Therefore
the algorithm enables the pair-wise comparison of structures
containing different ligands that interact with different
protein folds. The procedure comprises two steps. We first
identify local similarities shared by the input structures [3].
Subsequently we analyse the coordinates of the bound
ligands, looking for the largest common fragment that has a
similar position in space. To this end the ligands belonging
to each binding pocket are superimposed according to the
same roto-translation used for the protein residues. The
algorithm then enumerates all the possible combinations of
fragments (subset of connected atoms) using a recursive
depth-first procedure and identifies the one with the highest
score. Such score is defined as a trade-off between the size
of the common fragment and the fact that it should be
present in the highest possible number of bound ligands.
Benchmark
We devised a benchmark to test the assumption that the
presence of specific protein residues implies a discernible
preference for certain ligand fragments. To this end we
identified a set of non-redundant pairwise structural
similarities between binding pockets belonging to proteins
of different folds. Each similarity implies a roto-translation
of the binding sites and, accordingly, of the bound ligands.
Using the LIGANDSCOUT software we identified a total of
3161 pharmacophoric groups in the 210 ligands considered
and identified 450 pairs of pharmacophores which are
superimposed by the above-mentioned roto-translation. 364
pairs involved pharmacophores with compatible chemical
roles while 86 involved non compatible pairs. The result of
this analysis shows that the fraction of compatible pairs of
pharmacophores tends to decrease as the distance from the
protein residues increases. Moreover it is interesting to note
that the fraction of compatible pairs drops when the distance
exceeds the threshold value for the formation of hydrogen
bonds. This benchmark shows that the correspondences we
identify have a functional significance, because the
matching pharmacophores are those that effectively interact
with the residues involved in the superimposition.
Identification of structural motifs in the PDB
To demonstrate the usefulness of our approach we
performed a comparative analysis of all the binding pockets
in the PDB structures classified in SCOP (6,5x104 binding
sites). We focused on binding sites belonging to proteins of
different folds, involved in binding similar as well as
different ligands. We used sequence identity together with
the SCOP and CATH classifications to discard all the
matches involving homologous structures.This large-scale
comparison resulted in the identification of 657 protein
structural motifs associated to specific ligand fragments,
despite a high variability in the structure of the ligand as a
whole. In addition to that a lesser number (570) of motifs
factor behind this explosion of new domain architectures in
metazoa.
In contrast to the domain architectures, most known domain
families existed already in the last eukaryotic common
ancestor. However, many proteins have incomplete domain
coverage, and may hence contain domains created de novo.
To investigate this, we have studied the amount of
innovation in Saccharomyces cerevisiae and found that at
least two thirds of the residues are aligned to homologs in
non-fungi, whereas only a minor fraction is specific to S.
cerevisiae.
In addition, the species specific regions are often short,
disordered sequences located at either the N- or C-terminal.
11: A NOVEL METHOD FOR THE DETECTION OF
PROTEIN LOCAL STRUCTURAL MOTIFS BINDING
SPECIFIC LIGAND FRAGMENTS
Gabriele
Ausiello1,
Pier
Federico
Gherardini1,
Elena
Gatti1,
Ottaviano
Incani2, & Manuela
Helmer-Citterich3
1
( Dept. of Biology,
University of Rome
Tor Vergata, Italy,
2
Dept. of Chemistry,
University of Rome
Tor Vergata, Italy,
3
Centro
di
Bioinformatica Molecolare University of Rome Tor Vergata,
Italy)
We present an algorithm for the comparison of protein
binding pockets that identifies small structural motifs
binding specific ligand fragments. We applied this
method to all proteins of known structure, identifying
657 motifs. Some of these are present in as many as 60
folds.
Introduction
In order to understand the rules underpinning the interaction
of proteins with small ligands, a wealth of information can
be derived from the comparative analysis of binding pockets
of known structure. Such analysis can be performed starting
from either the ligand or the protein. In the former case a
number of sites that bind a molecule of interest are selected.
The ligand moieties are subsequently superimposed in order
to identify similarities and differences in the neighbouring
protein atoms [1]. This approach necessarily limits the
analysis to pockets that bind ligands with an overall similar
structure, since these are used as a reference to guide the
superimposition of the binding pockets.
Conversely, if the analysis starts from the protein side, one
can mine the PDB, looking for binding motifs which are
present in non-homologous proteins. Since such motifs have
evolved independently multiple times, they should represent
particularly favourable modes of interaction between protein
residues and ligand moieties [2]. However this approach
does not systematically take the structure of the bound
34
with known structure, structural alignments of the fragments
were made using STRUCTAL (5). If the score for a certain
fragment alignment was above a cut-off, the fragments were
considered as an internal repeat. In case of overlapping hits
in the same protein, the fragment pair with the highest score
was considered the correct one.
In the human genome the most common duplication seems
to be 6+6 TMH. From all 12 TMH two thirds contain such
duplication. For smaller proteins (6-8 TMH) an internal
repeat was found in less than 20%. This indicated that
longer proteins are more likely to contain internal repeats.
However, it might also be so that longer repeats are easier to
detect.
The same trend can be seen in yeast (S. cerevisiae), E. coli
and in the test set of proteins with known structure; the more
transmembrane helices in the chains, the bigger part of the
chains have an internal duplication. Internal repeats seem to
be a bit more common in E. coli than yeast, especially in 911 TMH chains and in the test set a higher fraction of the
small proteins (6 TMH) contains internal repeats. Since no
duplication events are found in the large G-protein coupled
receptor family, 7 TMH proteins containing an internal
repeat is lower in human than for the other species, lowering
the overall fraction duplicated genes from 34% to 22%.
One of the most evident examples of internal repeat with
known structure is an acriflavin resistance protein with 12
membrane spanning segments (1oye). Although the
sequence identity between the two halves is less than 20%,
they are structurally very similar to each other (STRUCTAL
score 1834). In search of homologues of different lengths
the sequence was blasted against a database of almost 600
bacterial genomes. After three rounds of PSI-BLAST (6) we
found frequency peaks both at homologues with 12 and 6
TM segments. The majority of the 6 TMH hits proved to be
two peptide chains involved in the Sec-complex, but some
examples were found where the proteins were a part of the
Acr family. A phylogenetic tree containing 6 and 12 TMH
homologues from ten genomes were made in order to find
out how the different proteins are related. The longer
proteins were split in two parts. The tree clearly separates
Sec proteins from proteins in the Acr family and almost all
N-terminal parts are clearly separated from the C-terminal
parts. The homologues with 6 TMH which are not Sec
proteins group together with the N-terminal or the Cterminal parts. They are always found in pairs in the
genomes, and if one homologue is located in the N-terminal
clad of the tree, the other is found in the C-terminal clad.
This suggests an evolutionary model where a 6 TMH protein
is duplicated and then the two copies fuse together to form a
larger protein, while in the cases with two short homologues
the fusion has not taken place (yet).
REFERENCES:
1. Abramson J, Smirnova I, Kasho V, Verner G, Kaback
HR, Iwata S:Structure and mechanism of the lactose
permease of
Escherichia coli. Science 2003, 301:610-615
2. Murakami S, Nakashima R, Yamashita E, Yamaguchi A:
Crystal structure of bacterial multidrug efflux transporter
AcrB. Nature 2002, 419:587-593
were identified on the structure but no common fragment
was found in the bound molecules. Overall these figures
suggest that the presence of specific residues in a binding
pocket confers a discernible preference in the identity and
position of a number of ligand atoms. Each motif is found in
at least 2 folds. 104 motifs map to three folds, 90 to 4-10
folds and a few exceptional cases involve from 17 up to 63
different folds. Such fragments are usually small compared
to the whole ligand. The 330 motifs associated to two or
more ligand atoms have been manually analysed in order to
categorise the types of fragments recognised. The results of
this classification show that the vast majority of motifs are
involved in the binding of anions (phosphate and carboxyl
groups, 215 motifs) and nucleotides (35). Other highly
represented motifs bind metals (14) and heme groups (10).
Overall these figures confirm that our methodology is
sound. Most of our results comprise motifs that are already
known in the literature as having widespread occurrence in
fold space. More importantly, this analysis highlights that no
other motifs, occur with comparable frequency in the PDB.
A more in-depth analysis of nucleotide binding sites showed
for the first time their modular nature. The same portion of
the nucleotide can be recognised by different motifs and
these are variously combined in proteins with different
folds.
REFERENCES
1 Nobeli I et al. 2001; Nucleic Acids Res. 29(21):4294-309
2 Kinoshita K et al. 1999; Protein Eng. 12(1):11–14
3 Ausiello G et al. 2008; BMC Bioinformatics. 9 Suppl 2:S2
12: HOW COMMON ARE INTERNAL REPEATS IN
ALPHA-HELICAL MEMBRANE PROTEINS?
Jenny Falk & Arne Elofsson (Biochemistry and
Biophysics, Stockholm University, Sweden)
In a genomic scan for membrane proteins that contain
internal repeats we found that 40% of all TM-proteins
with more than 6 predicted TM-regions contain a
detectable duplication, in agreement with structural
data. In addition, only in a few examples it was possible
to detect the existence of the parts as separate genes.
After structure determination it has been noticed that some
membrane proteins have an internal symmetry (1,2). The
most likely explanation is that there has been a duplication
of genes, where either the whole gene or a part of it is
duplicated and added to the already existing protein
encoding gene. This would result in a longer protein which
traverses the membrane more times than the original protein.
This would provide a possibility to circumvent the
constraints the hydrophobic environment imposes on the
evolution of membrane proteins. In this study a search for
such membrane proteins has been performed, using
sequence based as well as structure based methods.
In the sequence based search transmembrane helices (TMH)
were predicted by PRODIV-TMHMM (3) and protein
profiles were made. The profiles were then split into
fragments according to the predicted topology, and the
fragments were aligned to each other using the profileprofile alignment method SHRIMP (4). In addition for pairs
35
electrostatic expansions up to polynomial order L=30 on a 2
Gb personal computer. As expected, 3D correlations are
found to be considerably faster than the former 1D Hex
correlations but, surprisingly, 5D correlations are often
slower than 3D correlations. Nonetheless, we show that 5D
correlations will be advantageous when calculating multiterm knowledge-based interaction potentials.
When docking the 84 complexes of the Protein Docking
Benchmark, blind 3D shape-based correlations take around
30 minutes on a contemporary personal computer and find
acceptable solutions within the top 20 in 6 cases. However,
applying a simple angular constraint to focus the calculation
around the receptor binding site and adding electrostatics to
the correlation produces acceptable solutions within the top
20 in 28 cases. Further constraining the search to the ligand
binding site gives up to 48 solutions within the top 20, with
calculation times of just a few minutes per complex. Hence
the approach described provides a practical and fast tool for
rigid body protein-protein docking, especially when some
prior knowledge about one or both binding sites is available.
Hex is available under a no-cost academic licence from:
http://www.csd.abdn.ac.uk/hex/
15: AN APPROACH TO TRANSMEMBRANE
PROTEIN STRUCTURE PREDICTION WITH
STOCHASTIC DYNAMICAL SYSTEMS USING
BACKWARD SMOOTHING
3. Viklund H, Elofsson A: Best alpha-helical transmembrane
protein topology predictions are achieved using hidden
Markov models and evolutionary information. Protein Sci
2004, 13:1908-1917
4. Bernsel A, Viklund H, Elofsson A: Remote homology
detection of integral membrane proteins using conserved
sequence features. Proteins, in press
5. Gerstein M, Levitt M: Comprehensive assessment of
automatic structual alignment against a manual standard, the
SCOP classification of proteins. Protein Sci 1998, 7:445-456
6. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang
Z, Miller W Lipman DJ: Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs.
Nucleic Acids Res 1997, 25:3389-3402
13: ACCELERATING AND FOCUSING PROTEINPROTEIN DOCKING CORRELATIONS USING
MULTI-DIMENSIONAL
ROTATIONAL
FFT
GENERATING FUNCTIONS
Dave
Ritchie,
(University
of
Aberdeen, UK),
Dima
Kozakov
(Boston
University, USA),
Sandor
Vajda
(Boston
University, USA).
We have recently
developed
an
analytic 6D polar
Fourier
correlation expression for rigid-body FFT proteinprotein docking. This approach can rapidly calculate 3D
and 5D rotational correlations, and is well suited for
focusing and accelerating the calculation around known
or hypothesised binding sites when such information is
available.
Predicting how proteins interact at the molecular level is a
computationally intensive task. Many protein docking
algorithms begin by using FFT correlation techniques to
find putative rigid body docking orientations. Most such
approaches use 3D Cartesian grids and are therefore limited
to computing 3D translational correlations. However,
translational FFTs can speed up the calculation in only three
of the six rigid body degrees of freedom, and they cannot
easily incorporate prior knowledge about a complex to focus
and hence further accelerate the calculation. Furthemore,
several groups have developed multi-term interaction
potentials and others use multi-copy approaches to simulate
protein flexibility, which both add to the computational cost
of FFT-based docking algorithms. Hence there is a need to
develop more powerful and more versatile FFT docking
techniques.
We have recently developed a closed-form 6D spherical
polar Fourier correlation expression from which arbitrary
multidimensional multi-property multi-resolution FFT
correlations may be generated. The approach has been
implemented in the Hex docking program to calculate 3D
and 5D rotational correlations of protein shape and
Takashi Kaburagi and Takashi Matsumoto (Waseda
University, Japan)
A backward smoothing approach utilizing a stochastic
dynamical system with two-dimensional vector
trajectories is used to predict transmembrane protein
structures. Given a sequence of amino acids with
unknown structures, the presence/absence of each
residue in a transmembrane region is predicted by the
backward smoothing process.
In this study, we have developed a machine learning
algorithm for prediction of the structures of a single class of
protein: transmembrane proteins.
Transmembrane proteins have long been considered to be a
critical factor in understanding biological functions such as
36
Since the model structure employs a left-to-right topology,
the proposed scheme is expected to yield better results than
our previous prediction scheme.
From a biological point of view, this scheme can be
explained as follows: The proposed scheme is designed to
predict the annotation of the target protein from the Nterminus. Since the translation process in protein
biosynthesis starts from the N-terminus, this scheme is
expected to yield good results. Moreover, as an amino acid
chain grows in the translation process, amino acids are
added at the carboxyl end of the chain. The growing chain
immediately tends to fold into a particular conformation.
Because of this tendency, when predicting the state at
position “t,” it seems natural to use the sequence
information from position “t” to the end of the sequence and
not from the beginning of the sequence to position “t-1.”
In this study, we have also presented the performances of
five other prediction methods applied to our proposed
model.
The five methods that we used are as follows:
(i) The proposed method (backward smoothing)
(ii) Our previous method
(iii) The standard Viterbi method
(iv) A standard smoothing method
(v) A “forward” smoothing method
It should be noted that in order to perform experiments
accurately, it is necessary to use appropriate data sets.
Currently, one of the most difficult problems in protein
structure prediction in general and in transmembrane protein
structure prediction in particular, is the difficulty in
obtaining appropriate data sets for experiments.
We selected two publicly available data sets collected for
benchmarking various algorithms. In this study, we also
discuss the accuracy of the predictions of our algorithm,
which predicts whether particular amino acids are present in
a transmembrane region. The evaluation methods that we
have followed are the same as those used in Moller et al.
For the purpose of comparison, we have tested the
performance of TMHMM, HMMTOP, and SOSUI, which
are three well-known transmembrane structure prediction
tools, using the same test data sets.
We observed that the proposed backward smoothing method
has a prediction accuracy of 92.3%. It should be noted that
the five prediction methods applied to our proposed model
used the same model and the same parameters.
It was observed that among the five methods, the proposed
backward smoothing method had the best performance.
Precise comparisons with other prediction algorithms are
difficult because the sequences used for their training may
have been different. However, for comparison purposes, we
tested the same test sequences against three well-known
tools for predicting transmembrane helices.
In this experiment, the “backward’’ smoothing scheme had a
better performance compared to the other three well-known
prediction tools. In this study, we have proposed a novel
scheme (the backward smoothing scheme) to predict
transmembrane regions utilizing a finite-state stochastic
dynamical system.
cell
signaling,
ion
transport,
and
intercellular
communication.
Because of the biological and pharmaceutical importance,
the identification of transmembrane helices in membrane
proteins is a priority. Although promising methods in X-ray
crystallography and nuclear magnetic resonance (NMR)
have begun to open avenues to the determination of these
structures, the number of known three-dimensional
structures remains small. Therefore, reliable algorithms to
predict transmembrane protein structures would be very
useful.
There are two basic methods for predicting protein
structures.
The first method is to use algorithms that are based solely on
the construction principles of proteins associated with the
physicochemical properties of amino acids. The algorithms
do not involve any sort of training. In this method,
windowed averages of physicochemical quantities are
calculated. There are several successful examples of
algorithms of this type.
The second method is to collect data sets of known
structures, to extract the features from the data set, and to
apply machine learning algorithms to make predictions.
Some improvements have been made in using this second
method, but further development of algorithms is necessary
to improve the reliability of predictions.
We used a novel machine learning algorithm to predict
protein structures, and we also evaluated the reliability of
the predictions.
A machine learning algorithm assumes that there are models
and associated parameters behind the available data sets.
Generally, the degree of success of a machine learning
algorithm depends on two factors: how well the model
structure characterizes the target molecule from which the
data was taken and how well the learning algorithm
incorporates the available data sets.
The major features of the proposed algorithm are as follows:
(i) The hidden Markov model (HMM) topology used in the
proposed scheme consists of open-loop connections of
submodels. The submodels are made up of two types: a
transmembrane region submodel and a loop region
submodel. A stochastic dynamical system runs concurrently
with the inner state dynamics so that once a dynamical
system leaves a particular state, it does not return to that
state. In contrast, some of the previous HMM-based
algorithms were designed to have five or seven states that
could be revisited.
(ii) The proposed scheme utilizes a finite state stochastic
dynamical system with two-dimensional vector trajectories
consisting of a hydropathy index and formal charge.
For a given sequence of amino acids of unknown structure,
the presence of each residue in a transmembrane region was
predicted by a backward smoothing process.
The proposed prediction scheme based on backward
smoothing emphasizes the dependency more on the previous
state than the previous observation data once the previous
state has been estimated.
37
[2] Laurie ATR, Jackson RM, Bioinformatics, 21, 19081916 (2005)
ASTRACT 21 COIL WITHIN THE MEMBRANE:
STRUCTURAL ANOMALY FOR FUNCTIONAL
NEEDS
Anni
Kauko,
Kristoffer
Illergård & Arne
Elofsson (Center
for
Biomembrane
Research,
Stockholm
University,
Sweden)
To model and
understand
membrane
proteins, understanding on different substructures is
crucial. We have for first time analysed coil within the
membrane core. These polar segments consists 7 % of
core residues and are buried or at polar cavities. They
are conserved and functional, particularly in trasporters,
where coil can introduce polarity and flexibility required
for function.
Introduction
Membrane proteins perform many essential functions. They
consist 25% of proteome and over half of drug targets.
However, due to experimental difficulties only 1% of
structures in PDB are membrane proteins. Therefore it is of
special importance to predict different aspects of membrane
protein structure. For this purpose it is crucial to understand
the properties of membrane protein substructures.
Traditionally helical membrane proteins have been seen as
simple regular alpha-helix bundles. However, recent
structures have shown various substructures that differ from
this view, including reentrant regions, interfacial helices and
marginally hydrophobic helices. The coil is polar due to the
backbone polar groups, and 7 % of coil at membrane has
been ignored so far. Here we present first systematic
analysis on coil at membrane core.
Structural properties
Random coil segments within the deep membrane core can
be divided to three separate classes (reentrants, breaks and
kinks). Reentrants are coil segments present in reentrant
regions that enter and exit the membrane from the same
side. Breaks are longer coil segments that clearly interrupt
the regular structure of a transmembrane helix. Kinks
represent small distortions of the helix geometry.
Coil has higher preference toward polar and charged
sidechains at membrane core. This probably reflects the
preference of inherently polar coil toward polar
environments. Moreover, glycine and proline are more
common in coil, regardless whether coil is located in the
membrane or to the globular region. Further, major coil
segments are typically buried or located to the polar cavities
thus preventing the polar backbone groups to be exposed to
membrane. All these preferences are more pronounced in
reentrants and breaks than in kinks.
The proposed prediction scheme emphasizes the dependency
more on the previous state than the previous observation
data once the previous state has been estimated. Since the
model structure employs a left-to-right topology, the
proposed scheme is expected to yield better results than the
previous scheme. The experimental results suggest that the
backward smoothing scheme has a reasonably good
performance.
20: i-SITE: ENERGY-BASED METHOD FOR
PREDICTING
LIGAND-BINDING
SITES
ON
PROTEIN STRUCTURES
Mizuki
Morita,
Tohru
Terada,
Shugo Nakamura &
Kentaro
Shimizu
(The University of
Tokyo, Japan).
We
have
developed
a
method
for
predicting
the
ligand-binding
sites on protein structures. It is a simple energy-based
method and delivers high performance with apo protein
structures. We also could improve the accuracy of
prediction with re-ranking techniques by amino acid
conservation scores.
Identifying ligand-binding sites on the protein surfaces is the
first step of drug design and improvement of protein
functions. We have developed a simple energy-based
method for predicting the locations of ligand binding sites
on protein 3D structures [1]. A notable feature of our
method is to be successful when applied to ligand unbound
(apo) as well as bound (holo) forms of the proteins.
In our approach, the protein surface is coated with multiple
layers of probes to calculate the van der Waals interaction
energies between these probes and the protein. Energetically
favorable probes are then clustered and the resulting clusters
are ranked based on their total interaction energies.
Our method was applied to two Laurie & Jackson's datasets:
134 proteins were used to tune the parameters and the best
parameters were used to examined a set of 35 holo/apo
protein pairs and the results are compared to the results of
two alternative methods: Q-SiteFinder [2] and PocketFinder [2]. In 80% (28/35) of the test cases, the ligandbinding site was successfully predicted on a ligand-bound
state structure and in 77% (27/35) was successfully
predicted on an unbound state structure. This represents
significance over conventional methods in detecting ligandbinding sites on uncharacterized proteins. We also could
improve the accuracy of prediction with re-ranking
techniques by amino acid conservation scores of candidates
for ligand-binding sites.
REFERENCES:
[1] Morita M, Nakamura S, Shimizu K, Proteins, in press
38
Conserved and Functional
Coil within the membrane core is conserved, especially in
case of breaks and reentrants. While in globular regions,
substitution rates are equal for helices and coil, at membrane
coil has significantly lower substitution rates. Further,
within core, indel frequences are equally low for helices and
coils. Even if accesibility is taken in account, coil has lower
substitution rates than helix in membrane core. Thus
membrane coil is not ore conserved because of their lower
accessibility, but most likely because of their functional
importance.
Functional role was found for ~60 % of all reentrants and
for ~30 % of all breaks and for small fraction of kinks. All
functional coils (except one enzyme) are from channels and
transporters. In the classical case of potassium channels and
aquaporins, an exposed coil backbone from a reentrant
regions forms a rigid selectivity filter. The second, and
perhaps most typical, case of coil functionality is a coil
segment that forms both a flexible binding site for
transported substance and is involved in large
conformational changes required for transport, exemplified
by calcium ATPase. Finally, a coil segment can form a
flexible hinge required for gating, as suggested for the
ATP/ADP carrier. Taken together coil can provide polarity
and flexibility required for transport. Thus coil within the
membrane represent structural anomaly for needs of
function.
23: MOLECULAR DYNAMICS SIMULATIONS
USING AN ALPHA-CARBON-ONLY KNOWLEDGEBASED FORCE FIELD FOR PROTEIN STRUCTURE
PREDICTION
Patrick Buck and Chris
Bystroff
(Rensselaer
Polytechnic Institute USA)
dependencies of I-sites motifs in the protein structure
database.
The existence of strong sequence-structure correlations in
the database should enable us to develop and test folding
potentials for template-free protein structure prediction.
Knowledge-based potentials based on the statistical
occurrences of structural properties in native proteins have
proven to be the most successful approach to protein
structure prediction [8, 9]. Many theses approaches attempt
to discretize conformational space by fragment insertion
Monte Carlo [10] or chain build-up [11, 12] in folding
simulations. Although quite successful, these methods may
ignore intermediates along the folding pathway by strictly
optimizing the global fold energy [13]. Modeling folding
pathways is essential to the understanding of folding
kinetics and kinetic stability. Non-native intermediates along
the folding pathways may be required in the folding of some
knotted proteins [14].
In this study we use a reduced protein representation for
folding simulations in an alpha-carbon-only knowledgebased potential. Peptide residues are treated as beads on a
string, with backbone atoms for each residue lumped into a
single interaction center located at the position of each
alpha-carbon. Such a model can significantly reduce the cost
of computing trajectories to visualize long time-scale
dynamics such as in protein folding [15]. Recently, there has
been increased interest in simulating the physical folding
process using reduced protein representations and coarsegrained potentials [16-18]. To our knowledge, no alphacarbon-only statistical potential for folding by molecular
dynamics simulations has ever before been tried, as very
few of the published statistical potentials act solely on
alpha-carbons [19-21], hinting at the difficulty
of calculating a realistic energy using a reduced model.
Our new knowledge-based energy function includes
potentials for virtual bond opening and dihedral angles,
hydrogen bond donor and acceptor probability fields, and a
local-structure dependent pair-wise potential. All are
position-specific and conditional on their unique amino acid
sequences which is somewhat different compared to more
common residue-specific potentials. As a first test of our
energy function, we folded, via Brownian Dynamics, 27
short protein segments of length 12 that were predicted to be
autonomous folding units. This set of protein segments
represented a variety of secondary structures including helix
N-caps, beta-hairpins, and a mix of loops and turns. Most of
the native structural preference was accounted for by local
virtual bond angle preferences and predicted contacts, but
the inclusion of a hydrogen bond probability field
significantly increased the observed frequency of the native
state. Additionally, the confidence of our predictions was
assessed by determining how much of the simulation was
spent in the largest cluster center compared to all other
clusters. If more than half of those structures submitted for
clustering fell into the largest cluster then those protein
segments were regarded as having a structural preference.
Of the 27 protein segments predicted, 19 were found to have
trajectories where more than half of the total simulation
could be clustered into one conformation. Additionally, 15
Folding initiation sites are
short protein segments
that fold independently of
their three dimensional
context. We
used
Brownian dynamics to fold
peptides represented as
alpha-carbon
positions
only,
guided
by
a
knowledge-based force field. The simulations are
extremely fast and accurately predict the structures of
folding initiation site peptides.
Peptide sequences less than 20 residues in length can have
strong structural preferences that are independent of nonlocal interactions, as shown by NMR [1-3], and simulation
studies [4, 5]. It is thought that sequence patterns for these
peptides in the context of a parent sequence become
structured early in folding and exist in their native
conformation in unfolded proteins. Some of these short
sequence patterns, 3-19 residues in length, have been
captured in a structural motif library called I-sites (initiation
sites) [6] and the associated hidden Markov model
HMMSTR [7] which describes the adjacencies and
39
of the 19 protein segments found to have structural
preferences had cluster centers that were at least 2.5 Å away
from native. Scatter plots of energy versus RMSD to native
for all 27 protein segments in many cases showed that the
densest sampling was both closer to native and lower in
energy. For many of the best predicted structures, a
correlation was observed between the energy and distance
from native. A strong correlation suggest a funnel-like
landscape that is advantageous to minimize frustration
during simulations.
Initial studies, using an alpha-carbon-only potential based
on backbone virtual angles, Van der Waals repulsion and
contact energy terms, but without an orientation dependent
hydrogen bond term, showed errors in strand alignment of
beta-sheets and other irregularities that could be traced to
poor hydrogen bonding geometry. For example, three betastrands, all rich in non-polar side-chains, would arrange
themselves in a collagen-like triple helix rather than in a
sheet. The backbone angles permitted this, and the contact
energies favored this structure, since it increases the total
number of strand-strand contacts. To capture the directional
nature of hydrogen bonds, three-dimensional energy fields
were created by binning the positions of alpha-carbons
whose backbone nitrogen donates a hydrogen around the
acceptor alpha-carbon position after transforming the donor
coordinates into the acceptor alpha-carbon frame of
reference (Figure 1).
Leaving out the hydrogen bond energetic term did not
significantly change the RMSD of the largest cluster center
compared to native . However, the larger size of native-like
(< 2.0 Å) clusters affirmed that the hydrogen bond energy
significantly stabilized the native structure relative to all
other structures when compared to simulations without
hydrogen bond energy (p=0.001).
Folding a diverse set of short protein segments is
prerequisite to developing a hierarchical folding model for
larger proteins. It has long been thought that proteins fold
locally first, forming secondary structures which are then
able to nucleate tertiary contacting [22]. Recently, it has
been reported that this type of folding mechanism could be
implemented in a procedure called zipping and assembly
[23]. The success of folding a diverse set of protein
segments in the current study indicates that the zipping and
assembly technique could also be implemented with our
energy function. In finding the native conformation in
simulations of several different structural motifs that are
expected to fold autonomously, the force field passes a test
for generality and provides hope that our simplified model
could be used to fold larger sequences.
24: ENVIRONMENT-SPECIFIC SUBSTITUTION
TABLES FOR MEMBRANE PROTEINS
Sebastian Kelm (University of Oxford, UK), Jiye Shi (UCB
group, USA) & Charlotte M. Deane (University of Oxford,
UK)
membrane proteins differ
from soluble proteins on the
molecular evolution level.
Integral membrane proteins
constitute about 30% of all
known proteins and play key
functional roles in cells. Their
function is essential for a wide
range of physiological events,
such
as
neurotransmitter
transport, cell recognition and
nerve impulse transmission.
Membrane proteins are therefore important potential drug
targets. Despite their importance, experimentally determined
structures are rare as they are both difficult and expensive to
attain. The value of modelling the structures of these
proteins is therefore large. However, there are no fully
automated tools developed specifically for the structure
prediction of membrane proteins as opposed to their
globular soluble counterparts.
The existing state of the art in membrane protein structure
prediction is based on the use of tools developed and trained
on globular proteins and then relies on manual manipulation
and specialist expertise to generate models.
In this project we are utilizing the specific structural features
that membrane proteins exhibit to develop a toolkit directed
at modelling them more accurately in a fully automated
fashion. We have created a procedure to generate
environment-specific substitution matrices for membrane
proteins. In the first instance, by comparing these matrices
to those generated from globular proteins, it is possible to
gain valuable information about the molecular evolution of
membrane proteins. In particular we can examine the
environment specific substitution rates in and out of the
membrane, as well as compare them between membrane and
soluble proteins. Furthermore, we are investigating the
contribution of biological parameters to our substitution
tables and how these compare to those of globular proteins.
In the next step, our substitution matrices shall be used for
membrane protein model validation and, ultimately, for
structural prediction, for example by homology modelling.
25: IDENTIFICATION OF NOVEL INHIBITORS FOR
UBIQUITIN C-TERMINAL HYDROLASE-L3 BY
VIRTUAL SCREENING
Kazunori Hirayama (Department of Electrical Engineering
and Bioscience, Japan, Graduate School of Advanced
Science and Engineering, Waseda University, Japan),
Shunsuke Aoki (Department of Bioscience and
Bioinformatics, Kyushu Institute of Technology, Japan),
Kaori Nishikawa (Department of Degenerative Neurological
Diseases, National Institute of Neuroscience, National
Center of Neurology and Psychiatry, Japan), Takashi
Matsumoto (Department of Electrical Engineering and
Bioscience, Graduate School of Advanced Science and
Engineering, Waseda University, Japan), Keiji Wada
(Department of Degenerative Neurological Diseases,
We present our environment-specific substitution tables
for membrane proteins, a first step towards modelling
their structure. We compare our tables to those of
soluble proteins. Our results shed new light on just how
40
DOCK (Ewing et al., J. Comput. Aided Mol. Des. 2001, 15,
411-428), GOLD (CCDC, Cambridge, UK), and FlexX
(BioSolveIT, GmbH, Germany). BCR-ABL tyrosine kinase
inhibitors (IC50 values from 10 to 200 microM) were
successfully identified by virtual screening of 200,000
compounds against crystal structures using DOCK (Peng et
al., Bioorg. Med. Chem. Lett. 2003, 13, 3693-3699) and an
anchor-and-grow algorithm taking into account ligand
flexibility. Human thymidine phosphorylase inhibitor (IC50
= 77 microM) was also identified by virtual screening of
250,521 compounds using DOCK (McNally et al., Bioorg.
Med. Chem. Lett. 2003, 13, 3705-3709). In addition,
metallo-beta-lactamase inhibitors (IC50 values less than 15
microM) were identified by virtual screening using GOLD
(Olsen et al., Bioorg. Med. Chem. 2006, 14, 2627-2635),
using a genetic algorithm taking into account ligand
flexibility.
The advantage of chaining different docking programs was
evaluated, and the results showed that virtual ligand
screening can be performed with reasonable accuracy and be
performed more rapidly using chained screening than
screening using a single program with default parameters
(Miteva, J. Med. Chem. 2005, 48, 6012-6022). In this study,
the results of chained docking to UCH-L3 crystal structure
were examined using a UCH-L3 hydrolysis activity assay to
confirm the efficacy of the DOCK-GOLD SBDD method.
We identified three inhibitors (IC50 = 100 to 150 microM)
of UCH-L3 using the DOCK-GOLD virtual screening of
32,799 compounds.
Human UCH-L3 and ubiquitin vinylmethylester (Ub-VME)
complex crystal structure data (PDB code 1XD3) was
obtained from the Protein Data Bank (PDB) (Misaghi et al.,
J. Biol. Chem. 2005, 280, 1512-1520). Hydrogens were
added to the UCH-L3-ubiquitin complex using the CVFF99
force field in the Biopolymer module of the Insight II 2000
suite (Accelrys, Inc., San Diego, CA). Energy was
minimized using the Discover 3 module of the same suite
with all heavy atoms (that is, atoms other than hydrogen)
restrained, to exclude short contacts. To use the UCH-L3
protein structure in the following docking simulations, the
structures of the UCH-L3 and Ub-VME complex were
divided into their components.
In the 3D structure of the UCH-L3-ubiquitin complex, the
ubiquitin C-terminus is buried in the cleft of the active site
among four active site residues of UCH-L3: Gln89, Cys95,
His169, and Asp184 (Johnston et al., EMBO J. 1997, 16,
3787-3796; Misaghi et al., J. Biol. Chem. 2005, 280, 15121520). In the virtual screening process using DOCK and
GOLD, the protein-ligand interacting site was restricted to
the binding site of the three ubiquitin C-terminal amino
residues, so that the outcome could be verified using an
ubiquitin C-terminal hydrolase enzymatic assay. The first
DOCK screening was performed on the 32,799 compounds
in the CNS-Set, which was pre-filtered by RPBS using the
least stringent filtering conditions (Miteva, Nucleic Acids
Res. 2006, 34, W738-744).
Virtual screening experiments were performed using UCSF
DOCK 5.4.0 (Ewing et al., J. Comput. Aided Mol. Des.
2001, 15, 411-428) and GOLD 3.0.1 (CCDC, Cambridge,
National Institute of
Neuroscience, National
Center of Neurology and
Psychiatry, Japan).
We
screened
for
compounds
with
potential
inhibitory
activity of UCH-L3
(ubiquitin C-terminal
hydrolase-L3),
an
apoptosis-associated
de-ubiquitinating
enzyme, using the UCH-L3 structure (1XD3) and the
ChemBridge Compound Library. Using DOCK and
GOLD software, we identified ten candidate compounds,
and by enzymatic assay, we determined that three
compounds are UCH-L3 inhibitors.
Structure-based drug design (SBDD) is used to identify
potentially useful drugs because it enables faster drug
candidate identification than in vitro or in vivo biological
assays. The computer-based approach to drug screening
using molecular docking, is a shortcut method that can be
employed when the crystal structure of a target protein is
known. UCH-L3 (ubiquitin C-terminal hydrolase-L3) is a
de-ubiquitinating enzyme that is a component of the
ubiquitin-proteasome system and is known to be involved in
programmed cell death. A previous high-throughput drug
screening identified an isatin derivative as a UCH-L3
inhibitor. In this study, we screened for novel inhibitors
having a different structural basis. We used in silico
structure-based drug design using human UCH-L3 crystal
structure data (PDB code 1XD3) and a virtual compound
library (ChemBridge CNS-Set) of 32,799 chemicals. In a
two-step virtual screening using DOCK software (first
screening) and GOLD software (second screening), we
identified ten candidate compounds with GOLD scores over
60. To determine whether these compounds exhibited
inhibitory effects on the de-ubiquitinating activity of UCHL3, we performed an enzymatic assay using ubiquitin-7amido-4-methylcoumarin (Ub-AMC) as the substrate.
Among the ten candidate compounds, we identified three
compounds with similar basic dihydro-pyrrole skeletons as
UCH-L3 inhibitors with IC50 values of 100-150 microM
(Hirayama et al., Bioorg. Med. Chem. 2007, 15, 68106818). Experimentally determined IC50 values were 103
microM for compound 1, 154 microM for compound 6, and
123 microM for compound 7. UCH-L3 is involved in the
protection of programmed cell death in germ cells and
photoreceptor cells in vivo (Kwon et al., Am. J. Pathol.
2004, 165, 1367-1374; Sano et al., Am. J. Pathol. 2006, 169,
132-141). Thus, the structural information we determined
regarding the UCH-L3 inhibitors may be useful in the
development of apoptosis-inducing anti-cancer drugs.
Key methodologies for docking small molecules with
proteins were developed in the early 1980s (Kuntz et al., J.
Mol. Biol. 1982, 161, 269-288), and various types of
docking simulation software are now available, such as
41
traditional atom-based interaction scoring that is typical to
most empirical, force-field based and statistical scoring
methods. We have introduced a novel concept of scoring
interactions based on Interacting Surface Points (ISP) that
are represented by their 3D positions, normal vectors and 23
chemical feature types including H-bond donor/acceptor,
aromatic Pi electrons, hydrophobic groups. A statistically
derived empirical scoring function is constructed using a 4parameter geometric description of the relationship between
ISP pairs. The parameters include the distance between the
pairs of ISPs, the angles between the normal vectors. The
energy associated with each possible ISP pair is deduced
from statistics based on an inverse application of the
Boltzmann distribution function. During the statistics
collection temperature factors were considered with the
corresponding Gaussian functions applied to the atom
positions to account for the variable uncertainty of the atom
positions in the Protein Data Bank (PDB) X-ray structures.
More accurate geometric statistics have been collected from
the Cambridge Structure Database and recently incorporated
into the PDB data. Certain atoms, for example, the nitrogen
atom in the imidazole ring, may participate in very different
types of interactions at the same time (H-bonding and
aromatic Pi-stacking). The ISP representation can describe
these interactions better than the atom-based approach by
having multiple ISPs associated with the same atom but
pointing in different directions.
The advantage of the statistically driven ISP scoring
function is demonstrated on a case study using the
Acetylcholine Binding Protein (AChBP) which has a key
cation-Pi interaction observed crystallographically for
several substrates (e.g. CCE, Nicotine, Lobeline,
Epibatidine)[2]. Empirical and force-field based scoring
functions fail to rank the correct binding pose highest even
when using DFT-6-31**B3LYP charges. In contrast, eHiTS
produces the correct pose with the best score even when
using the default statistical table and weighting scheme for
which no example from this protein family was included.
When the automated training script is run to include the
family in the knowledge base then the energy separation
between the correct pose and other generated poses
improves and provides very cleanly distinguished clusters.
Furthermore, the eHiTS score gives a good correlation with
the experimentally measured log(Kd) values for the series,
correctly rank ordering the actives.
A simple count of the various ISP types present on a ligand
provides a very compact descriptor for the ligand's
interaction activity profile. We have used these descriptors
via a machine learning technique to create a very rapid
ligand-based VHTS filter - called LASSO (Ligand Activity
in Surface Similarity Order)[3]. The descriptor is
independent of 3D conformation and is focused on the
interaction properties rather than connectivity or structural
similarity. It is therefore capable of scaffold hopping, the
process of retrieving active ligands with different underlying
structures. LASSO is demonstrated to achieve high
enrichment rates for all families included in the DUD
benchmark set[4]. LASSO offers an extremely rapid
UK) (Jones et al., J. Mol. Biol. 1997, 267, 727-748). In the
first screening using DOCK, the substrate-binding site was
defined by selecting ligand-atom-accessible spheres and
describing
molecular
surfaces
using
the
SPHERE_GENERATOR program in the DOCK suite. All
spheres within 6 angstroms of the root mean square
deviation (RMSD) from each atom of the three C-terminal
residues of energy-minimized ubiquitin were selected by the
SPHERE_SELECTOR program in the DOCK suite.
Following the first screening with rigid ligand conditions,
1,780 compounds with binding energy scores of less than 30 kcal/mol were selected for a second screening using
GOLD.
Using GOLD, the virtual tripeptide structure composed of
three C-terminal residues of the energy-minimized ubiquitin
was set as the reference ligand to define the ligand-binding
site. All protein atoms within 5 angstroms of each ligand
atom were used to define the binding site. As a result, the
binding site was modeled as having 174 active atoms
(automatically selected by GOLD software). Ligands
predicted to be tight-binders by both DOCK and GOLD
were then evaluated by further in vitro experiments.
27: A NOVEL SCORING FUNCTION IN eHITS AND
LASSO
Zsolt
Zsoldos,
Danni
Harris,
Mehdi
Mirzazadeh,
Aniko
Simon
(Simulated
Biomolecular
Systems, Canada)
A
novel
statistical
scoring function
for
flexible
ligand docking is
presented based
on Interacting Surface Points (ISP). Results of a case
study on AChBP with cation-Pi interactions are shown.
A QSAR descriptor based on the ISP provides a 3D
conformation independent ligand activity filtering tool,
ideal for scaffold hopping.
The primary goal of most virtual screening experiments is to
identify new lead compounds as a starting point for
developing a drug discovery pipeline. There are two typical
approaches that are sometimes combined to develop a
screening funnel: ligand-based approaches (2D similarity,
3D pharmacophore, fingerprint, surface or other QSAR
descriptor) and structure-based flexible ligand docking and
scoring approaches. The latter is often considered too slow
for the large scale screening of databases of millions of
structures, while the former approach does not provide 3D
coordinates or estimated binding energies.
The fragment-based exhaustive flexible ligand docking
engine of eHiTS has been published previously[1]. We are
now focusing our efforts on developing an innovative
scoring function for eHiTS, one which departs from the
42
filtering tool in excess of a million ligands per minute on a
single CPU.
eHiTS flexible docking has proved to be among the most
accurate pose prediction tools[5] and combined with the
LASSO ligand based filter it provides one of the highest
enrichment factors based on comparative evaluation
studies[6]. While LASSO can rapidly and efficiently reduce
the number of candidates to be docked to a few percent of
the total database, accurate flexible docking with eHiTS
used to take several minutes of CPU time per ligand on
traditional hardware architectures. The algorithm has been
recently redesigned and coded to take advantage of the Cell
B/E accelerator architecture providing between 30-100 fold
speed-up[7] and bringing the runtime down to a few seconds
per ligand on a Sony Playstation PS3 gaming machine or
even faster on an IBM Cell Blade while still producing the
most accurate flexible docking.
The revolutionary hardware technology requires new
computational methods, replacing approximate precomputed grids with proximity look-up and explicit pairwise interaction computation. As a result, the calculation is
not only orders of magnitude faster, but it also provides
more accurate energy predictions. The emerging
technologies presented could also be applied to speed up
other molecular modeling related problems, e.g. QM or MD
simulations and protein folding, by multiple orders of
magnitude.
[1] Z. Zsoldos, D. Reid, A. Simon, S.B. Sadjad, A.P.
Johnson: eHiTS a new fast, exhaustive flexible ligand
docking system; J.Mol.Graph.Modeling. (26), 1, 2007, 198212;
[2] S.B. Hansen, G. Sulzenbacher, T. Huxfold, P. Marchot,
P. Taylor, Y. Bourne: Structures of Aplysia AChBP
complexes with nicotinic agonists and antagonists reveal
distinctive binding interfaces and conformations. The
EMBO
Journal
(2005)24,
3635-3646.
doi:10.1038/sj.emboj.7600828
[3] D. Reid, B.S. Sadjad, Z. Zsoldos, A. Simon: LASSO ligand activity by surface similarity order: a new tool for
ligand based virtual screening. Journal of Computer-Aided
Molecular Design,
[4] N. Huang, B.K. Shoichet, J.J. Irwin: Benchmarking sets
for molecular docking. J. Med. Chem. 49(23), 6789-801
[5] M. Kontoyianni, L.M. McClellan, G.S. Sokol:
Evaluation of Docking Performance: Comparative Data on
Docking Algorithms, J.Med.Chem., 2004; 47(3); 558-565.
eHiTS results for the same test case added by Fedor
Zhuravlev, Assist.Prof., Technical University of Denmark:
http://www.simbiosys.ca/ehits/ehits_validation.html
[6] G.B. McGaughey, R.P. Sheridan, C.I. Bayly, C.
Culberson, C. Kreatsoulas, S.
Lindsley, V. Maiorov, J. Truchon, W.D. Cornell:
Comparison of Topological, Shape, and Docking Methods
in Virtual Screening. J.Chem.Inf.Model. 2007; 47(4), 150419.
eHiTS results added by Merck:
http://www.simbiosys.ca/ehits/ehits_enrichment.html
[7] http://www.bio-itworld.com/inside-it/2008/05/gta4-andlife-sciences.html
28: SE: AN ALGORITHM FOR DERIVING
SEQUENCE ALIGNMENT FROM SUPERIMPOSED
STRUCTURES
Chin-Hsien Tai1, James J. Vincent2, Changhoon Kim1 &
Byungkook Lee1 (1National Cancer Institute, NIH, USA,
2
Vermont Genetics Network, Department of Biology,
University of Vermont, USA)
The Seed Extension (SE) algorithm produces more
accurate sequence alignments from superimposed
structures than three other programs tested which use
the dynamic programming algorithm. SE does not
require gap penalty and also uses less CPU time, suitable
for large-scale structural comparisons. It can be
implemented in other structure comparison programs.
Generating sequence alignments from superimposed
structures is an important part of structural comparison
programs and structure-based sequence alignments. The
accuracy of the alignment affects structural classification
and comparisons and possibly function prediction. Many
programs use a dynamic programming algorithm to generate
a sequence alignment from a pair of superimposed
structures. This procedure requires using a gap penalty and,
depending on the value of the penalty used, can introduce
spurious gaps and misalignments.
Here, we present a new algorithm, Seed Extension (SE), for
generating the sequence alignment from a pair of
superimposed structures. The SE algorithm first finds
“seeds”, the pairs of residues, one from each structure, that
meet a certain set of criteria for being unambiguously
equivalent. Three consecutive seeds form seed-segments,
which are extended along the diagonal of the alignment
matrix in both directions. Distance and amino acid similarity
between the residues are used to resolve conflicts that arise
during extension of more than one diagonal. SE is simple to
implement and does not require a gap penalty.
The manually curated alignments in NCBI’s Conserved
Domain Database were used as reference alignments to
compare the sequence alignments generated from pairs of
superimposed structures by the SE algorithm and by three
other programs that use dynamic programming algorithm,
Chimera, LSQMAN and SHEBA.
The SE algorithm performed best among the four programs
tested. It gave an average accuracy of 95.9% over 582 pairs
of superimposed proteins. The average accuracy of Chimera,
LSQMAN and SHEBA were 89.9%, 90.2% and 91.0%
43
respectively. For pairs of proteins with low sequence or
structural homology, the SE algorithm produced alignments
that were up to 18% more accurate, on average, than the
next best scoring program. Improvement was most
pronounced when the two superposed structures contained
equivalent helices or beta-strands that crossed at an angle.
SE also used considerably less CPU time than the dynamic
programming algorithm used in the original SHEBA. When
SE is implemented in SHEBA, replacing a standard dynamic
programming algorithm, the alignment accuracy improved
by 10% on average for protein structure pairs with RMSD
between 2 and 4 angstroms. The program is also two times
faster than with the dynamic programming algorithm routine
on average for protein pairs with about 200 residues and
more than 10 times faster when larger structures are
compared.
An example of sequence alignment generated by SE and the
dynamic programming routine used in SHEBA is shown in
the representative figure. This pair of 3 helical bundle
structures belongs to cd03439 family in CDD. SE generated
three aligned regions corresponding to the three helices; the
alignment was identical to that of CDD (100% accuracy). In
contrast, the original SHEBA with the dynamic
programming algorithm produced only two well-aligned
regions; the third region had many gaps and a small number
of inaccurately aligned residues.
The Seed Extension algorithm is available as a software
package for implementing in other structural comparison
programs.
protein design simulations, with newfound exclusion of
3-10 helix ends.
Computational studies of proteins such as homology
modeling and protein design involve the difficult task of
predicting the conformational effects of mutations. The
change of a single sidechain can have subtle, farreaching
effects that are difficult to model accurately. The use of
discrete “rotamers” simplifies the search over sidechain
conformational space, but protein backbone cannot in
general be so easily reduced.
One exception to this rule is the “backrub,” a low-amplitude,
hinge-like motion of a dipeptide coupled to sidechain
rotamer jumps. The backrub was documented by examining
very high-resolution electron density for alternate
conformation sidechains and inferring the backbone changes
that must be involved (Davis 2006, Structure 14:265). The
backrub has now been employed to good effect in protein
design studies (Smith & Kortemme 2008, J Mol Biol;
Georgiev 2008, Bioinformatics). Importantly, however, no
direct evidence has so far been presented to support the
assumption implicit in these designs: that this dynamic, lowenergy backbone motion on the timescale of rotamer
transitions for single sidechains is also relevant on the
evolutionary timescale of sidechain mutations.
To address this point, we have used our Top5200 structure
dataset to examine two different cases for which populations
of otherwise similar local conformation are related by a
single amino acid difference that alters an H-bond or van der
Waals contact with a neighboring chain. Both cases show
sequence-dependent bimodal backbone distributions that are
well described by the backrub motion. The first case is 4320
Phe, Tyr, or Trp residues with plus chi1 rotamers on antiparallel beta sheet, which places the aromatic ring directly
over a sidechain on the adjacent strand. If that sidechain is a
Gly, then the aromatic residue hinges downward to touch
the Gly H, while the Cbeta group of any other amino acid on
the opposite strand pushes the aromatic ring upward. The
second case is alpha-helix N-cap residues that form classic
sidechainbackbone N-cap H-bonds to the i+3 NH
(Richardson 1988, Science 240:1648). 4906 of the N-caps
were Asn or Asp (233 with psi from 165 to 170 degrees
shown in green below), and 7405 were Ser or Thr (1554 of
same subset in blue). The backbone conformations differed
consistently, where the longer N/D sidechains rotate the first
turn’s backbone away from residue i+3, while the shorter
S/T sidechains pull the first turn’s backbone toward i+3.
When examples are superimposed on the 3 atoms marked in
red below, the average N/D vs. S/T N-cap Calpha positions
are about 0.3 Angstroms apart in a backrub rotation of about
10 degrees, similar to shifts typical of rotamer backrubs. For
the helix N-caps the sequence change and the backrub occur
at the same residue, as seen earlier for rotamer changes,
while for beta aromatics the sequence change on the
adjacent strand causes a backrub shift at the aromatic.
These findings validate the inclusion of empirically
observed backbone motions such as the backrub as part of
the repertoire of “moves” for protein design and other
modeling efforts. If we allow nature to inform our notion of
30: CO-EVOLUTION OF STRUCTURAL
BIOINFORMATICS AND PROTEIN DESIGN FOR NCAP BACKRUBS
Daniel Keedy
(Department of
Biochemistry,
Duke
University,
USA),
Ed
Triplett (Duke
University,
USA), David
Richardson
(Department of
Biochemistry,
Duke
University, USA), Jane Richardson (Department of
Biochemistry, Duke University, USA), Ivelin Georgiev
(Computer Science Department, Duke University, USA),
Cheng-Yu Chen (Department of Biochemistry, Duke
University, USA) and Bruce Randall Donald (Duke
University, USA).
The “backrub” motion, a previously described dipeptide
rotation coupled to rotamer jumps, is now documented
to occur for helix N-cap residues and for beta-sheet
aromatics related by single amino acid substitutions.
Backrubs are thus suitable in a repertoire of moves for
44
library or a lattice model is discrete, which is inconsistent
with the continuous characteristics of protein backbone
torsion angles. This discrete nature may restrict the search
space and cause loss of prediction accuracy.
The subject of this abstract lies in protein conformation
sampling in real space, that is, the exploration of the
continuous conformational space compatible with a given
protein sequence using a probabilistic graphical model. In
particular, we develop a Conditional Random Fields (CRF)
model [1], called CRFSampler, to learn the complicated
relationship between protein sequence and structure and
then sample the conformations of a protein using this CRF
model. CRFSampler models the sequence-structure
relationship using approximately one million of parameters
and estimates them using a sophisticated discriminative
learning method. Given a protein sequence, the occurring
probability of a potential conformation (i.e., all the
backbone angles) can be accurately estimated by
CRFSampler and thus the protein conformation space can be
efficiently explored.
Instead of using fragments as basic building blocks of a
protein conformation, CRFSampler directly samples the
backbone angles at each position according to its occurring
probability calculated from the CRF model. Different from
fragment assembly methods and lattice models,
CRFSampler uses a directional statistics to model the
distribution of protein backbone angles at each position and
thus can sample backbone angles from a continuous space.
The distribution parameters of angles at each backbone
position are sampled by CRFSampler using sequence
information and PSIPREDpredicted secondary structure.
CRFSampler guarantees to search through the whole
continuous conformation space so that the native structure
of a protein will not be missed. On the other hand,
CRFSampler is also efficient because it is biased towards
those conformations with high occurring probability.
CRFSampler uses a graph to model the relationship between
sequence and backbone angles. The backbone angles at a
single position depend on residues and secondary structures
at many positions of the target protein to be folded.
CRFSampler also models the dependency between the
angles at three consecutive positions or even more. In
CRFSampler, a sophisticated model topology (see Figure 1
for an example) and feature set can be defined to describe
the dependency between sequence and structure without
worrying about learning of model parameters. CRFSampler
is much more expressive than the FB5-HMM model [2], in
which the angles at a single position only directly depends
on residue type at this position and only interdependence
between two adjacent positions are captured. Second,
CRFSampler also naturally captures the interaction between
primary sequence and secondary structure. CRFSampler can
automatically learn the relative importance of primary
sequence and secondary structure, as opposed to the FB5HMM model that assumes primary sequence and secondary
structure are equally important. Finally, CRFSampler can
easily incorporate sequence profile (i.e., positionspecific
frequency matrix) and predicted secondary structure
likelihood scores into the model to further improve sampling
backbone motion by using structurally observed backbone
distributions encoding “protein-like” behavior, we can
implicitly incorporate aspects of protein biophysics subtler
than the field has so far been able to model accurately.
To this end, we have also begun to utilize backrubs for
computational redesign of N-caps in GrsA PheA using a
new algorithm, BRDEE (see Ivelin Georgiev’s talk in the
main ISMB session). While studies by Fersht, Matthews,
Kallenbach, Presta, and others have found it possible to
introduce stabilizing N-caps where none existed before, on
the basis of the findings described above we suspected that
explicitly accounting for possible backrubs at the N-cap
position could improve the success rate of designs. A new
issue, highlighted by feedback between informatics and
design, is the difference between helix N-cap preferences for
3-10 vs. alpha-helical conformations. “Traditional” N-caps
(with sidechain-backbone H-bonds to residue i+3) appear to
be significantly less compatible with 3-10 helix starts.
32: EFFICIENT PROTEIN CONFORMATION
SAMPLING IN REAL SPACE
Jinbo Xu (Toyota Technological Institute of Chicago,
USA).
Protein conformation sampling poses as a major
bottleneck of ab initio folding. This abstract presents
CRFSampler, a protein conformation sampling
algorithm, built upon a probabilistic graphical model
Conditional Random Fields. CRFSampler models the
sequence-structure relationship using a million of
parameters.
Preliminary
results
indicate
that
CRFSampler can efficiently generate protein-like
conformations.
Ab initio folding has made exciting progress in the past
decade, as exemplified by the fragment assembly method
implemented in Rosetta and the hybrid method (i.e., hybrid
of fragment assembly and lattice model) implemented in
TASSER and I-TASSER. Many other groups have
developed a variety of fragment assembly methods and
lattice models for protein structure prediction, and
demonstrated success. Although these two popular structure
prediction methods achieved exciting results, several
important issues remain with protein conformation
sampling. First, due to the limited number of experimental
protein structures in PDB, it is still very difficult to have a
library of even moderate-sized fragments that can cover all
the possible local conformations of a sequence stretch.
Second, the conformational space defined by a fragment
45
performance. Although extremely expressive, CRFSampler
can avoid overfitting of the model parameters by
regularizing its parameters using a Gaussian prior, allowing
the user to achieve a balance between model complexity and
expressivity.
Our experimental results indicate that using CRFSampler,
protein-like conformations can be efficiently sampled in real
space without using fragment assembly. Using only
compactness and self-avoiding constraints, CRFSampler can
quickly generate native-like conformations with quality
better than those generated by the FB5-HMM model and the
Levitt's lattice model [3]. Please refer to our paper [4] for a
detailed comparison of CRFSampler, FB5-HMM, Levitt's
lattice model and Rosetta.
Currently we are developing a method for ab initio protein
structure prediction by combining CRFSampler with a
distancedependent statistical potential and a hydrogen
bonding energy. Using the DOPE statistical potential and
BMKhbond (only backbone and C-beta atoms considered),
we can successfully fold a variety of alpha and beta proteins
such as 1FC2, 1ENH, 2CRO, 1NKL, 1TRL, 1BG8, 2GB1,
1SRO, 1PGB, 1FGP and 1DKT.
Figure 1: An example CRF model for protein conformation
sampling. In this example, the angles (represented as the
middle level of this figure) at position i depend on the
residues and secondary structure types at positions i-2, i-1, i,
i+1 and i+2 and any nonlinear combinations of them. There
is also interdependence among angles in three consecutive
positions. This CRF model can also be extended to
incorporate long-range interdependence between angles and
make use of more information such as PSIBLAST profile
and alignments generated from comparative modeling.
[1] John Lafferty, Andrew Mccallum, and Fernando Pereira.
Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. In Proceedings of
the 18th International Conference on Machine Learning,
pages 282–289. Morgan Kaufmann, San Francisco, CA,
2001.
[2] Thomas Hamelryck, John T T. Kent, and Anders Krogh.
Sampling realistic protein conformations using local
structural bias. PLoS Comput Biology, 2(9), September
2006.
[3] Y. Xia, E. S. Huang, M. Levitt, and R. Samudrala. Ab
initio construction of protein tertiary structures using a
hierarchical approach. J Mol Biol, 300(1):171–185, June
2000.
[4] F. Zhao, S. Li, B. Sterner and J. Xu. Discriminative
Learning for Protein Conformation Sampling. PROTEINS.
2008 Apr 15. [Epub ahead of print].
33: MODELING THE INTERACTION OF MAP
KINASE PHOSPHATASE 3 WITH A NOVEL
INHIBITOR BY ACCOUNTING FOR
CONFORMATIONAL FACTORS
Ahmet Bakan1*, Gabriela Molina2*, Andreas Vogt3*,
Michael Tsang2 and Ivet Bahar1 (1Departments of
Computational Biology, 2Microbiology and Molecular
Genetics, 3Pharmacology and Chemical Biology, University
of Pittsburgh, USA)
* These authors contributed equally.
We
employ
flexible ligand
and side-chain
docking
to
multiple target
conformations
in a two step
procedure
to
pinpoint
an
unknown
inhibitor
binding site and
to assess the
related mechanism of inhibition. We present its
application to the interaction of MAP kinase
phosphatase 3 with a novel inhibitor.
Molecular docking is the primary method to probe proteinligand interactions. Rigid target assumption is the major
limitation to its success. In the recent years, flexible ligand
and side-chain docking to multiple target conformations has
emerged as a practical approach to improve pose accuracy
and scoring. We employ this approach in a two step
procedure to pinpoint an unknown inhibitor binding site and
to assess the related mechanism of inhibition. We present its
application to the interaction of MAP kinase phosphatase 3
(MKP3) with a novel inhibitor (BCI) identified from
zebrafish chemical screens [1]. Based on the computational
modeling, we proposed that BCI is an allosteric inhibitor
and supported the allosteric inhibition mechanism by in
vitro experiments.
The first step of the computational procedure is
identification of potential binding sites on the target protein.
To this aim, unbiased rigid protein docking simulations are
performed for all known distinct conformational states of
the target protein using AutoDock [2]. The resulting poses
are clustered to identify energetically favorable docking
sites. Favorable sites are further explored by allowing the
protein to undergo structural fluctuations in the
neighborhood of the predefined conformational states. In the
second step, flexible ligand and side-chain docking to
multiple target conformations is employed to reveal the
most favorable site. When a crystallographic structure of the
target is available, normal mode analysis is used for efficient
sampling of conformational fluctuations. When the structure
of the target is not known, multiple homology models are
used as an ensemble of accessible conformations. In the
former case, normal modes of internal motions of the target
protein are calculated using the anisotropic network model
(ANM), a simple elastic network model at residue level
resolution [3]. ANM modes relevant to the functional
motions of the protein or those affecting the geometry of a
potential binding site are selected from the low frequency
regime of the spectrum of modes. Protein conformations are
sampled along the selected modes by jointly optimizing
backbone and side-chains using an all-atom molecular
mechanics force field and harmonic restraints. For each
conformation, a diverse set of ligand docking poses are
generated using GOLD [4]. Resulting poses, reaching a total
46
number in the order of thousands for each potential site, are
clustered using an agglomerative clustering scheme. Well
populated and high scoring clusters are analyzed to reveal
the most likely binding site. Finally, based on the normal
mode analysis of the dynamics of the target and the location
of the most favorable binding site, an inhibition mechanism
is proposed. All together, these steps incorporate the
conformational factors into scoring which are generally
omitted. As opposed to selecting the highest scoring docking
pose, this approach is able to pinpoint the inhibitor binding
site.
MKP3 is a member of MKP family that has been implicated
in the development of cancer [5]. MKP3 dephosphorylates
extracellular signal-regulated kinase 2 (ERK2) and regulates
developmental processes. Upon binding to ERK2, MKP3 is
catalytically activated [6]. A selective and potent inhibitor of
MKP3 is being lacked. This approach was used to reveal the
inhibition mechanism of a novel MKP3 inhibitor, BCI (Fig.
panel A). BCI was identified in zebrafish chemical screens.
The first step of the procedure was applied to two known
states of MKP3: the low-activity state (Fig. panel B) [7] and
the high-activity state. The second step of the procedure
found that BCI preferentially binds a crevice between the
general acid loop and the nearby helix alpha7, rather than
interacting directly with the catalytic residues Asp262,
Cys293, or Arg299 (Fig. panel C). At this putative binding
site, a close interaction with Trp264, Asn335 and Phe336
was observed.
To assist in our understanding of the potential inhibition
mechanism, we explored the ANM modes of motions that
induce conformational changes at the general acid loop. Our
analysis showed that MKP3 possesses a tendency to reorient
its general acid loop to facilitate the catalytic interactions of
Asp262. We proposed that BCI binding to the accessible
crevice in the low-activity state effectively blocks the
flexibility of this loop, thereby restricting the movement of
Asp262 towards the phosphatase loop (Fig. panel D) and
inhibiting the catalytic activation induced upon ERK
binding. This inhibition mechanism was supported by
follow-up experiments using a fluorescent small-molecule
substrate of MKP3 and ERK2.
BCI was used to probe the role of MKP3 in development of
zebrafish embryo. It constitutes a basis for the development
of selective inhibitors of members of the MKP family.
This work demonstrates a practical and efficient approach to
identify the binding site of an inhibitor with an unknown
inhibition mechanism. The future aim of this study is to
develop this approach as a method for lead optimization, an
application area in which a practical structure based
approach is being lacked.
REFERENCES:
[1] G. A. Molina, S. C. Watkins, M. Tsang, BMC Dev Biol
7, 62 (2007).
[2] G. M. Morris et al., J Comp Chem 19, 1639 (1998).
[3] A. R. Atilgan et al., Biophys J 80, 505 (Jan, 2001).
[4] G. Jones et al., J Mol Biol 267, 727 (Apr 4, 1997).
[5] A. Bakan, J. S. Lazo, P. Wipf, K. M. Brummond, I.
Bahar, Curr Med Chem, Manuscript submitted (2008).
[6] M. Camps et al., Science 280, 1262 (May 22, 1998).
[7] A. E. Stewart, Nat Struct Biol 6, 174 (Feb, 1999).
34: HOW GOOD CAN TEMPLATE-BASED
MODELLING BE?
Braddon K. Lance
(McQuarie University,
Australia), Graham R.
Wood
(McQuarie
University, Australia),
Charlotte M. Deane
(Oxford, UK).
We quantify the best
possible
predictions
achievable in templatebased modelling when
using rigid fragments
from a single template.
Achieving
the
optimum positioning of template fragments yields
median improvements of 0.3 Å RMSD and 4% GDTHA, with the upper quartile yielding improvements of
over 0.7 Å RMSD and 10% GDT-HA.
The accuracy with which a template approximates a target is
strongly related to sequence identity, a relationship which is
well understood (Chothia and Lesk, 1986). A long-standing
challenge in template-based modelling is generating protein
structure predictions better than the best template. In
template based modelling, the position of the template
fragments is often modified in an attempt to improve the
prediction beyond that of the template structure. The
magnitude of improvements that can be achieved via
movement of template fragments alone has not previously
been studied. We have recently quantified these possible
improvements (Lance et al., 2008).
The magnitude of improvements that may be achieved by
optimal positioning of template fragments were quantified
using CASP7 targets (Moult et al., 2007), and the
HOMSTRAD database (Mizuguchi et al., 1998). In the
CASP7 tests we used the best template for each target
structure as listed on the CASP website. With the Homstrad
database we carried out comprehensive tests using all
structure pairs within each HOMSTRAD family, arbitrarily
assigning the role of target and template to each member of
the pair. Structure alignments giving corresponding
sequence alignments were calculated using TM-align
(Zhang, 2005). Within the sequence alignment, contiguous
amino-acids of length four or more that were aligned in the
target and template sequence were considered to define the
template fragments useful for approximating the target.
In addition to the standard (unmodified) template, a
fragment-optimized template was created, in which the
optimum positioning of each template fragment was leastsquares supoerposed independently onto the corresponding
target fragment. The structural similarity of both the
standard and fragment-optimized templates to the target
structure were compared using RMSD, GDT-HA, GDT-TS
and HBScore.
The suitability of a predicted structure for loop modelling
increases with greater accuracy of the three terminal
47
We
have
developed a
predictor
for residueresidue
contacts in
alphahelical TM
proteins,
utilizing
data
on
sequence
space
separation,
amino acid content and correlated mutations of residues.
Additional data include features unique to alphahelical TM
regions such as the predicted distance of a residue to the
membrane center. Our predictor uses a trained classifier
based on support vector machines, a statistical method with
a good track record, which is well equipped for the diverse
data available.
A challenge in these kinds of calculations is the size of the
data the model has to digest, which has extensive
computation times as a result. We are addressing this issue
by studies on the influence of the different input data
separately and in combinations to gauge what would convey
the best compromise in terms of speed and predictive
performance. A conclusion from this work is that the
predicted distance to the membrane center is a valuable
addition to the more tradional sorts of input previously tried
for soluble proteins.
Our method's results are on par with previous methods for
soluble proteins for predictions on individual chains and
satisfactory also for whole, multi-chained, proteins. Future
prospects involve using the contact predictions as input for
other tasks, e.g. prediction of helix binding sites or as initial
input for fragment assembly algorithms.
REFERENCES
[1] E. Wallin and G. von Heijne, “Genome-wide analysis of
integral membrane proteins from eubacterial, archaean, and
eukaryotic organisms,” Protein Science, vol. 7, pp. 1029–
1038, 1998.
[2] E. Granseth, H. Viklund, and A. Elofsson, “Zpred:
predicting the distance to the membrane center for residues
in alpha-helical membrane proteins.,” Bioinformatics, vol.
22, pp. e191–e196, Jul 2006.
[3] V. Vapnik, The Nature of Statistical Learning Theory.
Springer Verlag, New York, 1995.
[4] C. J. C. Burges, “A tutorial on support vector machines
for pattern recognition,” Data Mining and Knowledge
Discovery, vol. 2, no. 2, pp. 121–167, 1998.
36: COMPUTATIONAL METHODS TO ADVANCE
FROM CRYSTALLOGRAPHIC MODEL TO
ENZYME MECHANISM AND STRUCTUREFUNCTION RELATIONSHIPS
Troy Wymore & Adam Kraut (National Resource for
Biomedical Supercomputing, USA)
residues on each conserved fragment, the anchor regions
(Fiser et al., 2000).
To gauge the effect of fragment movement upon loop
modelling, the RMSD of the anchor regions (the three
terminal residues of each fragment) in the standard and
fragment-optimized templates were also calculated.
Our results demonstrate that optimal independent fragment
movement gives improvements over the template structure,
with mean improvement in RMSD, GDT-TS and GDT-HA
of 0.7 Angstroms, 5.4% and 6.3% respectively. For a
minority of models these improvements are substantial, with
the upper quartile showing improvements of 0.8 Angstroms
RMSD, 8.25% GDT-TS and 10% GDT-HA. Little change
was observed in the hydrogen bonding as measured using
HBScore.
The scope for improvement upon the template by rigid
fragment movement varies as a function of template quality,
with templates showing approximately 80% coverage of the
target offering the greatest scope for improvement.
Median change in anchor RMSD was close to zero, however
the magnitude of reductions were generally greater than the
increases in anchor RMSD, indicating that the fragment
optimised template is better for loop modelling overall.
These results demonstrate that there is still scope for much
greater improvement over the template structure via
fragment movement than is currently being realised in even
the best template-based modelling techniques.
REFERENCES
Chothia C, Lesk AM: The relation between the divergence
of sequence and structure in proteins. EMBO J 1986,
5(4):823-826.
Fiser A, Do RKG, Sali A: Modeling of loops in protein
structures. Protein Sci 2000, 9:1753:1773
Lance BK, Wood GR, Deane CM: How good can templatebased modelling be? In Preparation.
Mizuguchi K, Deane CM, Blundell TL, Overington JP:
HOMSTRAD: a database of protein structure alignments for
homologous families. Prot Sci 1998, 7:2469-2471.
Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T,
Tramontano A: Critical assessment of methods of protein
structure prediction - Round VII. Proteins 2007, 69(S8):3-9.
Zhang Y, Skolnick J: TM-align: a protein structure
alignment algorithm based on TM-score. Nucleic Acids Res
2005, 33:2302-2309.
35: CONTACT PREDICTION FOR MEMBRANE
PROTEINS
Aron Hennerdal & Arne Elofsson (Stockholm University,
Sweden)
Due to difficulties in the experimental determination of the
structure of transmembrane (TM) proteins, only relatively
few such structures have been deposited in the protein
databank (PDB). The alpha-helical class of TM proteins is
the most common and in many ways the most interesting
since it contains many novel drug targets. Structure
prediction for this class is in its infancy and many strategies
that have been proven useful for soluble proteins have yet to
be implemented and possibly modified.
48
complex model, we experimented with aspects of the LysTyr-Ser catalytic triad including alternative protonation
states and side chain orientation through classical MD
simulations. We found that it was necessary to manually
manipulate the Lys159 side chain conformation from the
one present in the crystal structure in order to be optimally
placed to assist in the stabilization of the tyrosinate, Tyr155.
MD simulations of several nanoseconds (ns) were
insufficient to observe this very small change. The reactive
configurations were then used in umbrella sampling
QM/MM simulations to determine the function of active site
residues and the role that water molecules play in this
reaction. The importance of these water molecules is not
easily obtained except through these specialized simulations.
Through evolution, a S-HPCDH has arisen to catalyze the
oxidation of S-HPC. R- and S-HPCDH are one of two cases
in which pairs of stereospecific dehydrogenases act in
concert in one metabolic pathway. The sequence of SHPCDH shares only 41% sequence identity with R-HPCDH
and the structure is unknown. Therefore, several
comparative modeling and docking programs were
examined for their ability to output an approximate
Michaelis complex model that not only identifies the
evolved binding site but also was useful for detailed atomic
simulation. Our results show that side chain placement by
the program SCWRL3 was critical for subsequent docking
and simulation. Docking S-HPC into the active site with the
program AUTODOCK was generally successful in
determining the new location of the sulfonate-binding site.
Finally, use of these models for classical MD simulation and
subsequent QM/MM simulations of the reactions will be
presented.
37: MOLECULAR SURFACE ABSTRACTION
Gregory
Cipriano,
George Phillips
&
Michael
Gleicher
(University of
WisconsinMadison, USA)
We
present
tools to study
protein
interfaces. Our
approach uses
abstracted representations of the shape of the molecular
surface and the physio-chemico-properties around it at
various levels of scale. We demonstrate three
applications: visual inspection, crystal contact
complementarity, and ligand pocket morphology.
Computational methods and strategies for constructing a
Michaelis complex model in two evolutionarily related
Dehydrogenases (one with a known crystallographic
model and one without) as well as the accuracy of
subsequent enzymatic reaction simulations with hybrid
Quantum Mechanical/ Molecular Mechanical (QM/MM)
methods from these different models will be presented.
Elucidating the mechanism of enzymatic reactions and
reproducing
experimental
reaction
rates
through
computational methods remains an enormous challenge
despite notable advances in computational methods and
protein crystallography. In most cases, significant
modifications of the enzyme crystallographic model must be
undertaken in order to obtain a Michaelis complex model
with all the critical interactions between substrate and
enzyme present. These modifications minimally require the
appropriate addition of protons to heavy atoms but could
also include docking of the natural substrate into the active
site, addition or reorientation of water molecules and
alternative placement of key side chains. Finally, an
appropriate and relatively computationally expensive
Quantum Mechanical (QM) method in conjunction with
simulation methods must to be employed to obtain free
energy profiles of competing mechanisms. The task of
generating such a model is made even more challenging if
the structure of the enzyme has not been determined through
crystallography. Yet, use of comparative modeling
techniques is sufficient in cases of trivial sequence similarity
to generate protein models that recapitulate several aspects
of the actual structure. Unfortunately, the efficacy of
comparative protein models for subsequent investigations
with classical molecular dynamics (MD) simulation that
employ molecular mechanical (MM) force fields is
questionable and even more so if resolution of mechanistic
controversies is sought with QM/MM methods.
In this presentation, we will first describe computational
strategies
for
simulating
RHydroxypropylthioethanesulfonate
Dehydrogenase
(HPCDH) enzymatic reactions starting from a
crystallographic model (1.8 Å-resolution) of the enzyme that
contains a reaction product. In order to obtain a Michaelis
The goal of this project is to create tools for understanding
and characterizing protein surfaces. In particular, we aim to
enable the study of the interfaces between proteins and their
interaction partners to enable the understanding and
prediction of function based on protein structure. A premise
of this analysis is that the shape of the protein's surface in
the interacting region, and the physical and chemical
49
polar regions occupying a different percentage of the total
contact area. For these we were able to confirm a
hypothesized correlation between high salt concentration at
the time of crystallization, and highly polar crystal-contact
regions.
Morphological Study of Ligand Binding Pockets:
We apply surface abstraction to study the binding pockets of
common ligands. Abstracted representations are created for
over 100 PDB entries with known bound ATP ligands.
Examining the portions of the surfaces in proximity to the
ligand reveals both diversity and common patterns in the
active sites. This study is enabled by the surface abstractions
that afford statistical characterization of the diverse patches.
In the future, we expect this statistical characterizations of
protein surfaces can be applied to larger corpora of proteins
to provide tools for automated characterization, annotation,
and classification.
[1] Greg Cipriano, Michael Gleicher. "Molecular Surface
Abstraction." IEEE Transactions on Visualization and
Computer Graphics (Proceedings Visualization 2007).
October 2007.
39: HIGH-THROUGHPUT CRYSTAL STRUCTURE
PREDICTION OF DRUG-LIKE MOLECULES
Bashir
Sadjad,
Zsolt
Zsoldos
and Aniko
Simon
(Simulated
properties around it, are central to any interaction as they
form the interface to partners. Tools for studying protein
interaction, therefore, must consider these functional
surfaces.
However, due to their size and complexity, protein surfaces
can be difficult both to assess visually and characterize
quantitatively.
Therefore,
abstracted
(simplified)
representations of the functional surface are important
components of tools for studying protein\ interfaces.
Abstracted representations afford easier visual inspection,
more robust shape analysis, parameterizations for encoding
properties on/around the surface, and areal descriptors that
allow for statistical aggregation.
We have developed molecular surface abstractions that
provide a multi-scale representation of the molecular surface
shape and physical properties around it [1]. These
abstractions simplify the functional surface by first
selectively removing high-frequency detail in the surface
geometry. Other physio-chemico-properties (e.g. charge,
hydrophobicity) are then aggregated and smoothed, to
produce a coarse representation of the original fields. To
avoid bias, fields are sampled onto the surface, and then
aggregated according to the overall smoothing amount. For
surface analysis, this process can be repeated over multiple
smoothing kernel sizes, producing a hierarchy of features for
a given surface point.
To date, we have explored molecular surface abstraction in
three applications:
Visual Abstraction of Protein Surfaces:
We provide a tool for visual inspection of the functional
surfaces of proteins that displays abstracted views [1]. These
views depict the surface with detail suppressed, coloring,
surface textures and symbols. The included figure shows
striped yellow patches to denote regions of the surface in
contact with known ligands, 'H' symbols to highlight
potential hydrogen bonding regions, and surface coloring to
indicate electrostatic charge, which has been abstracted to
emphasize major positive and negative regions. The
abstractions facilitate comparison: The figure depicts
ribonuclease proteins from two frog species (1M07 and
215S) whose enzymatic activity varies by a factor of 100.
The important similarities and differences between these
contact regions are readily apparent because extraneous
detail has been removed. Also note that, though the charge
distribution differs between the two surfaces, it remains
essentially the same in the contact regions.
Complementarity Analysis of Crystal Contacts:
We apply surface abstraction to study protein-protein
interaction in the context of crystal-contacts in a packed
crystal structure. In these cases, abstracting the surface can
help to 'flatten' the contact patch, as the geometricsmoothing step removes high-frequency detail that can
confound surface parameterization. This, in turn, simplifies
the task of assessing the properties of each patch, and allows
registration of one patch with another to study both sides of
a contact region. We show such a registration in the
accompanying figure. As a proof of concept, we looked at
four crystallizations of myoglobin (1BZP, 1DTI, 1JW8,
1U7R), each occupying a different space group and having
Biomolecular Systems, Canada).
We are developing a method to predict crystal structures
of drug-like molecules. This helps inclusion of
physical/material properties in the the lead optimization
process. The initial results for rigid molecules show that
our method is capable of producing crystal structures
very close to the observed experimental ones, preserving
key interactions.
Computational methods are widely used by pharmaceutical
companies. High-throughput screening (HTS) and its 'in
silico' virtual version (VHTS) have improved the process of
finding 'hits' by enabling the discovery chemists to test big
libraries of molecules against a target. While traditionally in
the lead optimization stage, drug potency and selectivity
have been the main targets of the optimization, recently
there has been a shift toward including the physical
properties of the drug form (e.g., solubility) in this stage [1].
We are trying to develop a computational method to predict
the possible crystal structures first and their corresponding
lattice energies second. This is an analogy to the VHTS
tools used for docking and we call it high-throughput crystal
prediction or HTCP. There are many groups that are trying
50
shows two sample crystal structure of a cyclic amide
overlayed.
Our closest predicted structure is shown by molecules with
green bonds and the experimental structure is the CSD
refcode RUVZEN (the picture is generated using 'mercury',
the visualization software of CCDC). The RMSD between
the 6 closest neighbors is 1.02 angstrom in this example.
The key hydrogen bonds are also shown and it can be seen
that they are preserved in the predicted structure. The closest
generated structure by our method has an average RMSD of
1.08 angstroms for the test set we used.
There are two major issues that we are currently working on.
The first is the ranking of the generated structures. It is
important to use a fast scoring function in the initial search.
However once a subset of good candidates are selected for
further optimization a more accurate scoring function is
required to properly rank them. The second issue is to add
the flexibility into our search to beable to optimize the
conformation and unit cell parameters at the same time.
In future we are planning to extend the current method to
work for more complex asymmetric unit cells as well. This
means inclusion of salts or water molecules in the unit cell
or extending the asymmetric unit cell to more than one
molecule.
REFERENCES
[1] Gardner, C.R. and Walsh, C.T. and Almarsson, O.,
Drugs as materials: valuing physical form in drug
discovery., Nature Reviews - Drug Discovery, 2004, 3(11),
926--34.
[2] Day, G. M. et al.,A third blind test of crystal structure
prediction., Acta Crystallographica Section B, 2005, 61(Pt.
5), 511--527.
[3] Zsoldos, Z., et al., eHiTS: a new fast, exhaustive flexible
ligand docking system., Journal of Molecular Graphics and
Modelling, 2007, 26(1), 198--212
[4] Allen, F. H., The Cambridge Structural Database: a
quarter of a million crystal structures and rising., Acta
Crystallographica Section B, 2002, 58(1 Pt. 3), 380--388.
[5] Chan, T.M. and Sadjad, B.S.Geometric optimization
problems over sliding windows., International Journal of
Computational Geometry and Applications, 2006, 16, 145-157.
40: THE JENA LIBRARY OF BIOLOGICAL
MACROMOLECULES - JENALIB: NEW FEATURES
Rolf
Huehne,
Frank-Thomas
Koch and Juergen
Suehnel
(Fritz
Lipmann Institute,
Germany)
to build such a computational method and there have been
several blind tests held by the Cambridge Crystallographic
Data Centre (CCDC). The third of which in 2004 showed
that there is still a long way to reliably predict crystal
structures especially for flexible molecules [2].
From the geometric search perspective, some of the current
tools use stochastic methods while some others are more
systematic but use too coarse grids that can not reliably
produce structures close enough to the experimental ones.
We are trying to develop a method that can guarantee a
certain accuracy level while having an acceptable speed.
There are some fundamental new elements in our search
approach. We start from a pair of neighbor molecules in a
hypothetical crystal structure. For this step we rely on the
fast shape fitting engine developed for the eHiTS docking
software [3]. Fixing a pair of molecules puts some
constraints on the crystal space group and unit cell
parameters. On the other hand, we only generate pairs that
satisfy a set of geometric and energetic constraints. These
constraints are extracted from statistics collected from the
Cambridge Structural Database (CSD) [4]. Some of these
constraints are purely geometric, for example we know that
for each molecule there should be another molecule where a
certain ratio of the two molecules surfaces is in contact with
each other. Some other constraints depend on the
physicochemical properties of the molecules and the type of
interactions they have. The bottom line is that all these
constraints are statistically validated using the vast
information stored in CSD.
Our scoring function is also based on statistics collected
from the CSD. We define a set of interaction types. For each
occurrence of an interaction type in CSD, we collect the
relative geometry of the participating atoms. Estimating the
expected probability for each configuration (i.e., an
interaction type plus the relative geometry of participating
atoms in it), we assign an energy value to each configuration
using ideas inspired by the Boltzmann distribution.
Efficiency is a major concern for our HTCP tool, mainly
because the number of structures that we generate is huge
(hundreds of thousands to millions). For this reason we have
developed special structures and algorithms to process
molecules. For example we use a set of vectors for
systematic sampling of the shape of a molecular fragment.
The shape of the fragment is represented by the lengths of
these vectors from the center of the fragment to the point
where the vector intersects the surface of the molecule. This
allows a fast generation of non-clashing neighbor molecules
as the initial pairs used for crystal structure generation. It
can also guarantee a certain level of geometric accuracy
because of the bounds proven for the vector set used (this is
the basis of some of the geometric approximation algorithms
[5]).
Our tests on a set of rigid molecules shows that our method
is capable of generating a structure very close to the
experimental one. To compare two crystal structures we
followed the method used in the aforementioned blind tests
which is to overlay one central molecule and calculate the
RMSD value for a set of neighbor molecules. The figure
The
JenaLib
(www.fli-
leibniz.de/IMAGE.html) offers value-added information
for all PDB and NDB database entries, e.g.: PDB/NDB
atlas pages, QuickSearch, PDB/UniProt alignments,
51
Disulfides are generally viewed as structurally stabilizing
elements in proteins. However, it is well known that some
disulfides are redox active and capable of being reduced
under physiological conditions. The enzymatic role of
redoxactive disulfides in thiol-disulfide reductases is
generally appreciated but it is less well-known that redoxactive disulfides also act as redox-sensitive switches of
protein function [1,2]. Thiol-based regulatory control of
proteins has been demonstrated to be an important
physiological control mechanism in response to changing
redox conditions. In particular, redox-control of disulfides
has been shown to mediate the oxidative stress response via
control of transcription factors and other signalling
molecules. It is likely to be important in pathological
conditions involving abnormal redox states such as
cardiovascular failure and ageing [3].
The ability to distinguish between structural and redoxactive disulfides is important for elucidating protein
function. Experimentally, the two types of disulfide can be
distinguished by their redox potentials. Disulfide redox
potentials measured in thiol-disulfide oxidoreductases range
from -120mV to -270mV [4-7]. For disulfides serving
structural purposes, the redox potential can be as low as 470mV [8]. However individual measurements of this kind
are difficult and time consuming. A computational approach
that can identify and characterize redox active disulfides will
contribute significantly to our understanding of disulfide
redox-activity.
Our work seeks to understand the physical principles of
disulfide redox-activity in protein structure. It has been
observed that sources of strain in a protein structure, such as
residues in forbidden regions of the Ramachandran plot and
cispeptide bonds, are found in functionally important
regions of the protein and warrant further investigation [911]. We hypothesize that disulfides that disobey known
rules of protein stereochemistry have functional importance
via redoxactivity.
The Thornton-Richardson rules of disulfide stereochemistry
specify disulfide bonds should not be found between
cysteine pairs [12,13]:
A. on adjacent β-stands;
B. in a single helix or strand;
C. on non-adjacent strands of the same β-sheet.
D. adjacent in the sequence.
In previous work in our lab, we have characterized the
cross-strand disulfide: a likely redox active disulfide that
violates rule A [14-16]. CSDs come in two flavors:
antiparallel (aCSDs), which straddle antiparallel β-strands
and, more rarely, parallel (pCSDs), which bridge parallel βstrands. aCSDs are by far the most common type of
forbidden disulfide in solved protein structures. Here we
identify seven additional subtypes that violate the ThorntonRichardson rules of disulfide stereochemistry and examine
evidence for their involvement with functional redox
activity [17].
[1] Choi, H.J., Kim, S.J., Mukhopadhyay, P., Cho, S., Woo,
J.R., Storz, G. and Ryu, S.E. (2001). Cell 105, 103-13.
[2] Littler, D.R. et al. (2004). J Biol Chem 279, 9298-305.
Jmol-based molecule viewer, SNP and PROSITE motif
mapping.Most recent new features are: PFAM domain
mapping, sequence pattern search, integration of various
data (SAPs, Exons, Domains etc.) into PDB/UniProt
alignments.
The Jena Library of Biological Macromolecules (JenaLib,
www.fli-leibniz.de/IMAGE.html)
offers
value-added
information for all entries included in the Protein Data Bank
(PDB) and Nucleic Acid Database (NDB), e.g.:
- PDB/NDB atlas pages and entry lists
- PDB sequence information extracted from atomic
coordinates
- PDB/UniProt alignments that clearly indicate gaps,
mutations, numbering irregularities and modified residues
- Integration of data on single amino acid polymorphisms
(SAPs), PROSITE motifs with PDB, GO and taxonomy
information
- Platform-independent Jmol-based molecule viewer that
offers integrated viewing of ligand, site, SAP, PROSITE and
SCOP information both for asymmetric and biological units
- QuickSearch option that allows searching for PDB/NDB
code, UniProt ID/Accession and other search terms in one
input field
The most recent new features are:
- PFAM domain mapping, classification tree browser and
visualization
- Integration of various data into PDB/UniProt alignment
view, e.g.: SCOP/CATH/PFAM domains, PROSITE motifs,
SAPs, Exons
- Sequence homology search option (BLAST)
- Sequence pattern search option
Offering all this information and analysis tools in one place
makes JenaLib a unique resource for the dissemination of
3D structural information on biological macromolecules.
41: COMPUTATIONAL INSIGHTS INTO REDOXACTIVE DISULFIDES IN PROTEIN STRUCTURES
Samuel Fan, Richard
George,
Naomi
Haworth & Merridee
Wouters (Structural
and
Computational
Biology
Program,
Victor Chang Cardiac
Research
Institute,
Australia).
We
are
characterizing
potentially
redoxactive disulfides in structures. Our previous studies
investigated disulfide torsional energies and a structural
motif associated with redox-activity: the cross-strand
disulfide, which links adjacent beta-strands. Here we
searched for other “forbidden” disulfides which violate
rules of protein stereochemistry and examine evidence
supporting redox activity.
52
[3] Humphries, K.M., Szweda, P.A. and Szweda, L.I.
(2006). Free Radical Res. 40, 1239-43.
[4] Huber-Wunderlich, M. and Glockshuber, R. (1998).
Folding & Design 3, 161-71.
[5] Krause, G. and Holmgren, A. (1991). J Biol Chem 266,
4056-66.
[6] Lin, T.Y. and Kim, P.S. (1989). Biochemistry 28, 52827.
[7] Wunderlich, M. and Glockshuber, R. (1993). Protein Sci
2, 717-26.
[8] Gilbert, H.F. (1990). Adv Enzymol Relat Areas Mol Biol
63, 69-172.
[9] Gunasekaran, K., Ramakrishnan, C. and Balaram, P.
(1996). J Mol Biol 264, 191-8.
[10] Pal, D. and Chakrabarti, P. (2002). Biopolymers 63,
195-206.
[11] Herzberg, O. and Moult, J. (1991). Proteins 11, 223-9.
[12] Richardson, J.S. (1981). Adv Protein Chem 34, 167339.
[13] Thornton, J.M. (1981). J Mol Biol 151, 261-87.
[14] Haworth, N.L., Feng, L.L. and Wouters, M.A. (2006). J
Bioinform Comput Biol 4, 155-68.
[15] Haworth, N.L., Gready, J. E., George, R.A., Wouters,
M.A. (2007). in press.
[16] Wouters, M.A., Lau, K.K. and Hogg, P.J. (2004).
BioEssays 26, 73-9.
[17] Wouters, M. A. George, R. A., Haworth, N.L (2007)
Current Peptide and Protein Science 8, 000
prediction, assessment, and web-based visualization of
thousands of candidate models.
A critical first step in comparative modeling is the accurate
alignment of the target with the template structure.
Introducing errors in the alignment phase will ultimately
lead to incorrect models that cannot be improved without an
aggressive and time-consuming refinement phase. It has
been shown that by sampling along the alignment path, with
stochastic dynamic programming, so-called ‘suboptimal’
alignments can actually yield alignments as good as the
structural alignment. Between 2,000 and 5,000 alternative
alignments were generated as inputs to Modeller. By
thoroughly sampling alignment space in this manner, we
have identified several methods that can reliably identify the
most native models among an ensemble. This approach
exercises structural assessment rather than sequence-based
homology measures.
Our database, called SA-COMPAS, contains a detailed
repository of template-based protein structures generated
from alternative alignments. Targets were chosen from
CASP6 and CASP7 TBM category. Each model in the
dataset has many assessment scores calculated. Assessments
currently include in SA-COMPAS are DOPE, DFIRE, and
ProsaII atomic statistical potentials, Pcons and ProQres for
global and local quality, Rosetta energy score, and the
CHARMM energy coupled with a generalized born implicit
solvent model (MMGB). Two additional scores,
Psipredpercent and Psipredweight, describing the agreement
of predicted secondary structures by Psipred and the
model’s actual secondary structure as derived by DSSP. Our
results indicate the scores based on secondary structure to be
the most effective for discriminating models with incorrect
alignments. A low Psipred score generally means that
secondary structures of the final model are not well formed.
In order to compare the effectiveness of these assessment
scores, every predicted model in the database has been
compared to the crystallographic coordinates in several
ways. Perhaps the most important similar score calculated is
the GDT_TS score, which is standard in the CASP
assessments. Other measures included are TMscore,
MaxSub, RMSD, fraction of correct native contacts, and
percent correct chi, psi, and rho values.
Continuing efforts to add value to the databank will include
adding more targets to enumerate the entire known fold
space as well as calculating more quality assessment scores
as they become available. Recently we have added
comparative modeling experiments that considered multiple
template structures. Given the increasing size of the PDB
today it is not uncommon to have cases where several good
templates exist and it’s nontrivial to select which one will
ultimate produce a better model. Our dataset indicates
whether or not structural assessment can consistently choose
the best structure from ensembles having models built from
multiple templates.
The core capabilities of the SA-COMPAS database resource
include searching the databank by keywords (literature),
database accession numbers (PDB, Uniprot), protein
function, and sequence similarity (BLAST). Researchers
interested in comparative modeling are able to graph scores
42: SA-COMPAS: A RESOURCE FOR PREDICTION,
ASSESSMENT, AND WEB-BASED VISUALIZATION
OF COMPARATIVE PROTEIN MODELS
Adam Kraut and Troy Wymore (National Resource for
Biomedical Supercomputing, Pittsburgh, USA)
Here we present a resource to manage the prediction,
assessment, and web-based visualization of large
ensembles of predicted protein structures. Over 300,000
structures so far were predicted and assessed with model
quality assessment programs, statistical potentials, and
molecular mechanics energy calculations to compare
effectiveness in the context of comparative modeling.
Assessing the quality of predicted protein structures is an
important problem of theoretical and practical interest. The
most effective protein structure prediction methods are ones
that generate large ensembles of possibilities and then
employ statistical potentials or scoring functions to identify
the best models among these ensembles. We have developed
a set of resources called SA-COMPAS that manages the
53
actual interfaces that are involved in oligomerization are
inferred from X-ray crystallographic structures using
assumptions about interface surface areas and physical
properties. In many cases, these hypothetical interfaces are
correct, but in other cases they may not be. Our previous
study showed that annotations on biological units in the
Protein Data Bank (PDB) and the Protein Quaternary Server
(PQS) agree only about 80% of the time.
We examined thoroughly the interfaces in crystals of single
homologous proteins in SCOP families. We attempted to
answer several questions. First, when are two crystals of the
same or similar proteins really the same crystal form and
when are they not? We find surprisingly that PDB entries
with the same space group, asymmetric unit size, and very
similar cell dimensions and angles (within 1%) does not
guarantee that two crystals are actually the same crystal
form, that is containing similar relative orientations and
interactions within the crystal. Conversely, two crystals in
different space groups may be quite similar in terms of all of
the interfaces within each crystal. Similar crystal forms can
be combined into a crystal form group if all interfaces with
ASA ≥ 200Å2 in one entry have corresponding interfaces in
another entry and at least 2/3 interfaces with ASA ≥ 200Å2
in the second entry are similar to some interfaces of the first
entry. PDB entries in a family are then divided into different
crystal form groups (“CFGs”). Second, we examined the
hypothesis used by many crystallographers to infer
biological interactions: observation of the same interface in
different crystal forms of a protein (or members of the same
family) suggests that the interface may be biologically
relevant. We compared all interfaces in the available CFGs
in each family and determined those shared by two or more
CFGs. We determined the number of CFGs with a common
interface, M, compared to the total number of different
CFGs in the same family, N. The usefulness of these
numbers is evaluated with prior benchmarks on oligomeric
interactions as well as with NMR structures. NMR
structures and the benchmark of PDB crystallographic
entries consisting of 126 dimers and larger structures and
132 monomers were used to determine whether the
existence or lack of existence of common interfaces across
multiple crystal form groups can be used to predict whether
a protein is an oligomer or not, and to identify oligomeric
interfaces if they exist. Monomeric proteins tend to have
common interfaces across only a minority of crystal form
groups (M<<N), while higher order structures exhibit
common interfaces across a majority of available crystal
form groups (See the figure which plots the number of
interfaces we find in the benchmark for dimers and
oligomers vs M and N). The data can be used to estimate the
probability that an interface is biological if two or more
crystal form groups are available. We find 36 families in
which all N out of N crystal form groups contain a particular
interface, where N≥10. These interfaces are very likely to be
physiological.
Third, we examined the usefulness of evolutionary
information in evaluating interfaces appearing in more than
one crystal form. It occurs often that different crystal forms
of identical proteins contain common interfaces but that
within and across ensembles, examine the correlation
between assessment scores, and visually inspect the
structure and the alignment within the web interface. We
also provided tools for generating stochastic alignments as
well as building models from alignments using
Modeller9v4. Models, alignments, and all additional
calculated data can be downloaded per target or in batch
from our server. We’ve also cross-linked our database
entries with external resources such as PDBsum, CATH,
SCOP, and the RCSB PDB.
Also of interest to researchers is the modern software
architecture used for the web interface. The implementation
of SACOMPAS leveraged many high-quality open-source
technologies. The relational database engine is powered by
MySQL and currently contains almost 1,000,000 individual
records. Server-side tasks are written in the Ruby
programming language and follow a Model-View-Controller
(MVC) pattern. Client-side interaction is handled by the
jQuery JavaScript framework. Two well-established Java
applets are used as additional components of the web
interface. JMol provides 3D molecular graphics and Jalview
provides sequence alignment editing and graphics. SACOMPAS is available at http://sacompas.cb.nrbsc.org
44: STATISTICAL ANALYSIS OF INTERFACES IN
CRYSTALS OF HOMOLOGOUS PROTEINS
Qifang
Xu
and
Roland
Dunbrack
(FCCC, Philadelphia,
USA)
Many
proteins
function
as
homooligomers.
Examination
of
interfaces
across
different
PDB
entries in SCOP
families identified
common interfaces,
which
exist
in
different crystal forms. These interfaces are likely to be
biological by testing in benchmark data and NMR
structures, and can be used to predict the oligomeric
status of a protein.
Many proteins function as homooligomers and are regulated
via their oligomeric state. Homooligomerization may be part
of allosteric regulation, or contribute to conformational and
thermal stabilities and to higher binding affinity with other
molecules. Mutations in these interfaces may be associated
with deleterious disease. For instance, Caffey disease is a
genetic disorder caused by abnormal dimeric chain due to a
missense mutation in exon of the gene encoding the a1
chain.
For some proteins, the stoichiometry of homooligomeric
states under various conditions has been studied using gel
filtration or analytical ultracentrifugation experiments. The
interfaces involved in these assemblies may be identified
using crosslinking and mass spectrometry, solution-state
NMR, and other experiments. But for most proteins, the
54
plasticity, learning and memory, and is believed to be the
target for the noble gas xenon to produce general anesthesia.
NMDA receptor is a hetero-oligomer composed of three
types of subunits: NR1 and NR2 (or NR3). Its activation
requires binding of neurotransmitter glutamate to the NR2
subunits and simultaneous binding of co-agonist glycine to
the NR1 subunits. With both of these native agonists bound,
the S1S2 clefts in the extracellular binding domains are
closed and the ion channel is opened allowing permeation of
ions. When antagonist, such as 5,7-Dichlorokynurenic acid
(DCKA), is bound to NR1, the S1S2 cleft is opened and the
ion channel is closed. Here, the opened or closed model
refers to the structure with the S1S2 cleft—rather than ion
channel—opened or closed.
This study focuses on the interaction of xenon with the
ligand-binding domains of the NMDA receptor to gain
insights into the possible mechanisms of xenon’s anesthetic
action. We chose two X-ray NMDA receptor structures and
performed xenon docking using Autodock (1) and over 20ns MD simulations using NAMD2 (2) in the absence and
presence of xenon. The structure 1PBQ (two NR1 subunits
complexed with the antagonist DCKA) represents the
opened NMDA model (3), whereas 2A5T (dimer of one
NR1 subunit with a glycine bounded and one NR2 subunit
with a glutamate bounded) represents the closed model (4).
Our study revealed several potential xenon-binding sites in
the ligand-binding domain, including the interface between
Domain 1 of the two subunits and the hinge region of the
S1S2 adjacent to the glycine- and glutamate-binding sites.
Our comparative study on the molecular dynamics of the
binding domains indicates that xenon has different effect on
the closed and the opened conformations of the S1S2 cleft.
A previous investigation suggested that xenon’s anesthetic
effect might arise from xenon’s competition with the native
agonists for the binding sites (5). Although the xenon
binding sites near the glutamate and glycine binding sites
exist, xenon occupation at these sites does not displace the
native agonists in the 20-ns simulations, and the closed
model (2A5T) is unchanged and seems insensitive to xenon.
In contrast, in the opened conformation (1PBQ) when the
native agonists are absent, xenon enhances the opening of
S1S2 cleft, suggesting that xenon competition with glycine
is indirect by stabilizing the opened conformation of S1S2
cleft and thereby making glycine binding to the open cleft
less favorable.
Another possible mechanism of xenon action on NMDA
receptor involves disruption of the “communication”
between two subunits. Previous studies suggest that the
functional formation of the dimer occurs due to interaction
between the Domain 1 of two subunits (6). Our MD
simulations revealed that xenon binding at the Domain 1
interface near the hinge of S1S2 cleft is stable over the
course of the simulation. This interaction may also
contribute to the xenon’s effects on NMDA receptor
function.
This study was supported by a grant from the NIH
(R01GM066358 and R01GM056257) and NCSA through
PSC.
REFERENCES
these usually appear in only 2 or 3 such forms and are not
shared by homologous proteins. That is, they are probably
only formed under non-physiological crystallization
conditions including high protein concentration, peculiar
pH, and the presence of nonphysiological ligands. This has
previously been observed for T4 lysozyme, which has been
studied in many crystal forms.
The benchmark data indicate that when an interface is
shared in as few as two different crystal forms by divergent
proteins (<90% identity), then the interface is very likely to
be biologically important. This highlights the importance of
solving structures of related proteins. We also find that in
large families, some interfaces are restricted to one branch
of a family, indicating the evolution of an interface in one
branch of the family and/or loss in another. Finally, we
compared interfaces common to multiple crystal forms with
the annotations found in the PDB, PQS, and PISA. With an
increasing number of crystal form groups that contain a
given interface, it becomes increasingly likely that the
available annotations agree that such an interface is part of a
biologically relevant assembly. PISA is found to be the most
reliable in identifying interfaces for which the evidence, in
terms of number of crystal forms containing the interface,
seems very high. PISA is therefore the best source of
biological assembly information when only one or two
crystal forms are currently available. The PDB in particular
is missing highly likely biological interfaces in its biological
unit files for about 10% of PDB entries.
45: XENON EFFECTS ON LIGAND BINDING
DOMAIN OF NMDA RECEPTOR
Lu
Liu
(University
of
Pittsburgh
School
of
Medicine, USA),
Yan
Xu
(Department of
Anesthesiology,
Pharmacology,
Structural
Biology,
and
Computational
Biology,
University of Pittsburgh School of Medicine, USA) & Pei
Tang (Department of Anesthesiology, Pharmacology,
Structural Biology, and Computational Biology, University
of Pittsburgh School of Medicine, USA)
MD simulations for interaction of xenon with its putative
anesthesia target, N-methyl-D-aspartate (NMDA)
receptor, suggests two possible mechanisms of action: to
enhance opening of the agonist binding domains to
prevent agonist binding; and to reside at the domain
interface to disrupt the functionally important interplay
between subunits.
N-methyl-D-aspartate (NMDA) receptor is a member of
excitatory neurotransmitter receptors essential for synaptic
55
1. Huey, R., Morris, G. M., Olson, A. J. and Goodsell, D. S.
(2007) J. Computational Chemistry 28, 1145-1152.
2. James C. Phillips, R. B., Wei Wang, James Gumbart,
Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot,
Robert D. Skeel, Laxmikant Kale, and Klaus Schulten.
(2005) Journal of Computational Chemistry 26, 1781-1802.
3. Furukawa, H. a. G., E. (2003) The EMBO journal 22,
2873-2885.
4. Furukawa, H., Singh, S. K., Mancusso, R. and Gouaux, E.
(2005) Nature 438, 185-192.
5. Dickinson, R., Peterson, B. K., Banks, P., Simillis, C.,
Martin, J. C., Valenzuela, C. A., Maze, M., and Franks, N.
P. (2007) Anesthesiology 107, 756-767.
6. Armstrong, N., Jasti, J., Beich-Frandsen, M. and Gouaux,
E. (2006) Cell 127, 85-97.
molecule is coupled with three sodium ions, one proton and
followed by the counter transport of one potassium ion
yielding an electrogenic uptake. Five members of human
glutamate transporters (EAAT1-5) have been characterized
after the first cloning of three rat transporters GLAST,
GLT1 and EAAC1. This family also includes two neutral
amino acid transporters ASCT1 and ASCT2, as well as a
number of homologous prokaryotic amino acid and
dicarboxylate transporters (2). Topology models based on
the cysteine-scanning accessibility studies of the
mammalian and bacterial carriers was recently advanced by
the crystal structure of an archaeal transporter, GltPh (3).
The first half of the protein forms six α-helical
transmembrane (TM) helices and the second half is
comprised of two reentrant loops (HP1 and HP2), a seventh
TM helix, interrupted by a β linker and an amphipathic TM8
helix (Fig.).
Top Fig. panel. Trimeric Structure of Glutamate
transporter, Gltph (1xfh). Top view from extracellular side
(left) and side view perpendicular to membrane bilayer
(right).
Molecular dynamics simulations based on the crystal
structure of GltPh showed that the HP2 loops possesses a
strong tendency to move away from substrate binding site in
the absence of a substrate. Gaussian network models (GNM)
(4) and anisotropic network models (ANM) (5) on the other
hand, suggest large-scale motions of the extracellular region
of glutamate transporters, which would facilitate the crosslinking of the single cysteine mutants made in this region.
The low frequency modes within these models have
frequently been identified to be functionally important (6).
Here, the first nondgenerate mode, reveals a symmetric
opening/closing of the extracellular vestibule (top panel).
This kind of motion perhaps plays a significant role in
substrate recognition and binding.
Bottom Fig. panel. Symmetric opening/closing of Gltph in
first non-degenerate ANM mode. Left and Right figures
display the ANM-predicted closed and open conformations,
respectively. In the central figure, corresponding to the xray
structure, the basin is exposed to the EC aqueous
environment, while in the closed form contact between
neighboring subunits occur (see for example the L34 loops
colored red).
This was confirmed by cysteine cross-linking experiments in
the cysteine-less version of EAAT1, leading to functional
defects in the glutamate transporter. A series of single
cysteine mutants made in HP2b residues, formed
intersubunit cross-links spontaneously and/or catalyzed by
an oxidizing agent, copper phenanthroline (CuPh). The
substrate accumulation activity of these mutants is
completely inhibited after cross-linking, which can be
reversed by treatment of DTT. With mutant V449C and
V453C, we found that substrate or its analog D,L-threo-βbenzyloxyaspartate (TBOA) can prevent the inhibition of
uptake activity during cross-linking.
Conformational changes of biomolecules at different time
scales are associated with their functions. Ideally,
researchers would like to watch individual atoms moving
within a protein. However, it is experimentally impossible at
47: LARGE SCALE MOTIONS IN GLUTAMATE
TRANSPORTERS REVEALED BY
ELASTIC NETWORK MODELS AND CYSTEINE
CROSS-LINKING STUDIES
Indira H Shrivastava1, Jie Jiang2, Susan G. Amara2 & Ivet
Bahar1 (1Department of Computational Biology &
2
Department of Neurobiology, University of Pittsburgh,
USA).
We examined the most cooperative motions of the
glutamate transporter using elastic network models. Our
study suggests that the three subunits of the protein
undergo concerted fluctuations that alternately
increase/decrease the accessibility of the central aqueous
basin to the extracellular region. These large scale
motions are supported by cysteine cross-linking
experiments in the mutants V449C and V453C of the
cysteine-less version of human excitatory amino acid
transporter (EAAT1).
Glutamate transporters, also termed as excitatory amino acid
transporters (EAATs), belong to a secondary active
transporter family which utilizes the free energy stored in
ions or solute gradients. These membrane proteins remove
excess glutamate from neuronal synapse, ensuring precise
synaptic communication between neurons preventing
glutamate toxicity. Malfunction of glutamate transporters
have been implicated in neurological diseases and
psychiatric disorders (1). The influx of one glutamate
56
presents significant challenges in understanding how
sequence ultimately conveys both structure and function. As
concluded from CASP, CAPRI, and similar experiments, the
best structure often does not get the top score. To more fully
succeed at these tasks, critical insights into the protein
sequence-structure-function relationships may be obtained
through the characterization of the sequence space
compatible with a protein structure.
The task of engineering a protein to assume a target threedimensional structure is known as protein design. Practical
applications of design include modifications of existing
proteins to affect such characteristics as stability or binding
affinity. A more ambitious goal is to design protein
sequences that will assume novel structures or acquire new
functionalities. Computational search algorithms are devised
to predict a minimal energy amino acid sequence for a
particular structure. In practice, however, an ensemble of
low energy sequences is often sought. Primarily, this is
performed since an individual predicted low energy
sequence may not necessarily fold to the target structuredue
to both inaccuracies in modeling protein energetics and the
non-optimal nature of search algorithms employed. Also,
some low energy sequences may be overly stable and thus
lack the dynamic flexibility required for biological
functionality.
Thus, a thorough understanding of the low energy sequence
space will enhance protein design efforts and also allow
designers to focus on structural positions with high potential
to be successfully mutated. Moreover, the investigation of
low energy sequence ensembles will provide crucial insights
into the pseudo-physical energy force fields that have been
derived to describe structural energetics for protein design.
Significantly, numerous studies have predicted low energy
sequences, which were subsequently synthesized and
demonstrated to fold to desired structures. However, the
characterization of the sequence space defined by such
energy functions as compatible with a target structure has
not been performed in full detail. Thus, we are interested in
exploring the near-optimal sequence space induced by a
widely-used energy function (in this case, the Rosetta
function), in an attempt to comprehend the predictive
consequences of using such energy functions for protein
design.
Methods and Results
In this work, we present a conceptually novel algorithm that
rapidly predicts the set of lowest energy sequences for a
target structure. Based on the theory of probabilistic
graphical models, our algorithm performs efficient
inspection and partitioning of the near-optimal sequence
space, without making any assumptions of positional
independence. Specifically, the underlying computational
tool we utilize is the representation of the protein design
energy optimization problem as a probabilistic graphical
model and subsequent application of the max-product loopy
belief propagation algorithm for finding high probability
sequences. We thus efficiently find minimal energy
sequences when the underlying search space includes
numerous
possible
rotamers
(discrete
side-chain
present and the dynamics of proteins are mostly inferred
from sophisticated biophysical methods which measure
physical properties. Alternatively, mutagenesis studies may
provide some valuable information on the in-depth
mechanism of proteins if mutants or sulfhydryl
modifications of cysteine substitutions lock biomolecules in
specific conformational states. In contrast, computational
simulations have the unbeatable edge to describe protein
dynamics completely since they can follow the precise
position of each atom at any instant in time, provided the
high resolution crystal structure of the protein is known.
Therefore, the combination of experimental studies with
computational simulations is of great importance for
elucidating allosteric motions bearing functional
significance.
REFERENCES
1. Amara, S. G., and Fontana, A. C. (2002) Neurochem Int
41(5), 313-318.
2. Slotboom, D. J., Konings, W. N., and Lolkema, J. S.
(1999) Microbiol Mol Biol Rev 63(2), 293-307
3. Yernool, D., Boudker, O., Jin, Y., and Gouaux, E. (2004)
Nature 431(7010), 811-818
4. Bahar, I., Atilgan, A. R., and Erman, B. (1997) Fold Des
2(3), 173-181
5. Xu, C., Tobi, D., and Bahar, I. (2003) J Mol Biol 333(1),
153-168
6. Bahar, I., and Rader, A. J. (2005) Curr Opin Struct Biol
15(5), 586-592
48: ACCURATE PREDICTION OF THE NEAROPTIMAL SEQUENCE SPACE FOR ATOMICLEVEL PROTEIN DESIGN
Menachem Fromer
(The
Hebrew
University
of
Jerusalem, Israel) &
Chen Yanover
Characterization of
the sequence space
compatible with a
protein structure will
provide insights into
the
sequencestructure-function
relationship.
We
present a novel algorithm, based on probabilistic
graphical modeling, to obtain near optimal sequences for
protein design. Our approach obtains lower energy
ensembles as compared to state-of-the-art methods and
suggests intriguing biological insights.
Introduction
After decades of computational and experimental research
on protein structure-function relationships, the 'protein
folding problem' is solved to a certain degree. Nevertheless,
while prediction methods often find the correct fold (and
even a structure with low RMSD to the targeted structure),
the additional step of identifying the top-scoring structure
57
conformations) for each amino acid type, without
considering any sequence more than once.
We benchmark the performance of our novel algorithm on a
diverse set of protein design examples taken from the
literature and show that it consistently yieldssequences of
lower energy than those derived from state-of-the-art
techniques (e.g. DEE, A*, Monte Carlo simulated
annealing). Thus, we find that previously presented search
techniques do not fully depict the low energy space as
precisely. We also observe that for cases when the complete
set of lowest energy sequences can be exhaustively
enumerated, the algorithm empirically obtains this set.
Examination of the predicted ensembles indicates that, for
each structure, the amino acid identity at a majority of
positions must be chosen extremely selectively so as to not
incur significant energetic penalties. We investigate this
high degree of sequence and biochemical similarity and
demonstrate how more diverse near-optimal sequences can
be predicted by our algorithm in order to systematically
overcome this bottleneck for computational design.
Furthermore, we exploit our in-depth analysis of a collection
of low energy sequences to generate novel biological
hypotheses. This is possible since, in effect, a set of low
energy sequences for the target structure characterizes
sequences well-suited to fold to the structure. This
information is summarized in a sequence profile (positionspecific scoring matrix, PSSM) that tabulates the positional
amino acid probabilities for sequences predicted to fold to
the structure. These profiles were then studied and used to
suggest an interpretation of previously observed
experimental design results for the calmodulin (CaM)
protein.
In conclusion, the novel methodologies introduced here
accurately portray the sequence space compatible with a
protein structure, thus providing a powerful instrument for
future work on protein design. In addition, we have supplied
a generic and customizable scheme to yield heterogeneous
low energy sequences predicted to fold to a target structure.
By providing an arsenal of varied (yet near-optimal)
sequences, this protocol adds a layer of robustness to the
design process and can thus play a critical role in enabling
protein scientists to successfully continue developing novel
proteins at an ever-increasing pace and scale.
CATH database. We found that the distribution of
protein conformational diversity is very heterogeneous
also at the S60 level of homologous superfamilies. We
found that this distribution is correlated with functional
diverstification.
It is well established that the native state of a protein is
better described by a set of conformers with about the same
energy and in dynamic equilibrium. This conformational
diversity is a clue feature in proteins to understand their
functions and
sequence-structure relationship. Since the pioneering
experiments of Max Perutz in the early 60s with his studies
on the T and R forms of hemoglobin, the study of protein
conformations has a central role in several areas of structural
biology as functional characterization, drug design,
development of docking and structural alignment techniques
and the understanding of protein evolution.
Here we study the extension and distribution of protein
conformational diversity in proteins. To this end, we have
used proteins with more than one crystallographic structure
as derived from the CATH structural database. We collected
all the proteins sharing the first 7 codes corresponding to
CATH structural classification. In those cases were different
chains of a given oligomeric structure were present, a single
representative domain were chosen randomly. After this, we
obtained 7700 proteins were the 45% of them have at least 2
crystallographic structures. Using this derived database we
estimated the conformational diversity for each protein,
using different measures of structural similarity as RMS,
TMscore, GDT and Maxsub scores. An all versus all
calculation for those structural similarity scores was
performed using the structures for each protein in the
derived database mentioned above. Then, for each score the
maximum structural dissimilarity was registered and these
values were used to estimate conformational diversity.
These information were complemented with properties
taken from other databases as protein length, oligomeric
state (PQS and PDB), taxonomy (NCBI taxonomy
database), functional annotation (GO terms and EC), and
presence of ligands (PDB).
In general we found that conformational diversity does not
depend on protein length and on the number of crystallized
structures for each protein. Using CATH structural
classification, each of the three main structural classes
(mainly alpha, mainly beta and mixed alpha and beta) seems
to have similar content of conformational diversity.
However, architectures and topologies for each class showed
a clear heterogeneity in conformational diversity extension.
At this level we found a strong correlation with functional
diversity using GO terms classification of proteins. These
results reflect the great diversity found when the
homologous superfamily level was evaluated. At this level
and in spite of the great structural similarity, the
heterogeneity in conformational diversity is observed up the
S60 level (homologous families with more than 60%
identity). We also found that the extension of
conformational diversity does not depend on the
presence/absence of ligands.
49: DISTRIBUTION AND EXTENSION OF PROTEIN
CONFORMATIONAL DIVERSITY
Ezequiel Iván Juritz, Sebastián Fernández Alberti and
Gustavo Parisi (Quilmes National University, Argentina)
We studied the extension and distribution of protein
conformational diversity as derived from the analysis of
58
probes (small organic molecules or functional groups), each
in a very large number of poses. The binding free energy
expression includes truncated van der Waals interactions
(both attractive and repulsive), a simplified PB electrostatic
term, and a structure-based pairwise potential. This function
provides adequate accuracy and can be written in the form
of a sum of correlations, which is suitable for calculation
using FFT. After adapting the energy function to model
probe binding to integral membrane proteins, the mapping
applied to the open and closed channel structures provided
very informative results. The open structure (determined by
X-ray crystallography) has its energetically most important
“hot spot” at the experimental drug binding site. No other
binding site is found, as some residues at the external
binding sites are not present in the structure, but there is a
strong free energy “field” toward the internal site. We also
docked amantadine and have shown that it binds at the
experimentally determined position inside the channel. For
the closed (NMR-derived) structure the mapping finds “hot
spots” both at the internal site and at all four external sites.
Docking of amantadine shows that it does not fit inside the
closed structure, in spite of the strong hot spot, but binds
outside as seen in the NMR structure. These results suggest
that at high pH the four externally bound inhibitors improve
the stability of the closed state, but as the pH is decreased
and the channel would open, the inhibitor may shift to the
internal site blocking proton transfer.
52: EXPLORING THE ACTIVATION MECHANISM
OF A G-PROTEIN-COUPLED PROTEIN RECEPTOR,
RHODOPSIN, USING NORMAL MODES FROM
COARSE-GRAINED ELASTIC NETWORK MODELS
IN MOLECULAR DYNAMICS SIMULATIONS
Our results indicate that the distribution of protein
conformational diversity is not uniform in the structural
space provided by CATH database. There exists a strong
heterogeneity in the extension of protein conformational
diversity also within the S60 level. Although several reports
indicates that homologous proteins with the same overall
fold share their dynamics and structural deformations, here
we found that the extension of the conformational diversity
reached by a given protein is strongly influenced by
functional constraints during evolution.
51: ANALYSIS OF POTENTIAL PROTON CHANNEL
INHIBITION MECHANISMS BY COMPUTATIONAL
PROTEIN MAPPING
Dima Kozakov, Gwo-Yu Chuang, Dmitry Beglov, Ryan
Brenke and Sandor Vajda (Boston University, USA).
The influenza A virus proton channel in the open state
binds an inhibitor in the middle of the four-helix
channel, whereas in the closed state it binds four
inhibitor molecules on the outer surface. Computational
solvent mapping, a technique developed to determine
“hot spot” regions of proteins, resolves the apparent
controversy between the two binding mechanisms.
The integral membrane protein M2 of influenza virus forms
a pH-gated proton channel which is necessary for infection
and hence it is an important drug target. A recent X-ray
structure of the transmembrane region captures the channel
in the open state. The channel was also crystallized with the
inhibitor amantadine, bound in the middle of the four-helix
bundle.
An NMR structure, published at the same time, shows the
channel in closed state, and reveals four amantadine-like
inhibitor molecules binding at the channel’s lipid-exposed
outer surface. Despite similarities in the structure, the
different binding of the inhibitor in open and closed states
infers different mechanisms of inhibition. Based on the xray structure,
in the open state the drug blocks the channel and prevents
proton transfer, whereas by the NMR structure the four drug
molecules bound on the outside stabilize the closed state. In
view of this contradiction it is not clear how well the in vitro
structures represent the in vivo mechanism of inhibition.
To study this controversy we applied computational protein
mapping, a technique developed for the characterization of
protein binding sites. The method performs an efficient
global search based on the Fast Fourier Transform (FFT)
correlation approach to evaluate the binding of a number of
Basak Isin1, Klaus Schulten2, Emad Tajkhorshid2, & Ivet
Bahar1 (1Department of Computational Biology, University
of Pittsburgh, USA, 2Beckman Institute, Department of
Physics, University of Illinois at Urbana-Champaign, USA).
Rhodopsin is a member of pharmaceutically relevant Gprotein-coupled receptor (GPCR) family and serves as a
prototype for understanding their activation. We studied
functional motions of rhodopsin at atomic detail in the
presence of water and lipids by proposing a new
molecular dynamics protocol that utilizes normal modes
derived from Anisotropic network model.
G protein–coupled receptors (GPCRs) are involved in a
number of clinically important ligand-receptor processes
59
surface. We seek to explore the global dynamics, while
incorporating the effects of explicit residues and interactions
with lipid and water at atomic detail. We propose for this
purpose an algorithm, referred to as ANM-restrained MD,
which uses the deformations derived from ANM analysis as
restraints in MD trajectories. This permits us to sample the
collective motions that are otherwise beyond the range of
conventional MD simulations. With this new approach, we
seek to incorporate the realism and accuracy of MD into
ENM analysis while taking advantage of ENM to accelerate
MD simulations. The steps of ANM-restrained MD can be
summarized as follows (9):
1. Normal modes are generated using ANM. A subset of low
frequency, global modes that are sufficiently decoupled
from others, is selected.
2. Starting from the first mode associated with the lowest
eigenvalue, harmonic restraints are applied in two opposite
directions (plus and minus) in MD simulations.
3. The resulting two conformations are then subjected to
energy minimization to relieve possible unrealistic
distortions lead by the restraints. The conformer with the
lower energy is then selected as the starting structure for the
application of the next mode as new harmonic restraints.
4. When all modes in the subset are utilized in MD, a new
set of modes is generated by ANM for the next cycle of
ANM-restrained MD and the procedure described above is
repeated using the new subset of modes.
Figure 1 left panel shows the ribbon diagram of rhodopsin
color coded by the residue-RMSDs between the starting and
end conformations of the simulations, from red (least
mobile) to blue (most mobile). TM helices and the
cytoplasmic loops are labeled. We identify two highly stable
regions in rhodopsin, one clustered near the chromophore,
the other near the cytoplasmic ends of transmembrane
helices H1, H2 and H7.
The hinge site in the vicinity of the chromophore (Figure 1,
right, bottom panel) includes residues that are directly
affected by the isomerization of retinal, as well as those
stabilizing all-trans conformation. We compared hinge site
residues in the chromophore binding with the experiments
investigating the decay rate of active rhodopsin (10). These
experiments have been useful in estimating the role of a
given amino acid in the structure and function of rhodopsin.
11 of 16 residues of the hinge site were studied by the
experiments and found to affect the stability of the active
state. Along with the validated hinges, 5 untested residues
are proposed to be critical for active state stability and good
candidates for decay experiments.
In the second stable region (Figure 1, top, right panel), we
found that two water molecules located in the cavity
between helices H1, H2 and H7, connect the highly
conserved NPXXY motif on H7 to highly conserved N-D
pair on H1 and H2. This supports the previous suggestions
that water molecules in the interior of GPCRs could play
critical roles in regulating their activity (11,12).
The CP ends of H3, H4, H5 and H6, and the connecting
loops CL2 and CL3 at the CP region, are highly mobile with
high RMSDs leading to the exposure of the ERY motif
crucial for G-protein binding (Figure 1, left panel).
and perform diverse functions including responses to light,
odorant molecules, neurotransmitters, and hormones. The
crystal structures in inactive states are available for only two
GPCRs, rhodopsin and beta-adrenergic receptor and no
structure has yet been determined for an active state of any
GPCRs(1).
Rhodopsin,
the
vertebrate
dim-light
photoreceptor, is one of the best-characterized members of
GPCR family. The structure-function studies of rhodopsin
provide the fundamental basis for understanding how
members of the GPCR family work.
Like all GPCRs, rhodopsin comprises cytoplasmic (CP),
transmembrane (TM), extracellular (EC) domains and
contains a bundle of seven TM helices (H1-H7)(2). The CP
region includes three CP loops (CL1-CL3), a soluble helix
(H8) and the C-terminus (C) (see Figure 1, left panel).
Seven helices (H1-H7) span the TM region. This TM bundle
encloses the chromophore, 11-cis-retinal, covalently bound
to Lys296 on H7 and 11-cis-retinal (colored orange in
Figure 1) acts as an antagonist in the dark. The EC region
consists of three loops (EL1-EL3) and the N-terminus (N).
Light absorption by rhodopsin isomerizes 11-cis-retinal to
all-trans. Then, in chromophore binding pocket, structural
perturbations trigger the rearrangement of helices and the
exposure of critical sites for G-protein binding on the CP
domain site (3,4).
Despite the extensive biophysical and biochemical data on
rhodopsin activation, details about how the conformational
changes for activation are triggered and the molecular
mechanisms explaining the experimental data on the active
state of rhodopsin still remain unknown.
For exploring the biologically relevant, long timescale
motions of large structures, elastic Network Models (ENMs)
such as Anisotropic Network Models (ANM) have been
successfully used while avoiding expensive computations
(5,6). ENM models are based on the topology of interresidue contacts in the native structure. They assume that
many functional mechanisms of proteins are intrinsically
defined by their 3-dimensional structure. Interactions
between residues in close proximity are represented by
harmonic potentials with a uniform spring constant, and
network junctions are usually identified by the Cα atoms.
Low frequency motions, also referred to as ‘global’ modes,
are insensitive to the details of the models and energy
parameters used in normal mode analyses. Despite their
numerous insightful applications, ENM methods have
limitations. They lack information on residue specificities,
atomic details, side chain motions, and the effects of
interactions with the environment such as the lipids and
water molecules on proteins. On the other hand, Molecular
Dynamics (MD) simulations provide atomic-level detail
with high temporal resolution for both harmonic and
anharmonic motions. However, the standard MD is not
efficient for sampling large conformational changes
spanning periods of time longer than microseconds
especially for large macromolecules (7,8).
Here, our aim is to find at atomic detail the biologically
relevant conformations of rhodopsin which couple retinal
isomerization to conformational changes in both the TM
domain and the critical G-protein binding sites on the CP
60
REFERENCES
1. Kobilka, B. and G. F. Schertler. 2008. New G-proteincoupled receptor crystal structures: insights and limitations.
Trends Pharmacol. Sci 29:79-83.
2. Palczewski, K., T. Kumasaka, T. Hori, C. A. Behnke, H.
Motoshima, B. A. Fox, I. Le Trong, D. C. Teller, T. Okada,
R. E. Stenkamp, M. Yamamoto, and M. Miyano. 2000.
Crystal structure of rhodopsin: A G protein-coupled
receptor. Science 289:739-45.
3. Isin, B., A. J. Rader, H. K. Dhiman, J. KleinSeetharaman, and I. Bahar. 2006. Predisposition of the dark
state of rhodopsin to functional changes in structure.
Proteins 65:970-983.
4. Klein-Seetharaman, J. 2002. Dynamics in rhodopsin.
Chembiochem 3:981-6.
5. Atilgan, A. R., S. R. Durell, R. L. Jernigan, M. C.
Demirel, O. Keskin, and I. Bahar. 2001. Anisotropy of
fluctuation dynamics of proteins with an elastic network
model. Biophys. J. 80:505-515.
6. Bahar, I., A. R. Atilgan, and B. Erman. 1997. Direct
evaluation of thermal fluctuations in proteins using a singleparameter harmonic potential. Fold. Des. 2:173-181.
7. Sotomayor, M. and K. Schulten. 2007. Single-molecule
experiments in vitro and in silico. Science 316:1144-1148.
8. Tajkhorshid, E., A. Aksimentiev, I. Balabin, M. Gao, B.
Isralewitz, J. C. Phillips, F. Zhu, and K. Schulten. 2003.
Large scale simulation of protein mechanics and function.
Adv. Protein Chem. 66:195-247.
9. Isin, B., K. Schulten, E. Tajkhorshid, and I. Bahar. 2008.
Mechanism of Signal Propagation upon Retinal
Isomerization: Insights from Molecular Dynamics
Simulations of Rhodopsin Restrained by Normal Modes.
Biophys. J.
10. Farrens, D. L. and H. G. Khorana. 1995. Structure and
function in rhodopsin. Measurement of the rate of
metarhodopsin II decay by fluorescence spectroscopy. J Biol
Chem 270:5073-6.
11. Lehmann, N., U. Alexiev, and K. Fahmy. 2007. Linkage
between the intramembrane H-bond network around aspartic
acid 83 and the cytosolic environment of helix 8 in
photoactivated rhodopsin. J. Mol Biol 366:1129-1141.
12. Okada, T., Y. Fujiyoshi, M. Silow, J. Navarro, E. M.
Landau, and Y. Shichida. 2002. Functional role of internal
water molecules in rhodopsin revealed by X- ray
crystallography. Proc. Natl. Acad. Sci. U. S. A 99:59825987.
53: TOPS++FATCAT: FAST FLEXIBLE
STRUCTURAL ALIGNMENT USING CONSTRAINTS
DERIVED FROM TOPS+ STRINGS MODEL
Mallika Veeramalai (Joint Center for Molecular Modeling,
Burnham Institute for Medical Research, USA), Yuzhen Ye
(School of Informatics, Indiana University, Bloomington,
USA), & Adam Godzik (Joint Center for Molecular
Modeling, Burnham Institute for Medical Research, USA).
TOPS++FATC
AT
provides
FATCAT
accuracy
and
insights
into
protein
structural
changes at a
speed
comparable to
sequence
alignments towards interactive structure similarity
searches.
Protein structure analysis and comparison are major
challenges in structural bioinformatics. Despite the existence
of many tools and algorithms, very few of them have
managed to capture the intuitive understanding of protein
structures developed in structural biology, especially in the
context of rapid database searches. Such intuitions could
help speed up similarity searches and make it easier to
understand the results of such analyses. We developed a
TOPS++FATCAT algorithm that uses an intuitive
description of the proteins’ structures as captured in the
popular TOPS diagrams to limit the search space of the
aligned fragment pairs (AFPs) in the flexible alignment of
protein structures performed by the FATCAT [1] algorithm.
Here we explore constraints obtained from the TOPS+
strings alignment, which identifies topologically equivalent
secondary structure elements (alpha helices, beta strands,
and loops) for this purpose. For benchmarking and
comparison, we have used the PDB40 dataset of 1,901
protein domain pairs (DP) corresponding to SCOP version
1.61
from
the
ASTRAL
database
[4].
The
TOPS++FATCAT algorithm is faster than FATCAT by
more than an order of magnitude with a minimal cost in
classification and alignment accuracy. For beta-rich proteins
its accuracy is better than FATCAT, because the TOPS+
strings models [2,3] contains important information of the
parallel and anti-parallel hydrogen-bond patterns between
the beta-strand SSEs (Secondary Structural Elements). The
overall results for all protein classes show that
TOPS++FATCAT performance is only slightly lower (3%–
7% AUC value difference) as compared to FATCAT while
providing a significant, more than 10-fold speedup. We
show that the TOPS++FATCAT errors, rare as they are, can
be clearly linked to oversimplifications of the TOPS
diagrams and can be corrected by the development of moreprecise secondary structure element definitions. The
TOPS++FATCAT provides FATCAT accuracy and insights
into protein structural changes at a speed comparable to
sequence alignments, opening up a possibility of interactive
protein structure similarity searches.
Figure 1 - The schematic illustration of FATCAT structural
alignment by chaining AFPs in a constrained alignment
region defined by TOPS alignment output. (a) In FATCAT,
two fragments form an AFP (shown as a line in the graph)
according to the criteria (see text). (b) The alignment of
secondary structure elements from TOPS+ comparison is
TOPS++FATCAT: a fast flexible structural alignment
using constraints derived from TOPS+ strings models.
Intuitive topological constraints help to prune the search
space involved in FATCAT comparison process. The
61
used to define the constrained area for AFP detection, in
which each two aligned secondary structure elements
defines an “eligible” block (shown as filled squares). These
blocks may be disconnected, and we need to connect them
with connecting blocks (shown as open squares). (c) We add
a buffer area surrounding the constrained area defined in (b)
(shown as the area closed by dashed lines) to get the
constrained alignment region for FATCAT alignment (show
as the area closed by dark lines). (d) Only those AFPs within
the constrained alignment region are used in the dynamic
programming algorithm for chaining.
REFERENCES
1. Ye Y, Godzik A: Flexible structure alignment by chaining
aligned fragment pairs allowing twists. Bioinformatics 2003,
19 Suppl 2:II246-II255.
2. Veeramalai M: A novel method for comparing
topological models of protein structures enhanced with
ligand
information. PhD Degree Thesis. Department of Computing
Science: University of Glasgow; 2005.
3. Veeramalai M, Gilbert D: A Novel Method for
Comparing Topological Models of Protein Structures
Enhanced with
Ligand Information (Bioinformatics – under review).
4. Chandonia J-M, Walker NS, Lo Conte L, Koehl P, Levitt
M, Brenner SE: ASTRAL compendium enhancements.
Nucleic Acids Research 2002, 30:260- 263.
54: CONFORMATIONAL DIVERSITY MODULATES
PROTEIN SEQUENCE DIVERGENCE
We studied how the presence of conformational diversity
constraints protein sequence evolution. We found that in
60% of the cases one of the conformer dominates the
structural constraints on sequence divergence. This also
indicates the importance of structural deformations to
design new models of protein evolution.
We have used a set of 75 proteins with different conformers
taken from bibliography and from the database of
macromolecular movements. For each conformer we have
used the SCPE to obtain a whole set of site-specific
substitution matrices. Using maximum likelihood
calculations performed with the program HYPHY, a set of
homologous proteins and a phylogenetic tree; we studied
how well the different conformers reproduce the substitution
pattern found in the alignment. As a null hypothesis we have
used the JTT model of protein evolution which does not take
into account protein structure for the derivation of
substitution matrices. For model comparison we have used a
likelihood ratio test, which provide us with a statistical
framework to evaluate how well the models reproduce
sequence divergence pattern found in the alignments. Also,
using distance matrix comparison between the conformers,
we estimated which of the structures is the “open” and
which is the “closed” form.
First of all, we found that in 80% of the proteins, the SCPE
model outperforms JTT model which is in well agreement
with previous results. We then compared how well each of
the conformers for a given protein describes the substitution
pattern found in the alignment. This was evaluated using
SCPE runs for each of the conformers. We found that over
60% of the cases one conformer is a better model than the
other. This is an important result that indicates that, over the
set of conformers studied, there is one that dominates the
constraints over sequence divergence. Moreover, in the 60%
of these conformers, the “open” form is the one that better
describes the sequence divergence of the homologous
proteins. We were unable to find a clear correlation between
RMS calculated between conformers and the maximum
likelihood performance. This may be related with the fact
that SCPE could be more sensitive to detect conformational
changes than RMS does.
Our results indicate that conformational diversity constraints
protein sequence evolution and also indicate the importance
of protein dynamics or structural deformations to design
new models of protein evolution. A molecular evolution
model that take into account a combination of the structural
constrains of a set of conformers for a given protein, is able
to describe with a superior accuracy the sequence
divergence within a protein family.
It is well established that the conservation of protein
structure during evolution modulates sequence divergence.
However, recent evidence support the fact that the native
state of any protein is better described as an ensemble of
protein conformations. Here we study how the different
conformations of a given protein constraint the substitution
pattern observed in the sequence.
To study how the conservation of protein structure
constrains sequence divergence, we have developed the
Structurally Constrained Protein Evolution (SCPE) model.
The SCPE simulates sequence divergence with special
consideration to the conservation of protein structure. These
simulations allow the derivation of site-specific substitution
matrices that we found outperform protein evolution models
that do not consider protein structure explicitly.
55: RENAMING DIASTEREOTOPIC ATOMS FOR
CONSISTENT PDB-WIDE ANALYSES
Christopher
Bottoms & Dong
Xu (University of
Missouri-Columbia,
USA)
Biological
chemistry is very
stereospecific.
However,
diastereotopic
atoms of small
molecules are often
given names that
Ezequiel Iván Juritz, Sebastián Fernández Alberti and
Gustavo Parisi (Quilmes National University, Argentina)
62
method of Cieplak and Wisniewski [13] for spatial
comparisons. However, instead of using CIP priorities, we
take advantage of the inherent “chirality” of atom names.
This allows for use of idealized ligands in any conformation
for naming diastereotopic atoms of query ligands in any
other, or the same, conformation. It is also less
computationally expensive than attempting to superpose an
ideal and a query ligand.
REFERENCES
1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN,
Weissig H, Shindyalov IN, Bourne PE: The Protein Data
Bank (http://www.rcsb.org/). Nucleic Acids Res 2000,
28(1):235-242.
2. Faig M, Bianchet MA, Winski S, Hargreaves R, Moody
CJ, Hudnott AR, Ross D, Amzel LM: Structure-based
development of anticancer drugs: complexes of
NAD(P)H:quinone oxidoreductase 1 with chemotherapeutic
quinones. Structure 2001, 9(8):659-667.
3. Bressi JC, Verlinde CL, Aronov AM, Shaw ML, Shin SS,
Nguyen LN, Suresh S, Buckner FS, Van Voorhis WC,
Kuntz
ID, Hol WG, Gelb MH: Adenosine analogues as selective
inhibitors of glyceraldehyde-3-phosphate dehydrogenase of
Trypanosomatidae via structure-based drug design. J Med
Chem 2001, 44(13):2080-2093.
4. Newsletter 1984. European Journal of Biochemistry 1984,
138(1):5-7.
5. Eckstein F: Nucleoside phosphorothioates. Annu Rev
Biochem 1985, 54:367-402.
6. Cech TR, Herschlag D, Piccirilli JA, Pyle AM: RNA
catalysis by a group I ribozyme. Developing a model for
transition state stabilization. J Biol Chem 1992,
267(25):17479-17482.
7. Padgett RA, Podar M, Boulanger SC, Perlman PS: The
stereochemical course of group II intron self-splicing.
"Science (New York, NY" 1994, 266(5191):1685-1688.
8. Domanico PL, Rahil JF, Benkovic SJ: Unambiguous
stereochemical course of rabbit liver fructose bisphosphatase
hydrolysis. Biochemistry 1985, 24(7):1623-1628.
9. Tsai MD: Use of phosphorus-31 nuclear magnetic
resonance to distinguish bridge and nonbridge oxygens of
oxygen-17-enriched
nucleoside
triphosphates.
Stereochemistry of acetate activation by acetyl coenzyme A
synthetase. Biochemistry 1979, 18(8):1468-1472.
10. Schultze P, Feigon J: Chirality errors in nucleic acid
structures. Nature 1997, 387(6634):668.
11. Waszkowycz B: Towards improving compound
selection in structure-based virtual screening. Drug Discov
Today
2008, 13(5-6):219-226.
12. Good A: Structure-based virtual screening protocols.
Curr Opin Drug Discov Devel 2001, 4(3):301-307.
13. Cieplak T, Wisniewski J: A new effective algorithm for
the unambiguous identification of the stereochemical
characteristics of compounds during their registration in
databases. Molecules 2001, 6:915-926.
14. Bottoms C, Xu D: Wanted: Unique names for unique
atom positions. PDB-wide analysis of diastereotopic atom
do not uniquely distinguish them from each other. We
describe a tool for renaming their diastereotopic atoms
based on idealized ligands.
Often accompanying the macromolecules deposited in the
Protein Data Bank (PDB) [1] are smaller molecules of
biological importance. Some of these are energy-carrying
cofactors, such as ATP, coenzyme A, and nicotinamideadenine dinucleotide (NAD). Some analogs of these
molecules are either drugs or can be used in drug design [2,
3]. Like other biologically relevant molecules, many of
these small molecules contain chiral or prochiral centers. An
atom is a chiral center if four different chemical groups are
attached to it. A chiral configuration can be designated R or
S, depending on the arrangement of the attached groups. If,
however, two of these groups are identical, then the center
atom is prochiral, meaning that it would become chiral if
either of the identical groups were substituted for a unique
group. These two groups are called diastereotopic, i.e., if
either were replaced with a unique group, the molecule
would become one or another diastereomer. Within a pair of
diastereotopic atoms, one is designated pro-R and the other
pro-S, indicating the configuration of the chiral atom would
result from replacing the diastereotopic atom with a group
that has higher priority than the other groups.
The pro-S and pro-R oxygen atoms of nucleic acid strands
are named “OP1” and “OP2”, respectively [4]. Many
enzymes treat the pro-R and pro-S oxygen atoms of DNA
and RNA differently[5]. These diastereotopic oxygen atoms
are also treated differently in RNA-intron splicing [6, 7].
Small diphosphate-containing molecules also participate in
enzymatic reactions in which the distinction between
diastereotopic atoms or groups is important [5, 8, 9].
Unfortunately, many of these diastereotopic atoms do not
have standardized names (see the figure, which shows
diphosphate groups from two different NAD molecules of
the PDB file 2OHX). Consistent naming of diastereotopic
atoms is needful when performing all-atom superpositioning
or all-atom root mean square deviation (RMSD) calculations
[10]. It is also needful for data mining in the PDB, e.g.,
structure-based virtual screening for drug candidates [11,
12]. Using the determinant algorithm of Cieplak and
Wisniewski [13], we conducted a systematic PDB-wide
analysis on the diastereotopic oxygen atom names of small
molecules containing diphosphate [14].
The lack of standardized naming conventions for
diastereotopic atoms of small molecules has left the ad hoc
names assigned to many of these atoms non-unique, which
may create problems in data-mining of the PDB. Therefore,
researchers designing PDB-wide analyses need to consider
this issue to avoid spurious results. We previously provided
a tool for renaming diastereotopic oxygen atoms of
diphosphate-containing
molecules
(http://digbio.missouri.edu/ddan/DDAN.htm), but at this
conference we present a more general tool. This tool
compares the naming conventions of idealized ligands and
query ligands. Names of diastereotopic atom pairs in query
ligands are swapped, as needed, to make them conform to
the idealized ligands. Like our previous tool, this uses the
63
molecular bonds, and electrostatic potentials of known eh1like motifs. For example, experimentally determined
structure of an eh1 peptide bound to the WD domain and a
previous study on eh1-like motifs were used.
Models of the WD domain interacting with eh1 motifs were
generated using Deep View program and Swiss-Model
server. Swiss Model used specialized software and up-todate databases to build models of putative eh1 sequences
and evaluate their quality. Information gained on bond
lengths, binding sites, local shape complementarity,
interaction potentials, energies, and models' stabilities was
used to devise a scoring function. Theoretical eh1-like
sequences were assessed and ranked using the scoring
function. This was used to predict which putative eh1-like
motifs likely had a potential repressive function.
Motif recognition techniques and bioinformatics searches
for experimental data within the NCBI protein databases
were employed to verify the predictions. The results showed
that the scoring function captured a general correspondence
between putative motifs' characteristics and the likelihood
that they were found in transcription factors of various
species. Conversely, the scoring function was very reliable
in predicting which putative sequences were not found in
nature.
This report discusses findings on inter-motif bonds, charge
and polarity of residues, a secondary structure of eh1-like
motifs, bonds between the motifs and the WD domain, and
alike. The study indicated that mutations in motifs' residues
could produce only limited changes in the tertiary structures
and still preserve motifs' functionality.
This study identified several new eh1-like motifs. NCBI
Blast searches confirmed that these motifs were conserved
in transcription factors of several species, implying that they
likely had transcriptional roles. The results of this study may
be used to predict other regulatory motifs. Given the
importance of transcriptional regulation, this report on the
prediction and evaluation of new eh1-like motifs will
facilitate further studies of transcriptional and regulatory
mechanisms.
names of small molecules containing diphosphate. BMC
Bioinformatics 2008, 9(Suppl 9):S16.
56: PREDICTING NEW ENGRAILED HOMOLOGY
MOTIFS FROM STRUCTURAL AND ENERGY
STUDIES OF THE WD PROPELLER DOMAIN
BINDINGS TO KNOWN MOTIFS
Danielle S. Dalafave
(The College of New
Jersey, USA)
Data mining and
computational
techniques were used
to predict new eh1like
motifs
and
evaluate
their
structure, stability,
and functionality. To
the best of the
author's knowledge,
this is the first report on using structural and energetic
considerations to predict eh1-like motifs that bind to WD
domains of Gro/TLE transcriptional corepressors.
Data mining and computational techniques were used to
predict new engrailed homology-1 (eh1)-like motifs and
evaluate their 3D structures, amino acid sequences, stability,
and possible functionality. Eh1-like motifs bind to WD
domains of the Gro/TLE corepressors to provide
transcriptional repressive functions. To the best of the
author's knowledge, this is the first study that uses a
combination of compositional, structural, and energetic
considerations to predict new eh-1 motifs that bind to WD
domains.
Reliable methods that predict molecules involved in
proteins' binding would greatly enhance our understanding
of proteins' capacities for selective recognition and could
potentially lead to new disease intervention methods. At
times, experimental studies may be difficult to perform and
computational methods need to be employed.
Transcription factors are proteins with important roles in
controlling the transcription of genetic information from
DNA to RNA. Gro/TLE protein family performs their gene
repression functions via transcription factors, rather than
through direct interactions with DNA. Gro/TLE can bind to
diverse transcription factors, some of which belong to
systems whose abnormal activities may lead to cancers. The
WD domain is a highly conserved region of Gro/TLE. X-ray
studies showed that the WD domain forms a beta-propeller,
which recognizes specific transcription factors.
Experiments had suggested that eh1-like motifs bind to the
pore region of the WD propeller to provide their repressive
function. A consensus motifs' sequence is FSBXXBBX,
where F = Phe, S = Ser, B = branched hydrophobic amino
acid residue, and X = nonpolar or charged residue. When H
(Tyr) or H (His) is substituted in the first position, the new
motif also binds Gro/TLE corepressors.
As a first step in this study, available experimental and
theoretical information was analyzed to gain insights into
structural and sequence constrains, inter- and intra-
57: USE OF EVOLUTIONARY INFORMATION IN
MODEL QUALITY EVALUATION FOR PROTEIN
STRUCTURE PREDICTION
Nicolas Palopoli (Universidad Nacional de Quilmes,
Argentina), Diego Gomez Casati (IIB-INTECH, Argentina)
64
have selected a number of decoys publicly available on the
Web (5,6,7). They have been built by comparative modeling
from a template structure, which was taken as the native
structured and served as reference for our comparisons. We
ran the SCPE for the template and decoy structures and
assessed the results through different scoring functions,
including global maximum likelihood comparisons,
estimation under different cutoffs of the number of
structurally constrained sites (defined as sites where the Zscore of the log likelihood for the site against the
distribution of log likelihoods for the same site exceeds the
desired cutoff) and partial sum of log likelihoods for
structurally constrained sites, which has proven to be the
most successful measure. When comparing the results,
without considering any particular scoring scheme, we have
found that the native structure is ranked among the top three
decoys in 74% of the cases, while it is selected as the best
structure in 51% of them.
Our results indicate that the use of evolutionary information
could indeed aid in the discrimination of native structures
when combined with structural information. The SCPE has
been shown to be very promising as a tool for the validation
step in protein structure prediction. We still need to address
some important issues before our method becomes of
common use. We found that SCPE works better with
structures longer than a hundred residues, where the number
of structurally constraint residues is statistically significant.
We also are subjected to the availability of adequate
structural alignments (in terms of the number of sequences
they comprise). We are currently working on a unique
scoring function which would allow us to rank decoy
structures on the absence of a reference native structure.
REFERENCES:
(1) Tramontano A. An account of the Seventh Meeting of
the Worldwide Critical Assessment of Techniques for
Protein Structure Prediction. FEBS Journal 2007,
274(7):1651-1654.
(2) Tan CW. Using neural networks and evolutionary
information in decoy discrimination for protein tertiary
structure prediction. BMC Bioinformatics 2008, Feb
11;9:94.
(3) Parisi G, Echave J. Generality of the Structurally
Constrained Protein Evolution model: assessment on
representatives of the four main fold classes. Gene 2005,
345(1):45-53.
(4) Sander C, Schneider R. The HSSP database of protein
structure-sequence alignments. Nucleic Acids Res. 1994
Sep;22(17):3597-9.
(5) Samudrala R, Levitt M. Decoys 'R' Us: a database of
incorrect conformations to improve protein structure
prediction. Protein Sci. 2000 Jul;9(7):1399-401.
(6) David Baker´s Lab, http://www.bakerlab.org
(7) CASP6,
http://www.predictioncenter.org/casp6/Casp6.html
58: COMPUTATIONAL DISCOVERY OF SMALL
MOLECULAR WEIGHT PROTEIN INTERACTION
INHIBITORS
Lidio Meireles (Department of Computational Biology,
University of Pittsburgh, USA), Alexander Doemling
& Gustavo Parisi (Universidad Nacional de Quilmes,
Argentina)
We present a novel method for validation of protein
models using evolutionary information based on our
SCPE program. By testing it with a set of publicly
available decoys we found that our method is suitable for
discriminating among models, thus being useful for
protein structure prediction.
It is widely known that, while the primary sequence of a
protein drives the adoption of a characteristic tertiary
structure, it is not as conserved as its three-dimensional
structure, which mostly accounts for the function of the
protein. Thus, knowledge of the three-dimensional structure
of a protein can often be very helpful for understanding its
biological activity. When the structure has not been
experimentally solved, reliable computational methods
capable of predicting the structure of a protein from its
amino acid sequence become extremely helpful. Different
techniques are suitable for this task, such as comparative
modeling, fold recognition or ab initio predictions, each of
them displaying its own strengths and limitations. No matter
which is taken, a critical stage when predicting protein
structure involves discriminating among the proposed
structural models or decoys (1). Common approaches
involve energy, statistical potentials describing interactions
among atoms, and structural comparisons and clustering
between the proposed decoys. Besides, many model quality
assessment programs provide some sort of a scoring
function capable of ranking decoys according to different
structural features. Though the inclusion of evolutionary
information has proved to be helpful in decoy discrimination
(2), it has not been extensively and explicitly used in
prediction methods.
Here we present a novel method for decoy discrimination
and model selection based on structural features combined
with evolutionary information. It is based on the Structurally
Constrained Protein Evolutionary model (SCPE) program
developed in our group (3). The SCPE algorithm simulates
protein evolution by introducing random mutations into the
evolving sequences and selecting them against too much
structural perturbation. By running SCPE for different
decoys we could obtain decoy-specific, site-specific sets of
substitution matrices; they represent the evolution of sites
under the constraints imposed by each decoy structure. The
models could then be compared through their set of
matrices, by estimating the likelihood of the evolutionary
model for a given set of sequences and a fixed topology
(both derived from HSSP database (4) of structure-based
sequence alignments). Under our criteria, the best structural
model would be the one that explains better the sequential
divergence in the set of homologous sequences. For the
SCPE to be able to discriminate among correct and wrong
models, it needs to take proper account of evolutionary
information while being sensitive to structural dissimilarity.
We would expect to develop a proper ranking scheme which
would lead us to rank native-like structures among the top
results, while leaving quite dissimilar decoys (at higher
RMSD) far behind. As a dataset for testing our method we
65
determined by comparing the SASA of the residues of each
protein in the free and in the bounded state.
Virtual Library Generation
We generate a library of small molecular weight compounds
specifically designed to mimic the chemistry and structure
of deeply buried anchor residues identified on the previous
step. To generate the compounds, we use a fragment-based
approach associated with multicomponent reaction scaffolds
(MCR). MCR are convergent and efficient ways to access
and generate a large and diverse chemical space using only
one reaction step (“one-pot”) [4]. By specifying basic
molecular scaffolds and small molecular fragments,
including anchor analog fragments, compounds are
synthesized in silico by the software Chemaxon Reactor [5].
Virtual Screening
Virtual compounds incorporating anchor analogs are predocked by fitting the anchor analog to the anchor residue
provided by the protein-protein complex structure,
significantly simplifying and expediting the virtual pipeline
compared to de novo small molecule docking. Following
anchor fitting, the compounds are energy minimized in the
context of the acceptor protein and the top compounds are
predicted from the set of best ranked structures.
Our methodology is innovative and includes several
benefits. For example, our virtual library is not restricted to
any specific database or commercially available compounds,
but instead it is constructed on demand biased by the
specific protein target and the anchors revealed by the
protein-protein complex. Moreover, because the library is
constructed from multicomponent reaction scaffolds, the
compounds are straightforward and efficiently to synthesize
using standard protocols. Docking is extremely fast, in the
order of a second per compound, as it merely involves
anchor fitting followed by energy minimization. However,
perhaps mostly important is the fact that our compounds are
designed to mimic the chemistry and structure of anchors of
PPIs that were favored by nature. Since anchors bury the
most SASA upon binding, their conformation is critical not
only for binding in vivo but also in computational docking
methods.
REFERENCES
[1] Wells JA, McClendon CL. Reaching for high-hanging
fruit in drug discovery at protein-protein interfaces. Nature
2007;450(7172):1001-1009.
[2] Rajamani D, Thiel S, Vajda S, Camacho CJ. Anchor
residues in protein-protein interactions. Proc Natl Acad Sci
U S A 2004;101(31):11287-11292.
[3] Clackson T, Wells JA. A hot spot of binding energy in a
hormone-receptor interface. Science 1995;267(5196):383386.
[4] Doemling A, Recent progress in isocyanide based
multicomponent reaction chemistry. Chemical Review,
2006, 106, 17.
[5] György Pirok, Nóra Máté, Jenő Varga, József Szegezdi,
Miklós Vargyas, Szilárd Dóránt, and Ferenc Csizmadia.
Making "real" molecules in virtual space. J. Chem. Inf.
Model. 2006; 46, 563-568.
(Departments of Pharmacy and Chemistry, University of
Pittsburgh, USA) & Carlos Camacho (Department of
Computational Biology, University of Pittsburgh, USA).
Protein-protein interactions have proven to be difficult
targets for drug discovery. To face this challenge, we
propose a computational pipeline involving novel and
chemically accessible target-specific libraries, which by
design include sidechain analogs that mimic anchor
residues from protein-protein complex structures.
Protein-protein interactions (PPIs) constitute an emerging
class of targets for pharmaceutical intervention with the
PDB providing a highly valuable source for structural
information on protein interactions [1]. However, the
diversity of PPIs does not fit well in the current drug
discovery paradigm that focus almost exclusively on
screening large historical collections of (commercially
available) small molecular weight compounds. Despite
computational limitations on the sampling of chemical space
and scoring of protein-small molecule docked
conformations, in silico screening methods continue to be
developed and improved as credible and complementary
alternatives to high-throughput biochemical compound
screening.
In order to overcome the aforementioned limitations, we
have developed a virtual screening technology of virtual
libraries that by design have a built-in amino acid hot spot,
or “anchor”, burying deep into acceptor proteins. Key to our
methodology is the concept of anchor residues or hot spots
which have been shown to play an important role in the
early stages of molecular recognition [2, 3]. Moreover, there
is good evidence that in many cases the anchoring grooves
are relatively unchanged upon complexation [2], thus
providing a uniquely well characterized starting point to
docking a small molecule. Our computational pipeline for
discovery of small molecular weight inhibitors of protein
interactions starts with the PDB of a protein-protein
complex and ends with a ranked list of compounds likely to
inhibit the underlying protein interaction. The method can
be described in three steps, as follows:
Anchor Identification
Anchor residues are reliably identified as those residues
undergoing the largest change in solvent-accessible surface
area (SASA) upon complexation. This can be quickly
66
transmembrane dimer at a one angstrom R.M.S.D. without
the aid of experimental data. We have applied it to obtain a
near atomistic structural model the betaglycan
transmembrane homodimer, a type III TGF-beta receptor
family member. The top ranking model is in excellent
agreement with our mutational data obtained with the
TOXCAT, an in-vivo assay for association of helices in the
E.coli membrane. The positions that are most susceptible to
disrupt dimerization all map at the interaction interface, and
the effect of experimental and in-silico mutagenesis on
association equilibria are in very good agreement. A second
round of mutagenesis, performed on a selection of mutations
that were predicted to be either tolerable or disruptive,
further confirms the model. These included a
complementary double mutant that was succesfully
predicted to rescue a disruptive single mutation according to
the structural model. Hence, as this case-study demonstrates,
our novel in silico modeling protocol assists in
understanding wide mutational and biophysical data of this
important transmembrane protein. Complimentary, the
modeling assists in focusing the experimental efforts
towards the key loci in this protein and can also be utilized
for designing altered requested structures.
REFERENCES
1 Senes A, Engel DE, DeGrado WF. "Folding of helical
membrane proteins: the role of polar, GxxxG-like and
proline motifs." Curr Opin Struct Biol. 2004 14(4), 465-79
2 Senes A, Ubarretxena-Belandia I, Engelman DM. "The
Ca–H···O hydrogen bond: a determinant of stability and
specificity in transmembrane helix interactions." Proc Natl
Acad Sci U S A. 2001 98(16), 9056-61
3 Senes A, Gerstein M, Engelman DM. "Statistical analysis
of amino acid patterns in transmembrane helices: the GxxxG
motif occurs frequently and in association with betabranched residues at neighboring positions." J Mol Biol.
2000 296(3), 921-36
4 Walters RF, DeGrado WF “Helix-packing motifs in
membrane proteins” PNAS 2006, 103 13658-63.
61: COMPARING SEQUENCE AND STRUCTUREBASED CLASSIFIERS FOR PREDICTING RNA
BINDING SITES IN SPECIFIC FAMILIES OF RNA
BINDING PROTEINS
Michael
Terribilini,
Cornelia
Caragea,
Deepak
Reyon,
Ben
Lewis, Li
Xue,
Jeffry Sander, Jae-Hyung Lee, Robert L Jernigan, Vasant
Honavar, Krishna Rajan & Drena Dobbs (Iowa State
University, USA).
We evaluate machine learning classifiers for predicting
RNA binding residues in proteins, using either sequence
based information only, or a combination of sequence
and structure derived information and quantitate
relative contributions of these different input types to
59: A FRAGMENT BASED METHOD FOR THE
PREDICTION OF ATOMISTIC MODELS OF
TRANSMEMBRANE HELIX-HELIX INTERACTION
Alessandro Senes1,2, Dan W Kulp1,
David T. Moore1 & William F
DeGrado1.
(1Department
of
Biophysics
and
Biochemistry,
University of Pennsylvania, USA,
2
present address: University of
Wisconsin, Madison, USA)
We present a general method for
modeling
transmembraneproteins based on combinatorial
fragment-based
libraries
of
natural proteins. It can be applied
for ab initio modeling as well as utilize experimental
constraints. A derived model of the TGF-beta receptor
betaglycan transmembrane dimer supports our extensive
mutational data and biophysical characterization.
Membrane proteins are a large and medically important
class of proteins. However, their structural characterization
by X-ray crystallography, NMR and other biophysical
techniques is generally challenging and thus they are
significantly under-represented in the structural database.
We present a computational method that can provide
atomistic
structural
predictions
of
interacting
transmembrane helices. The factors that stabilize membrane
protein folding and association are different from those that
apply in solution, and therefore designated computational
methods need to be developed. The largest group of
membrane proteins has a helical bundle topology, and the
association of the hydrophobic helices is driven by detailed
complementary packing, hydrogen bonding (1), often
including networks of weak Ca-H...O hydrogen bonds (2),
which are favored by sequence motifs comprising a patch of
small interfacial residues (3). The helical pairs often adopt a
number of frequent interhelical geometries, as observed in
the available crystal structures (4). Our method is based on a
large fragment-based combinatorial library of natural
backbones that is biased to sample these common
interaction motifs. The method consists of two phases. First,
the most likely structural candidates for the primary
sequence is selected from the comprehensive pools of
backbone templates using a sequence-based score derived
from sequence alignment statistics and structural constrains,
e.g. sterics and hydrogen bonding potential. This stage can
also incorporate experimentally derived information, such as
mutagenesis data. Second, highly detailed three-dimensional
models are assembled from the candidate backbones by
placing the side chains from large designated rotamer
libraries that have been pre-optimized for each individual
backbone. The ensemble of the most energetically favorable
models is screened for compatibility with any available
experimental evidence, and used to guide further
experimental validation.
The method is very rapid and can produce very detailed
complementary packing. For example, the method
reproduced the structure of the glycophorin A
67
evaluated ensemble classifiers that use combinations of the
input data types described above.
Results
Our results, partially summarized in Table 1 below, indicate
that when classifiers are evaluated on the basis of AUC for
ROC curves, the best “overall” performance is obtained
using ensemble classifiers that use amino acid sequence
information in combination with either: i) PSSMs (derived
from sequence homologs identified using BLAST); or ii)
spatial neighbor information (extracted from PDB structures
of proteins). Results obtained using “custom” classifiers
trained to predict RNA-binding residues in specific families
of RNA binding proteins, comparisons of our results with
those published by others, and comparisons of our
predictions with available experimental data for several
clinically important ribonucleoprotein complexes will also
be presented.
Table 1: Comparison of Classifiers for RNA-binding Site
Prediction
Classifier - AUC for ROC
Sequence-based - 0.74
Structure-based - 0.77
PSSM-based - 0.80
Ensemble - 0.81
REFERENCES
1) Terribilini, M., Lee, J.H., Yan, C., Jernigan, R., Honavar,
V., and Dobbs, D. 2006. Prediction of RNA binding sites in
proteins based on amino acid sequence. RNA 12: 14501462.
2) Terribilini,M., Lee, J.H., Yan, C., Jernigan, R., Carpenter,
S., Honavar, V., and Dobbs, D. 2006. Identifying interaction
sites in “recalcitrant” proteins: predicted protein and RNA
binding sites in Rev proteins of HIV-1 and EIAV agree with
experimental data. Proc. Pac. Symp. Biocomput.
62: R-ALIGN: A ROBUST STATISTICS BASED
SUPERPOSITION ALGORITHM FOR PROTEINS
Chakra
Chennubhotla
and Ivet Bahar
(Department
of
Computational
Biology,
University of
Pittsburgh,
USA)
overall prediction performance. We also present novel
classifiers optimized for specific families of RNA binding
proteins.
Introduction
Protein-RNA interactions play critical roles in a wide range
of biological processes. Previously, we developed a machine
learning approach for predicting which amino acids in an
RNA-binding protein mediate protein-RNA interactions,
using only the amino acid sequence of the protein as input
(http://bindr.gdcb.iastate.edu/RNABindR/)(1,2) Here we
report an evaluation of the relative contributions of
sequence, structural features, and evolutionary information
to performance of algorithms for predicting RNA-binding
residues in proteins. In this study, we train and test multiple
classifiers using several benchmark datasets, including a
non-redundant dataset of 181 RNA-binding polypeptide
chains with <30% sequence identity (RB181), and “custom”
datasets comprising sets of related RNA-binding proteins.
We systematically compare results obtained using simple
classifiers that use only one type of information as input
(e.g., Naïve Bayes classifier, using only amino acid
sequence as input) with results obtained using ensemble
classifiers that exploit specific combinations of input
information (e.g., an ensemble of Naïve Bayes classifiers
that use the amino acid sequence, information from
sequence homologs and/or the identities of spatial neighbors
in known structures as input). We also attempt to generate
“custom” classifiers for predicting RNA binding sites in
specific families of RNA binding proteins (i.e., those
sharing similar sequences or structures).
Methods
Interfaces from known protein-RNA complexes in the PDB
were extracted to generate a non-redundant set of 181
RNAbinding protein chains (RB181). The input to the
sequence-based classifier was a window of amino acid
identities for contiguous residues in the protein sequence. A
Naive Bayes classifier was trained using leave-one-out cross
validation: one sequence was chosen as the test case and all
other proteins in the dataset were used as the training set.
This procedure was repeated until every protein had been
used as the test case. The input to the structure-based
classifier was a window of amino acid identities for spatial
neighbors within the protein structure. To generate the input,
we calculated the distance between each pair of residues in
the structure. The window for each residue was built from
the amino acid identities of the nearest n neighboring
residues. A Naïve Bayes classifier was then trained and
tested using the same leave-one-out cross validation
procedure as for the sequence-based classifier. The input to
the PSSM-based classifier was a window of PSSM vectors
for residues contiguous in the protein sequence. PSSMs
were generated using PSI-BLAST against the NCBI nr
database. A support vector machine classifier was then
trained and tested using ten-fold cross validation. The
protein sequences were split into ten disjoint sets; for each
round of cross validation, one set was used as the test set
and the other nine sets were used for training. We also
We
present
R-Align
a
robust statistics based superposition algorithm for
finding 3D similarities in protein structures over a
hierarchy of scales, from global to local. R-Align (1)
distinguishes core residues from flexible ones; (2)
identifies rigidly moving domains along with linker
regions and (3) provides a metric for ranking
similarities.
Problem
To visualize and understand structural variation in flexible
proteins, the first step is to superimpose the two
68
robust than measuring the standard-deviation). Then, noting
that the median of the absolute values of samples from a
Normal distribution is roughly 2/3 of the standard deviation
of a Normal distribution, we set s = 1.5*median(absolutevalue-of-errors).
4 Compute weights for each error term using the weight
function W(e; s).
Weigh the corresponding terms in the Kabsch least-square
estimation algorithm. Solve the weighted least-squares error
to find the rotation and translation parameters for
superposition. The weights start to emphasize core regions
that are structurally more stable than flexible regions.
5 Repeat until convergence. Given a converged solution,
decide if there is sufficient support from the data to accept
the superposition (eg. the sum of weights has to be greater
than a threshold). Discard the solution if there is insufficient
support.
6 For each verified solution, separate inliers from outliers. In
particular, the scale parameter s affects the point at which
the influence of the outliers begins to decrease. By
examining the influence function Psi(e; s) we can deduce
that outlier rejection begins where the second derivative of
Rho(e; s) is zero. This means, an error e that is greater than
[s/sqrt(3)] has a reduced influence and will be viewed as an
outlier. From this, a threshold on weights can be used to
identify inliers and outliers.
7 Repeat steps 2 to 7 until there is no more conformational
change that need to be explained, i.e. all the residues have
been accounted for.
In summary, we discussed how to estimate the scale
parameter s from the data (step 3); how core regions are
emphasized over flexible portions in an automatic way (step
4); and how to identify multiple rigidly moving domains and
consequently the hinge regions (steps 6 and 7). The weights
derived from any given alignment help us rank the level of
similarity between two structures.
Results
We applied R-Align successfully to many dynamic proteins
with two known conformations. Fig. 1 shows the first six
snapshots needed for R-Align to identify a rigidly moving
domain in GroEL while aligning chain A of 1AON (blue) to
chain A of 1OEL (red). In the full implementation of the
algorithm, we use several different methods for generating
initial guesses for the rotation and translation parameters.
Additionally, we address the issue of leverage points - these
are residues having extreme influence on the estimator for
some initial guesses. We reduce leverage problems by
controlling the spatial support of the robust superposition
algorithm. We highlight several errors that can potentially
arise in comparative modeling and fold recognition targets,
including over-segmentation (breaking the protein into
several substructures each having roughly similar motion
parameters) and undersegmentation (joining two or more
independently moving substructures into one rigid domain).
REFERENCES
1. Kabsch, W. 1976. A solution for the best rotation to relate
two sets of vectors. Acta Crystallogr. A32:922-923.
2. Hample, F. R., Ronchetti, E. M., Rousseeuw, P. J. and W.
A. Stahel. Robust Statistics : The Approach Based on
conformations of a protein. A standard least-squares
superposition estimates the optimal rotation and translation
parameters by minimizing the squared error between the
coordinates of the corresponding atoms in the two
conformations [1].
In the language of robust statistics, least-squares solution is
sensitive to gross errors or outliers [2,3], i.e., a large
deviation arising from even a single residue can greatly
distort the estimation of the transformation matrix
parameters. In fact, a leastsquares superposition often
produces a physically inappropriate result, as it fails to
distinguish between core residues that are structurally stable
from flexible residues that move a lot between multiple
conformations. Additionally, the conformational change
may involve rigid movement of more than one domain. A
least-squares formulation that seeks a single set of rotational
and translation parameters ignores this possibility [4]. We
address these problems by introducing R-Align, a robust
statistics based alignment algorithm.
Robust Statistics and Robust Estimators
The goal is to estimate the rigid-body parameters that can
explain the (rigid) motion of a bulk of residues and identify
deviating substructures for further treatment. To this end, we
introduce an estimator function Rho(e; s) which provides a
cost for any error e at a given scale s. We choose a robust
estimator in the name of Gemen-McLure, whose shape is
such that it assigns quadratic cost to low errors (just as leastsquares) but a fixed cost for large deviations. If the scale
parameters s is very large, Gemen-McLure function behaves
like a least-squares estimator.
Given the estimator we define two new functions: a
influence function Psi(e; s) given by the first derivative of
the estimator Rho(d; s) and a weight function W(e; s)
=Psi(e; s)/e. The Gemen-McLure influence function Psi(e; s)
increases quadratically for small values of the error e. Then
as the deviations increase further, the influence function
eventually stops increasing and then begins to decrease. By
decreasing, it is giving less influence to residues with
particularly large deviations (we call these outliers).
Importantly, unlike the least-squares formulation, the
influence function goes to zero as the error goes to infinity.
Interestingly, W(d; s) has the shape of a Gaussian, implying
low weights for large errors [5]. In comparison, the weight
function for the least-squares estimator is a fixed quantity!
We next outline the various steps involved in using the
robust estimator function in finding structural similarities.
R-Align: An Iteratively Reweighted Least Squares
Superposition Algorithm
1 Start from an initial guess for rotation and translation
parameters. For this, use Kabsch least-squares algorithm on
a random set of spatially close residues (local alignment).
2 Measure deviations (i.e. errors), which is the distance in
the coordinates of a given atom j in molecule 1 and
molecule 2, after aligning the two conformations.
3 Update the scale parameter using the deviations. We will
assume that a bulk of the residues undergoing a coherent
rigid motion have deviations that are Normal distributed.
However, because of the mix of core and flexible regions,
we first measure the median of the errors (which is more
69
Influence Functions. John Wiley and Sons, New York, NY,
1986.
3. Allan Jepson, Foundations of Computer Vision, CSC487,
University
of
Toronto.
ftp://ftp.cs.utoronto.ca/pub/jepson/teaching/vision/2503/robu
stEstimation.pdf
4. Zemla, A. LGA: a method for finding 3D similarities in
protein structures.
Nucleic Acids Research, 2003, 31(13):3370-3374.
5. Damm, K. L. and H. A. Carlson. Gaussian-Weighted
RMSD Superposition of Proteins: A Structural Comparison
for Flexible Proteins and Predicted Protein Structures.
Biophy. J. 90:4558-4573.
63: PHYLOGENY-BASED SCORING OF
STRUCTURAL COVERAGE IN PROTEIN FAMILIES
Natasha Sefcovic (USCD, USA), Christian Zmasek
(Burnham Institute, USA) & Adam Godzik (Burnham
Institute, USA)
Samuel Flores, Chris
Bruns & Russ B.
Altman
(Stanford
University, USA)
RNAs
play
a
pervasive role in
gene expression and
regulation.
Their
structure
and
dynamics
are
crucially important
for
understanding
function. An internal
coordinate representation allows us to freeze bond length
and angle vibrations and rigidify secondary structure,
leading us to recover observed motions of RNA at low
computational cost.
RNAs play a pervasive role in gene expression and
regulation. Their structure and dynamics are crucially
important for understanding function. Attempts to solve
these have been stymied by the long time scales involved,
and the lack of an accurate treatment of counterions.
Freezing bond length and angle vibrations and rigidifying
secondary structure using internal coordinates leads to a
significant reduction in the number of energy evaluations
and an increase in the permissible length of time steps.
The effect of solvent and counterions can be treated by
implicit or explicit means. By combining these methods
experimentally observed motions of RNA can be recovered.
The results suggest extensibility to large systems such as the
ribosome which are difficult to study by conventional means
The recently announced SimTK library for multibody
dynamics contains many tools to make macromolecular
simulations tractable. It is possible to rigidify arbitrary
portions of the molecule or molecules, creating extended
bodies whose internal interactions need not be calculated.
Forces and rubber-band-like elements can be applied at
arbitrary points. Molecules, fragments, or atoms can be
constrained to a hypothetical ground or to each other, in one
or more of their six degrees of freedom. Contact elements
can keep specified molecules within given spatial
boundaries. The time evolution can be controlled by
choosing the time integrator (including variable step size
integrators), and by adding a thermostat and velocity
dampers. These tools can potentially be used to model
systems much larger than are usually considered tractable.
In this work we lay the groundwork for ribosomal dynamics
by predicting the structure and dynamics of HIV1
Transactivation Response Element (TAR) a small molecule
used as a model system for RNA dynamics.
In the first stage, we economically generate a large number
of conformations of TAR by rigidifying the two helices in
the molecule, and allowing bond rotations in the junction
connecting the helices. The time evolution is computed with
Coulomb and Van der Waals interactions turned off. We
then evaluate the ability of a Knowledge Based potential
Proteins can be organized into families based on their
common ancestry, which we can recognize from sequence
similarity. If the three-dimensional structure of at least one
protein in the family has been solved, this family is
considered to have "structural coverage," since we can
usually predict structures of all of the other proteins in the
family by comparative modeling. However, the accuracy of
such models critically depends on the level of structural
similarity between templates and modeling targets;
therefore, the quality of the structural coverage depends on
the number and distribution of the proteins with the solved
structures in the family. While intuitively obvious, so far no
quantitative measure of the quality of structural coverage
has been developed.
Here, we propose a quantitative measure of the structural
coverage of a protein family, in which we use distances
along a phylogenetic tree to calculate and compare the
impact of specific proteins on the structural coverage. We
explain our measure using several examples and compare it
to several alternatives. We show that the choice of proteins
that have been solved for their own individual reasons, as
recorded to date in the Protein Data Bank, does not provide
optimal coverage of the family as a whole.
With such a measure, we can now begin to provide exact
answers to questions such as: How many experimental
structures do we need to achieve a specific level of model
quality? What is the optimal order for target selection?
Would solving structures of some subsets of proteins
provide better modeling coverage of the family than solving
others? A quantitative measure of structural coverage for
protein families is, therefore, needed to have a rational and
meaningful discussion of the goals and achievements not
only of structural genomics, but also those researchers
interested in protein structure, function and evolution as well
those in the modeling community. We trust that our
proposal of such specific scoring systems will begin to add
substance to this discussion.
69 INTERNAL COORDINATE METHODS FOR
MACROMOLECULAR STRUCTURE AND
DYNAMICS
70
trained on known protein structures, to find correct
conformations from among the thus-generated decoys.
In a second stage, we use an all-atoms SimTK analysis to
model the TAR's motion under explicit water, explicit
counterion conditions. We show how using a small amount
of water around each molecule compares to the use of an
extensive water environment. We comment on the effect of
the ion environment on the conformational dynamics of
TAR. The results show how the dynamics of small RNAs
can be computed accurately and economically using our
novel internal coordinate dynamics code. Possible
extensions to larger systems are discussed.
Due to bacterial resistance to current antibiotic
drugs there is special interest to develop antibacterial
peptide agents to address the bacterial resistance problem.
Antibacterial peptides have several advantages to current
drugs as they have rapid microbicidal activity as a
consequence of their natural occurring means for pathogenic
challenges. They have either multiple targets within the cell
or bacterial cell membrane, which due to having the net
positive charge undergo the electrostatic magnetic force and
are interacted by cytoplasmic membrane. They interact with
the charged components in the outer layer of the bacterial
surface with phosphate in lipopolysaccharides of gramnegative or with the lipoteichoic acids on the gram-positive
surfaces [2].
The structural studies of AMEs have shown that
although there is no common folding motif for binding
aminoglycosides,
antibiotics-binding
pockets
are
consistently lined with negatively charged residues so as to
effectively attract and bind the positively charged drugs.
Wright et al. [3] examined the cationic peptides to study
their function as broad-spectrum inhibitors of AMEs. It was
identified that the antimicrobial peptide Indolicidin and
analogs thereof were able to inhibit APH (3')-IIIa [4], AAC
(6')-Ii [5], and the bi-functional enzyme APH (2")-AAC (6').
This signifies that in principal it may be possible to develop
broad-spectrum inhibitors of AMEs so as to combat
aminoglycoside antibiotic resistance. However, peptides are
unsuitable as therapeutic agents due to their poor
pharmacokinetic properties, and as such Indolicidin and its
analogs must be viewed as leads for the development of
peptidomimetics.
To advance the design of such peptidomimetics it is
imperative that detailed information on the interactions
between cationic peptides and different AMEs is obtained.
For this reason series of Monte Carlo based conformational
search [6,7] are performed on a group of available cationic
antimicrobial peptides Indolicidin, its two analogs (CP10A
and CP11CN) and an ensemble of Indolicidin derivatives
with lengths of less than 14 amino acids (GW11-GW28)
against two different AMEs, APH(3')-IIIa and AAC(6')-Ii.
The predicted binding sites of these peptides are evaluated
by calculated binding affinities and analyzing their
interaction modes against the key sites of the substrates
binding pockets.
The calculated binding affinities of peptides in
complex with APH (3')-IIIa are found in a high correlation
with the experimental data for the peptides with the lengths
of more than ten and less than eight amino acids, whereas
the calculated binding affinity for the peptides against AAC
(6')-Ii is in complete agreement with the corresponding
experimental data.
These observations validate the efficiency of our
computational approach, which was then applied for a set of
docking studies on a new group of designed peptides. The
peptides either partly occupy both ATP and aminoglycoside
binding pockets and form a bridge-like binding site between
them or only locate in the vicinity of aminoglycoside
binding site, by which inhibit antibiotic activity rather than
ATP (e.g., two peptides in ribbon representation in Figure).
70: MONTE CARLO CONFORMATIONAL SEARCH
ON CATIONIC PEPTIDE INHIBITORS OF
ANTIBIOTIC RESISTANCE ENZYMES
Laleh Alisaraie, Albert M. Berghuis (Department of
Biochemistry, McGill University, Montreal, Canada )
Results of investigations on the cationic peptide
inhibitors are presented. Binding sites of the peptides are
identified and evaluated. The calculated binding affinity
are validated with the experimental data and based on
these observations a novel potential peptide inhibitor is
introduced as a lead compound for development of
peptidomimetic adjuvant that can inhibit enzymemediated resistance to aminoglycoside antibiotics.
Aminoglycoside are often prescribed as broad-spectrum
antibiotics. The available data from last decades shows the
increasing
of
bacterial
resistance
to
available
aminoglycoside
antibiotic
classes.
Resistance
to
aminoglycoside is usually because of the dramatic increase
in enzymatic activity of aminoglycoside modifying enzymes
(AMEs), aminoglycoside-acetyltransferase (AACs), nucleotidyltransferase (ANTs) and -phosphotransferases
(APHs).
Aminoglycoside resistance often appears as a result of the
plasmid-borne genes encoding AMEs, which in different
groups are capable to target A site of the 30S bacterial
ribosome. Chemical modification of these species is
catalyzed by AMEs among which, AACs and APHs are the
main culprits due to having a wide range of variety in
chemical modification of the antibiotics not only in terms of
the site modifications also in several unique resistance
profiles and protein structure designations. [1]
71
Among the inhibitors in the designed data base, GW31 with
the length of six amino acids (ILAWAW) demonstrates a
high binding affinity and strong interactions with the key
sites amino acids in the substrates binding pockets in both
APH (3')-IIIa and AAC (6')-Ii. In Figure, GW31 (spheres) is
shown that occupies the binding site of ATP (green stick) in
complex state with APH (3')-IIIa.
GW31 is introduced as a novel lead compound for
development of peptidomimetic adjuvant that can inhibit
enzyme-mediated resistance to aminoglycoside antibiotics.
REFERENCES:
1. K. J. Shaw, P. N. Rather, R. S. Hare, G. H. Miller.,
Microbiol. Rev., 1993, 57, 138-163
2. R. E. Hanock, A. Rozek., FEMS Microbiol. Lett., 2002,
206, 143-149
3. D. Boehr, K. Draker, K. Koteva, M. Bains, R. Hancock,
G. Wright., Chemistry & Biology, 2003, 10, 189-196
4. D. H. Fong, A. M. Berghuis., EMBO J., 2002, 21, 23232331
5. D. L. Burk, B. Xiong, C. Breitbach, Berghuis, A.M.,
Acta Crystallogr., Sect. D, 2005, 61, 1273-1279
6. C. McMartin, R. Bohacek, J. Computer-Aided. Mol. Des,
1997, 11, 333-344
7. L. Alisaraie, L. Haller, G. Fels., Chem. Inf. Model., 2006,
46, 1174 -118
72
Last Name
Chuang
Cipriano
Dalafave
del Sol
Dintyala
Subrahmanya
Dixit
Doxey
Dunbrack
First Name
Syed
(Nabil)
Daniel
Gabriele
Ivet
Ahmet
Andrew
Christopher
Phil
Mitchell
Patrick
Forbes
Chris
Matthieu
Prof.
Chakra
Gwo-Yu
Gregory
Danielle
Antonio
Venkata
Ravikant
Surjit
Andrew
Roland
Durek
Pawel
Ekman
Ellis
Falk
Fan
Flores
Fromer
Gallin
Georgiev
Ghersi
Gippert
Glazer
Hall
Hart
Heifets
Heilbut
Hennerdal
Hirayama
Hou
Huehne
Isin
Jaitly
Jefferys
Jiang
Juritz
Kaburagi
Kamisetty
Kauko
Keedy
Diana
Jonathan
Jenny
Samuel
Samuel
Menachem
Warren
Ivelin
Dario
Garry
Dariya
David
Reece
Abraham
Adrian
Aron
Kazunori
Zhenglin
Rolf
Dr. Basak
Navdeep
Benjamin
Bo
Ezequiel
Takashi
Hetunandan
Anni
Daniel
Ali
Almonacid
Ausiello
Bahar
Bakan
Bordner
Bottoms
Bourne
Brittnacher
Buck
Burkowski
Bystroff
Chavent
Chennubhotla
Country
Affiliation
Imperial College
University of California
University of Rome - Tor Vergata
University of Pittsburgh
Mayo Clinic
Univ of Missouri-Columbia
University of California
University of Washington
Rensselaer Polytechnic Institute
University of Waterloo
Rensselaer Polytechnic Institute
CNRS/INRIA
Boston University
University of Wisconsin-Madison
The College of New Jersey
Fujirebio Inc
Cornell University
Zymeworks Inc.
University of Waterloo
Fox Chase Cancer Center
Max-Planck-Institute of Molecular
Plant Physiology
The Arrhenius Laboratories
Statistics Department. DEFS
The Arrhenius Laboratories
Victor Chang Cardiac Research Institute
Stanford University
The Hebrew University of Jerusalem
University of Alberta
Duke University
Mount Sinai School of Medicine
Novozymes A/S
Stanford University
Boston University
Genentech, Inc
University of Toronto
Boston University
Center for Biomembrane Research
Waseda University
Pioneer, A DuPont Company
Fritz Lipmann Institute (FLI)
Imperial College
University of Massachusetts Amherst
Unidad de Fisicoquimica - CEI
Waseda University
University of Pittsburg
Stockholm University
Duke University
73
UK
USA
Italy
USA
USA
USA
USA
USA
USA
USA
Canada
USA
France
USA
USA
USA
USA
Japan
USA
Canada
Canada
USA
Germany
Sweden
Australia
Sweden
Australia
USA
Israel
Canada
USA
USA
Denmark
USA
USA
USA
Canada
USA
Sweden
Japan
USA
Germany
USA
Canada
UK
USA
Argentina
Japan
USA
Sweden
USA
Email
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Kelm
Korkin
Kortemme
Kozakov
Sebastian
Dmitry
Tanja
Dima
Kraut
Adam
Krissinel
Landon
Lario
Lewis
Lilien
Liu
Meireles
Mettu
Moll
Morita
Eugene
Melissa
Paula
Byungkook
(BK)
Benjamin
Ryan
Lu
Lidio
Ramgopal
Mark
Mizuki
Moult
John
Najmanovich
Nugent
Palopoli
Peterson
Qiu
Reuveni
Reyon
Ritchie
Rohs
Rux
Ryabov
Safi
Samish
Schlick
Sefcovic
Senes
Seto
Shackelford
Shoichet
Shrivastava
Sierk
Sternberg
Stogios
Taylor
Rafael
Timothy
Nicolas
Matthhew
Dr. Yang
Shlomi
Deepak
Dave
Remo
John
Yaroslav
Maria
Ilan
Tamar
Natasha
Alessandro
Marian
George
Brian
Indira
Michael
Michael
Peter J
Todd
JeanFrancois
Lee
Tomb
Tress
Michael
Vajda
Sandor
Valencia
Alfonso
Veeramalai
Mallika
Wallach
Wood
Izhar
Graham
Wymore
Troy
University of Oxford
University of Missouri
University of California, San Francisco
Boston University
National Resource for Biomedical
Supercomputing
European Bioinformatics Institute
Brandeis University
Zymeworks, Inc.
National Institutes of Health
Iowa State University
University of Massachusetts Amherst
Rice University
The University of Tokyo
University of Maryland Biotechnology
Institute
European Bioinformatic Institute
University College London
Unidad de Fisicoquimica - CEI
The MITRE Corporation
GlaxoSmithKline
Tel Aviv University
University of Aberdeen
Howard Hughes Medical Institute
Wistar Institute
NIH
University of Pennsylvania
New York University
Joint Center for Structural Genomics
University of Wisconsin, Madison
Bayer HealthCare Pharmaceuticals Inc.
University of California, Santa Cruz
UCSF
Saint Vincent College
Imperial College
NIST
E.I.DuPont De Nemours & Co., Inc
Spanish National Cancer Research
Centre
Boston University
Spanish National Cancer Research
Centre
Burnham Institute for Medical
Research
Macquarie University
National Resource for Biomedical
Supercomputing
74
UK
USA
USA
USA
USA
UK
USA
Canada
USA
USA
Canada
USA
USA
USA
USA
Japan
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
UK
UK
Argentina
USA
China
Israel
USA
UK
USA
USA
USA
Canada
USA
USA
USA
USA
USA
USA
USA
USA
USA
UK
Canada
USA
USA
Spain
USA
Spain
USA
Canada
Australia
USA
[email protected]
[email protected]
[email protected]
mpeterson @mitre.org
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Xu
Xu
Yamanishi
Zaback
Zavodszky
Zhang
Zhou
Zhu
Zsoldos
Jinbo
Qifang
Yoshihiro
Peter
Maria
Yi
Ming
Hongbo
Zsolt
Toyota Tech Inst at Chicago
Fox Chase Cancer Center
Ecole des Mines de Paris
Michigan State University
University of Massachusetts
Columbia University
Max-Planck-Institut für Informatik
SimBioSys Inc.
75
USA
USA
France
USA
USA
USA
USA
Germany
Canada
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Index by abstract ID
Fig
ID Title
Presenting author
Page
Ivet Bahar
6
Donald Petrey, Markus
Fischer & Barry Honig
6
Alfonso Valencia
6
Tamar Schlick
6
Philip Bourne
7
Tanja Kortemme
7
Brian Shoichet
7
MetaMol: High quality visualization of Molecular
Skin Surface
Matthieu Chavent, Bruno
Levy & Bernard Maigret
28
3 Specific interactions for ab initio folding of proteins
Yuedong Yang & Yaoqi
Zhou
Toward elucidating allosteric mechanisms of
K1 function via structure-based analysis of protein
dynamics
On the nature of protein fold space: extracting
K2 functional information from apparently remote
structural neighbors
Prediction of functional characteristics based on
K3
sequence and structure
Chromatin structure insights revealed by mesoscale
K4
modeling
K5 I am not a PDBid I am a Biological Macromolecule
Conformational flexibility and sequence diversity in
computational protein design
Hits, Leads & Artifacts from Virtual and HighK7
Throughput Screening
K6
2
Structure determination of protein-protein complexes
4 using parameters of their overall rotational dynamics
available via NMR relaxation data
Focused docking: a computational approach to
improve small-molecule docking into protein
5
structures
6 Crystal contacts as nature's docking solutions
29
Yaroslav Ryabov & Charles
Schwieters
30
Dario Ghersi & Roberto
Sanchez
31
Eugene Krissinal
24
7
Support Vector Machine-based Transmembrane
Protein Topology Prediction
Tim Nugent & David Jones
32
9
Proteins: coexistence of stability and flexibility
(MGMS awardee)
Shlomi Reuveni Rony
Granek & Joseph Klafter
18
76
10
Diana Ekman, Åsa K.
Björklund and Arne
Elofsson
33
Gabriele Ausiello, Pier
Federico Gherardini, Elena
Gatti1, Ottaviano Incani &
Manuela Helmer-Citterich
34
Jenny Falk & Arne
Elofsson
35
Dave Ritchie, Dima
Kozakov & Sandor Vajda
36
Michael Sternberg, Stephen
Muggleton, Ata Amini,
Huma Lodhi, David Gough
& Paul Shrimpton
22
Takashi Kaburagi &
Takashi Matsumoto
36
Syed Ali & Michael
Sternberg
10
Andrew Bordner
19
Domain rearrangement and domain creation in the
evolution of new proteins
A novel method for the detection of protein local
11
structural motifs binding specific ligand fragments
12
How common are internal repeats in alpha-helical
membrane proteins?
Accelerating and Focusing Protein-Protein Docking
13 Correlations Using Multi-Dimensional Rotational
FFT Generating Functions
14 Logic-based drug discovery
An Approach to Transmembrane Protein Structure
15 Prediction with Stochastic Dynamical Systems using
Backward Smoothing
16
The evolution of protein function driven by a multidomain repertoire (MGMS awardee)
17
Predicting small ligand binding sites on proteins using
low-resolution structures
Vibin Ramakrishnan, Saeed
Salem, Saipraveen
Geofold: a mechanistic model to study the effect of
18
Srinivasan, Wilfredo Colon,
topology on protein unfolding pathways and kinetics Mohammed Zaki & Chris
Bystroff
25
19
Scoring confidence index: statistical evaluation of
ligand binding mode predictions
Maria Zavodszky, Andrew
Stumpff-Kan, David Lee &
Michael Feig
20
20
i-SITE: Energy-based method for predicting ligandbinding sites on protein structures
Mizuki Morita, Tohru
Terada, Shugo Nakamura &
Kentaro Shimizu
37
21
Coil within the membrane: Structural anomaly for
functional needs
Anni Kauko, Kristoffer
Illergård & Arne Elofsson
37
Rafael Najmanovich &
Janet Thornton
20
Functional insights from binding sites similarities
22 complement existing methods for prediction of
protein function
77
Molecular Dynamics Simulations Using An Alpha23 Carbon-Only Knowledge-Based Force Field For
Protein Structure Prediction
Patrick Buck and Chris
Bystroff
39
Environment-Specific Substitution Tables for
Membrane Proteins
Sebastian Kelm, Jiye Shi &
Charlotte M. Deane
40
Identification of novel inhibitors for ubiquitin C25
terminal hydrolase-L3 by virtual screening
Kazunori Hirayama,
Shunsuke Aoki, Kaori
Nishikawa, Takashi
Matsumoto, Keiji Wada
40
26 Algorithms for protein design
Ivelin Georgiev, Cheng-Yu
Chen & Bruce Randall
Donald
16
27 A novel scoring function in eHiTS and LASSO
Zsolt Zsoldos, Danni
Harris, Mehdi Mirzazadeh,
Aniko Simon
42
SE: An algorithm for deri Co-Evolution of Structural
Chin-Hsien Tai, James J.
Bioinformatics and Protein Design for N-cap
Vincent, Changhoon Kim &
28
Backrubs ving sequence alignment from
Byungkook Lee
superimposed structures
43
24
29 Minor groove electrostatics and binding specificity
30
Co-Evolution of Structural Bioinformatics and
Protein Design for N-cap Backrubs
Classification of mechanistically diverse enzyme
31 superfamilies according to similarities in reaction
mechanism
32
Efficient Protein Conformation Sampling in Real
Space
Remo Rohs, Sean West,
Peng Liu & Barry Honig
12
Daniel Keedy, Ed Triplett,
David Richardson, Jane
Richardson, Ivelin Georgiev,
Cheng-Yu Chen & Bruce R.
Donald
44
Daniel Almonacid &
Patricia C. Babbitt
15
Jinbo Xu
45
Modeling the Interaction of MAP Kinase Phosphatase Ahmet Bakan, Gabriela
33 3 with a Novel Inhibitor by Accounting for
Conformational Factors
34 How good can template-based modelling be?
78
Molina, Andreas Vogt,
Michael Tsang & Ivet Bahar
46
Braddon K. Lance, Graham
R. Wood & Charlotte M.
Deane
47
35 Contact Prediction for Membrane Proteins
Computational Methods to Advance from
36 Crystallographic Model to Enzyme Mechanism and
Structure-Function Relationships
37 Molecular Surface Abstraction
Aron Hennerdal & Arne
Elofsson
48
Troy Wymore & Adam
Kraut
48
Gregory Cipriano, George
Phillips & Michael Gleicher
49
Maxim Shapovalov &
Roland Dunbrack
26
38
The next generation of the backbone dependent
rotamer library
39
High-Throughput Crystal Structure Prediction of
Drug-Like Molecules
Bashir Sadjad, Zsolt
Zsoldos and Aniko Simon
50
40
The Jena Library of Biological Macromolecules JenaLib: New Features
Rolf Huehne, FrankThomas Koch & Juergen
Suehnel
51
41
Samuel Fan, Richard
Computational insights into redox-active disulfides in
George, Naomi Haworth &
protein structures
Merridee Wouters
SA-COMPAS: A resource for prediction, assessment,
42 and web-based visualization of comparative protein
models
Adam Kraut and Troy
Wymore
53
Hetunandan Kamisetty &
Christopher Langmead
23
Qifang Xu and Roland
Dunbrack
54
43
Conformational free energy of protein structures:
computing upper and lower bounds
44
Statistical analysis of interfaces in crystals of
homologous proteins
45
Xenon Effects on Ligand Binding Domain of NMDA
Lu Liu, Yan Xu & Pei Tang
Receptor
Large Scale Motions in Glutamate Transporters
47 Revealed by Elastic Network Models and Cysteine
Cross-linking Studies
79
52
Indira H Shrivastava, Jie
Jiang, Susan G. Amara &
Ivet Bahar
55
56
Menachem Fromer & Chen
Yanover
48
Accurate Prediction of the Near-Optimal Sequence
Space for Atomic-Level Protein Design
49
Ezequiel Iván Juritz,
Distribution and extension of protein conformational
Sebastián Fernández Alberti
diversity
& Gustavo Parisi
58
50
Poing: a fast and simple model for protein structure
prediction
Benjamin Jefferys,
Lawrence Kelley & Michael
Sternberg
17
Dima Kozakov, Gwo-Yu
Chuang, Dmitry Beglov,
Ryan Brenke & Sandor
Vajda
59
Analysis of potential proton channel inhibition
51
mechanisms by computational protein mapping
57
Exploring the Activation Mechanism of a G-ProteinCoupled Protein Receptor, Rhodopsin, Using Normal Basak Isin, Klaus Schulten,
52
Emad Tajkhorshid & Ivet
Modes from Coarse-grained Elastic Network Models
Bahar
in Molecular Dynamics Simulations
59
53
TOPS++FATCAT: fast flexible structural alignment
Mallika Veeramalai,
Yuzhen
Ye, & Adam Godzik
using constraints derived from TOPS+ Strings Model
61
54
Conformational Diversity modulates protein sequence Ezequiel Iván Juritz,
Sebastián Fernández Alberti
divergence
& Gustavo Parisi
662
55
Renaming diastereotopic atoms for consistent PDBwide analyses
Christopher Bottoms &
Dong Xu
62
Danielle S. Dalafave
64
Predicting new engrailed homology motifs from
56 structural and energy studies of the WD propeller
domain bindings to known motifs
57
Use of evolutionary information in model quality
evaluation for protein structure prediction
Nicolas Palopoli, Diego
Gomez Casati & Gustavo
Parisi
64
58
Computational Discovery of Small Molecular Weight Lidio Meireles, Alexander
Doemling & Carlos
Protein Interaction Inhibitors
Camacho
65
A fragment based method for the prediction of
59 atomistic models of transmembrane helix-helix
interaction
Alessandro Senes, Dan W
Kulp, David T. Moore &
William F DeGrado
Ben A. Lewis Mateusz
Kurcinski, Deepak Reyon,
Combining Predictions of Protein Structure and
Jae-Hyung Lee, Vasant
60 Protein-RNA Interaction to Model the Structure of theHonavar, Robert L. Jernigan,
Andrzej Kolinski, Andrzej
Human Telomerase Complex
Kloczkowski & Drena
Dobbs
80
67
13
Michael Terribilini, Cornelia
Caragea, Deepak Reyon,
Comparing Sequence and Structure-based Classifiers Ben Lewis, Li Xue, Jeffry
61 for Predicting RNA Binding Sites in Specific
Sander, Jae-Hyung Lee,
Robert
L Jernigan, Vasant
Families of RNA Binding Proteins
Honavar, Krishna Rajan &
Drena Dobbs
67
62
R-Align: A Robust Statistics Based Superposition
Algorithm for Proteins
Chakra Chennubhotla &
Ivet Bahar
68
63
Phylogeny-based scoring of structural coverage in
protein families
Natasha Sefcovic, Christian
Zmasek & Adam Godzik
70
Dariya S. Glazer, Randall J.
Radmer & Russ B. Altman
8
George Shackelford &
Kevin Karplus
27
Predicting DNA-binding affinity of modularly
66
designed zinc finger proteins
Peter Zaback, Jeffry D.
Sander, J. Keith Joung,
Daniel, F. Voytas & Drena
Dobbs
11
67
Channeling protein structure analysis towards
understanding cough dynamics
Ilan Samish & William F.
DeGrado
14
68
An Automatic Server for Function Prediction
Evaluation
Michael Tress, Alfonso
Valencia, Michael Sternberg
& Mark Wass
9
Samuel Flores, Chris Bruns
& Russ B. Altman
70
Laleh Alisaraie, Albert M.
Berghuis
71
64 4D Structure-based Function Prediction
65 Two stage residue-residue contact predictor
Internal coordinate methods for macromolecular
69 structure and dynamics
70
Monte Carlo Conformational Search on Cationic
Peptide Inhibitors of Antibiotic Resistance Enzymes
81

3DSig 2008 - Najmanovich Research Group

Transcription

Similar documents

Hoosier Lotto Prize Pool Sheet

Lecture8

1 2 - RCSB PDB

Proteins - Département de Biologie

Movato Kitchen Issue 2016

Youth Asset Mapping in Toronto

Toronto,Canada Toronto,Canada - Toronto District School Board

Proteome Science