Kevin Drew Thesis Intro - Bonneau Lab

Transcription

Kevin Drew Thesis Intro - Bonneau Lab
Predicting old functions and designing
new ones: Genome scale protein
annotation & Peptidomimetic design
by
Kevin Drew
A dissertation submitted in partial fulfilment
of the requirements for the degree of
Doctor of Philosophy
Department of Basic Medical Science
New York University
May 2013
Richard Bonneau, PhD
Introduction Only
c Kevin Drew
�
All Rights Reserved, 2013
Introduction
Erwin Schrodinger in his 1944 book, What is Life?, described life as a system capable of exporting
entropy [1]. Simply, if the living system is unable to retain its own order and export entropy
to the environment, it will reach equilibrium and die. Conversely, if it is capable of transferring
entropy from inside the system to outside while retaining order, it will continue living. This
description of life is neat and concise. The mechanisms that achieve such a task are, however,
enormously complex. Proteins, polymeric chains of amino acids, are responsible for much of this
sophisticated functionality.
This thesis is, broadly, about protein function. The study of protein function, as with nearly
all biological fields, depends on the intimate relationship between the concept of the gene and
its end product, most notably protein. Genes are passed from cell to cell, organism to organism,
generation to generation and most often encode biochemically active proteins that interact with
each other (and other molecules) to carry out the functions of the cell. Understanding genes,
their resulting proteins, and their interactions within the cell is perhaps the fundamental focus
of modern biology.
There are many ways to go about this understanding but ultimately the interactions of life
are chemical and best described at the atomic and molecular level. The goal of this atomic
and molecular approach is to identify, for example, the specific atoms or amino acid residues
responsible for a specific function of a protein produced by a gene that is linked to a phenotype
of interest (such as a disease state). If we have a clear understanding of protein function at this
level of resolution, we can then begin to manipulate their functions to, for instance, cure diseases.
These challenges are enormous but this thesis addresses two specific avenues (among many)
for using atomic and molecular approaches to better describe and engineer biology, allowing
greater understanding and utility. The two avenues revolve around the computational software
Rosetta which has the ability to both model and design atomic structure.∗ The first approach
uses the modeling aspect of Rosetta as part of a genome annotation pipeline that annotates
structure and function to proteins of unknown function. The second approach uses the design
aspect of Rosetta to design specific molecular inhibitors to protein interactions thus modulating
∗ “If
the only tool you have is a hammer, you tend to see every problem as a nail.” - Abraham Maslow
1
their function. These two approaches address important problems in the field of protein function.
0.1
Roots of the field
The roots of this research rely heavily on work which began in the middle of the last century when
the concept of the gene was nonphysical, the genetic material was still unknown and molecular
biology had yet to evolve from genetics and biochemistry (see timeline in figure 1).
In 1941, George Beadle and Edward Tatum, attempting to understand how genes control biochemical function through experiments with the model system Neurospora, conclusively linked
the gene to a biochemically active protein [2]. Using x-ray mutagenesis, they found strains incapable of synthesizing vitamin B6 which was due to a lack of functional protein. These effects were
inherited by further generations of the strains — suggesting the gene held the information necessary for a specific biochemical function. This discovery was the pivot point on which the fields
of genetics and biochemistry would eventually merge. In 1944, Erwin Schrodinger, a quantum
physicist, published a series of lectures in which he contemplated about the physical description
of life, in What is life? [1]. This book inspired many researchers to think about the chemical
nature of the gene. Common thought of the day was that protein was the genetic material and
DNA was just a structural component of the cell. Oswald Avery conducted an experiment in
1944 that suggested DNA, not protein, was the chemical component of the gene [3] and this
work was furthered by Hershey and Chase in 1952 by isotopically labeling of protein and DNA to
show DNA was definitively the chemical component of the gene [4]. In 1953, Watson and Crick
determined structure of DNA providing the greatest evidence that DNA and genetic material
were one in the same [5]. The next dozen or so years provided the ”Central Dogma” [6], the
genetic code [7, 8] and a near complete description of the relationship of genetic codons to amino
acids [9]. Intercalated within these events was Linus Pauling’s discovery in 1949 that a genetic
mutation caused a molecular change in the protein hemoglobin that altered its function [10].
2
3
Figure 1: Timeline of important protein structure and function discoveries
Besides these events being foundational discoveries of molecular biology, they provide the basis
for functional annotation of the proteins in a genome. If one knows the DNA sequence of a gene,
one will know the protein sequence that that gene encodes. Knowing this, along with Pauling’s
observation that the protein’s sequence is related to its function, provides a tremendously powerful
insight into determining the function of proteins.
Fast forward four decades (with a brief stopover in 1977 to mention Frederick Sanger’s sequencing technology and the publication of the first sequenced genome, bacteriophage
phi X174 [11]) to 2001 and the complete sequence of Human DNA [12, 13] is known along with
model organisms of E. coli [14], Yeast [15], C. elegans [16], Drosophila melanogaster [17] and
Arabidopsis thaliana [18]. With this vast amount of sequence data for genes and their resulting
proteins and with the observation that a gene’s sequence is related to its function, we have a lot
of the pieces necessary to determine a gene’s function.
0.2
Anfinsen’s dogma
Unfortunately, however, uncovering general principles that determine a protein’s function from
its sequence has been more complex than uncovering the code that translates DNA sequence into
protein sequence. To understand the relationship between protein sequence and protein function,
we must think more about the configuration of atoms that is determined by a protein’s sequence
and ultimately its shape. A protein’s function is determined by the protein’s sequence of amino
acids, but only indirectly. More directly, a protein’s function is determined by the arrangement
of atoms in three dimensional space. For example, the enzyme, Ribonuclease A, consists of 124
amino acids (372 DNA coding bases) and functions to cleave single stranded RNA. Christian
Anfinsen, in his studies of Ribonuclease A, showed that denaturing and reducing the protein’s
disulfide bonds destroyed its ability to cleave RNA. More importantly, however, he showed that
when the disulfides are re-oxidized, the protein’s function returns, albeit at a lower percentage
of the native protein [19, 20]. This showed that the protein could be chemically unfolded and
then recover its shape and therefore its function without any external factors. In other words, all
the information necessary for the protein’s fold and function was in its sequence. Unbeknownst
to him at the time, the reducing of disulfide bonds in Ribonuclease A and subsequent unfolding
4
caused the disruption of important residues for catalytic activity. Interestingly, none of the
residues important for Ribonuclease A function, His12, Lys41, His119 and Asp121, are near each
other in linear sequence space but it is their relative orientation in three dimensional space that
allows each residue to carry out its chemical function (figure 2).
Figure 2: Ribonuclease A: a model for protein folding. Protein structure (pdbid: 7RSA) of
Ribonuclease A, showing residues important for function (His12, Lys41, His119 and Asp121) and residues
important for conformational stability (Cys26, Cys81, Cys58, Cys110). Anfinsen’s work on Ribonuclease
A laid the foundation for the field of protein folding.
0.3
Evolutionary relationships of protein structure
Another observation about the relationship between protein sequence and structure came through
examining the first experimental structures of proteins. It was noticed almost immediately af5
ter the x-ray crystal structure of the tetrameric hemoglobin had been solved that each chain
resembled the structure of myoglobin, a protein whose structure was solved just a few years
earlier [21, 22]. At the time, the sequences of these proteins were not fully known so further
comparisons of the sequence - structure relationship were limited. But once protein sequencing
technologies improved and more sequences became available, Perutz and Kendrew noted that only
9 residues out of 140 total amino acids in hemoglobins from various vertebrates were identical
even though their overall three dimensional structures were similar [23].
Later in 1986, Chothia and Lesk analyzed the protein structures available at the time by
comparing both sequences and structures of homologous protein pairs [24]. They discerned a
trend where protein structure was more conserved than protein sequence. This observation
generalized Perutz and Kendrew’s observation of the globin family sequence-structure relationship
and suggests that when available, protein structure similarity is a better metric to infer homology
than protein sequence.
Further studies of the relationship between protein structure and protein sequence continued
as the number of protein sequences and structures grew. Contrary to the early days of protein
crystallography where the structure of some proteins were known and their sequences were not,
protein sequences soon became available in abundance while protein crystallography continued
at a slow pace. Because of the difficulty in experimentally solving protein structures, it became
of interest to infer protein structure by sequence alone. One notable analysis by Chris Sanders
and later improved by Burkhard Rost laid out parameters of protein sequence similarity (i.e.
residue similarity and length of alignment) that could be used to accurately determine similar
structures between pairs of proteins [25, 26]. This allowed one to infer the structure of one protein
from another if their sequences were similar enough. Interestingly, there seems to be a realm
of sequence space, coined the “Twilight Zone of protein sequence alignment”, where a pair of
proteins shows no sequence similarity but had nearly identical three dimensional folds. In other
words, two proteins in this region that share structure similarity but not sequence similarity are
still likely to have a common evolutionary origin (and possibly a conserved molecular function).
From this one can reasonably infer that pairs of proteins with high sequence similarity may
be predicted to have similar functions because of the likely similarity of structures. But what
about protein pairs in the “Twilight Zone”? Is it possible to find relationships among proteins
6
whose sequences have diverged beyond the detectable limits? Without sequence similarity, we
would have to rely on structure similarity. Unfortunately, experimentally determining protein
structure is notoriously expensive and laborious. We are left with a need to theoretically predict
the three dimensional structure of proteins from its sequence alone. This is what is known as the
protein folding problem.
0.4
Protein folding problem
In lecturing about the protein folding problem in 1969, Cyrus Levinthal calculated that for a
100 residue protein, there would be 10300 separate conformations available [27]. Essentially, a
sequence can exhibit 10300 different ‘folds’ and it is the theoreticians job to pick the one that
represents the one observed in nature. Levinthal went on to note that nature surely did not sample
all conformations, so there must be identifiable rules that limit the search space of protein folding
and one such rule is that local structure formation of the protein backbone nucleates the rest of
folding (another is the observation that hydrophobic packing in the core of the protein is a main
driving force of folding [28]).
Rosetta is one of many computational algorithms∗ that take advantage of local structure
formation prior to global folding while attempting to predict the three dimensional structure of
a protein [29]. To do this, Rosetta combines many small fragments of experimentally determined
structures whose fragment’s sequence match portions of the sequence of interest. This estimation
removes a large number of degrees of freedom from the system and therefore makes the problem
more tractable. Rosetta then uses a Monte Carlo simulated annealing approach to sampling the
remaining degrees of freedom. After each change of a degree of freedom, the new conformation
is compared to the old conformation based on a energy based score function and is accepted
or rejected based on the Metropolis criterion. The score function is comprised of several terms
including one that favors hydrophobic residues in the core of the protein and hydrophilic residues
exposed to solvent.
Using the Rosetta algorithm, Bonneau et al. [30] achieved a significant milestone in protein
folding by correctly predicting the folds of several proteins at the 2000 CASP4 competition (The
∗ “Computer
Science is no more about computers than astronomy is about telescopes.” - EW Dijkstra
7
Critical Assessment of protein Structure Prediction is a biannual competition which attempts to
determine the state of the art among protein folding algorithms) [31]. Building on this result,
they observed the Rosetta structure predictions were accurate enough to compare to known
structures in the Protein Data Bank (PDB) and find proteins with similar structures [32]. The
Rosetta predicted structure of one such target, Bacteriocin AS-48, was compared to all known
structures and matched PDB entry 1NKL, a NK-lysin protein (the experimental structures match
to 3Å RMSD, see figure 3). Both are functionally related as lysins but — similar to Perutz and
Kendrew’s observation of the globin family described above — when the NK-lysin protein and
Bacteriocin AS-48 protein sequences are compared, only 14% of residues are identical. The
low sequence similarity and the high structural similarity of this pair puts it squarely in the
”Twilight Zone” of protein sequence alignment. Additionally, the fact that a predicted structure
was accurate enough to match a highly similar structure in the PDB showed promise that this
procedure would be useful to functionally annotate proteins on a genome scale.
Figure 3: Bacteriocin AS-48 CASP4 Target Rosetta structure prediction from CASP4 matched
1NKL in PDB [30]. Protein’s have low sequence similarity, high structural similarity and have related
functions.
8
0.5
Scaling Up
These results from Rosetta’s CASP4 predictions provided a proof of concept for further studies
into genome annotation using the method of comparing predicted structures to experimentally
determined structures to infer function. The computational cost of the Rosetta de novo algorithm
is quite large and prediction for a single protein domain is estimated to take over a year on a
single CPU. More over, there are four thousand protein domains that are suitable for de novo
structure prediction in the human proteome which would therefore take four thousand years on
a single CPU computer. This cost is a large barrier to any such project that attempts to use
structure predictions to annotate genomes. Fortunately, there have been tremendous advances in
distributed computing, specifically grid computing. The premise behind grid computing is that
use of individual computers is sporadic with short times of intense use and long periods where
the computer sits idle. If we construct programs in such a way to run on computers while they
sit idle and pause our program when they are in use, then we can utilize these idle CPU cycles.
IBM’s World Community Grid (WCG) provides such infrastructure to construct programs in this
way so that we can distribute Rosetta structure predictions across thousands of idle computers.
Hundreds of thousands of volunteers all over the world download the WCG screen saver onto
their personal computers, which includes the Rosetta executable and while their computers are
idle, the screen saver will download a protein sequence from WCG servers and run our structure
prediction algorithm. It is this ability to distribute computation across many computers that
allows structure predictions to be obtained in a timely manner.
0.6
Protein’s and their interactions
Determining the molecular function of individual proteins is an important goal but even if all
proteins were fully annotated, it would not complete our understanding of the cell. This is because
proteins do not act in isolation but rather generally interact with other proteins to perform their
molecular functions. Therefore studying the partners with which a single protein interacts is as
9
important as determining the molecular function of the individual protein.∗
One way to study a specific protein interaction is to remove its occurrence and observe the
effect on the cell (or organism) when the interaction is absent. This is similar to genetic knock
out studies where a gene is removed from a genome, which often leads to uncovering a function
or process that the gene is required. Modulating a protein interaction will provide a similar tool
for studying the effect of that interaction on processes in the cell. Additionally, having a way to
modulate a protein interaction would be beneficial for drug development because many diseases
are caused by misregulated protein interactions. Disrupting an interaction may provide a way to
resolving this misregulation [33].
A good starting point toward understanding how to modulate protein interactions is to look
at the chemical makeup of the interface. Soon after the first crystal structures were determined,
Chothia and Janin (1975) [34] studied three structures of protein-protein interactions. They noted
three points, 1) amino acid side chains at the interfaces were well packed, 2) interfaces spanned
around 1500 Å2 and 3) a large portion of the free energy of binding came from hydrophobic
residues at the interaction interface.
Once it became feasible to mutate protein sequences in an efficient manner, Bogan and Thorn
analyzed mutagenesis data at protein interfaces [35]. They collected a database of interface
alanine point mutations along with the interaction’s delta delta G values upon binding. They
observed the majority of binding affinity was due to just a few amino acids termed ”hotspot”
residues while identities of other residues at the interface had little effect on affinity.
Later studies of experimental structures of protein interactions deposited in the PDB by
Bullock et al. [36] show many hotspot residues in protein interactions are on an α helix at the
interface.
From these observations, one can begin to develop a rational approach to creating molecular
antagonists to protein interactions that will inhibit their occurrence in the cell. A starting point
to disrupting a protein interaction is to mimic the hotspot residues of the interface so that the
target will recognize it with high affinity and outcompete the native partner.
∗ It should be noted that some Gene Ontology Molecular Function terms do in fact describe a protein’s function
as interacting with another protein (e.g. GO:0005515, “Protein Binding”).
10
0.7
Peptidomimetics
Using this idea, a naı̈ve approach might be to excise an alpha helix containing hotspot residues
from one side of the interaction and use that as an inhibitor. This, in theory, will be recognized
by the target protein but in practice there is no guarantee the native α helical conformation will
be a low energy conformation for the isolated peptide. Additionally, a short peptide the size
of an α helix is subject to degradation by proteases in the cell. Peptidomimetics are a class of
molecular scaffolds that address these concerns; they are stable in solution, proteolytic resistant
and have functional groups that mimic the side chains on protein secondary structures.
A subclass of peptidomimetics that has been successfully used to inhibit protein interactions
are helical mimetics. The Gellman lab has created a helical mimetic antagonist to the bcl2
family of anti-apoptotic proteins using alpha-beta peptides which form stable helices [37]. An
alternative method used by the Arora group is to nucleate a helix by converting a hydrogen bond
between the i and i +4 residues to a covalent linker. Using this approach they have created an
inhibitor of the Hif1α P300 interaction which is involved in the angiogenesis pathway [38]. One
final example of a successful helical mimetic used for protein interaction inhibition is of Ernst
et al. using the simple molecular scaffold of a terphenyl to disrupt oligomerization of gp41, a
protein necessary for HIV entry into the host cell [38]. These successes show the promise of using
peptidomimetics for inhibiting protein interactions.
0.8
Rosetta Design
Using peptidomimetics to mimic secondary structures at protein interfaces is a logical first step
in developing interaction inhibitors. It is often the case, however, that this strategy alone will
produce only a mediocre binder to the target protein and further optimization is necessary
to produce a better binder. The molecular modeling software, Rosetta, is well suited for this
optimization. Rosetta has been shown to design novel protein folds [39], redesign protein interfaces [40], enzymes [41] and more recently a protein binder to Hemagluttin of Influenza which
disrupts its function [42]. The design algorithm iterates between a conformational optimization
step and a sequence optimization step. During the conformational optimization step, random
11
changes are made to the protein’s degrees of freedom (ex. phi and psi angles) in an attempt to
lower the score evaluated by Rosetta’s energy function. This step is generally the same as the
one described above for de novo structure prediction. During the sequence optimization step,
random substitutions are made to residues in the protein from a library of potential amino acids
(generally these are the 20 canonical amino acids) in an attempt again, to lower the Rosetta
energy function score. Substitutions are accepted if they lower the score and accepted with some
probability proportional to their energy increase otherwise. Iterating between these steps often
leads to a conformation and sequence that is lower in energy than the ones used to begin.
There is a limit however to the sequence diversity provided by using a substitution library
containing just the 20 canonical amino acids. Certain applications that have more flexibility in
their chemical synthesis (ex. solid state synthesis instead of expressing recombinant DNA in a
cell) have the advantage of creating molecules out any arbitrary amino acid rather than just
the canonical 20 amino acids. Renfrew et al. [43] exploited this fact when they built a library of
noncanonical amino acids within Rosetta and designed a binder to calpain-1 protein. Synthesis of
peptidomimetics often use solid state techniques and therefore make excellent design candidates
for using these noncanonical libraries. Many of the noncanonical amino acids in the library are
analogues of one of the canonical amino acids, for instance a hydrogen atom on the ring of a
phenylalanine substituted for a fluorine atom. These analogues provide additional amino acids
for Rosetta to suggest as substitutions at hotspot residues that will attempt to increase binding
affinity of mediocre binders. Using a peptidomimetic scaffold in combination with Rosetta design
and the chemically diverse noncanonical amino acids offers a powerful approach in developing
high affinity protein interaction inhibitors.
0.9
The structure of this thesis
In this thesis, I describe advancements in both areas of protein structure/function prediction and
the computational design of protein interaction inhibitors. During my graduate work I have also
had excellent collaborations which were tangental or fall outside the focus of my thesis which I
will briefly describe here.
Several projects have grown out of the genome scale protein structure and function prediction
12
pipeline project. First, I was involved in work in collaboration with Mike Boxem, which experimentally determined the minimal region of protein domains to interact with their partners. Our
database of protein structure annotations was important for the analysis of these regions to show
they were in folded well-ordered regions of the proteins. This work was published in Cell [44].
Additionally, our database was used to help the Eichenberger lab at NYU define domains important in the spore coat formation of Bacillus subtilis. This work was published in Molecular
Microbiology [45]. I collaborated with the Purugganan lab at NYU to use protein structure as a
canvas to map sites of positive selection derived from evolutionary analysis of plant species. This
work has been published in Genome Biology and Evolution along with a web server which users
can use to view three dimensional structures highlighted by sites important to the evolution of
the protein [46]. Another collaboration which resulted in a publication with the Landthaler lab
at MDC-Berlin in Molecular Cell, involved finding unexpected enriched superfamilies in a set of
novel mRNA binding proteins for which we had structure predictions [47]. This paper also uses
structure predictions to predict function of these novel mRNA binding proteins. This extends
the idea that structure predictions are a valuable predictor for protein function, something that I
describe in the body of this thesis. Noah Youngs in our lab has applied this idea further to several
genomes using sophisticated machine learning techniques and preliminary work is currently in
review. Finally, I have developed a web based interface to the protein structure annotations using
protein protein interactions as an entry point. This interface, BioNetBuilder, is a Cytoscape [48]
plugin which builds protein interaction networks based on public interaction data for easy visual
display and access to additional data such as structure and function annotations. This work was
published in Bioinformatics [49] and a follow up paper describes a redesign of the software and
the construction of the Chicken interactome [50].
Collaborations relating to modeling of noncanonical backbones in Rosetta include participation in the CAPRI (Critical Assessment of PRediction of Interactions) challenge with the Grey
lab at Johns Hopkins. This involved predicting the binding mode of an oligosaccaride to a protein
and although the challenge results have yet to be made public, the work has been submitted for
presentation at the CAPRI meeting. Also, in collaboration with the Grey lab at Johns Hopkins,
is the development of web server infrastructure for Rosetta applications which will be submitted
for publication soon.
13
I was also involved in a project to predict temperature sensitive mutations in proteins of
interest using structure modeling by Rosetta as a predictor. This work was published in the 2011
Rosetta Special Collection in PLoS ONE [51].
0.9.1
Chapter description
In the first chapter of this thesis, I describe structure annotation of protein domains in over 100
genomes. Building on the 2000 CASP4 results by Bonneau et al. [30] and the utility of IBM’s
World Community Grid, the ability of using protein structure predictions to annotate proteins of
unknown function on the genome scale is feasible. I have applied the Rosetta de novo structure
prediction algorithm to protein domains run on the grid. The resulting structure predictions
were then compared to proteins with known structures to classify the subject protein domain
into a SCOP superfamily where all the members are inferred to have a common evolutionary
origin and therefore likely a similar function. Using a double blind benchmark to evaluate the
error in our classifier, we correctly predict the SCOP superfamily nearly 50% of the time over
the whole benchmark and reach nearly 80% for predictions deemed high confidence.
In the second chapter, I describe how SCOP superfamily classifications of protein domains
were then combined with Gene Ontology (GO) cellular component and biological process annotations in a naı̈ve Bayes framework for GO molecular function prediction. This function prediction
method using structure information outperformed the method without structure information
suggesting structure is a valuable predictor of molecular function.
The third chapter of this thesis describes the development of a framework within the Rosetta
molecular modeling suite to model and rationally design helical mimetics. I focus on the peptidomimetic molecular scaffold, oligooxopiperazine (OOP), which mimics one face of an alpha
helix.
In the fourth chapter, I describe how within the Rosetta framework I made several designs of
OOPs targeting the p53 MDM2 protein interaction, a relevant cancer drug target. These designs
have been synthesized and experimentally validated to target MDM2 and competitively inhibit
the p53 MDM2 protein interaction.
Finally, I discuss in the fifth chapter future perspectives of my work.
14
Bibliography
[1] Erwin Schrödinger. What is life?: The physical aspect of the living cell. The University
Press, Cambridge, 1944.
[2] G W Beadle and E L Tatum. Genetic control of biochemical reactions in Neurospora. Proc
Natl Acad Sci U S A, 27(11):499–506, Nov 1941.
[3] O T Avery, C M Macleod, and M McCarty. Studies on the chemical nature of the substance inducing transformation of pneumococcal types : Induction of transformation by a
desoxyribonucleic acid fraction isolated from pneumococcus type iii. J Exp Med, 79(2):137–
58, Feb 1944.
[4] A D HERSHEY and M CHASE. Independent functions of viral protein and nucleic acid
in growth of bacteriophage. J Gen Physiol, 36(1):39–56, May 1952.
[5] J D WATSON and F H CRICK. Molecular structure of nucleic acids; a structure for
deoxyribose nucleic acid. Nature, 171(4356):737–8, Apr 1953.
[6] F H CRICK. On protein synthesis. Symp Soc Exp Biol, 12:138–63, 1958.
[7] F H CRICK, L BARNETT, S BRENNER, and R J WATTS-TOBIN. General nature of
the genetic code for proteins. Nature, 192:1227–32, Dec 1961.
[8] P LENGYEL, J F SPEYER, and S OCHOA. Synthetic polynucleotides and the amino
acid code. Proc Natl Acad Sci U S A, 47:1936–42, Dec 1961.
[9] M Nirenberg, P Leder, M Bernfield, R Brimacombe, J Trupin, F Rottman, and C O’Neal.
Rna codewords and protein synthesis, vii. on the general nature of the rna code. Proc Natl
Acad Sci U S A, 53(5):1161–8, May 1965.
[10] L PAULING and H A ITANO.
Sickle cell anemia, a molecular disease.
109(2835):443, Apr 1949.
142
Science,
[11] F Sanger, G M Air, B G Barrell, N L Brown, A R Coulson, C A Fiddes, C A Hutchison,
P M Slocombe, and M Smith. Nucleotide sequence of bacteriophage phi x174 dna. Nature,
265(5596):687–95, Feb 1977.
[12] E S Lander, L M Linton, B Birren, C Nusbaum, M C Zody, J Baldwin, K Devon, K Dewar, M Doyle, W FitzHugh, R Funke, D Gage, K Harris, A Heaford, J Howland, L Kann,
J Lehoczky, R LeVine, P McEwan, K McKernan, J Meldrim, J P Mesirov, C Miranda,
W Morris, J Naylor, C Raymond, M Rosetti, R Santos, A Sheridan, C Sougnez, N StangeThomann, N Stojanovic, A Subramanian, D Wyman, J Rogers, J Sulston, R Ainscough,
S Beck, D Bentley, J Burton, C Clee, N Carter, A Coulson, R Deadman, P Deloukas, A Dunham, I Dunham, R Durbin, L French, D Grafham, S Gregory, T Hubbard, S Humphray,
A Hunt, M Jones, C Lloyd, A McMurray, L Matthews, S Mercer, S Milne, J C Mullikin, A Mungall, R Plumb, M Ross, R Shownkeen, S Sims, R H Waterston, R K Wilson,
L W Hillier, J D McPherson, M A Marra, E R Mardis, L A Fulton, A T Chinwalla, K H
Pepin, W R Gish, S L Chissoe, M C Wendl, K D Delehaunty, T L Miner, A Delehaunty,
J B Kramer, L L Cook, R S Fulton, D L Johnson, P J Minx, S W Clifton, T Hawkins,
E Branscomb, P Predki, P Richardson, S Wenning, T Slezak, N Doggett, J F Cheng,
A Olsen, S Lucas, C Elkin, E Uberbacher, M Frazier, R A Gibbs, D M Muzny, S E Scherer,
J B Bouck, E J Sodergren, K C Worley, C M Rives, J H Gorrell, M L Metzker, S L Naylor,
R S Kucherlapati, D L Nelson, G M Weinstock, Y Sakaki, A Fujiyama, M Hattori, T Yada,
A Toyoda, T Itoh, C Kawagoe, H Watanabe, Y Totoki, T Taylor, J Weissenbach, R Heilig,
W Saurin, F Artiguenave, P Brottier, T Bruls, E Pelletier, C Robert, P Wincker, D R
Smith, L Doucette-Stamm, M Rubenfield, K Weinstock, H M Lee, J Dubois, A Rosenthal, M Platzer, G Nyakatura, S Taudien, A Rump, H Yang, J Yu, J Wang, G Huang,
J Gu, L Hood, L Rowen, A Madan, S Qin, R W Davis, N A Federspiel, A P Abola, M J
Proctor, R M Myers, J Schmutz, M Dickson, J Grimwood, D R Cox, M V Olson, R Kaul,
C Raymond, N Shimizu, K Kawasaki, S Minoshima, G A Evans, M Athanasiou, R Schultz,
B A Roe, F Chen, H Pan, J Ramser, H Lehrach, R Reinhardt, W R McCombie, M de la
Bastide, N Dedhia, H Blöcker, K Hornischer, G Nordsiek, R Agarwala, L Aravind, J A
Bailey, A Bateman, S Batzoglou, E Birney, P Bork, D G Brown, C B Burge, L Cerutti,
143
H C Chen, D Church, M Clamp, R R Copley, T Doerks, S R Eddy, E E Eichler, T S
Furey, J Galagan, J G Gilbert, C Harmon, Y Hayashizaki, D Haussler, H Hermjakob,
K Hokamp, W Jang, L S Johnson, T A Jones, S Kasif, A Kaspryzk, S Kennedy, W J Kent,
P Kitts, E V Koonin, I Korf, D Kulp, D Lancet, T M Lowe, A McLysaght, T Mikkelsen,
J V Moran, N Mulder, V J Pollara, C P Ponting, G Schuler, J Schultz, G Slater, A F
Smit, E Stupka, J Szustakowski, D Thierry-Mieg, J Thierry-Mieg, L Wagner, J Wallis,
R Wheeler, A Williams, Y I Wolf, K H Wolfe, S P Yang, R F Yeh, F Collins, M S Guyer,
J Peterson, A Felsenfeld, K A Wetterstrand, A Patrinos, M J Morgan, P de Jong, J J
Catanese, K Osoegawa, H Shizuya, S Choi, Y J Chen, J Szustakowki, and International
Human Genome Sequencing Consortium. Initial sequencing and analysis of the human
genome. Nature, 409(6822):860–921, Feb 2001.
[13] J C Venter, M D Adams, E W Myers, P W Li, R J Mural, G G Sutton, H O Smith,
M Yandell, C A Evans, R A Holt, J D Gocayne, P Amanatides, R M Ballew, D H Huson,
J R Wortman, Q Zhang, C D Kodira, X H Zheng, L Chen, M Skupski, G Subramanian,
P D Thomas, J Zhang, G L Gabor Miklos, C Nelson, S Broder, A G Clark, J Nadeau,
V A McKusick, N Zinder, A J Levine, R J Roberts, M Simon, C Slayman, M Hunkapiller,
R Bolanos, A Delcher, I Dew, D Fasulo, M Flanigan, L Florea, A Halpern, S Hannenhalli,
S Kravitz, S Levy, C Mobarry, K Reinert, K Remington, J Abu-Threideh, E Beasley, K Biddick, V Bonazzi, R Brandon, M Cargill, I Chandramouliswaran, R Charlab, K Chaturvedi,
Z Deng, V Di Francesco, P Dunn, K Eilbeck, C Evangelista, A E Gabrielian, W Gan,
W Ge, F Gong, Z Gu, P Guan, T J Heiman, M E Higgins, R R Ji, Z Ke, K A Ketchum,
Z Lai, Y Lei, Z Li, J Li, Y Liang, X Lin, F Lu, G V Merkulov, N Milshina, H M Moore,
A K Naik, V A Narayan, B Neelam, D Nusskern, D B Rusch, S Salzberg, W Shao, B Shue,
J Sun, Z Wang, A Wang, X Wang, J Wang, M Wei, R Wides, C Xiao, C Yan, A Yao,
J Ye, M Zhan, W Zhang, H Zhang, Q Zhao, L Zheng, F Zhong, W Zhong, S Zhu, S Zhao,
D Gilbert, S Baumhueter, G Spier, C Carter, A Cravchik, T Woodage, F Ali, H An, A Awe,
D Baldwin, H Baden, M Barnstead, I Barrow, K Beeson, D Busam, A Carver, A Center,
M L Cheng, L Curry, S Danaher, L Davenport, R Desilets, S Dietz, K Dodson, L Doup,
S Ferriera, N Garg, A Gluecksmann, B Hart, J Haynes, C Haynes, C Heiner, S Hladun,
144
D Hostin, J Houck, T Howland, C Ibegwam, J Johnson, F Kalush, L Kline, S Koduru,
A Love, F Mann, D May, S McCawley, T McIntosh, I McMullen, M Moy, L Moy, B Murphy, K Nelson, C Pfannkoch, E Pratts, V Puri, H Qureshi, M Reardon, R Rodriguez, Y H
Rogers, D Romblad, B Ruhfel, R Scott, C Sitter, M Smallwood, E Stewart, R Strong,
E Suh, R Thomas, N N Tint, S Tse, C Vech, G Wang, J Wetter, S Williams, M Williams,
S Windsor, E Winn-Deen, K Wolfe, J Zaveri, K Zaveri, J F Abril, R Guigó, M J Campbell, K V Sjolander, B Karlak, A Kejariwal, H Mi, B Lazareva, T Hatton, A Narechania,
K Diemer, A Muruganujan, N Guo, S Sato, V Bafna, S Istrail, R Lippert, R Schwartz,
B Walenz, S Yooseph, D Allen, A Basu, J Baxendale, L Blick, M Caminha, J CarnesStine, P Caulk, Y H Chiang, M Coyne, C Dahlke, A Mays, M Dombroski, M Donnelly,
D Ely, S Esparham, C Fosler, H Gire, S Glanowski, K Glasser, A Glodek, M Gorokhov,
K Graham, B Gropman, M Harris, J Heil, S Henderson, J Hoover, D Jennings, C Jordan,
J Jordan, J Kasha, L Kagan, C Kraft, A Levitsky, M Lewis, X Liu, J Lopez, D Ma, W Majoros, J McDaniel, S Murphy, M Newman, T Nguyen, N Nguyen, M Nodell, S Pan, J Peck,
M Peterson, W Rowe, R Sanders, J Scott, M Simpson, T Smith, A Sprague, T Stockwell,
R Turner, E Venter, M Wang, M Wen, D Wu, M Wu, A Xia, A Zandieh, and X Zhu. The
sequence of the human genome. Science, 291(5507):1304–51, Feb 2001.
[14] F R Blattner, G Plunkett, 3rd, C A Bloch, N T Perna, V Burland, M Riley, J ColladoVides, J D Glasner, C K Rode, G F Mayhew, J Gregor, N W Davis, H A Kirkpatrick, M A
Goeden, D J Rose, B Mau, and Y Shao. The complete genome sequence of Escherichia
coli k-12. Science, 277(5331):1453–62, Sep 1997.
[15] A Goffeau, B G Barrell, H Bussey, R W Davis, B Dujon, H Feldmann, F Galibert, J D Hoheisel, C Jacq, M Johnston, E J Louis, H W Mewes, Y Murakami, P Philippsen, H Tettelin,
and S G Oliver. Life with 6000 genes. Science, 274(5287):546, 563–7, Oct 1996.
[16] C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a
platform for investigating biology. Science, 282(5396):2012–8, Dec 1998.
[17] M D Adams, S E Celniker, R A Holt, C A Evans, J D Gocayne, P G Amanatides, S E
Scherer, P W Li, R A Hoskins, R F Galle, R A George, S E Lewis, S Richards, M Ashburner,
145
S N Henderson, G G Sutton, J R Wortman, M D Yandell, Q Zhang, L X Chen, R C
Brandon, Y H Rogers, R G Blazej, M Champe, B D Pfeiffer, K H Wan, C Doyle, E G
Baxter, G Helt, C R Nelson, G L Gabor, J F Abril, A Agbayani, H J An, C AndrewsPfannkoch, D Baldwin, R M Ballew, A Basu, J Baxendale, L Bayraktaroglu, E M Beasley,
K Y Beeson, P V Benos, B P Berman, D Bhandari, S Bolshakov, D Borkova, M R Botchan,
J Bouck, P Brokstein, P Brottier, K C Burtis, D A Busam, H Butler, E Cadieu, A Center,
I Chandra, J M Cherry, S Cawley, C Dahlke, L B Davenport, P Davies, B de Pablos,
A Delcher, Z Deng, A D Mays, I Dew, S M Dietz, K Dodson, L E Doup, M Downes,
S Dugan-Rocha, B C Dunkov, P Dunn, K J Durbin, C C Evangelista, C Ferraz, S Ferriera,
W Fleischmann, C Fosler, A E Gabrielian, N S Garg, W M Gelbart, K Glasser, A Glodek,
F Gong, J H Gorrell, Z Gu, P Guan, M Harris, N L Harris, D Harvey, T J Heiman, J R
Hernandez, J Houck, D Hostin, K A Houston, T J Howland, M H Wei, C Ibegwam, M Jalali,
F Kalush, G H Karpen, Z Ke, J A Kennison, K A Ketchum, B E Kimmel, C D Kodira,
C Kraft, S Kravitz, D Kulp, Z Lai, P Lasko, Y Lei, A A Levitsky, J Li, Z Li, Y Liang,
X Lin, X Liu, B Mattei, T C McIntosh, M P McLeod, D McPherson, G Merkulov, N V
Milshina, C Mobarry, J Morris, A Moshrefi, S M Mount, M Moy, B Murphy, L Murphy,
D M Muzny, D L Nelson, D R Nelson, K A Nelson, K Nixon, D R Nusskern, J M Pacleb,
M Palazzolo, G S Pittman, S Pan, J Pollard, V Puri, M G Reese, K Reinert, K Remington,
R D Saunders, F Scheeler, H Shen, B C Shue, I Sidén-Kiamos, M Simpson, M P Skupski,
T Smith, E Spier, A C Spradling, M Stapleton, R Strong, E Sun, R Svirskas, C Tector,
R Turner, E Venter, A H Wang, X Wang, Z Y Wang, D A Wassarman, G M Weinstock,
J Weissenbach, S M Williams, WoodageT, K C Worley, D Wu, S Yang, Q A Yao, J Ye, R F
Yeh, J S Zaveri, M Zhan, G Zhang, Q Zhao, L Zheng, X H Zheng, F N Zhong, W Zhong,
X Zhou, S Zhu, X Zhu, H O Smith, R A Gibbs, E W Myers, G M Rubin, and J C Venter.
The genome sequence of Drosophila melanogaster. Science, 287(5461):2185–95, Mar 2000.
[18] Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant
Arabidopsis thaliana. Nature, 408(6814):796–815, Dec 2000.
[19] M SELA, F H WHITE, Jr, and C B ANFINSEN. Reductive cleavage of disulfide bridges
in ribonuclease. Science, 125(3250):691–2, Apr 1957.
146
[20] C B ANFINSEN and E HABER. Studies on the reduction and re-formation of protein
disulfide bonds. J Biol Chem, 236:1361–3, May 1961.
[21] J C KENDREW, G BODO, H M DINTZIS, R G PARRISH, H WYCKOFF, and D C
PHILLIPS. A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature, 181(4610):662–6, Mar 1958.
[22] M F PERUTZ, M G ROSSMANN, A F CULLIS, H MUIRHEAD, G WILL, and A C
NORTH. Structure of haemoglobin: a three-dimensional fourier synthesis at 5.5-a. resolution, obtained by x-ray analysis. Nature, 185(4711):416–22, Feb 1960.
[23] H.C. Watson M.F. Perutz, J.C. Kendrew. Structure and function of haemoglobin: Ii. some
relations between polypeptide chain configuration and amino acid sequence. Journal of
Molecular Biology, 13(3):669–678, October 1965.
[24] C Chothia and A M Lesk. The relation between the divergence of sequence and structure
in proteins. EMBO J, 5(4):823–6, Apr 1986.
[25] C Sander and R Schneider. Database of homology-derived protein structures and the
structural meaning of sequence alignment. Proteins, 9(1):56–68, 1991.
[26] B Rost. Twilight zone of protein sequence alignments. Protein Engineering, 12(2):85–94,
February 1999.
[27] Cyrus Levinthal. How to fold graciously. In Mössbaun Spectroscopy in Biological Systems
Proceedings, number 41 in 67, pages 22–24. Univ. of Illinois Bulletin, 1969.
[28] W KAUZMANN. Some factors in the interpretation of protein denaturation. Adv Protein
Chem, 14:1–63, 1959.
[29] Carol A Rohl, Charlie E M Strauss, Kira M S Misura, and David Baker. Protein structure
prediction using rosetta. Methods Enzymol, 383:66–93, 2004.
[30] R Bonneau, J Tsai, I Ruczinski, D Chivian, C Rohl, C E Strauss, and D Baker. Rosetta in
casp4: progress in ab initio protein structure prediction. Proteins, Suppl 5:119–26, 2001.
147
[31] A M Lesk, L Lo Conte, and T J Hubbard. Assessment of novel fold targets in casp4:
predictions of three-dimensional structures, secondary structures, and interresidue contacts.
Proteins, Suppl 5:98–118, 2001.
[32] R Bonneau, J Tsai, I Ruczinski, and D Baker. Functional inferences from blind ab initio
protein structure predictions. J Struct Biol, 134(2-3):186–90, 2001.
[33] Michelle R Arkin and James A Wells. Small-molecule inhibitors of protein-protein interactions: progressing towards the dream. Nat Rev Drug Discov, 3(4):301–17, Apr 2004.
[34] C Chothia and J Janin. Principles of protein-protein recognition. Nature, 256(5520):705–8,
Aug 1975.
[35] A A Bogan and K S Thorn. Anatomy of hot spots in protein interfaces. J Mol Biol,
280(1):1–9, Jul 1998.
[36] Brooke N Bullock, Andrea L Jochim, and Paramjit S Arora. Assessing helical protein
interfaces for inhibitor design. J Am Chem Soc, 133(36):14220–3, Sep 2011.
[37] Melissa D Boersma, Holly S Haase, Kimberly J Peterson-Kaufman, Erinna F Lee, Oliver B
Clarke, Peter M Colman, Brian J Smith, W Seth Horne, W Douglas Fairlie, and Samuel H
Gellman. Evaluation of diverse /-backbone patterns for functional -helix mimicry: analogues of the bim bh3 domain. J Am Chem Soc, 134(1):315–23, Jan 2012.
[38] Laura K Henchey, Swati Kushal, Ramin Dubey, Ross N Chapman, Bogdan Z Olenyuk,
and Paramjit S Arora. Inhibition of hypoxia inducible factor 1-transcription coactivator
interaction by a hydrogen bond surrogate alpha-helix. J Am Chem Soc, 132(3):941–3, Jan
2010.
[39] Brian Kuhlman, Gautam Dantas, Gregory C Ireton, Gabriele Varani, Barry L Stoddard,
and David Baker. Design of a novel globular protein fold with atomic-level accuracy.
Science, 302(5649):1364–8, Nov 2003.
[40] Tanja Kortemme, Lukasz A Joachimiak, Alex N Bullock, Aaron D Schuler, Barry L Stoddard, and David Baker. Computational redesign of protein-protein interaction specificity.
Nat Struct Mol Biol, 11(4):371–9, Apr 2004.
148
[41] Lin Jiang, Eric A Althoff, Fernando R Clemente, Lindsey Doyle, Daniela Röthlisberger,
Alexandre Zanghellini, Jasmine L Gallaher, Jamie L Betker, Fujie Tanaka, Carlos F Barbas,
3rd, Donald Hilvert, Kendall N Houk, Barry L Stoddard, and David Baker. De novo
computational design of retro-aldol enzymes. Science, 319(5868):1387–91, Mar 2008.
[42] Sarel J Fleishman, Timothy A Whitehead, Damian C Ekiert, Cyrille Dreyfus, Jacob E Corn,
Eva-Maria Strauch, Ian A Wilson, and David Baker. Computational design of proteins
targeting the conserved stem region of influenza hemagglutinin. Science, 332(6031):816–
21, May 2011.
[43] P Douglas Renfrew, Eun Jung Choi, Richard Bonneau, and Brian Kuhlman. Incorporation
of noncanonical amino acids into rosetta and use in computational protein-peptide interface
design. PLoS One, 7(3):e32637, 2012.
[44] Mike Boxem, Zoltan Maliga, Niels Klitgord, Na Li, Irma Lemmens, Miyeko Mana, Lorenzo
de Lichtervelde, Joram D Mul, Diederik van de Peut, Maxime Devos, Nicolas Simonis,
Muhammed A Yildirim, Murat Cokol, Huey-Ling Kao, Anne-Sophie de Smet, Haidong
Wang, Anne-Lore Schlaitz, Tong Hao, Stuart Milstein, Changyu Fan, Mike Tipsword,
Kevin Drew, Matilde Galli, Kahn Rhrissorrakrai, David Drechsel, Daphne Koller, Frederick P Roth, Lilia M Iakoucheva, A Keith Dunker, Richard Bonneau, Kristin C Gunsalus,
David E Hill, Fabio Piano, Jan Tavernier, Sander van den Heuvel, Anthony A Hyman, and
Marc Vidal. A protein domain-based interactome network for C. elegans early embryogenesis. Cell, 134(3):534–45, Aug 2008.
[45] Katherine H Wang, Anabela L Isidro, Lia Domingues, Haig A Eskandarian, Peter T McKenney, Kevin Drew, Paul Grabowski, Ming-Hsiu Chua, Samantha N Barry, Michelle Guan,
Richard Bonneau, Adriano O Henriques, and Patrick Eichenberger. The coat morphogenetic protein spovid is necessary for spore encasement in Bacillus subtilis. Mol Microbiol,
74(3):634–49, Nov 2009.
[46] M M Pentony, P Winters, D Penfold-Brown, K Drew, A Narechania, R DeSalle, R Bonneau,
and M D Purugganan. The plant proteome folding project: structure and positive selection
in plant protein families. Genome Biol Evol, 4(3):360–71, 2012.
149
[47] Alexander G Baltz, Mathias Munschauer, Björn Schwanhäusser, Alexandra Vasile, Yasuhiro Murakawa, Markus Schueler, Noah Youngs, Duncan Penfold-Brown, Kevin Drew,
Miha Milek, Emanuel Wyler, Richard Bonneau, Matthias Selbach, Christoph Dieterich,
and Markus Landthaler. The mrna-bound proteome and its global occupancy profile on
protein-coding transcripts. Mol Cell, 46(5):674–90, Jun 2012.
[48] Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel
Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13(11):2498–
504, Nov 2003.
[49] I. Avila-Campillo, K. Drew, J. Lin, D. J. Reiss, and R. Bonneau. BioNetBuilder: automatic
integration of biological networks. Bioinformatics, 23:392–393, Feb 2007.
[50] Jay H Konieczka, Kevin Drew, Alex Pine, Kevin Belasco, Sean Davey, Tatiana A
Yatskievych, Richard Bonneau, and Parker B Antin. Bionetbuilder2.0: bringing systems
biology to chicken and other model organisms. BMC Genomics, 10 Suppl 2:S6, 2009.
[51] Christopher S Poultney, Glenn L Butterfoss, Michelle R Gutwein, Kevin Drew, David
Gresham, Kristin C Gunsalus, Dennis E Shasha, and Richard Bonneau. Rational design of temperature-sensitive alleles using computational structure prediction. PLoS One,
6(9):e23947, 2011.
[52] Kevin Drew, Patrick Winters, Glenn L Butterfoss, Viktors Berstis, Keith Uplinger,
Jonathan Armstrong, Michael Riffle, Erik Schweighofer, Bill Bovermann, David R
Goodlett, Trisha N Davis, Dennis Shasha, Lars Malmström, and Richard Bonneau. The
proteome folding project: proteome-scale prediction of structure and function. Genome
Res, 21(11):1981–94, Nov 2011.
[53] DT Jones and JJ Ward. Prediction of disordered regions in proteins from position specific
score matrices. Proteins-Structure Function and Bioinformatics, 53(6):573–578, 2003.
[54] DT Jones. Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology, 292(2):195–202, September 1999.
150