X-ray Crystallography I - UCSF Macromolecular Structure Group

Transcription

X-ray Crystallography I - UCSF Macromolecular Structure Group
X-ray Crystallography I
James Fraser
Macromolecluar Interactions BP204
Key take-aways
1. X-ray crystallography results from an ensemble of
Billions and Billions of molecules in the crystal
2. Models in the PDB are often sub-optimal and can
contain errors
3. Intensity of “spots” relates to the electron density
(which relates to the molecules) in the unit cell
4. Positions of “spots” relates to the arrangement of
unit cells in the crystal
5. Every “spot” contains contributions from every
part of the crystal. Every part of the map contains
contributions from every “spot”
Key outcomes
• Understand “Table 1” in X-ray Papers (now
often “Table S1”)
• Understand the basic workflow of
determining a crystal “structure”
• Embrace the beauty and challenge of
disorder at high and low resolution
Today we are going to tackle
crystallography in reverse
•
Texts begin with diffraction theory from a series of point atoms
•
•
Bob teaches mini-course in Spring with this level of detail
(e.g. Biomolecular Crystallography, Rupp; Principles of Protein X-ray Crystallography, Drenth;
Crystallography Made Crystal Clear, Rhoades)
Today - model to reflections;Tomorrow - phasing
What’s PHENIX for: Macromolecular crystallographic structure solution
Adams et al., Acta Cryst. D66, 213-221 (2010).
What is a protein
structure?
•
What is a protein
“structure”
Is it a:
•
•
•
•
•
•
•
pretty cartoon...
space-filling set of spheres...
picture of the protein in the crystal...
computational picture of the protein...
representation of atoms that satisfies experimental constraints...
PDB formatted text file...
model!!!
Moreover... a model of
the crystal lattice...
P D B Files are text:
rotein
ata
ank
chemistry, sequence, position, certainty
HEADER
TITLE
COMPND
COMPND
HYDROLASE
T4 LYSOZYME C-TERMINAL FRAGMENT
MOL_ID: 1;
2 MOLECULE: LYSOZYME;
REMARK
REMARK
REMARK
REMARK
3
3
3
3
ATOM
ATOM
ATOM
ATOM
ATOM
MASTER
END
10-DEC-06
2O7A
...
1
2
3
4
5
FIT TO DATA USED IN REFINEMENT (NO
R VALUE
(WORKING + TEST SET, NO
R VALUE
(WORKING SET, NO
FREE R VALUE
(NO
...
N
CA
C
O
CB
287
VAL
VAL
VAL
VAL
VAL
0
A
A
A
A
A
2
2
2
2
2
3
-19.742
-19.867
-19.073
-19.367
-19.341
...
10
0
0
CUTOFF).
CUTOFF) : NULL
CUTOFF) : 0.090
CUTOFF) : 0.108
-2.254
-2.152
-0.927
0.178
-3.411
0
-19.976
-18.529
-18.101
-18.554
-17.836
6 1566
1.00
1.00
1.00
1.00
1.00
1
22
54.44
54.48
41.86
47.57
68.76
10
N
C
C
O
C
The$universe$of$protein$structures:$$
Our$knowledge$about$protein$structures$is$increasing..$
• 
• 
• 
65,271' protein' structures' are' deposited' in' PDB'
(2/15/2010).'
This'number'is'growing'by'>'~7000'a'year''
Growing'input'from'Structural'Genomics'HT'structure'
determinaJon'(>1000'structures'a'year)'
©Robert'M.'Stroud'2012'
XPray'
9'
How do we tell if a
model is “good”?
• physically (packing, contacts)
• chemically (bond lengths, bond angles,
chirality, planarity, torsions)
• crystallographically (real space fits - Bfactors, R-factor)
• statistically (R-free, CC1/2)
Most of these stats appear in Table I
Physical Checks
Steric clashes
Bad
Good
  Overall clash score (number of bad overlaps per 1000 atoms)
  A clash: disallowed atom pair overlap ≥0.4 Å
MolProbity: all-atom contacts and structure validation for proteins and nucleic
acids. Davis et al, Nucleic Acids Research, 2007, Vol. 35
Chemical checks
• Bond lengths and angles
Geometry: global figures
  A priori chemical knowledge is introduced (restraints) to keep the model
chemically correct while fitting it to the experimental :
ERESTRAINTS = EBOND+EANGLE+EDIHEDRAL+EPLANARITY+ENONBONDED+…
-  Typically only rmsd for bonds and angles are reported along with RWORK
and RFREE
-  Typical values (resolutions ~1.5-2Å): rmsd(bonds)~0.02Å, rmsd(angles)~2°
o  These values can be smaller at lower resolution (~2.5-3Å), approaching 0
at ~3Å and lower resolution, and they can be larger at higher resolution
(~1.5Å and higher).
Engh and Huber, Acta Cryst, 1991
Chemical checks
• backbone and side chain torsion angles
Ramachandran plot
Rotamers
Rotamers: a set of conformers arising from restricted rotation about one single
bond
χ2
χ1
Typically 1-3% outliers
Are all outliers bad?
...not if justified by fit to
electron density map
Ramachandran plot: outlier may be good
  Not everything flagged as outlier is actually wrong
...what forces might cause this?
-  Check the map
-  Make sure the map is not biased by the model
...similarly for side
chains
Rotamers: outlier may be good
Flagged as
rotamer outlier
thing flagged as outlier is actually wrong
he map
Correct rotamer
... each outlier should be
explainable by examining the
electron density AND by forces
acting in context of the whole
protein
Rotamers:
r
Phenix offers handy tools
for looking at outliers
PHENIX tools for model validation
PHENIX tools for model validation
outliers in graphs also recenter Coot
What the F are
electron density maps?
obs-Fcalc
Density maps can offer a
“model free” view
• 2mFo-DFc (blue)
• mFo-DFc (red, green)
m and D are “de-biasing”
coefficients
Fobs = Observed Amplitude
Fcalc = Model-based Amplitude
!
Maps are contoured in
units of SIGMA (rmsd)
Typically 1.0 for 2Fo-Fc,
+/-3 for Fo-Fc
Map units
  Electron density map
FT
{F(s)}
ρ crystal (r)
has some arbitrary units.
€
  Two ways of bringing a map into some scale:
-  Divide it by standard deviation (map in sigmas)
-  Include reflection F(000) and divide map by the unit cell volume. Model
should be complete to estimate F(000). Map in e/Å3.
to Coot!!!
Density of individual
atoms in high
resolution map
appears like a
sphere or ellipsoid
  Computationally it is very beneficial to approximate the electron density
arising from each atom as a Gaussian function
-  Electron density at the point r of an atom located at position r0 and having
B-factor B and occupancy q:
 4π 2 r − r 2 
 4π 3 / 2
0

ρ atom (r,r0 ,B,q) = q∑ ak 
 exp−
bk + B 
 bk + B 
k=1

5
-  Number of terms in the above formula depends on how accurately we want
to model an atom
€
q and B are hard to separate
(even at very high resolution)
Natoms
ρ crystal (r) =
∑
i=1
 4π 2 r − r 2 
 4π 3 / 2
0,i

qi ∑ ak 
 exp −
bk + Bi 
 bk + Bi 
k=1

looks very nice:
€
5
However in practice we see densities more like:
What are B-factors?
ATOM
ATOM
ATOM
ATOM
ATOM
1
2
3
4
5
N
CA
C
O
CB
VAL
VAL
VAL
VAL
VAL
A
A
A
A
A
2
2
2
2
2
Fcalc (~h) =
-19.742
-19.867
-19.073
-19.367
-19.341
X
j
fj exp
⇣
-2.254
-2.152
-0.927
0.178
-3.411
⌘
1
~ t~
4 Bj h h
⇣
-19.976
-18.529
-18.101
-18.554
-17.836
1.00
1.00
1.00
1.00
1.00
54.44
54.48
41.86
47.57
68.76
⌘
exp 2⇡i~ht x~j ,
N
C
C
O
C
(1)
i.e. three coordinates x~j = (xj , yj , zj ) and one isotropic B-Factor Bj are refined for each
atom j (fj is the respective scattering factor, ~h a reciprocal lattice vector). The overall
mean displacement of an atom originates from several sources:
• di↵erent conformations in di↵erent unit cells (’internal static disorder’)
• vibration or dynamic transitions within molecules (’internal dynamic disorder’)
• lattice defects
• lattice vibrations (acoustical phonons)
+restraints, +model errors!
From the variety of these contributions it is clear that an isotropic description of mean
ty of these contributions it is clear that an isotropic description of mean
ements is only a very crude approximation. In contrast to small molecules
of more detailed models by introducing more parameters into the refines unfortunately not supported by the number and quality of the X-ray
macromolecules. The situation is di↵erent if atomic resolution data are
to the large number of observables (typically on the order of 30 to 50
non-hydrogen
atom)
isotropic
model for
thethe
shape
theanPD
(1 paramFigure
1: the
In the
anisotropic
case
PDoffor
atom
is approximat
ding to the radius) can be upgraded to an anisotropic model (6 parameters
instead of a spherical distribution.
orientation and the elongation of an ellipsoid (Fig. 1)). The 6 parameters
Going anisotropic
means 6 parameters
instead of 1
for the anisotropic description of the PD of an atom can be written a
trix Uj which enters the structure factor equation in a way very sim
B-factor:
X
⇣
⌘
⇣
⌘
~atom
Fcalc
= is fapproximated
2⇡ 2~htby
Uj~an
h exp
2⇡i~ht x~j ,
j exp
he anisotropic case the PD for
an(h)
ellipsoidal
j
herical distribution.
resulting
in PD
6+3=9
of be
1+3=4
(eq.
parameters
to be refin
opic description
of the
of is
aninstead
atomdecision
can
written
as 1)
a symmetric
maWhen
this
justified?
the matrix
Uin
area referred
as anisotropic
displacement p
enters the elements
structure of
factor
equation
way verytosimilar
to the isotropic
In order to obtain meaningful results from a refinement of ADP’s fo
in most cases restraints have to be employed to supplement the expe
What is q?
doi:10.1016/S0022-2836(02)00476-X available online at http://www.idealibrary.com on
w
B
J. Mol. Biol. (2002) 320, 783–799
Structural Basis for Mobility in the 1.1 Å Crystal
Structure of the NG Domain of Thermus aquaticus Ffh
Ursula D. Ramirez1†, George Minasov1†, Pamela J. Focia1
Robert M. Stroud2, Peter Walter2, Peter Kuhn3 and
Douglas M. Freymann1*
1
Department of Molecular
Pharmacology and Biological
Chemistry, Northwestern
University Medical School
The NG domain of the prokaryotic signal recognition protein Ffh is a twodomain GTPase that comprises part of the prokaryotic signal recognition
particle (SRP) that functions in co-translational targeting of proteins to
the membrane. The interface between the N and G domains includes two
Difference Maps:
Fo-Fc
Balanced difference maps: model errors
Error in position
Error in occupancy
Should be green, not blue...
Error in B-factor
Balanced difference maps: model errors
Model anisotropic atom with
isotropic
Add positional error
Balanced difference maps
  Ser residue needs a different rotamer
Fo#Fc%maps%iden.fy%everything%ordered%that%is%'missing'%
mapmap%
#Eliminate%Bias%
#Half%electron%content%
#See%electrons%
©Robert%M.%Stroud%2012%
70%
Also useful in “dynamic”
crystallography
Refinement is the
process of minimizing
Fo-Fc
...need to balance prior knowledge and data
...an iterative process, difference maps
minimized, and 2Fo-Fc maps improve
(phases... we are coming to this)
Refinement target function
  Structure refinement is a process of changing a model parameters in order
to optimize a goal (target) function:
T = F(Experimental data, Model parameters, A priori knowledge)
-  Experimental data – a set of diffraction amplitudes Fobs (and phases, if
available).
-  Model parameters: coordinates, ADP, occupancies, bulk-solvent, …
-  A priori knowledge (restraints or constraints) – additional information that
may be introduced to compensate for the insufficiency of experimental data
(finite resolution, poor data-to-parameters ratio)
  Typically: T = TDATA + w*TRESTRAINTS
-  EDATA relates model to experimental data
-  ERESTRAINTS represents a priori knowledge
-  w is a weight to balance the relative contribution of EDATA and ERESTRAINTS
  A priori knowledge can be imposed in the form of constraints so
T = EDATA
Refinement target optimization methods
  Gradient-driven minimization
Target function
profile
  Simulated annealing (SA)
Target function
profile
Local
minimum
Global minimum
  Grid search (Sample parameter
space within known range [XMIN, XMAX])
XMIN
solution X
Target
function profile
MAX
Local
minima
Global
minimum
Deeper local
minimum
Global minimum
  Hands & eyes (Via Coot)
How do we tell if a
model is “good”?
• physically (packing, contacts)
• chemically (bond lengths, bond angles,
chirality, planarity, torsions)
• crystallographically (real space fits - Bfactors, R-factor)
• statistically (R-free, CC1/2)
Most of these stats appear in Table I
Hands and Eyes are still
important!
Refinement convergence
Minimization
Both minimization and
SA can fix it
Simulated Annealing
This is beyond the
convergence radius
for minimization
Real-space grid search
This is beyond the
convergence radius for
minimization and SA
  In practice it is helpful to look at {B, map CC, 2mFo-DFc, mFo-DFc}
Indicates problem p
Real-space
Indicates problem places
"
#
"
$ OBS OBS $ "CALC # "CALC
CC =
grid points
%
'' $ " OBS # " OBS
& grid points
grid points
2
$"
grid points
CALC
# " CALC
1/ 2
(
2
**
)
  Scale independent
  Can be computed for the whole structure (not really interesting – you already
!
have R-factor) or locally (most interesting; typically computed per residue)
  Values greater than ~0.8 indicate good correlation
  May give high correlation for weak densities
  Map CC is correlated with B-factor: poorly defined regions typically have low
map CC and high B-factors
although this emphasizes
local adjustments,
refinement is global
F(h,k,l)&=&Σj&fj&e(2πi&(hx+ky+lz))&
Every&X(ray&reflec,on&(h,k,l)&has&a&contribu,ng&wave&from&all&atoms&.&
&
&
&
ρ(x,y,z)&=&Σ&F(h,k,l)&e((2πi(hx+ky+lz))&
&
&
or
ρ(x,y,z)&=&Σ|F(h,k,l)|&e((2πi(hx+ky+lz)&+&φhkl)&
&
Every&point&in&the&density&map&has&contribu,ons&from&every&reflec,on&
&
&
R-factor
  R-factor formula
#F
OBS
R = reflections
" FMODEL
#F
OBS
reflections
BSOL s 2
#
&
"
"sU CRYSTAL s t
% FCALC_ATOMS + kSOLe 4 FMASK (
FMODEL = k OVERALLe
%
(
$
'
!
  R-factor values:
-  Expected value for a random model R~59%
!
-  You can see some model in 2mFo-DFc map, R~30%
-  You can see most of the model in 2mFo-DFc map, R<20%
-  Perfect model R~0%
  Sometimes the R-factor looks very good (you would expect a good model)
but the model-to-map fit is terrible… Overfitting.
Overfitting (I)
Let’s suppose:
(red, blue or green) is the model: y = ax + b (2 parameters: a and b)
is the data.
Lot’s of data – one
single correct model
R-factor is good
Less data – more
ambiguity, less certainty:
a bunch of models
R-factor may be
good too
Little data – variety
of models: from good
to completely wrong
R-factor = 0 for all models
(including wrong ones)
Overfitting (II)
Let’s suppose:
model: y = ax + b (2 parameters: a and b)
data
model described using more parameters: y=ax2+bx+c
model described using even more parameters: y=a1xn+a2xn-1+…
Less parameters
More parameters
R-factor is good
R-factor is better
Much more parameters
R=0
  What leads to overfitting?
-  Insufficient amount of data (low resolution, poor completeness)
-  Ignoring data (cutting by resolution, sigma, anisotropy correction)
-  Inoptimal parameterization
-  Excess of imagination
-  Bad weights
Model parameters
  Choice for model parameterization depends on amount of available data
and its resolution
Key resolution limits and corresponding features
Overfitting
  Solution: cross-validation (R-free factor):
-  At the beginning of structure solution split the data into two sets: test set
(~5-10% of randomly selected data), and work set (the rest).
-  From this point on you look at two R-factors: R-work (computed using work
set), and R-free (computed using test set)
Dataset (FOBS)
Work set reflections are used
for everything: model building,
refinement, map calculation, …
work
test
Test set reflections are never
used for any model optimization,
expect Rfree factor calculation
  Rationale: the model that fits well ~90% of work set should fit well 10% of
excluded data (test set). Since test set data does not participate in
refinement, Rfree > Rwork. The gap Rfree–Rwork depends on resolution and
ranges from 5-7% (at medium to low resolution) to ~0.5A
1% (at ultra-high
resolution)
Why does Rfree work so well?
F(h,k,l)&=&Σj&fj&e(2πi&(hx+ky+lz))&
Every&X(ray&reflec,on&(h,k,l)&has&a&contribu,ng&wave&from&all&atoms&.&
&
&
&
ρ(x,y,z)&=&Σ&F(h,k,l)&e((2πi(hx+ky+lz))&
&
&
or
ρ(x,y,z)&=&Σ|F(h,k,l)|&e((2πi(hx+ky+lz)&+&φhkl)&
&
Every&point&in&the&density&map&has&contribu,ons&from&every&reflec,on&
&
&
What the F are
reflections, structure
factors, amplitudes,
“spots”?
hkl
We rotate the crystal to place a different set of
reflections on the detector
©Robert(M.(Stroud(2012(
10(
Ewald sphere
construction given:
wavelength
angle
lattice
distance from detector
orientation of lattice
relative to detector
predicts:
which diffracted waves
satisfy Bragg’s law
Each reflection is
measured multiple times
• F = sqrt(Intensity)
• SigI = error in Intensity (resulting from
multiple observations)
Where to cut the data?
I/sigma - background
Rmerge - consistency
CC1/2, CC* - effect on
refinement
(Karplus and Diederichs, Science, 2012)
©Robert(M.(Stroud(2012(
10(
If you have too many
overloads
if you throw out weak
(low res) data
if you randomly miss
data (like Rfree)
if you miss slices of data
(bad strategy) - why you
need a whole dataset
Sca-ering#pa-ern#is#the##
Fourier#transform#of#the#structure##
FT#
FT=1#
F(S)#=#Σj#fj#e(2πirj.S)#
##
Structure#is#the#‘inverse’#
Fourier#transform#of#the##
Sca-ering#pa-ern###
ρ(r)"="Σ"F(S)"e(&2πir.S)"
A crystal only samples the parts of the
transform that satisfy Bragg’s Law
a%
b%
FT%
1/b%
FT#1%
F(h,k,l)&=&Σj&fj&e(2πi&(hx+ky+lz))&
Every&X(ray&reflec,on&(h,k,l)&has&a&contribu,ng&wave&from&all&atoms&.&
&
&
&
ρ(x,y,z)&=&Σ&F(h,k,l)&e((2πi(hx+ky+lz))&
&
&
or
ρ(x,y,z)&=&Σ|F(h,k,l)|&e((2πi(hx+ky+lz)&+&φhkl)&
&
Every&point&in&the&density&map&has&contribu,ons&from&every&reflec,on&
&
&
Fourier Transform
(2πi&(hx+ky+lz))&
F(h,k,l)&=&Σhave
&f
&e
Waves
phase too...
j j
next lecture - and paper... how model phases
•
Every&X(ray&reflec,on&(h,k,l)&has&a&contribu,ng&wave&from&all&atoms&.&
bias our maps and how to “solve” the phase
problem
&
&
&
ρ(x,y,z)&=&Σ&F(h,k,l)&e((2πi(hx+ky+lz))&
&
&
or
ρ(x,y,z)&=&Σ|F(h,k,l)|&e((2πi(hx+ky+lz)&+&φhkl)&
&
Every&point&in&the&density&map&has&contribu,ons&from&every&reflec,on&
&
&
Key take-aways
1. X-ray crystallography results from an ensemble of
Billions and Billions of molecules in the crystal
2. Models in the PDB are often sub-optimal and can
contain errors
3. Intensity of “spots” relates to the electron density
(which relates to the molecules) in the unit cell
4. Positions of “spots” relates to the arrangement of
unit cells in the crystal
5. Every “spot” contains contributions from every
part of the crystal. Every part of the map contains
contributions from every “spot”
Key outcomes
• Understand “Table 1” in X-ray Papers (now
often “Table S1”)
• Understand the basic workflow of
determining a crystal “structure”
• Embrace the beauty and challenge of
disorder at high and low resolution