X-ray Crystallography I - UCSF Macromolecular Structure Group
Transcription
X-ray Crystallography I - UCSF Macromolecular Structure Group
X-ray Crystallography I James Fraser Macromolecluar Interactions BP204 Key take-aways 1. X-ray crystallography results from an ensemble of Billions and Billions of molecules in the crystal 2. Models in the PDB are often sub-optimal and can contain errors 3. Intensity of “spots” relates to the electron density (which relates to the molecules) in the unit cell 4. Positions of “spots” relates to the arrangement of unit cells in the crystal 5. Every “spot” contains contributions from every part of the crystal. Every part of the map contains contributions from every “spot” Key outcomes • Understand “Table 1” in X-ray Papers (now often “Table S1”) • Understand the basic workflow of determining a crystal “structure” • Embrace the beauty and challenge of disorder at high and low resolution Today we are going to tackle crystallography in reverse • Texts begin with diffraction theory from a series of point atoms • • Bob teaches mini-course in Spring with this level of detail (e.g. Biomolecular Crystallography, Rupp; Principles of Protein X-ray Crystallography, Drenth; Crystallography Made Crystal Clear, Rhoades) Today - model to reflections;Tomorrow - phasing What’s PHENIX for: Macromolecular crystallographic structure solution Adams et al., Acta Cryst. D66, 213-221 (2010). What is a protein structure? • What is a protein “structure” Is it a: • • • • • • • pretty cartoon... space-filling set of spheres... picture of the protein in the crystal... computational picture of the protein... representation of atoms that satisfies experimental constraints... PDB formatted text file... model!!! Moreover... a model of the crystal lattice... P D B Files are text: rotein ata ank chemistry, sequence, position, certainty HEADER TITLE COMPND COMPND HYDROLASE T4 LYSOZYME C-TERMINAL FRAGMENT MOL_ID: 1; 2 MOLECULE: LYSOZYME; REMARK REMARK REMARK REMARK 3 3 3 3 ATOM ATOM ATOM ATOM ATOM MASTER END 10-DEC-06 2O7A ... 1 2 3 4 5 FIT TO DATA USED IN REFINEMENT (NO R VALUE (WORKING + TEST SET, NO R VALUE (WORKING SET, NO FREE R VALUE (NO ... N CA C O CB 287 VAL VAL VAL VAL VAL 0 A A A A A 2 2 2 2 2 3 -19.742 -19.867 -19.073 -19.367 -19.341 ... 10 0 0 CUTOFF). CUTOFF) : NULL CUTOFF) : 0.090 CUTOFF) : 0.108 -2.254 -2.152 -0.927 0.178 -3.411 0 -19.976 -18.529 -18.101 -18.554 -17.836 6 1566 1.00 1.00 1.00 1.00 1.00 1 22 54.44 54.48 41.86 47.57 68.76 10 N C C O C The$universe$of$protein$structures:$$ Our$knowledge$about$protein$structures$is$increasing..$ • • • 65,271' protein' structures' are' deposited' in' PDB' (2/15/2010).' This'number'is'growing'by'>'~7000'a'year'' Growing'input'from'Structural'Genomics'HT'structure' determinaJon'(>1000'structures'a'year)' ©Robert'M.'Stroud'2012' XPray' 9' How do we tell if a model is “good”? • physically (packing, contacts) • chemically (bond lengths, bond angles, chirality, planarity, torsions) • crystallographically (real space fits - Bfactors, R-factor) • statistically (R-free, CC1/2) Most of these stats appear in Table I Physical Checks Steric clashes Bad Good Overall clash score (number of bad overlaps per 1000 atoms) A clash: disallowed atom pair overlap ≥0.4 Å MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Davis et al, Nucleic Acids Research, 2007, Vol. 35 Chemical checks • Bond lengths and angles Geometry: global figures A priori chemical knowledge is introduced (restraints) to keep the model chemically correct while fitting it to the experimental : ERESTRAINTS = EBOND+EANGLE+EDIHEDRAL+EPLANARITY+ENONBONDED+… - Typically only rmsd for bonds and angles are reported along with RWORK and RFREE - Typical values (resolutions ~1.5-2Å): rmsd(bonds)~0.02Å, rmsd(angles)~2° o These values can be smaller at lower resolution (~2.5-3Å), approaching 0 at ~3Å and lower resolution, and they can be larger at higher resolution (~1.5Å and higher). Engh and Huber, Acta Cryst, 1991 Chemical checks • backbone and side chain torsion angles Ramachandran plot Rotamers Rotamers: a set of conformers arising from restricted rotation about one single bond χ2 χ1 Typically 1-3% outliers Are all outliers bad? ...not if justified by fit to electron density map Ramachandran plot: outlier may be good Not everything flagged as outlier is actually wrong ...what forces might cause this? - Check the map - Make sure the map is not biased by the model ...similarly for side chains Rotamers: outlier may be good Flagged as rotamer outlier thing flagged as outlier is actually wrong he map Correct rotamer ... each outlier should be explainable by examining the electron density AND by forces acting in context of the whole protein Rotamers: r Phenix offers handy tools for looking at outliers PHENIX tools for model validation PHENIX tools for model validation outliers in graphs also recenter Coot What the F are electron density maps? obs-Fcalc Density maps can offer a “model free” view • 2mFo-DFc (blue) • mFo-DFc (red, green) m and D are “de-biasing” coefficients Fobs = Observed Amplitude Fcalc = Model-based Amplitude ! Maps are contoured in units of SIGMA (rmsd) Typically 1.0 for 2Fo-Fc, +/-3 for Fo-Fc Map units Electron density map FT {F(s)} ρ crystal (r) has some arbitrary units. € Two ways of bringing a map into some scale: - Divide it by standard deviation (map in sigmas) - Include reflection F(000) and divide map by the unit cell volume. Model should be complete to estimate F(000). Map in e/Å3. to Coot!!! Density of individual atoms in high resolution map appears like a sphere or ellipsoid Computationally it is very beneficial to approximate the electron density arising from each atom as a Gaussian function - Electron density at the point r of an atom located at position r0 and having B-factor B and occupancy q: 4π 2 r − r 2 4π 3 / 2 0 ρ atom (r,r0 ,B,q) = q∑ ak exp− bk + B bk + B k=1 5 - Number of terms in the above formula depends on how accurately we want to model an atom € q and B are hard to separate (even at very high resolution) Natoms ρ crystal (r) = ∑ i=1 4π 2 r − r 2 4π 3 / 2 0,i qi ∑ ak exp − bk + Bi bk + Bi k=1 looks very nice: € 5 However in practice we see densities more like: What are B-factors? ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 N CA C O CB VAL VAL VAL VAL VAL A A A A A 2 2 2 2 2 Fcalc (~h) = -19.742 -19.867 -19.073 -19.367 -19.341 X j fj exp ⇣ -2.254 -2.152 -0.927 0.178 -3.411 ⌘ 1 ~ t~ 4 Bj h h ⇣ -19.976 -18.529 -18.101 -18.554 -17.836 1.00 1.00 1.00 1.00 1.00 54.44 54.48 41.86 47.57 68.76 ⌘ exp 2⇡i~ht x~j , N C C O C (1) i.e. three coordinates x~j = (xj , yj , zj ) and one isotropic B-Factor Bj are refined for each atom j (fj is the respective scattering factor, ~h a reciprocal lattice vector). The overall mean displacement of an atom originates from several sources: • di↵erent conformations in di↵erent unit cells (’internal static disorder’) • vibration or dynamic transitions within molecules (’internal dynamic disorder’) • lattice defects • lattice vibrations (acoustical phonons) +restraints, +model errors! From the variety of these contributions it is clear that an isotropic description of mean ty of these contributions it is clear that an isotropic description of mean ements is only a very crude approximation. In contrast to small molecules of more detailed models by introducing more parameters into the refines unfortunately not supported by the number and quality of the X-ray macromolecules. The situation is di↵erent if atomic resolution data are to the large number of observables (typically on the order of 30 to 50 non-hydrogen atom) isotropic model for thethe shape theanPD (1 paramFigure 1: the In the anisotropic case PDoffor atom is approximat ding to the radius) can be upgraded to an anisotropic model (6 parameters instead of a spherical distribution. orientation and the elongation of an ellipsoid (Fig. 1)). The 6 parameters Going anisotropic means 6 parameters instead of 1 for the anisotropic description of the PD of an atom can be written a trix Uj which enters the structure factor equation in a way very sim B-factor: X ⇣ ⌘ ⇣ ⌘ ~atom Fcalc = is fapproximated 2⇡ 2~htby Uj~an h exp 2⇡i~ht x~j , j exp he anisotropic case the PD for an(h) ellipsoidal j herical distribution. resulting in PD 6+3=9 of be 1+3=4 (eq. parameters to be refin opic description of the of is aninstead atomdecision can written as 1) a symmetric maWhen this justified? the matrix Uin area referred as anisotropic displacement p enters the elements structure of factor equation way verytosimilar to the isotropic In order to obtain meaningful results from a refinement of ADP’s fo in most cases restraints have to be employed to supplement the expe What is q? doi:10.1016/S0022-2836(02)00476-X available online at http://www.idealibrary.com on w B J. Mol. Biol. (2002) 320, 783–799 Structural Basis for Mobility in the 1.1 Å Crystal Structure of the NG Domain of Thermus aquaticus Ffh Ursula D. Ramirez1†, George Minasov1†, Pamela J. Focia1 Robert M. Stroud2, Peter Walter2, Peter Kuhn3 and Douglas M. Freymann1* 1 Department of Molecular Pharmacology and Biological Chemistry, Northwestern University Medical School The NG domain of the prokaryotic signal recognition protein Ffh is a twodomain GTPase that comprises part of the prokaryotic signal recognition particle (SRP) that functions in co-translational targeting of proteins to the membrane. The interface between the N and G domains includes two Difference Maps: Fo-Fc Balanced difference maps: model errors Error in position Error in occupancy Should be green, not blue... Error in B-factor Balanced difference maps: model errors Model anisotropic atom with isotropic Add positional error Balanced difference maps Ser residue needs a different rotamer Fo#Fc%maps%iden.fy%everything%ordered%that%is%'missing'% mapmap% #Eliminate%Bias% #Half%electron%content% #See%electrons% ©Robert%M.%Stroud%2012% 70% Also useful in “dynamic” crystallography Refinement is the process of minimizing Fo-Fc ...need to balance prior knowledge and data ...an iterative process, difference maps minimized, and 2Fo-Fc maps improve (phases... we are coming to this) Refinement target function Structure refinement is a process of changing a model parameters in order to optimize a goal (target) function: T = F(Experimental data, Model parameters, A priori knowledge) - Experimental data – a set of diffraction amplitudes Fobs (and phases, if available). - Model parameters: coordinates, ADP, occupancies, bulk-solvent, … - A priori knowledge (restraints or constraints) – additional information that may be introduced to compensate for the insufficiency of experimental data (finite resolution, poor data-to-parameters ratio) Typically: T = TDATA + w*TRESTRAINTS - EDATA relates model to experimental data - ERESTRAINTS represents a priori knowledge - w is a weight to balance the relative contribution of EDATA and ERESTRAINTS A priori knowledge can be imposed in the form of constraints so T = EDATA Refinement target optimization methods Gradient-driven minimization Target function profile Simulated annealing (SA) Target function profile Local minimum Global minimum Grid search (Sample parameter space within known range [XMIN, XMAX]) XMIN solution X Target function profile MAX Local minima Global minimum Deeper local minimum Global minimum Hands & eyes (Via Coot) How do we tell if a model is “good”? • physically (packing, contacts) • chemically (bond lengths, bond angles, chirality, planarity, torsions) • crystallographically (real space fits - Bfactors, R-factor) • statistically (R-free, CC1/2) Most of these stats appear in Table I Hands and Eyes are still important! Refinement convergence Minimization Both minimization and SA can fix it Simulated Annealing This is beyond the convergence radius for minimization Real-space grid search This is beyond the convergence radius for minimization and SA In practice it is helpful to look at {B, map CC, 2mFo-DFc, mFo-DFc} Indicates problem p Real-space Indicates problem places " # " $ OBS OBS $ "CALC # "CALC CC = grid points % '' $ " OBS # " OBS & grid points grid points 2 $" grid points CALC # " CALC 1/ 2 ( 2 ** ) Scale independent Can be computed for the whole structure (not really interesting – you already ! have R-factor) or locally (most interesting; typically computed per residue) Values greater than ~0.8 indicate good correlation May give high correlation for weak densities Map CC is correlated with B-factor: poorly defined regions typically have low map CC and high B-factors although this emphasizes local adjustments, refinement is global F(h,k,l)&=&Σj&fj&e(2πi&(hx+ky+lz))& Every&X(ray&reflec,on&(h,k,l)&has&a&contribu,ng&wave&from&all&atoms&.& & & & ρ(x,y,z)&=&Σ&F(h,k,l)&e((2πi(hx+ky+lz))& & & or ρ(x,y,z)&=&Σ|F(h,k,l)|&e((2πi(hx+ky+lz)&+&φhkl)& & Every&point&in&the&density&map&has&contribu,ons&from&every&reflec,on& & & R-factor R-factor formula #F OBS R = reflections " FMODEL #F OBS reflections BSOL s 2 # & " "sU CRYSTAL s t % FCALC_ATOMS + kSOLe 4 FMASK ( FMODEL = k OVERALLe % ( $ ' ! R-factor values: - Expected value for a random model R~59% ! - You can see some model in 2mFo-DFc map, R~30% - You can see most of the model in 2mFo-DFc map, R<20% - Perfect model R~0% Sometimes the R-factor looks very good (you would expect a good model) but the model-to-map fit is terrible… Overfitting. Overfitting (I) Let’s suppose: (red, blue or green) is the model: y = ax + b (2 parameters: a and b) is the data. Lot’s of data – one single correct model R-factor is good Less data – more ambiguity, less certainty: a bunch of models R-factor may be good too Little data – variety of models: from good to completely wrong R-factor = 0 for all models (including wrong ones) Overfitting (II) Let’s suppose: model: y = ax + b (2 parameters: a and b) data model described using more parameters: y=ax2+bx+c model described using even more parameters: y=a1xn+a2xn-1+… Less parameters More parameters R-factor is good R-factor is better Much more parameters R=0 What leads to overfitting? - Insufficient amount of data (low resolution, poor completeness) - Ignoring data (cutting by resolution, sigma, anisotropy correction) - Inoptimal parameterization - Excess of imagination - Bad weights Model parameters Choice for model parameterization depends on amount of available data and its resolution Key resolution limits and corresponding features Overfitting Solution: cross-validation (R-free factor): - At the beginning of structure solution split the data into two sets: test set (~5-10% of randomly selected data), and work set (the rest). - From this point on you look at two R-factors: R-work (computed using work set), and R-free (computed using test set) Dataset (FOBS) Work set reflections are used for everything: model building, refinement, map calculation, … work test Test set reflections are never used for any model optimization, expect Rfree factor calculation Rationale: the model that fits well ~90% of work set should fit well 10% of excluded data (test set). Since test set data does not participate in refinement, Rfree > Rwork. The gap Rfree–Rwork depends on resolution and ranges from 5-7% (at medium to low resolution) to ~0.5A 1% (at ultra-high resolution) Why does Rfree work so well? F(h,k,l)&=&Σj&fj&e(2πi&(hx+ky+lz))& Every&X(ray&reflec,on&(h,k,l)&has&a&contribu,ng&wave&from&all&atoms&.& & & & ρ(x,y,z)&=&Σ&F(h,k,l)&e((2πi(hx+ky+lz))& & & or ρ(x,y,z)&=&Σ|F(h,k,l)|&e((2πi(hx+ky+lz)&+&φhkl)& & Every&point&in&the&density&map&has&contribu,ons&from&every&reflec,on& & & What the F are reflections, structure factors, amplitudes, “spots”? hkl We rotate the crystal to place a different set of reflections on the detector ©Robert(M.(Stroud(2012( 10( Ewald sphere construction given: wavelength angle lattice distance from detector orientation of lattice relative to detector predicts: which diffracted waves satisfy Bragg’s law Each reflection is measured multiple times • F = sqrt(Intensity) • SigI = error in Intensity (resulting from multiple observations) Where to cut the data? I/sigma - background Rmerge - consistency CC1/2, CC* - effect on refinement (Karplus and Diederichs, Science, 2012) ©Robert(M.(Stroud(2012( 10( If you have too many overloads if you throw out weak (low res) data if you randomly miss data (like Rfree) if you miss slices of data (bad strategy) - why you need a whole dataset Sca-ering#pa-ern#is#the## Fourier#transform#of#the#structure## FT# FT=1# F(S)#=#Σj#fj#e(2πirj.S)# ## Structure#is#the#‘inverse’# Fourier#transform#of#the## Sca-ering#pa-ern### ρ(r)"="Σ"F(S)"e(&2πir.S)" A crystal only samples the parts of the transform that satisfy Bragg’s Law a% b% FT% 1/b% FT#1% F(h,k,l)&=&Σj&fj&e(2πi&(hx+ky+lz))& Every&X(ray&reflec,on&(h,k,l)&has&a&contribu,ng&wave&from&all&atoms&.& & & & ρ(x,y,z)&=&Σ&F(h,k,l)&e((2πi(hx+ky+lz))& & & or ρ(x,y,z)&=&Σ|F(h,k,l)|&e((2πi(hx+ky+lz)&+&φhkl)& & Every&point&in&the&density&map&has&contribu,ons&from&every&reflec,on& & & Fourier Transform (2πi&(hx+ky+lz))& F(h,k,l)&=&Σhave &f &e Waves phase too... j j next lecture - and paper... how model phases • Every&X(ray&reflec,on&(h,k,l)&has&a&contribu,ng&wave&from&all&atoms&.& bias our maps and how to “solve” the phase problem & & & ρ(x,y,z)&=&Σ&F(h,k,l)&e((2πi(hx+ky+lz))& & & or ρ(x,y,z)&=&Σ|F(h,k,l)|&e((2πi(hx+ky+lz)&+&φhkl)& & Every&point&in&the&density&map&has&contribu,ons&from&every&reflec,on& & & Key take-aways 1. X-ray crystallography results from an ensemble of Billions and Billions of molecules in the crystal 2. Models in the PDB are often sub-optimal and can contain errors 3. Intensity of “spots” relates to the electron density (which relates to the molecules) in the unit cell 4. Positions of “spots” relates to the arrangement of unit cells in the crystal 5. Every “spot” contains contributions from every part of the crystal. Every part of the map contains contributions from every “spot” Key outcomes • Understand “Table 1” in X-ray Papers (now often “Table S1”) • Understand the basic workflow of determining a crystal “structure” • Embrace the beauty and challenge of disorder at high and low resolution