Mass spectrometry based proteomics (2)
Transcription
Mass spectrometry based proteomics (2)
Mass spectrometry based proteomics (2) Kenny Helsens [email protected] Department of Biochemistry, Ghent University Department of Medical Protein Research, VIB Ghent, Belgium Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 The central paradigm -‐ Primary structure (sequence) …YSFVATAER… -‐ Secondary structure (structural elements) -‐ Ter=airy structure (3D shape) -‐ Modifica=ons (dynamic, func5on) phosphoryla5on Adapted from the NCBI Science Primer h"p://www.ncbi.nih.gov/About/primer/gene5cs_cell.html Kenny Helsens [email protected] -‐ Processing (targeBng, ac5va5on) trypsin platelet ac5vity Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 2D-PAGE separation of proteins (Est. 1975) Principle cell lysis protein extrac9on cells Protein A Protein C Protein D protein mixture pI Chemistry toolbox Mr 2D-‐PAGE Kenny Helsens [email protected] Protein B Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 2D-PAGE separation of proteins (Est. 1975) protein complex protein mixture extraction http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm 2D-PAGE separation MS/MS analysis 100 pI % 0 100 300 500 700 900 1100 1300 1500 1700 1900 2100 m/z fragmentation 100 MS analysis tryptic % 0 digest 300 400 500 Kenny Helsens [email protected] 600 700 800 900 1000 1100 m/z MW Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Overall gel-free proteomics workflow protein extraction complex protein mixture enzymatic digest http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm Data-dependent MS/MS analyses 100 100 100 % % % 0 100 300 500 700 900 1100 1300 1500 1700 1900 2100 m/z 0 100 300 500 700 900 1100 1300 1500 1700 1900 2100 m/z 0 100 300 500 700 900 1100 1300 1500 1700 extremely complex peptide mixture 1900 2100 separation selection m/z MS analysis less complex peptide fractions Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Going gel-free in the new millenium • ICAT (Gygi et al., 1999) • MudPIT (Washburn et al., 2001) • Accurate Mass Tags for proteome analysis (Conrads et al., 2000) • Signature Peptides approach for proteomics (Ji et al., 2000) • AA-based covalent chromatography peptide selection (Wang & Regnier, 2001) • Affinity-based enrichment of phosphopeptides (Oda et al., 2001) • ICAT for phosphopeptides (Zhou et al., 2001) • Reversible biotinylation of Cys-peptides (Spahr et al., 2000) • COFRADIC (Gevaert et al., 2002) Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 An overview of the pro’s and cons • Massive increase in mixture redundancy (eg. membrane proteins) à Corresponding increase in mixture complexity (from a few thousand proteins to hundreds of thousands of peptides!) • Easier seperation of peptides instead of proteins Loss of protein-level information (pI, MW, isoforms) • Mixture complexity can be reduced by peptide selection (Cys- peptides, Met-peptides, N-terminal peptides, phospho-peptides, …) à Again leading to reduced redundancy of the mixture • Choice of selection technique, depending on circumstances/analyte Massive amounts of data generated (10.000 spectra per hour) • Additional processing information (N-terminal peptides) Unadapted database search engines (N-terminal processing) Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 AN INCOMPLETE OVERVIEW OF GEL-FREE TECHNIQUES Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 MudPIT: that which we call a rose… Strong ca9on exchanger SCX Reverse-‐phase resin RP ESI-based MS • Orthogonal, 2D separa=on of pep=des • 2D analogon: pI = SCX, Mr = RP Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 But what about the complexity? e.g., Escherichia coli 4,349 predicted proteins if 100% expressed 109,934 detectable tryptic peptides if 50% expressed 54,967 detectable tryptic peptides Sample complexity increased one order of magnitude! Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 A thought experiment seems appropriate What happens when there are 100.000 pep<des present? How oRen do we need to repeat an analysis of an iden=cal sample in order to obtain reasonable coverage? The explorative aspect Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 The explorative aspect 100,000 Complete coverage 2010 80,000 coverage 60,000 capacity 2006 10000 20000 40,000 50000 2002 20,000 0 5 Kenny Helsens [email protected] 10 15 round 20 Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 25 30 MS/MS IDENTIFICATION Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Peptide sequences and MS/MS spectra L E N N A R T intensity LENNAR RT NART LEN NNART LENNART LENNART LENNA T ART L L E LE ENNART LENN N N A R T m/z Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Peptide fragment fingerprinting (PFF) Int YSFVATAER in silico digest HETSINGK MILQEESTVYYR Int in silico MS/MS m/z m/z Int m/z Int SEFASTPINK … protein sequence database 1) YSFVATAER 2) YSFVSAIR 3) FFLIGGGGK m/z peptide sequences 34 12 12 theoretical MS/MS spectra in silico matching peptide scores experimental MS/MS spectrum Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Peptide fragment fingerprinting (PFF) Reference: Jimmy K. Eng et al, Molecular and Cellular Proteomics, 2011, in press Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Different types of PFF identification Spectral comparison database sequence theore=cal spectrum compare experimental spectrum Sequencial comparison database sequence compare de novo sequence From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 experimental spectrum The most popular algorithms • MASCOT (Matrix Science) h"p://www.matrixscience.com • SEQUEST (Scripps, Thermo Fisher Scien=fic) h"p://fields.scripps.edu/sequest • X!Tandem (The Global Proteome Machine Organiza=on) h"p://www.thegpm.org/TANDEM • OMSSA (NCBI) h"p://pubchem.ncbi.nlm.nih.gov/omssa/ Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Overall concept of scores and cut-offs Incorrect iden=fica=ons Threshold score Correct iden=fica=ons False nega=ves False posi=ves Adapted from: www.proteomesoOware.com – Wiki pages Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Playing with probabilistic cut-off scores higher stringency 6% 100% 90% 5% iden=fica=ons 4% 80% 70% 60% 3% 50% false posi=ves 2% 40% 30% 20% 1% 10% 0% 0% p=0.05 Kenny Helsens [email protected] p=0.01 Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 p=0.005 p=0.0005 SEQUEST • Very well established search engine • Can be used for MS/MS (PFF) iden=fica=ons • Based on a cross-‐correla=on score (includes experimental peak height) • Published core algorithm (patented, licensed to Thermo Fisher Scien=fic) • Provides preliminary (Sp) score, rank, cross-‐correla=on score (XCorr), and score difference between the top tow ranks (deltaCn, ΔCn) • Thresholding is up to the user, and is commonly done per charge state • Many extensions exist to perform a more automa=c valida=on of results CrossCorr XCorr = avg AutoCorr offset=-75 to 75 ( ) deltaCn= Kenny Helsens [email protected] XCorr 1 − XCorr 2 XCorr 1 Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 SEQUEST: some additional pictures From: MacCoss et al., Anal. Chem. 2002 From: Peng et al., J. Prot. Res.. 2002 Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Mascot • Very well established search engine • Can do MS (PMF) and MS/MS (PFF) iden=fica=ons • Based on the MOWSE score • Unpublished core algorithm (trade secret) • Predicts an a priori threshold score that iden=fica=ons need to pass • From version 2.2, Mascot allows integrated decoy searches • Provides rank, score, threshold and expecta=on value per iden=fica=on • Customizable confidence level for the threshold score Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Mascot: some additional pictures Averageidentity identity threshold Average threshold 40 35 30 y = 8.3761x - 34.089 2 6%R = 0.9985 100% 90% 25 20 5% 80% 15 10 70% 4% iden=fica=ons 5 0 6.50 3% 7.00 2% 60% 50% 7.50 8.00 8.50 40% log10(number of AA) 30% false posi=ves 20% 1% 10% 0% 0% p=0.05 Kenny Helsens [email protected] p=0.01 p=0.005 Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 p=0.0005 X!Tandem • A successful open source search engine • Can be used for MS/MS (PFF) iden=fica=ons • Based on a hyperscore (Pi is either 0 or 1): ⎛ n ⎞ HyperScore = ⎜ ∑ Ii * Pi ⎟ * Nb !* Ny ! ⎝ i =0 ⎠ • Relies on a hypergeometric distribu=on (hence hyperscore) • Published core algorithm, and is freely available • Provides hyperscore and expectancy score (the discrimina=ng one) • X!Tandem is fast and can handle modifica=ons in an itera=ve fashion • Has rapidly gained popularity as (auxiliary) search engine Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 X!Tandem: some additional pictures 60 4 3.5 log(# results) # results 50 40 3 2.5 30 2 1.5 20 1 0.5 10 0 0 20 0 20 40 60 80 25 100 hyperscore 35 hyperscore 40 45 significance threshold 6 4 log(# results) 30 2 0 Adapted from: Brian Searle, ProteomeSoOware, hRp://www.proteomesoOware.com/XTandem_edited.pdf -2 -4 -6 E-‐value=e-‐8.2 -8 -10 0 20 40 60 hyperscore Kenny Helsens [email protected] 80 100 Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 50 OMSSA • A successful open source search engine • Can be used for MS/MS (PFF) iden=fica=ons • Relies on a Poisson distribu=on • Published core algorithm, and is freely available • Provides an expectancy score, similar to the BLAST E-‐value • OMSSA was recently upgraded to take peak intensity into account • Good really good marks in a recently published compara=ve study Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 COMPARATIVE STUDIES Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Combining the output of search algorithms SEQUEST 3792 Mascot 3229 212 (+4,2%) 486 (+9,6%) ProteinSolver 3203 179 40 329 (+6,5%) 168 380 (+7,5%) 348 501 Phenyx 3186 1776 139 195 77 96 146 Figure courtesy of Dr. Chris9an Stephan, Medizinisches Proteom-‐Center, Ruhr-‐Universität Bochum; Human Brain Proteome Project Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 POST-IDENTIFICATION VALIDATION Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Don’t simply trust computers... Automatic software translation of the caption on the picture below stated that this picture of Japanese Scouts was taken during an International Scouting Jamboree in the Netherlands The Netherlands??? Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Eliminating false positives Suspect peptide identifications happen. The problem is that finding them requires detailed analysis of a single spectrum and its identifications, amongst thousands of other spectra… Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 ESTIMATING FALSE DISCOVERY RATES THE DECOY DATABASE APPROACH Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Decoy databases, the latest fashion Three main types of decoy DB’s are used: -‐ Reversed databases (easy) -‐ Shuffled databases (slightly more difficult) LENNARTMARTENS à SNETRAMTRANNEL LENNARTMARTENS à NMERLANATERTTN (for instance) -‐ Randomized databases (as difficult as you want it to be) LENNARTMARTENS à GFVLAEPHSEAITK (for instance) The concept is that each pep=de iden=fied from the decoy database is an incorrect iden=fica=on. By coun=ng the number of decoy hits, we can es=mate the number of false posi=ves in the original database. Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Estimating the FDR (i) 2 × nbr _ decoy _ hits FDR = nbr _ forward _ hits + nbr _ decoy _ hits FDR is the False Discovery Rate – it is a metric that gives you an indica=on of how many (percent) of your iden=fica=ons are poten=ally incorrect. Note that we mul=ply the number of decoy hits by 2, because we should not only count the actual decoy hits, but also the ‘hidden’ false posi=ves that are present in the forward iden=fica=ons. The assump=on here is that we expect one forward false posi=ve hit per decoy false posi=ve hit, hence the doubling term. From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214 Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Estimating the FDR (ii) nbr _ decoy _ hits FDR = nbr _ forward _ hits This metric was proposed by Storey and Tibbs for genomics data, and further inves=gated by Lukas Käll for proteomics. It provides a more accurate (and simpler!) es=mate of the FDR, but can be extended to also take into account the (suspected) false posi=ves in the forward set. See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445 See: Käll et al,., JPR 2008, 7(1): 29-34 Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 PROTEIN INFERENCE Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Not all peptides are created equal Gene 1a 1b 2 Transcripts 1a 1b 2 1a 1b 2 3 1b 2 3 1b 2 3 1a Translations 4 5 6a 6b 5 6a 6b 5 6a 6b 4 5 6a 6b 4 5 6a 2 5 Peptides matching all transcripts matching a transcript subset matching exactly 1 translation Intron Kenny Helsens [email protected] 3 Exon UTR 2 3 5 2 3 4 5 2 3 4 5 Exon CDS Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 redundant Peptide Protein inference: a question of conviction William of Ockham (c. 1285–1349) is remembered as an influen5al nominalist but his popular fame as a great logician rests chiefly on the maxim a"ributed to him and known as Ockham's razor. The term razor (the German "Ockhams Messer" translates to "Occam's knife") refers to dis5nguishing between two theories either by "shaving away" unnecessary assump5ons or cuBng apart two similar theories. The simplest explana9on that covers all the facts is usually the best. hup://en.wikipedia.org/wiki/Occam's_razor Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Protein inference: a question of conviction pep5des a b c d proteins prot X prot Y prot Z x x x x x x pep5des a b c d proteins prot X prot Y prot Z x x x x x x pep5des a b c d proteins prot X (-) prot Y (+) prot Z (0) x x x x x x Minimal set Occam { Maximal set an9-‐Occam { Minimal set with maximal annota<on true Occam? Kenny Helsens [email protected] { See: Martens and Hermjakob, Molecular BioSystems, 2007 Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 A few algorithms for protein inference • ProteinProphet Nesvizhskii AI et al, Analy5cal Chemistry, 2003 • MassSieve Slo"a et al., Proteomics, 2010 h"p://www.proteomecommons.org/dev/masssieve • DBToolkit Martens et al, Bioinforma5cs, 2005 h"p://genesis.UGent.be/dbtoolkit • IDPicker Zhang et al, Journal of Proteome Research, 2007 h"p://fenchurch.mc.vanderbilt.edu/bumbershoot/idpicker/ Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 DBToolkit protein inference Minimal set with maximal annotation Kenny Helsens [email protected] { peptides a proteins prot X (-) prot Y (+) prot Z (0) x x b c d x x Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 x x IDPicker parsimonious protein assembly (I) Initialize Zhang et al, Journal of Proteome Research, 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 IDPicker parsimonious protein assembly (II) Collapse Zhang et al, Journal of Proteome Research, 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 IDPicker parsimonious protein assembly (III) Separate Zhang et al, Journal of Proteome Research, 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 IDPicker parsimonious protein assembly (IV) Reduce Zhang et al, Journal of Proteome Research, 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 PUBLIC DATA DISSEMINATION Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 PUBLIC DATA DISSEMINATION Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 PRIDE Kenny Helsens [email protected] hSp://www.ebi.ac.uk/pride/ Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 PRIDE Converter Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 PRIDE Converter Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 PRIDE Inspector Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011 Thank you! Questions? Kenny Helsens [email protected] Mass spectrometry based proteomics (2) EBI Bioinformatics Roadshow - Prague - 7 Sep 2011