Mass spectrometry based proteomics (2)

Transcription

Mass spectrometry based proteomics (2)
Mass spectrometry
based proteomics
(2)
Kenny Helsens
[email protected]
Department of Biochemistry, Ghent University
Department of Medical Protein Research, VIB
Ghent, Belgium
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
The central paradigm
-­‐ Primary structure (sequence) …YSFVATAER… -­‐ Secondary structure (structural elements) -­‐ Ter=airy structure (3D shape) -­‐ Modifica=ons (dynamic, func5on) phosphoryla5on Adapted from the NCBI Science Primer h"p://www.ncbi.nih.gov/About/primer/gene5cs_cell.html Kenny Helsens [email protected] -­‐ Processing (targeBng, ac5va5on) trypsin platelet ac5vity Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
2D-PAGE separation of proteins (Est. 1975)
Principle cell lysis protein extrac9on cells Protein A Protein C Protein D protein mixture pI Chemistry toolbox Mr 2D-­‐PAGE Kenny Helsens [email protected] Protein B Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
2D-PAGE separation of proteins (Est. 1975)
protein
complex protein mixture
extraction
http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm
2D-PAGE separation
MS/MS analysis
100
pI
%
0
100
300
500
700
900
1100
1300
1500
1700
1900
2100
m/z
fragmentation
100
MS analysis
tryptic
%
0
digest
300
400
500
Kenny Helsens [email protected] 600
700
800
900
1000
1100
m/z
MW
Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Overall gel-free proteomics workflow
protein
extraction
complex protein mixture
enzymatic
digest
http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm
Data-dependent MS/MS analyses
100
100
100
%
%
%
0
100
300
500
700
900
1100
1300
1500
1700
1900
2100
m/z
0
100
300
500
700
900
1100
1300
1500
1700
1900
2100
m/z
0
100
300
500
700
900
1100
1300
1500
1700
extremely complex
peptide mixture
1900
2100
separation
selection
m/z
MS analysis less complex
peptide fractions
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Going gel-free in the new millenium
•  ICAT (Gygi et al., 1999)
•  MudPIT (Washburn et al., 2001)
•  Accurate Mass Tags for proteome analysis (Conrads et al., 2000)
•  Signature Peptides approach for proteomics (Ji et al., 2000)
•  AA-based covalent chromatography peptide selection (Wang & Regnier, 2001)
•  Affinity-based enrichment of phosphopeptides (Oda et al., 2001)
•  ICAT for phosphopeptides (Zhou et al., 2001)
•  Reversible biotinylation of Cys-peptides (Spahr et al., 2000)
•  COFRADIC (Gevaert et al., 2002)
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
An overview of the pro’s and cons
•  Massive increase in mixture redundancy (eg. membrane proteins)
à  Corresponding increase in mixture complexity (from a few
thousand proteins to hundreds of thousands of peptides!)
•  Easier seperation of peptides instead of proteins
 Loss of protein-level information (pI, MW, isoforms)
•  Mixture complexity can be reduced by peptide selection (Cys-
peptides, Met-peptides, N-terminal peptides, phospho-peptides, …)
à  Again leading to reduced redundancy of the mixture
•  Choice of selection technique, depending on circumstances/analyte
 Massive amounts of data generated (10.000 spectra per hour)
•  Additional processing information (N-terminal peptides)
 Unadapted database search engines (N-terminal processing)
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
AN INCOMPLETE OVERVIEW
OF GEL-FREE TECHNIQUES
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
MudPIT: that which we call a rose…
Strong ca9on exchanger SCX
Reverse-­‐phase resin RP
ESI-based MS
•  Orthogonal, 2D separa=on of pep=des •  2D analogon: pI = SCX, Mr = RP Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
But what about the complexity?
e.g., Escherichia coli
4,349 predicted proteins
if 100% expressed
109,934 detectable tryptic peptides
if 50% expressed
54,967 detectable tryptic peptides
Sample complexity increased one order of magnitude!
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
A thought experiment seems appropriate
What happens when there are 100.000 pep<des present? How oRen do we need to repeat an analysis of an iden=cal sample in order to obtain reasonable coverage? The explorative aspect
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
The explorative aspect
100,000
Complete coverage 2010 80,000
coverage
60,000
capacity
2006 10000
20000
40,000
50000
2002 20,000
0
5
Kenny Helsens [email protected] 10
15
round
20
Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
25
30
MS/MS IDENTIFICATION
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Peptide sequences and MS/MS spectra
L E N N A R T intensity LENNAR RT NART LEN NNART LENNART LENNART LENNA T ART L L E LE ENNART LENN N N A R T m/z Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Peptide fragment fingerprinting (PFF)
Int
YSFVATAER
in silico
digest
HETSINGK
MILQEESTVYYR
Int
in silico
MS/MS
m/z
m/z
Int
m/z
Int
SEFASTPINK
…
protein sequence database
1)  YSFVATAER
2)  YSFVSAIR
3)  FFLIGGGGK
m/z
peptide sequences
34
12
12
theoretical MS/MS
spectra
in silico
matching
peptide scores
experimental MS/MS spectrum
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Peptide fragment fingerprinting (PFF)
Reference: Jimmy K. Eng et al, Molecular and Cellular Proteomics, 2011, in press Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Different types of PFF identification
Spectral comparison database sequence theore=cal spectrum compare experimental spectrum Sequencial comparison database sequence compare de novo sequence From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
experimental spectrum The most popular algorithms
•  MASCOT (Matrix Science) h"p://www.matrixscience.com •  SEQUEST (Scripps, Thermo Fisher Scien=fic) h"p://fields.scripps.edu/sequest •  X!Tandem (The Global Proteome Machine Organiza=on) h"p://www.thegpm.org/TANDEM •  OMSSA (NCBI) h"p://pubchem.ncbi.nlm.nih.gov/omssa/ Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Overall concept of scores and cut-offs
Incorrect iden=fica=ons Threshold score Correct iden=fica=ons False nega=ves False posi=ves Adapted from: www.proteomesoOware.com – Wiki pages Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Playing with probabilistic cut-off scores
higher stringency 6%
100%
90%
5%
iden=fica=ons 4%
80%
70%
60%
3%
50%
false posi=ves 2%
40%
30%
20%
1%
10%
0%
0%
p=0.05
Kenny Helsens [email protected] p=0.01
Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
p=0.005
p=0.0005
SEQUEST
•  Very well established search engine •  Can be used for MS/MS (PFF) iden=fica=ons •  Based on a cross-­‐correla=on score (includes experimental peak height) •  Published core algorithm (patented, licensed to Thermo Fisher Scien=fic) •  Provides preliminary (Sp) score, rank, cross-­‐correla=on score (XCorr), and score difference between the top tow ranks (deltaCn, ΔCn) •  Thresholding is up to the user, and is commonly done per charge state •  Many extensions exist to perform a more automa=c valida=on of results CrossCorr
XCorr = avg AutoCorr offset=-75 to 75
(
)
deltaCn= Kenny Helsens [email protected] XCorr 1 − XCorr 2
XCorr 1
Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
SEQUEST: some additional pictures
From: MacCoss et al., Anal. Chem. 2002 From: Peng et al., J. Prot. Res.. 2002 Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Mascot
•  Very well established search engine •  Can do MS (PMF) and MS/MS (PFF) iden=fica=ons •  Based on the MOWSE score •  Unpublished core algorithm (trade secret) •  Predicts an a priori threshold score that iden=fica=ons need to pass •  From version 2.2, Mascot allows integrated decoy searches •  Provides rank, score, threshold and expecta=on value per iden=fica=on •  Customizable confidence level for the threshold score Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Mascot: some additional pictures
Averageidentity
identity threshold
Average
threshold
40
35
30
y = 8.3761x - 34.089
2
6%R = 0.9985
100%
90%
25
20
5%
80%
15
10
70%
4%
iden=fica=ons 5
0
6.50
3%
7.00
2%
60%
50%
7.50
8.00
8.50
40%
log10(number of AA)
30%
false posi=ves 20%
1%
10%
0%
0%
p=0.05
Kenny Helsens [email protected] p=0.01
p=0.005
Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
p=0.0005
X!Tandem
•  A successful open source search engine •  Can be used for MS/MS (PFF) iden=fica=ons •  Based on a hyperscore (Pi is either 0 or 1): ⎛ n
⎞
HyperScore = ⎜ ∑ Ii * Pi ⎟ * Nb !* Ny !
⎝ i =0
⎠
•  Relies on a hypergeometric distribu=on (hence hyperscore) •  Published core algorithm, and is freely available •  Provides hyperscore and expectancy score (the discrimina=ng one) •  X!Tandem is fast and can handle modifica=ons in an itera=ve fashion •  Has rapidly gained popularity as (auxiliary) search engine Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
X!Tandem: some additional pictures
60
4
3.5
log(# results) # results 50
40
3
2.5
30
2
1.5
20
1
0.5
10
0
0
20
0
20
40
60
80
25
100
hyperscore 35
hyperscore 40
45
significance threshold 6
4
log(# results) 30
2
0
Adapted from: Brian Searle, ProteomeSoOware, hRp://www.proteomesoOware.com/XTandem_edited.pdf -2
-4
-6
E-­‐value=e-­‐8.2 -8
-10
0
20
40
60
hyperscore Kenny Helsens [email protected] 80
100
Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
50
OMSSA
•  A successful open source search engine •  Can be used for MS/MS (PFF) iden=fica=ons •  Relies on a Poisson distribu=on •  Published core algorithm, and is freely available •  Provides an expectancy score, similar to the BLAST E-­‐value •  OMSSA was recently upgraded to take peak intensity into account •  Good really good marks in a recently published compara=ve study Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
COMPARATIVE STUDIES
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Combining the output of search algorithms
SEQUEST 3792 Mascot 3229 212 (+4,2%) 486 (+9,6%) ProteinSolver 3203 179 40 329 (+6,5%) 168 380 (+7,5%) 348 501 Phenyx 3186 1776 139 195 77 96 146 Figure courtesy of Dr. Chris9an Stephan, Medizinisches Proteom-­‐Center, Ruhr-­‐Universität Bochum; Human Brain Proteome Project Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
POST-IDENTIFICATION VALIDATION
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Don’t simply trust computers...
Automatic software translation of the caption on the picture below
stated that this picture of Japanese Scouts was taken during an
International Scouting Jamboree in the Netherlands
The Netherlands??? Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Eliminating false positives
Suspect peptide identifications happen.
The problem is that finding them requires
detailed analysis of a single spectrum and
its identifications, amongst thousands of
other spectra…
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
ESTIMATING FALSE DISCOVERY RATES
THE DECOY DATABASE APPROACH
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Decoy databases, the latest fashion
Three main types of decoy DB’s are used: -­‐ Reversed databases (easy) -­‐ Shuffled databases (slightly more difficult) LENNARTMARTENS à SNETRAMTRANNEL LENNARTMARTENS à NMERLANATERTTN
(for instance) -­‐ Randomized databases (as difficult as you want it to be) LENNARTMARTENS à GFVLAEPHSEAITK
(for instance) The concept is that each pep=de iden=fied from the decoy database is an incorrect iden=fica=on. By coun=ng the number of decoy hits, we can es=mate the number of false posi=ves in the original database. Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Estimating the FDR (i)
2 × nbr _ decoy _ hits
FDR =
nbr _ forward _ hits + nbr _ decoy _ hits
FDR is the False Discovery Rate – it is a metric that gives you an indica=on of how many (percent) of your iden=fica=ons are poten=ally incorrect. Note that we mul=ply the number of decoy hits by 2, because we should not only count the actual decoy hits, but also the ‘hidden’ false posi=ves that are present in the forward iden=fica=ons. The assump=on here is that we expect one forward false posi=ve hit per decoy false posi=ve hit, hence the doubling term. From: Elias and Gygi, Nature Methods 2007, 4(3): 207-214
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Estimating the FDR (ii)
nbr _ decoy _ hits
FDR =
nbr _ forward _ hits
This metric was proposed by Storey and Tibbs for genomics data, and further inves=gated by Lukas Käll for proteomics. It provides a more accurate (and simpler!) es=mate of the FDR, but can be extended to also take into account the (suspected) false posi=ves in the forward set. See: Storey and Tibbs, PNAS 2003, 100(16): 9440-9445
See: Käll et al,., JPR 2008, 7(1): 29-34
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
PROTEIN INFERENCE
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Not all peptides are created equal
Gene
1a
1b 2
Transcripts
1a
1b 2
1a
1b 2
3
1b 2
3
1b 2
3
1a
Translations
4
5 6a
6b
5 6a
6b
5 6a
6b
4
5 6a
6b
4
5 6a
2 5
Peptides
matching all transcripts
matching a transcript subset
matching exactly 1 translation
Intron
Kenny Helsens [email protected] 3
Exon UTR
2 3
5
2 3
4 5
2 3
4 5
Exon CDS
Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
redundant
Peptide
Protein inference: a question of conviction
William of Ockham (c. 1285–1349) is remembered as an influen5al nominalist but his popular fame as a great logician rests chiefly on the maxim a"ributed to him and known as Ockham's razor. The term razor (the German "Ockhams Messer" translates to "Occam's knife") refers to dis5nguishing between two theories either by "shaving away" unnecessary assump5ons or cuBng apart two similar theories. The simplest explana9on that covers all the facts is usually the best. hup://en.wikipedia.org/wiki/Occam's_razor Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Protein inference: a question of conviction
pep5des a
b
c
d proteins prot X prot Y prot Z x
x
x
x
x
x pep5des a
b
c
d proteins prot X prot Y prot Z x
x
x
x
x
x pep5des a
b
c
d proteins prot X (-)
prot Y (+)
prot Z (0)
x
x
x
x
x
x Minimal set Occam { Maximal set an9-­‐Occam { Minimal set with maximal annota<on true Occam? Kenny Helsens [email protected] { See: Martens and Hermjakob, Molecular BioSystems, 2007 Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
A few algorithms for protein inference
• ProteinProphet Nesvizhskii AI et al, Analy5cal Chemistry, 2003 • MassSieve Slo"a et al., Proteomics, 2010 h"p://www.proteomecommons.org/dev/masssieve • DBToolkit Martens et al, Bioinforma5cs, 2005 h"p://genesis.UGent.be/dbtoolkit • IDPicker Zhang et al, Journal of Proteome Research, 2007 h"p://fenchurch.mc.vanderbilt.edu/bumbershoot/idpicker/ Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
DBToolkit protein inference
Minimal set with
maximal annotation
Kenny Helsens [email protected] {
peptides
a
proteins
prot X (-)
prot Y (+)
prot Z (0)
x
x
b
c
d
x
x
Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
x
x
IDPicker parsimonious protein assembly
(I) Initialize
Zhang et al, Journal of Proteome Research, 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
IDPicker parsimonious protein assembly
(II) Collapse
Zhang et al, Journal of Proteome Research, 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
IDPicker parsimonious protein assembly
(III) Separate
Zhang et al, Journal of Proteome Research, 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
IDPicker parsimonious protein assembly
(IV) Reduce
Zhang et al, Journal of Proteome Research, 2007 Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
PUBLIC DATA DISSEMINATION
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
PUBLIC DATA DISSEMINATION
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
PRIDE
Kenny Helsens [email protected] hSp://www.ebi.ac.uk/pride/ Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
PRIDE Converter
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
PRIDE Converter
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
PRIDE Inspector
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011
Thank you!
Questions?
Kenny Helsens [email protected] Mass spectrometry based proteomics (2)
EBI Bioinformatics Roadshow - Prague - 7 Sep 2011

Similar documents