PPT - SimulTOF

Comments

Transcription

PPT - SimulTOF
Classification of Proteins and Biomarkers to Tissue Types Using Publicly Available Gene Chip Data
BG- Medicine, Waltham, Massachusetts USA (now at Virgin Instruments Corporation Sudbury, MA)
Introduction
Many proteomics applications have as their goal the identification of protein biomarkers
that can be used to monitor disease status, starting from plasma or tissue biopsies. In
either case, the samples often consist of proteins originally expressed by many cell types,
which may be difficult to separate from one another. The credibility of a biomarker is
strengthened if a case can be made that the biomarker is a member of a class of proteins
derived from a specific cell type in the diseased tissue. A method is described by which
publicly available gene chip data can be used to classify proteins detected by proteomics
according to cells, tissues, and organs that are relevant to the disease in question.
70

50
Example heterogeneous tissue: atherosclerotic plaque
o Known constituents:
 endothelial cells
 smooth muscle
 infiltrating lymphocytes
 plasma
Choose datasets from GEO web site
o Check proteomics data to ensure that the gene profile datasets express the
most abundant proteins
o Novartis dataset (Su et al.)
o White blood cell study (Jeffrey et al.)
Data mining
o Align proteins using gene symbol, saving the largest expression value per
tissue
o Normalize by column (% expression attributed to each gene for tissue
type)
K-means clustering - 49
o Prepare a table of gene vs. % expression
o Perform K means clustering (60 clusters) and PCA (12 components)
90
o Color PCA plots based on K means cluster
80 to K means cluster and PCA
Annotate master gene table according
70
coefficients
60
50
Caveats
40
 Using proteomics data to accomplish
30this would be superior
 Such data needs to be shotgun, and20
retain abundance statistics
10
o Couldn’t find any
0 would have been preferable
 Gene chip data on more relevant tissues
o No decent aorta endothelial dataset
K-means clustering - 52
40
30
20
10
0
Table 2.
K clusters 41-60 (most tissue-specific);
K-means clustering - 50
16 tissues in columns;
90
K clusters in rows.
80
70
Compare deduced profile name to pattern
60
50
at bottom of Fig, 2B.
40
o Some gene datasets don’t merge well with others (apples vs. oranges)

90
80
70
60
50
40
Tissue30
Gene
20
10
0
Table
Profiles
Table 1.
1. Tissue Gene
Profiles
tissue
adipocytes
B cells
90
bone marrow cd33 myeloid
80
70
bone marrow cd34
60
50
endothelial_cd105
40
30
erythroid early cd71
20
10
liver
0
liver_fetal
macrophage
macrophage stimulated 90
80
monocytes cd14
70
60
neutrophil
50
neutrophil stimulated 40
30
20
smooth muscle
10
TH1 cells
0
TH2 cells
T...
TH
2
c
T el
s
ne mo H1 ls
ce
ut
r o oth
ph m lls
u
il
st scl
im e
ul
m
n ate
m
ac on e ut d
ro o c r op
ph yt
hi
ag es
l
c
e
st d1 4
im
m ula
ac
t
ro ed
ph
ag
e
er
liv
yt
l
hr
e
o i ive r r
en d e _fe
ta
do arl
l
bo
th y
ne bo eli cd7
m ne al_c 1
ar m
ro a r d10
ro
w
5
cd w
3 3 cd
m 34
ye
lo
i
B d
ad ce
ip lls
oc
yt
es

60
ref
order K
Su et al.K-means clustering
16 -58
55
Jeffrey et al.
15 59
Su et al.
14 22
Su et al.
13 46
Su et al.
12
7
Su et al.
11 48
Su et al.
9 60
Su et al.
10 57
K-means clustering - 58
Jeffrey et al.
8 52
Jeffrey et al.
7 55
Su et al.
6 33
Jeffrey et al.
5 47
Jeffrey et al.
4 51
Su et al.
3 56
Jeffrey et al.
2 54
Jeffrey
et al.
50 cd3...
sm... Neu
MF_s live...1eryt...
ColorK-means
key:
>50%,
K-means
clustering
- 49
clustering
- 51 >30%,
>20%, >10% of intensity
in indicated tissue.
30
K # type
MF MF_s Neu_s Neu TH1 TH2 B20 smooth cd105 liver liver_fetal adipo mono cd33 erythro cd34
60 71 liver
1.22 1.06 1.23 0.78 1.04 0.97 10
0.87
1.01 0.61 73.88
10.97 3.17 0.71 1.01 0.61 0.87
0
59 33 B
2.01 1.85 1.23 1.02 2.08 2.76 73.13
1.00 1.45 2.30
2.48 2.22 1.08 1.42 1.10 2.87
clustering
K-means
- 53 1.84 1.47 1.30
clustering
- 54 1.11- 520.70 1.04
58 60 adipo
2.26 1.75clustering
1.20 0.93
6.15 1.20 K-means
3.84K-means
3.23
71.32
0.67
57 29 liver_f
1.30 1.37 0.85 0.81 1.20 0.95 1.02
1.07 0.71 12.24
70.64 3.17 1.04 1.25 1.15 1.23
90
56 26 smooth
2.18 2.94 1.38 1.01 1.11 1.04 80
0.82 69.51 1.41 1.92
3.89 9.59 0.55 0.89 0.57 1.19
55 55 MF_stim
8.54 64.30 3.14 2.69 4.77 1.80 70
2.05
2.40 0.68 1.23
1.49 2.39 0.91 1.61 0.80 1.20
54 27 TH1_hi
1.26 2.45 1.30 1.01 60.67 17.92 60
3.51
1.01 0.86 1.77
2.43 2.17 0.88 0.88 0.65 1.21
50
53 40 smooth_adipo
1.87 1.65 1.19 0.71 1.32 0.92 40
0.99 39.24 0.97 2.60
4.25 41.28 0.58 0.76 0.58 1.08
52 94 MF
36.87 42.55 1.23 0.98 1.79 1.47 30
1.65
1.74 0.74 1.69
1.83 2.67 1.34 1.69 0.68 1.07
51 70 Neu_stim
3.75 6.05 53.22 12.25 3.85 3.10 20
3.00
1.38 0.64 1.45
1.60 2.05 2.37 3.17 0.74 1.38
10
50 26 TH2
2.96 2.26 1.67 1.07 22.77 49.54 2.52
1.31 1.10 2.02
3.34 3.01 0.99 1.40 1.15 2.88
0
49 60 liver_b
1.93 1.80clustering
1.45 1.02
1.80 1.11 38.45
38.84
4.27
0.85
K-means
clustering
K-means
- 56 1.74 1.87 1.26
K-means
clustering
- 57 1.18- 551.27 1.17
48 35 erythro
1.75 1.55 1.16 1.08 2.09 1.61 1.30
1.58 12.15 1.77
16.88 2.80 1.08 1.22 48.67 3.31
47 69 Neu
3.20 2.88 36.08 35.87 2.64 2.04 90
2.77
0.92 0.64 1.61
1.48 2.62 2.44 3.20 0.51 1.09
46 19 CD34
1.86 1.98 2.49 2.68 3.78 2.75 80
3.33
1.75 6.24 2.90
4.92 3.22 2.79 9.13 1.40 48.77
70
45 4 liver_neu
0.94 0.87 32.80 15.75 1.22 0.71 60
0.78
2.33 0.57 34.47
4.35 2.10 0.85 0.81 0.68 0.78
44 80 liver
4.66 3.86 2.47 1.87 3.23 3.33 50
2.56
2.65 1.81 46.36
11.18 7.51 2.08 2.44 1.52 2.47
43 112 MF_un
41.97 23.72 1.89 1.69 3.47 3.47 40
2.54
2.31 1.07 2.50
3.01 4.68 2.02 2.60 0.93 2.14
30
42 47 MF_stim_neu_stim 9.50 32.70 27.87 11.56 4.05 2.59 20
2.15
2.04 0.37 0.64
0.81 1.27 1.47 2.05 0.24 0.67
41 156 T
5.73 3.70 1.57 0.95 32.88 29.23 10
3.57
1.70 3.10 1.37
3.57 2.10 0.94 1.77 3.04 4.78
0
K-means clustering - 59
K-means
clustering
- 60
K-means
clustering
- 58
Fig. 2A
90 70
80 60
70
60 50
50 40
40 30
30
20 20
10 10
0
Expanded
view of tissue
profile for
cluster K60
B
TH2 sm... Neu MF_s live... eryt... cd3...
order: as in Fig 2A-C. K: most specific K cluster in other figures and tables
0
B
TH2
MF_s
live... live...
eryt... cd3...
T... sm...
sm...Neu
Neu
MF_s
eryt... B
cd3...
TH
2
c
T el
ne smo H1 ls
ce
ut
r o oth
ph m lls
u
il
st scl
im e
ul
a
m mo ne ted
ac
no utr
ro
c
ph yt oph
e
il
ag
s
c
e
st d1 4
im
m ula
ac
t
ro ed
ph
ag
e
er
liv
yt
l
hr
e
o i ive r r
d
_
en
f
e
do arl eta
l
bo
th y
ne bo eli cd7
a
n
1
l
e
m
a r m _cd
ro a r
1
ro 0 5
w
cd w
3 3 cd
m 34
ye
lo
i
B d
ad ce
ip lls
oc
yt
es

Fig. 2A
Expanded view of tissue
profile for cluster K1
Methods

Deduced tissues that dominate each of
the 60 K clusters.
# indicates the number of genes in the cluster.
K #
1 391
2 595
3 399
4 879
5 333
6 355
7 192
8 372
Series1 9 223
10 283
11 414
12 459
13 433
14 89
15 74
16 636
17 212
18 153
19 294
20 188
21 385
22 25
23 340
24 160
25 66
26 135
27 206
28 385
29 342
30 173
31 126
32 155
33 23
34 51
35 246
36 70
37 72
38 123
39 147
40
7
41 156
42 47
43 112
44 80
45
4
46 19
47 69
48 35
49 60
50 26
51 70
52 94
Series1
53 40
54 27
55 55
56 26
B 57TH2
29
58 60
59 33
60 71
type
common
MF_T_B_common
Neu_stim_common
liver_f_adipo_comm
MF_not_cd105_com
T_B_common
liver_f_adipo_comm
T_common
B_common
blood_common
blood
liver_f
adipo_liver
MF_cd33_common
erythro_liver_f
T_MF_B
MF_common
endo5
liver_b_adipo
B
K-means clustering - 50
B
CD33
MF_unstim_T
MF_neu
MF_smooth
TH1
MF_B
T
MF_T
Neu_stimK-means clustering - 53
Neu_unstim
MF_stim_lo
mono
Neu_both
MF_both
smooth
liver_f
B
K-means clustering - 56
adipo
CD105
T
MF_stim_neu_stim
MF_un
liver
liver_neu
CD34
Neu
erythro K-means clustering - 59
liver_b
TH2
Neu_stim
MF
smooth_adipo
TH1_hi
MF_stim
smooth
sm... Neu MF_s live... eryt... cd3...
liver_f
adipo
B
liver
Tissue profiles for each
of the 60 K clusters.
800
700
Fig. 3.
342
340
294
300
246
212
192
206
74
147 156
112
72
66
69
23
1
4
7
10
13
16
19
22
25
28
31
34
60
70
40
55
29
33
4
0
37
40
43
46
49
52
55
58
K-means clustering
Fig. 4A-B.
PCA plots colored
by K cluster. Each
pole tends to be
enriched in a
particular tissue.
Other tissues
dominate (as listed
in Table 4) when
other PCA
dimensions are
plotted.
K-means clustering - 51
K-means clustering - 27
K-means clustering - 28
K-means clustering - 29
K-means clustering - 30
K-means clustering - 31
K-means clustering - 32
K-means clustering - 34
K-means clustering - 35
K-means clustering - 54
K-means clustering - 33
K-means clustering - 36
40
35
30
25
20
15
10
5
0
Neu MF_s live... eryt... cd3...
B
TH2 sm...
Neu MF_s live... eryt... cd3...
B
TH2 sm...
Neu MF_s live... eryt... cd3...
B
K-means clustering - 57
Fig. 4C.
K-means clustering - 60
K-means clustering - 49
K-means clustering - 50
K-means clustering - 51
90
80
70
60
50
40
30
20
10
0
K-means clustering - 52
K-means clustering - 53
K-means clustering - 54
90
80
70
60
50
40
30
20
10
0
K-means clustering - 55
K-means clustering - 56
B
TH2 sm...
K-means clustering - 59
K-means clustering - 57
Neu MF_s live... eryt... cd3...
K-means clustering - 60
90
80
70
60
50
40
30
20
10
0
sm...
385
333
126
40
35
30
25
20
15
10
5
0
T...
414
391 399
100
K-means clustering - 26
K-means clustering - 58
400
433
223
K-means clustering - 25
90
80
70
60
50
40
30
20
10
0
Scree Plot (Eigenvalues)
Principal Eigenvalue Eigenvalue Cumulative
Component
(%)
Eigenvalue (%)
PC (1)
163.785
25.4
25.4
PC (2)
101.818
15.8
41.2
PC (3)
85.870
13.3
54.5
PC (4)
68.733
10.7
65.1
PC (5)
52.557
8.1
73.3
PC (6)
37.328
5.787
79.077
PC (7)
27.99
4.339
83.416
PC (8)
26.54
4.114
87.531
PC (9)
23.279
3.609
91.14
PC (10)
14.7
2.279
93.418
PC (11)
13.598
2.108
95.526
PC (12)
11.684
1.811
97.338
500
200
40
35
30
25
20
15
10
5
0
sm...
600
Histogram of gene
occupancy of K
clusters.
40
35
30
25
20
15
10
5
0
T...
Annotated Scree plot showing PCA dimension,
and relationship between tissues and PCA dimension.
Fig. 2B.
Neu MF_s live... eryt... cd3...
Exploded view:
PCA4 vs. PCA8, 60 K
clusters, 10 left on
extremes
639 remaining of
11324 genes.
B
macrophage->
Histogram of the
expression profile of a gene (PRTN3)
in a 79 tissue dataset (Su et al.) using
the GEO site at NCBI.
MF
Liver, adipo, smooth
B, T cells
<-Liver
Macrophage->
TH2 sm...
Neu MF_s live... eryt... cd3...
B
TH2 sm...
Neu MF_s live... eryt... cd3...
B
lo
liver
T
neutrophil
liver
B
fetal liver
smooth
MF_stim
MF; CD34
neu_unstim
CD34; liver;mono
TH1






Can assign many genes (proteins) to specific tissues
Most housekeeping proteins have a particular uneven distribution
Each PCA dimension maps to a particular combination of tissues
o (~ as expected for principal component analysis)
Each K cluster consists of proteins with a similar tissue distribution
This same process works based solely on proteomics data
The input datasets need not be pure
o The process works even when input data consists of
replicates that have unintended residual variability
References
K53
smooth_adipo
60
K56
smooth
40

Andersen, J. S.; Wilkinson, C. J.; Mayor, T.; Mortensen, P.; Nigg, E. A.; Mann, M., (2003)
Proteomic characterization of the human centrosome by protein correlation profiling. Nature
426, 570-4.

Jeffrey KL, Brummer T, Rolph MS, Liu SM, Callejas NA, Grumont RJ, Gillieron C, Mackay F,
Grey S, Camps M, Rommel C, Gerondakis SD, Mackay CR (2006) Positive regulation of
immune cell function and inflammatory responses by phosphatase PAC-1., Nat Immunol. 7:27483.

Parker KC, Walsh R, Salajegheh M, Amato A, Krastins B, Sarracino D, Greenberg SA (2009)
Characterization of human skeletal muscle biopsy samples using shotgun proteomics. J Proteome
Res. Apr 21. (somewhat similar methodology applied to muscle biopsy)

Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M,
Kreiman G, Cooke MP, Walker JR, Hogenesch JB. (2004) A gene atlas of the mouse and human
protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 101:6062-7.
K39
adipo
20
K46
CD34
0
-20
K43
MF_un
K55
MF_s
K57
liver_f
-40
K49
liver
-60
-40
-30
Acknowledgments
K44
liver
-20
K50
TH2
-10
0
10
20
PCA 8
<- stimulated macrophages
B
hi
MF
MF
MF
adipocyte
T
liver
adipocyte
CD34
smooth
neu_stim
erythro
TH2
Conclusions
<- T cells
Fig. 1.
Assign proteins to the cell type in which they are most abundant
Download gene chip data for relevant tissues
Perform PCA and K means clustering
Annotate results
liver adipocyte ->
Overview




Table 4.
Table 3.
liver adipocyte ->
Kenneth C. Parker
Poster Number MPB 064
CD34->
30





Spotfire for graphics and statistical software
Microsoft for Access and Excel
NCBI for the GEO datasets
BG-Medicine for letting me study protein informatics
Virgin Instruments for paying my way here

Similar documents