Analysis of codon usage diversity of bacterial genes with a self

Transcription

Analysis of codon usage diversity of bacterial genes with a self
Gene 276 (2001) 89–99
www.elsevier.com/locate/gene
Analysis of codon usage diversity of bacterial genes with a self-organizing
map (SOM): characterization of horizontally transferred genes with
emphasis on the E. coli O157 genome
Shigehiko Kanaya a,b,c, Makoto Kinouchi a,b, Takashi Abe a,d, Yoshihiro Kudo e, Yuko Yamada e,
Tatsuya Nishi d, Hirotada Mori b,c, Toshimichi Ikemura f,*
a
Department of Bio-System Engineering, Faculty of Engineering, Yamagata University, Yonezawa, Yamagata-ken 992-8510, Japan
b
CREST JST (Japan Science and Technology), Tsukuba, Japan
c
Research and Education Center for Genetic Information, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara-ken 630-0101, Japan
d
Xanagen Inc., Sakado, Takatsu-ku, Kawasaki, Kanagawa-ken 213-0012, Japan
e
Department of Biochemistry, Jichi Medical School, Kawachi-gun, Tochigi-ken 329-0498, Japan
f
Division of Evolutionary Genetics, Department of Population Genetics, National Institute of Genetics, Mishima, Shizuoka-ken 411-8540, Japan
Received 14 April 2001; received in revised form 12 June 2001; accepted 10 August 2001
Received by G. Bernardi
Abstract
With increases in the amounts of available DNA sequence data, it has become increasingly important to develop tools for comprehensive
systematic analysis and comparison of species-specific characteristics of protein-coding sequences for a wide variety of genomes. In the
present study, we used a novel neural-network algorithm, a self-organizing map (SOM), to efficiently and comprehensively analyze codon
usage in approximately 60,000 genes from 29 bacterial species simultaneously. This SOM makes it possible to cluster and visualize genes of
individual species separately at a much higher resolution than can be obtained with principal component analysis. The organization of the
SOM can be explained by the genome G 1 C% and tRNA compositions of the individual species. We used SOM to examine codon usage
heterogeneity in the E. coli O157 genome, which contains ‘O157-unique segments’ (O-islands), and showed that SOM is a powerful tool for
characterization of horizontally transferred genes. q 2001 Elsevier Science B.V. All rights reserved.
Keywords: Codon usage; Self-organizing map; E. coli O157; Horizontally transferred gene
1. Introduction
With progress in the genome projects, a vast amount of
nucleotide sequence data are now available. Multivariate
analysis methods such as factor corresponding analysis
and principal component analysis (PCA) have been used
to systematically study heterogeneous codon usage in a
wide variety of species (Grantham et al., 1980; Medigue
et al., 1991; Sharp and Matassi, 1994; Pouwels and Leunissen, 1994; Andersson and Sharp, 1996; Kanaya et al.,
1996a; Guerdoux-Jamet et al., 1997; Kunst et al., 1997).
To characterize species-specific heterogeneity in codon
usage, we previously developed a measure denoted as Z1
that is based on the widest range of the axis obtained by
PCA of codon usage patterns (Kanaya et al., 1996a,b, 1999,
Abbreviations: O-islands, O157-unique segments; PCA, principal
component analysis; SOM, self-organizing map
* Corresponding author. Tel.: 181-559-81-6788; fax: 181-559-81-6794.
E-mail address: [email protected] (T. Ikemura).
2001; Nakayama et al., 1999, 2000). Conventional multivariate analysis methods such as PCA are useful for analyzing codon usage within one or a small number of species,
but their respective resolving powers are rather poor when a
large number of genes from many species are analyzed
simultaneously. Here, we introduce a novel neural-network
algorithm with high resolving power, a self-organizing map
(SOM), and we analyzed codon usage in approximately
60,000 genes from 29 bacterial species simultaneously.
The SOM neural networks have been proposed by von der
Malsburg (1973) and Kohonen (1982) and make it possible
to visualize high-dimensional systems (reviewed in Kohonen et al., 1996). This method can be used to identify categories from raw data with high resolving power and to trace
factors reflected in individual categories.
In bacterial genomes, codon usage varies within and
between species. Our group and others have shown that,
in unicellular organisms, choice among synonymous codons
for highly expressed genes is typically dependent on levels
0378-1119/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved.
PII: S 0378-111 9(01)00673-4
90
S. Kanaya et al. / Gene 276 (2001) 89–99
of isoaccepting tRNAs (Ikemura, 1981a,b, 1982; Dong et
al., 1996; Percudani et al., 1997; Kanaya et al., 1999, 2001)
and therefore the extent of codon bias for each gene is
associated with the level of protein production (Ikemura,
1981a,b, 1982, 1985a,b; Gouy and Gautier, 1982; Medigue
et al., 1991; reviewed in Andersson and Kurland, 1990;
Kunisawa, 1992; Sharp and Matassi, 1994). Codon usage
diversity is also affected by G 1 C% of the genome
(Bernardi and Bernardi, 1985; Muto and Osawa, 1987;
Sueoka, 1992; Osawa, 1995). These characteristics of
codon usage heterogeneity could explain the organization
of the SOM obtained in this study.
Genes introduced through horizontal transfer from
distantly related organisms are known to retain the sequence
characteristics of the donor genome and can be distinguished from those of the acceptor genome (Jeltsch and
Pingoud, 1996; Lawrence and Ochman, 1998). The present
study showed that SOM is an efficient tool for characterizing horizontally transferred genes and predicting the donor/
acceptor relationship with respect to the transferred genes.
We applied this method to characterize codon usage heterogeneity in the E. coli O157 genome, which contains the
unique segments including O-islands (Perna et al., 2001)
that are absent in E. coli K12.
2. Materials and methods
To exclude the effects of gene size, amino acid composition, and codon box number, the codon frequency of the kth
gene for the v(m)th codon xkv(m) was calculated as
"MðmÞ
#
X
xkvðmÞ ¼ f kvðmÞ =
f kvðmÞ =MðmÞ
ð1Þ
v¼1
where, fkv(m) denotes the v(m)th synonymous codon number
for the mth amino acid, and M(m) denotes the codon box
number. The codon usage pattern for the kth gene is
described by a vector xk consisting of xkv(m).
The SOM is a neural-network algorithm that implements
a characteristic nonlinear projection from the high-dimensional space of input signals onto a low-dimensional array
of weights (reviewed in Kohonen et al., 1996). The weights
(wij) in the codon frequency space are arranged in a twodimensional lattice denoted by i ( ¼ 0, 1,…, I 2 1) and j
( ¼ 0, 1,…, J 2 1). The learning process of the SOM algorithm in the present study is independent of the order of
input vectors (Abe et al., 1999). In the original method,
the initial weights vectors wij are set by random values
(Kohonen, 1990; Kohonen et al., 1996), but in the present
method the vectors are initialized by PCA, which is a statistical method that performs linear mapping to extract optimal
features from an input distribution in the mean squared error
sense and can be used by self-organizing neural networks to
form unsupervised neural preprocessing modules for classification problems (Kohonen et al., 1996). Hence, the initial
weight vectors (wij) are set based on the widest scale of the
gene distribution in the codon frequency space with PCA.
Weights in the first dimension are arranged into 200 lattices
(I ¼ 200) corresponding to the width of five times the standard deviation (5s 1) of the first principal component. The
second dimension (J) is defined by the nearest integer
greater than Is 2/s 1. The weight vector on the ijth lattice
is represented as follows:
w ij ¼ xav 1 5s 1 ½b1 ði 2 I=2Þ=I 1 b2 ðj 2 J=2Þ=J
ð2Þ
Here, xav is the average vector for codon usage patterns; b1
and b2 are eigen vectors for the first and second principal
components. In Step 2, the Euclidean distances between the
input vector xk and the weight vector wij are calculated, and
xk is classified into the weight vector (called wi 0 j 0 ) with the
smallest distance among them. After classifying all input
vectors into the weight vectors, an updating process is
done according to Step 3.
In Step 3, the ijth weight vector is updated with
0
1
Nij
X
wij ðnewÞ ¼ wij 1 aðrÞ@
xk =Nij 2 wij A
ð3Þ
xk [Sij
Here, the components of set Sij are input vectors classified
into wi 0 j 0 satisfying i 2 bðrÞ # i 0 # i 1 bðrÞ and
j 2 bðrÞ # j 0 # j 1 bðrÞ. The two parameters a (r)
(0 , aðrÞ , 1) and b (r) are learning coefficients for the
rth cycle, and Nij is the number of components of Sij. In
the present study, a (r) and b (r) are set by
aðrÞ ¼ max{0:01; að1Þð1 2 r=TÞ}
ð4Þ
bðrÞ ¼ max{0; bð1Þ 2 r}
ð5Þ
where a (1) and b (1) are the initial values for the T-cycle of
the learning process. In the present study, we selected 100
for T, 0.5 for a (1), and I/4 for b (1). The learning process is
monitored by the total distance between xk and the nearest
weight vector wi 0 j 0 , represented as
QðrÞ ¼
N n
X
k xk 2 w i 0 j 0 k2
o
ð6Þ
k¼1
where N is the total number of genes analyzed.
3. Results and discussion
3.1. Species-specific codon usage visualized by SOM
SOM was constructed with 59,122 genes, each containing
at least 100 codons, from the 29 bacterial species whose
complete genomic sequences have been determined: Aquifex aeolicus (Deckert et al., 1998), Archaeoglobus fulgidus
(Klenk et al., 1997), Aeropyrum pernix (Kawarabayasi et al.,
1999), Bacillus subtilis (Kunst et al., 1997), Bacillus halodurans (Takami et al., 2000), Borrelia burgdorferi (Fraser et
al., 1997), Buchnera sp. (Shigenobu et al., 2000), Campylobacter jejuni (Parkhill et al., 2000), Chlamydia trachoma-
S. Kanaya et al. / Gene 276 (2001) 89–99
tis (Stephens et al., 1998), Chlamydia pneumoniae (Parkhill
et al., 2000), Deinococcus radiodurans (White et al., 1999),
E. coli (Blattner et al., 1997), Haemophilus influenzae
(Fleischmann et al., 1995), Halobactgerium sp. (Ng et al.,
2000), Helicobacter pylori (Tomb et al., 1997), Methanococcus jannaschii (Bult et al., 1996), Methanobacterium
thermoautotrophicum (Smith et al., 1997), Mycobacterium
tuberculosis (Cole et al., 1998), Neisseria meningitidis
(Tettelin et al., 2000), Pseudomonas aeruginosa (Stover et
al., 2000), Pyrococcus abyssi (Heilig, unpublished data;
GenBank Accession number, AL096836), Pyrococcus horikoshii (Kawarabayasi et al., 1998), Rickettsia prowazekii
(Andersson et al., 1998), Synechocystis sp. (Kaneko et al.,
1996), Thermotoga maritima (Nelson et al., 1999), Treponema pallidum (Fraser et al., 1998), Ureaplasma urealyticum (Glass et al., 2000), Vibrio cholerae (Heidelberg et al.,
2000), and Xylella fastidiosa (Simpson et al., 2000).
As the first step to obtain the initial weight vectors (see
Section 2) codon usage for these 59,122 genes was analyzed
by PCA. After the learning process of the 100th cycle,
codon usage of the genes was effectively reflected in the
weight vectors. The learning process was monitored by
Q(r) in Eq. (6). Total error by Eq. (6) decreased from
Qð1Þ ¼ 1:13 £ 106 at the initial cycle to Qð100Þ ¼
3:39 £ 105 at the 100th cycle. Comparison of gene classification into lattice points by the initial vectors (Fig. 1a) with
the classification by the final vectors (Fig. 1b) showed
clearly that genes within a single species were much more
tightly clustered with the final vectors. Lattices that include
genes from a single species are indicated in color, and those
including genes of more than one species are indicated in
black. It is apparent that the resolving power of the conventional multivariate analysis PCA which is depicted in Fig. 1a
is poor and rather useless for comprehensive analysis and
comparison of a large number of genes from multiple
genomes.
3.2. SOM organization with respect to genome G 1 C% and
taxonomic relationships
Analysis of the raw data and raw vectors revealed that the
clustered genes have very similar patterns of codon usage.
Neighboring weight vectors in SOM tend to be similar, and
distantly separated weight vectors tend to be different.
Strongly biased weights are also known to localize to the
edge of the SOM, but weakly biased weights tend to localize
in the center. This suggests that species with strong codon
biases are located at the edge of this map. In fact, genes of
U. urealyticum (genome G 1 C%, 25.5%), Buchnera sp.
(26.4%), R. prowazekii (29.0%), C. jejuni (30.5%), and M.
jannaschii (31.4%) were distributed on the left side of the
SOM, and those of P. aeruginosa (66.6%), D. radiodurans
(66.6%), and Halobacterium sp. (66.6%) were distributed
on the right side (Fig. 1b). Genome G 1 C% increases from
left to right and thus is reflected mainly in the horizontal
91
axis. Fig. 2 illustrates the configuration of individual species
whose genome G 1 C%s are listed.
On the vertical axis, genes for Archaea (M. jannaschii, P.
horikoshii, P. abyssi, A. premix, M. thermoautotrophicum,
A. fulgidus, and Halobacterium sp.) were distributed at the
bottom, and those for g- and b-Proteobacteria (Buchmera
sp., H. influenzae, X. fastidiosa, V. cholerae, E. coli, P.
aeruginosa, and N. meningitidis) were distributed at the
upper part. This shows that the SOM reflects globally the
taxonomic relationships. Two thermophilic bacteria, T.
maritima and A. aeolicus, were located close to Archaea,
indicating that the thermophilic bacteria have codon usage
patterns similar to those of Archaea. This may be related to
the observation that each of these thermophilic bacteria
contains a large number of genes that are most similar to
those of thermophilic Archaea (Ochman et al., 2000).
3.3. SOM organization and tRNA gene number
The relative proportions of isoaccepting tRNAs in cells
are important factors that influence synonymous codon
choice in genes of unicellular organisms; codon usage in
highly expressed genes is typically dependent on tRNA
content (Ikemura, 1981a,b, 1982; Kanaya et al., 1999).
Cellular tRNA contents are known to be related to copy
numbers of tRNA genes (Ikemura, 1981a,b; Dong et al.,
1996; Duret, 2000; Kanaya et al., 1999). To investigate
SOM classification from the viewpoint of the levels of
isoaccepting tRNAs, we examined tRNA genes of individual species (Table 1). The results shown in Table 1 and
those of a previous study (Kanaya et al., 1999) indicate that
increases in tRNA genes may occur in two ways. One is
multiplication of gene encoding the tRNA with one of
anticodons specific for each amino acid, which results in
a clear difference in the tRNA levels between the major
and minor isoaccepting tRNAs. This phenomenon has been
observed for bacteria with large genomes such as N. meningitidis, P. aeruginosa, E. coli, V. cholerae, B. halodurans,
B. subtilis, and H. influenzae. Species in which multiplication of tRNA genes for more than ten anticodon types are
observed are indicated by blue dots in Fig. 2. The respective species are located primarily in the right and upper
zones of the SOM shown in Fig. 2. In the second mechanism, there are single copies of tRNA genes encoding a
variety of anticodons specific for one amino acid; the
species which have more than 40 anticodon types are indicated by red dots in Fig. 2. This is observed often for
bacteria with high G 1 C% genomes. M. tuberculosis, D.
radiodurans, T. maritima, and most of the Archaea belong
to this category. The remaining bacterial species, which are
not indicated by colored dots, are located primarily in the
left and upper zones of the SOM in Fig. 2. Bacteria with
low G 1 C% genomes tend to have small numbers of
isoacceptor species and, therefore, belong to this residual
class. It is notable that Buchnera sp. and R. prowazekii,
which are taxonomically distant from each other but
92
S. Kanaya et al. / Gene 276 (2001) 89–99
possess almost identical isoacceptor sets, are located in the
vicinity in the SOM. A similar observation was also made
for P. horikoshii, P. abyssi, T. maritima, and A. fulgidus.
Collectively, these findings support the view that species-
specific codon usage in bacterial genomes is determined
primarily by genome G 1 C% and compositions of isoaccepting tRNAs and that SOM organization is reflective of
these two factors.
Fig. 1. Gene classification by (a) initial weights and (b) final weights: A. aeolicus (abbreviated as Aaeo), A. fulgidus (Aful), A. pernix (Aper), B. subtilis (Bsub),
B. halodurans (Bhal), B. burgdorferi (Bbur), Buchnera sp. (Buch), C. jejuni (Cjej), C. trachomatis and C. pneumonia (Chla), D. radiodurans (Drad), E. coli
(Ecol), H. influenzae (Hinf), Halobactgerium sp. (Halo), H. pylori (Hpyl), M. jannaschii (Mjan), M. thermoautotrophicum (Mthe), M. tuberculosis (Mtub), N.
meningitidis (Nmen), P. aeruginosa (Paer), P. abyssi (Paby), P. horikoshii (Phor), R. prowazekii (Rpro), Synechocystis sp. (Syne), T. maritima (Tmar), T.
pallidum (Tpal), U. urealyticum (Uure), V. cholerae (Vcho), and X. fastidiosa (Xfas). Archaea, eubacteria and two thermophilic bacteria are denoted by yellow,
green, and blue letters. The configuration of bacterial species in (b) is depicted in Fig. 2. These SOM results are available on the Xanagen Inc. web (URL http://
www.xanagen.com).
S. Kanaya et al. / Gene 276 (2001) 89–99
93
Fig. 2. Configuration of bacterial species in SOM. Genome G 1 C% are indicated with black letters. Abbreviations for bacterial species correspond to those in
Fig. 1b. Some species have adapted independently to a common lifestyle: archaeal and bacterial hyperthermophiles (Aravind et al., 1998), for example, and the
intracellular pathogens Richettsia and Chlamydia (Wolf et al., 1999). Species within the respective groups show similar codon usage and tRNA compositions
and are located close to each other in the SOM.
3.4. Donor/acceptor relationship in horizontal gene transfer
Foreign genes can be identified by their atypical nucleotide compositions and codon usage patterns (Lawrence and
Ochman, 1998). Parasites and hosts often have coding strategies that can be distinguished (Grantham et al., 1980), and
foreign-type genes such as genes of transposons, plasmids,
and viruses often have codon usage that are quite different
from patterns of the hosts (Medigue et al., 1991). Therefore,
codon usage data have been used to identify which genes
were transferred horizontally from other genomes (Lawrence and Ochman, 1997, 1998; Nakayama et al., 1999,
2000; Kunisawa et al., 1998). In Fig. 1b, a black lattice
within a species-specific colored territory indicates the
presence of genes with codon usage that is atypical in its
own genome but similar to that of the species represented by
the color. This invasion of the alien genes into speciesspecific colored territories in the SOM may provide information concerning the donor/acceptor relationship in horizontal gene transfer. Table 2 lists the numbers of genes
present in the territories of different species in the SOM
shown in Fig. 1b. For example, 273 E. coli genes are located
in the V. cholerae territory, and 116 V. cholerae genes are
located in the E. coli territory. A relatively large number of
B. subtilis genes have codon usage patterns that differ from
the intrinsic pattern, and many genes from various eubac-
teria have patterns similar to that of B. subtilis. These findings may provide fundamental knowledge for studies of the
donor/acceptor relationship in horizontal gene transfer. This
possibility was tested for E. coli O157 and is described in
the following section.
3.5. Codon usage heterogeneity in the E. coli O157 genome
E. coli O157:H7 is a global threat to public health and has
been implicated in many outbreaks of hemorrhagic colitis.
The severity of symptoms, the lack of effective treatments,
and the potential for large-scale outbreaks from contaminated foods have propelled intensive research into this
organism (Su and Brandt, 1995). The genomic sequences
has been determined by two groups (Perna et al., 2001;
Hayashi et al., 2001). The size of the E. coli O157:H7
genome is much larger than that of E. coli K12; approximately 4 Mb of sequence is homologous between E. coli
K12 and E. coli O157, but sequences such as O-islands are
present only in E. coli O157. We examined codon usage
heterogeneity in approximately 1000 O157-specific genes
with the SOM and Z1 parameter analyses. The Z1 parameter
is an index of the heterogeneity of codon usage ( Kanaya et
al., 1996a, 1999, 2001). A large, positive Z1 value indicates
high adaptation of codon usage to the translation system,
and a negative Z1 represents low adaptation; foreign genes
94
Table 1
Number of tRNA genes for individual anticodons a
S. Kanaya et al. / Gene 276 (2001) 89–99
a
E. coli O157 genome sequences reported by Perna et al. (2001) and Hayashi et al. (2001) are abbreviated as O157P and O157H, respectively. C. trachomatis and C. pneumoniae are abbreviated as Ctra and
Cpne, respectively. Abbreviations for other species correspond to those in Fig. 1. uua and uca correspond to anticodons for selenocysteine.
Table 2
The number of genes in their own territory and of genes with atypical codon usage that invaded territories of other species in the SOM in Fig. 1b a
Archaea
Eubacteria
Aful Aper Drad Halo Mjan Mthe Paby Phor Aaeo Tmar Bbur Bhal Bsub Buch Cjej Cpne Ctra Ecol O157P Hinf Hpyl Mtub Nmen Paer Rpro Syne Tpal Uure Vcho Xfas
A. fulgidus
A. pernix
D. radiodurans
Halobacterium sp.
M. jannaschii
M. thermoautotrophicum
P. abyssi
P. horikoshii
A. aeolicus
T. maritima
B. burgdorferi
B. halodurans
B. subtilis
Buchnera sp.
C. jejuni
Chlamydia
E. coli
H. influenzae
H. pylori
M. tuberculosis
N. meningitidis
P. aeruginosa
R. prowazekii
Synechocystis sp.
T. pallidum
U. urealyticum
V. cholerae
X. fastidiosa
GN
AGN
1602 112
6
8
18 45 19 15 34
3
1
1
3
6
35 2038 94 36
6 26 30 54
5
5
3
2
1
7
9
5
4 1672 21 155
2
3
3
2 22
32 271 1
14 2
1
38 67
31 216 91 1529 57 337 169 121 86
8 9
4
8
19 2
1
1
2
11
1 30
3 1063
7 21
5
5 64
44 157 7
40 2
1
28 51
84 33
1
1137 16 13
5 18
29 11
5 874 326 79
5
1
1
92 79
5
1 28 59 499 1324 63 48 14
42 83
12 23 10
15 35
11
20 29 31 1191 114
2
1
118 29
3
2
30
8 12 12 1428
4 81
6 12
1 70
1 24
1
2
4 10
395
6 101 4
29 4
12 21
3
7
1
1
6 12
2 1 2938 410
1 66 58 131 143
10 22 246 12 22
7
2 24
4 93 181 1572 7
15 13 14 111 176
5
1
2
8 10 400
35 4
1
1
3
1
12
15
1
32
7 15 10 1149 7
5
5
1
9
5
1
3
3
2 17
4
6 17
63 84 3
17 789 702
26 59
13
3 49
3032 3324
1
56 23 3
19 26 18
37 44
2
6 36
1
1
1
6
1
3
10 91
2
4
19 31
4
2 330 114 45
1
4
1
57
18 254 4
12 4
1
27 74
25
2
1
4
1
5
32 46
1
3 201 18 91
1
1
4
2 3
9 89 2
8
3
12
9
7
1
8
8 28 61
99 12
3
6 17
1
2
6
1
4
3
1
1 30
35 208
8 5
2
36 55
31 62
1 11
5
1
1 21
12 19 1
6 1
1
17 46
5
2
6 20
8
1
1
1
3
2
1
48 44
4
6 273 298
13
3
1
6 12
36 53
2028 2611 2927 1761 1542 1644 1692 2001 1499 1684 772 3541 3627 523 1491 968 833 3907 4587
426 573 1255 232 479 507 818 677 308 256 382 603 2055 123 342 179 131 875 1263
a
1
1
2
131
55 174
5 56
30 29
1
2
17
11
15
3
40
10
1
55
10
47
17
19
24 96
13
8 218
16
8
16
1
8
1293 10
5 1033 43
6 44 2535
1
28
14 241
37
1
7 34
4
1
4 78
20
44 23
3
2
24
1572 1371 3676
279 338 1141
4
50 1
242 2
50 5
25 27
6
3
6
57 6
13
3
49 116 1
7
67
12
7 58
15
2 6
70
32
16
9
8
16
40 350 2
1223
11 1
5 4210
12
570
31
4 3
6
45 1
2
14
39
2
68
20
1678 5255 773
455 1045 203
1 9
17 26
65 10
5 1
57 5
1
16
1
2
11
50
66
3
9
66
3
19
14
127
6
17
6
2286
5
3
44
9
2909
623
11
1
4
24
3
15
15
165
2
3
35
6
4
9
6
11
21
2
1
6
540
1
3
11
1
481
19
33
919 565
379 84
3
5
16
1
12
8
22
54
7
16
19
9
2
5
177
60
1
3
68
116
61
12
17
7
8
18
45
15
2
2495
68
3236
741
9
8
60
56
1
3
32
59
10
12
73
59
48
2
19
37
1
130
1310
2045
735
S. Kanaya et al. / Gene 276 (2001) 89–99
SOM territory
GN and AGN represent total gene numbers analyzed and the numbers of genes with atypical codon usage, respectively. Abbreviations of species correspond to those in Table 1 and Fig. 1.
95
96
S. Kanaya et al. / Gene 276 (2001) 89–99
Fig. 3. Histogram analysis of Z1 values of E. coli K12 and O157 genes.
Ribosomal protein genes (Rp) are presented to show the example for genes
highly adaptive to the translation system.
tend to have negative Z1 values regardless of the gene
expression level (Kanaya et al., 1996a, 1999, 2001). In the
following analysis, parameters for calculation of Z1 values
were obtained using E. coli K12 genes with more than 100
codons. Then, Z1 values of E. coli K12 and E. coli O157
genes thus calculated were compared (Fig. 3). E. coli O157
had a larger number of genes with negative Z1 values, and
O157-specific genes tended mostly to have negative Z1
values. This supports the notion that many O157-specific
genes were transferred horizontally into the E. coli genome
(Perna et al., 2001; Hayashi et al., 2001). The distribution of
Z1 values for E. coli O157 genes is shown in Fig. 4. Of
particular interest are genes located in five O-islands
(OI#8, OI#84, O#106, O#115, and OI#148 as described by
Perna et al., 2001) that have negative Z1 values, and thus,
codon usage patterns very different from those of E. coli
K12 genes. The O157 genes were then mapped to the SOM
territories of individual species in Fig. 1b and are presented
in Fig. 4. These data and those in Table 2 indicate that a
large number of genes were transferred horizontally to E.
coli O157 from V. cholerae or closely related species.
To test the feasibility of using SOM for clarifying the
donor/acceptor relationship in horizontal gene transfer, the
O157-specific genes localized to the Vibrio territory on the
SOM in Fig. 1b were examined in detail. Among approximately 1000 O157-specific genes, more than 50 genes were
found to be present in the V. cholerae territory on the SOM.
Then, we selected 23 genes that have positive Z1 values
calculated using the parameters for V. cholerae genes.
These genes should have codon usage patterns typical of
those found in V. cholerae, that fit well with the translation
system of V. cholerae. A BLASTP search showed that seven
out of the 23 O-specific genes have homologous genes in
Vibrio genomes (Table 3), suggesting that these seven genes
originated from Vibrio or closely related species. This finding that about one-third of the O157-specific genes selected
solely with SOM and Z1 parameter analyses have homologs
in the Vibrio genomes supports the feasibility of the present
strategy. It is possible that the homologs of the remaining
Fig. 4. Z1 distribution and gene classification by similarity of codon usage across the E. coli O157 genome. In the upper part, the average Z1 values for a window
of 11 genes are plotted with a step size of one gene. In the lower part, O157 genes classified into individual species-specific territories in the SOM of Fig. 1b are
listed by bars in the row with the species name.
S. Kanaya et al. / Gene 276 (2001) 89–99
97
Table 3
O157-specific genes located in the V. cholerae territory in SOM and with positive Z1 values a
ID
Z1538
Z5334
Z2099
Z1442
Z5415
Z0414
Z2152
Z2083
Z2239
Z0895
Z4385
Z3616
Z5088
Z2568
Z3132
Z2165
eae
chuT
Z3159
Z4383
Z2053
Z1494
Z5523
Z1
2.88
1.56
0.99
0.98
0.97
0.97
0.95
0.88
0.66
0.63
0.53
0.52
0.43
0.30
0.29
0.26
0.24
0.19
0.19
0.08
0.04
0.01
0.01
Homologs in Vibrio or related species
Function
Species
Fimbriae
Yersinia pestis
(none)
Bacteriophage 933W
Bacteriophage 933W
(none)
(Lactococus lactis)
Serratia marcescens
Serratia marcescens
V. cholerae
(Clostridium cochlearium)
V. cholerae
Bacteriophage APSE-1
V. cholerae
Bacteriophage 933W
Yersinia enterocolitica
V. cholerae
Yersinia pseudotuberculosis
V. cholerae
V. mimicus
Erwinia chrysanthemi
V. cholerae
Bacteriophage 933W
P. aeruginosa
Shigatoxin 2
Antitermination protein N
Hypothetical protein
DNA damage inducible protein I
DNA damage inducible protein I
Outer membrane protein
Methylaspartate mutase
Ferric vibriobactin ABC transporter
Hypothetical protein
Transposase orfAB subunitA
Shigatoxin 2
Hypothetical protein
NADH oxidase
Invasin
Heme transport
Heme receptor
Achromobactin transport system
Accessory colonization factor
Shigatoxin 2
MFS transporter
a
ID corresponds to the ID number for E. coli O157 sequence (Perna et al., 2001) registered in GenBank. The function of only two sequences, eae and chuT,
was assigned in the database sequence. Protein sequences of Vibrio or related species including phages. which show significant homology to the O157-specific
sequences, are listed. In the case where no homologs were found, this is noted as (none), and in the case where the homologs were found only in the species
taxonomically distant from Vibrio, the species name is given in parentheses.
two-thirds of the predicted genes may be found in the Vibrio
or related genomes with gene sequencing or molecular
biological methods such as Southern blotting hybridization.
In other words, the present methods for analyzing codon
usage heterogeneity within a species (Z1 parameter) and
between species (SOM classification) can be used together
as a powerful tool to assess possible donor genomes for
horizontal gene transfer. Furthermore, this strategy may
provide a key to predicting horizontally transferred genes
that have been lost from the present-day genomes of the
donor species or are at least absent in the currently
sequenced genomes.
References
Abe, T., Kanaya, S., Kinouchi, M., Kudo, Y., Mori, H., Matsuda, H.,
Carlos, D.C., Ikemura, T., 1999. Gene classification method based on
batch-learning SOM. In: Asai, K., Miyano, S., Takagi, T. (Eds.),
Genome Informatics Series No. 10. Universal Academy Press, Tokyo,
pp. 314–315.
Andersson, S.G., Kurland, C.G., 1990. Codon preferences in free-living
microorganisms. Microbiol. Rev. 54, 198–210.
Andersson, S.G.E., Sharp, P.M., 1996. Codon usage in the Mycobacterium
tuberculosis complex. Microbiology 142, 915–925.
Andersson, S.G.E., Zomorodipour, A., Andersson, J.O., Sicheritz-Ponten,
T., Alsmark, U.C.M., Podowski, R.M., Naslund, A.K., Eriksson, A.,
Winkler, H.H., Kurland, C.G., 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396, 133–140.
Aravind, L., Tatusov, R.L., Wolf, Y.I., Walker, D.R., Koonin, E.V., 1998.
Evidence for massive gene exchange between archaeal and bacterial
hyperthermophiles. Trends Genet. 14, 442–444.
Bernardi, G., Bernardi, G., 1985. Codon usage and genome composition. J.
Mol. Evol. 22, 363–365.
Blattner, F.R., Plunkett III, G., Bloch, C.A., et al., 1997. The complete
genome sequence of Escherichia coli K-12. Science 277, 1453–1462.
Bult, C.J., White, O., Olsen, G.J., et al., 1996. Complete genome sequence
of the methanogenic Archaeon, Methanococcus jannaschii. Science
273, 1058–1073.
Cole, S.T., Brosch, R., Parkhill, J., et al., 1998. Deciphering the biology of
Mycobacterium tuberculosis from the complete genome sequence.
Nature 393, 537–544.
Deckert, G., Warren, P.V., Gaasterland, T., et al., 1998. The complete
genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature
392, 353–358.
Dong, H., Nilsson, L., Kurland, C.G., 1996. Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J.
Mol. Biol. 260, 649–663.
Duret, L., 2000. tRNA gene number and codon usage in the C. elegans
genome are co-adapted for the optimal translation of highly expressed
genes. Trends Genet. 16, 287–289.
Fleischmann, R.D., Adams, M.D., White, O., et al., 1995. Whole-genome
random sequencing and assembly of Haemophilus influenzae Rd.
Science 269, 496–512.
Fraser, C.M., Casjens, S., Huang, W.M., et al., 1997. Genomic sequence of
a Lyme disease spirochetaete, Borrelia burgdorferi. Nature 390, 580–
586.
Fraser, C.M., Norris, S.J., Weinstock, G.M., et al., 1998. Complete genome
98
S. Kanaya et al. / Gene 276 (2001) 89–99
sequence of Treponema pallidum, the syphilis spirochete. Science 281,
375–388.
Glass, J.I., Lefkowitz, E.J., Glass, J.S., Cheryl, R., et al., 2000. The
complete sequence of the mucosal pathogen Ureaplasma urealyticum.
Nature 407, 757–762.
Gouy, M., Gautier, C., 1982. Codon usage in bacteria: correlation with gene
expressivity. Nucleic Acids Res. 10, 7055–7074.
Grantham, R., Gautier, C., Gouy, M., Mercier, R., Pave, A., 1980. Codon
catalog usage and the genome hypothesis. Nucleic Acids Res. 8, r49–
r62.
Guerdoux-Jamet, P., Henaut, A., Nitschke, P., Risler, J., Danchin, A., 1997.
Using codon usage to predict genes origin: is the Escherichia coli outer
membrane a patchwork of products from different genomes? DNA Res.
4, 257–265.
Hayashi, T., Makino, K., Ohnishi, M., Kurokawa, K., et al., 2001. Complete
genome sequence of enterohemorrhagic Escherichia coli O157:H7 and
genomic comparison with a laboratory strain K-12. DNA Res. 8, 11–22.
Heidelberg, J.F., Eisen, J.A., Nelson, W.C., Clayton, R.A., et al., 2000.
DNA sequence of both chromosomes of the cholera pathogen Vibrio
cholerae. Nature 406, 477–483.
Ikemura, T., 1981a. Correlation between the abundance of Escherichia coli
transfer RNAs and the occurrence of the respective codons in its protein
genes. J. Mol. Biol. 146, 1–21.
Ikemura, T., 1981b. Correlation between the abundance of Escherichia coli
transfer RNAs and the occurrence of the respective codons in its protein
genes: a proposal for a synonymous codon choice that is optimal for the
E. coli translational system. J. Mol. Biol. 151, 389–409.
Ikemura, T., 1982. Correlation between the abundance of yeast transfer
RNAs and the occurrence of the respective codons in protein genes:
differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer
RNAs. J. Mol. Biol. 158, 573–597.
Ikemura, T., 1985a. Codon usage and tRNA content in unicellular and
multicellular organisms. Mol. Biol. Evol. 2, 13–34.
Ikemura, T., 1985b. Codon usage, tRNA content, and rate of synonymous
substitution. In: Ohta, T., Aoki, K. (Eds.), Population Genetics and
Molecular Evolution. Japan Scientific Societies Press, Tokyo, pp.
385–406.
Jeltsch, A., Pingoud, A., 1996. Horizontal gene transfer contributes to the
wide distribution and evolution of type II restriction-modification
systems. J. Mol. Evol. 42, 91–92.
Kanaya, S., Kudo, Y., Nakamura, Y., Ikemura, T., 1996a. Detection of
genes in Escherichia coli sequences determined by genome projects
and prediction of protein production levels, based on multivariate diversity in codon usage. Comput. Appl. Biosci. 12, 213–225.
Kanaya, S., Kudo, Y., Suzuki, S., Ikemura, T., 1996b. Systematization of
species-specific diversity of genes in codon usage: comparison of the
diversity among bacteria and prediction of the protein production levels
in cells. In: Akutsu, T. et al. (Ed.), Genome Informatics Series No. 7.
Universal Academy Press, Tokyo, pp. 61–71.
Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T., 1999. Studies of codon
usage and tRNA genes of 18 unicellular organisms and quantification of
Bacillus subtilis tRNAs: gene expression level and species-specific
diversity of codon usage based on multivariate analysis. Gene 238,
143–155.
Kanaya, S., Yamada, Y., Kinouchi, M., Kudo, Y., Ikemura, T., 2001. Codon
usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and CG-dinucleotide usage as assessed
by multivariate analysis. J. Mol. Evol. in press.
Kaneko, T., Sato, S., Kotani, H., et al., 1996. Sequence analysis of the
genome of the unicellular cyanobacterium Synechocystis sp. strain
PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 3, 109–136.
Kawarabayasi, Y., Sawada, M., Horikawa, H., et al., 1998. Complete
sequence and gene organization of the genome of a hyper-thermophilic
Archaebacterium, Pyrococcus horikoshii OT3. DNA Res. 5, 55–76.
Kawarabayasi, Y., Hino, Y., Horikawa, H., et al., 1999. Complete genome
sequence of an aerobic hyper-thermophilic Crenarchaeon, Aeropyrum
pernix K1. DNA Res. 6, 83–101.
Klenk, H., Clayton, R.A., Tomb, J., et al., 1997. The complete genome
sequence of the hyperthermophilic, sulphate-reducing archaeon
Archaeoglobus fulgidus. Nature 390, 364–370.
Kohonen, T., 1982. Self-organized formation of topologically correct
feature maps. Biol. Cybern. 43, 59–69.
Kohonen, T., 1990. The self-organizing map. Proc. IEEE 78, 1464–1480.
Kohonen, T., Oja, E., Simula, O., Visa, A., Kangas, J., 1996. Engineering
applications of the self-organizing map. Proc. IEEE 84, 1358–1384.
Kunisawa, T., 1992. Synonymous codon preferences in bacteriophage T4: a
distinctive use of transfer RNAs from T4 and from its host Escherichia
coli. J. Theor. Biol. 159, 287–298.
Kunisawa, K., Kanaya, S., Kutter, E., 1998. Comparison of synonymous
codon distribution patterns of bacteriophage and host genomes. DNA
Res. 5, 319–326.
Kunst, F., Ogasawara, N., Moszer, I., et al., 1997. The complete genome
sequence of the Gram-positive bacterium Bacillus subtilis. Nature 390,
249–256.
Lawrence, J.G., Ochman, H., 1997. Amelioration of bacterial genomes:
rates of change and exchange. J. Mol. Evol. 44, 383–397.
Lawrence, J.G., Ochman, H., 1998. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95, 9413–9417.
Medigue, C., Rouxel, T., Vigier, P., Henaut, A., Danchin, A., 1991.
Evidence for horizontal gene transfer in Escherichia coli speciation.
J. Mol. Biol. 222, 851–856.
Muto, A., Osawa, S., 1987. The guanine and cytosine content of genomic
DNA and bacterial evolution. Proc. Natl. Acad. Sci. USA 84, 166–
169.
Nakayama, K., Kanaya, S., Ohnishi, M., Terawaki, Y., Hayashi, T., 1999.
The complete nucleotide sequence of fCTX, a cytotoxin-converting
phage of Pseudomonas aeruginosa: implications for phage evolution
and horizontal gene transfer via bacteriophages. Mol. Microbiol. 31,
399–419.
Nakayama, K., Takashima, K., Ishihara, H., Shinomiya, T., Kageyama, M.,
Kanaya, S., Ohnishi, M., Murata, T., Mori, H., Hayashi, T., 2000. The
R-type pyocin of Pseudomonas aeruginosa is related to P2 phage, and
the F-type is related to lambda phage. Mol. Microbiol. 38, 213–231.
Nelson, K.E., Clayton, R.A., Gill, S.R., Gwinn, M., 1999. Evidence for
lateral gene transfer between Archaea and bacteria from genome.
Sequence of Thermotoga maritima. Nature 399, 323–329.
Ng, W.V., Kennedy, S.P., Mahairas, G.G., Berquistc, B., Pan, M., et al.,
2000. From the cover genetics genome sequence of Halobacterium
species NRC-1. Proc. Natl. Acad. Sci. USA 97, 12176–12181.
Ochman, H., Lawrence, J.G., Groisman, E.A., 2000. Lateral gene transfer
and the nature of bacterial innovation. Nature 405, 299–304.
Osawa, S., 1995. Evolution of the Genetic Code. Oxford University Press,
Oxford.
Parkhill, J., Wren, B.W., Mungall, K., Ketley, J.M., et al., 2000. The
genome sequence of the food-borne pathogen Campylobacter jejuni
reveals hypervariable sequences. Nature 403, 665–668.
Percudani, R., Pavesi, A., Ottonello, S., 1997. Transfer RNA gene redundancy and translational selection in Saccharomyces cerevisiae. J. Mol.
Biol. 268, 322–330.
Perna, N.T., Plunkett III, G., Burland, V., Mau, B., et al., 2001. Genome
sequence of enterohaemorrhagic Escherichia coli O157: H7. Nature
409, 529–533.
Pouwels, P.H., Leunissen, J.A.M., 1994. Divergence in codon usage of
Lactobacillus species. Nucleic Acids Res. 22, 929–936.
Sharp, P.M., Matassi, G., 1994. Codon usage and genome evolution. Curr.
Opin. Genet. Dev. 4, 851–860.
Shigenobu, S., Watanabe, H., Hattori, M., Sakaki, Y., Ishikawa, H., 2000.
Genome sequence of the endocellular bacterial symbiont of aphids
Buchnera sp. APS. Nature 407, 81–86.
Simpson, J.G., Reinach, F.C., Arruda, P., Abreu, A., Acenicio, M., 2000.
The genome sequence of the plant pathogen Xylella fastidiosa. Nature
406, 151–157.
S. Kanaya et al. / Gene 276 (2001) 89–99
Smith, D.R., Douchette-Stamm, L.A., Deloughery, C., et al., 1997.
Complete genome sequence of Methanobacterium thermoautotrophicum DH: functional analysis and comparative genomics. J. Bacteriol.
179, 7135–7155.
Stephens, R.S., Kalman, S., Lammel, C., et al., 1998. Genome sequence of
an obligate intracellular pathogen of humans: Chlamydia trachomatis.
Science 282, 754–759.
Stover, C.K., Pham, X.Q., Erwin, A.L., Mizoguchi, S.D., et al., 2000.
Complete genome sequence of Pseudomonas aeruginosa PAO1, an
opportunistic pathogen. Nature 406, 959–964.
Su, C., Brandt, L.J., 1995. Escherichia coli O157:H7 infection in humans.
Ann. Intern. Med. 123, 698–714.
Sueoka, N., 1992. Directional mutation pressure, selective constraints, and
genetic equilibria. J. Mol. Evol. 3, 95–114.
Takami, H., Nakasone, K., Takaki, Y., Maeno, G., et al., 2000. Complete
genome sequence of the alkaliphilic bacterium Bacillus halodurans and
99
genomic sequence comparison with Bacillus subtilis. Nucleic Acids
Res. 28, 4317–4331.
Tettelin, H., Saunders, N.J., Heidelberg, J., Jeffries, A.C., et al., 2000.
Complete genome sequence of Neisseria meningitidis Serogroup B
strain MC58. Science 287, 1809–1815.
Tomb, J., White, O., Kerlavage, A.R., et al., 1997. The complete genome
sequence of the gastric pathogen Helicobacter pylori. Nature 388, 539–
547.
von der Malsburg, C., 1973. Self-organization of orientation sensitive cells
in the striate cortex. Kybernetik 14, 85–100.
White, O., Eisen, J.A., Heidelberg, J.F., Hickey, E.K., et al., 1999. Genome
sequence of the radioresistant bacterium Deinococcus radiodurans R1.
Science 286, 1571–1577.
Wolf, Y.I., Aravind, L., Koonin, E.V., 1999. Rickettsiae and Chlamydiae
evidence of horizontal gene transfer and gene exchange. Trends Genet.
15, 173–175.