Analysis of codon usage diversity of bacterial genes with a self
Transcription
Analysis of codon usage diversity of bacterial genes with a self
Gene 276 (2001) 89–99 www.elsevier.com/locate/gene Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli O157 genome Shigehiko Kanaya a,b,c, Makoto Kinouchi a,b, Takashi Abe a,d, Yoshihiro Kudo e, Yuko Yamada e, Tatsuya Nishi d, Hirotada Mori b,c, Toshimichi Ikemura f,* a Department of Bio-System Engineering, Faculty of Engineering, Yamagata University, Yonezawa, Yamagata-ken 992-8510, Japan b CREST JST (Japan Science and Technology), Tsukuba, Japan c Research and Education Center for Genetic Information, Nara Institute of Science and Technology, 8916-5, Takayama, Ikoma, Nara-ken 630-0101, Japan d Xanagen Inc., Sakado, Takatsu-ku, Kawasaki, Kanagawa-ken 213-0012, Japan e Department of Biochemistry, Jichi Medical School, Kawachi-gun, Tochigi-ken 329-0498, Japan f Division of Evolutionary Genetics, Department of Population Genetics, National Institute of Genetics, Mishima, Shizuoka-ken 411-8540, Japan Received 14 April 2001; received in revised form 12 June 2001; accepted 10 August 2001 Received by G. Bernardi Abstract With increases in the amounts of available DNA sequence data, it has become increasingly important to develop tools for comprehensive systematic analysis and comparison of species-specific characteristics of protein-coding sequences for a wide variety of genomes. In the present study, we used a novel neural-network algorithm, a self-organizing map (SOM), to efficiently and comprehensively analyze codon usage in approximately 60,000 genes from 29 bacterial species simultaneously. This SOM makes it possible to cluster and visualize genes of individual species separately at a much higher resolution than can be obtained with principal component analysis. The organization of the SOM can be explained by the genome G 1 C% and tRNA compositions of the individual species. We used SOM to examine codon usage heterogeneity in the E. coli O157 genome, which contains ‘O157-unique segments’ (O-islands), and showed that SOM is a powerful tool for characterization of horizontally transferred genes. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Codon usage; Self-organizing map; E. coli O157; Horizontally transferred gene 1. Introduction With progress in the genome projects, a vast amount of nucleotide sequence data are now available. Multivariate analysis methods such as factor corresponding analysis and principal component analysis (PCA) have been used to systematically study heterogeneous codon usage in a wide variety of species (Grantham et al., 1980; Medigue et al., 1991; Sharp and Matassi, 1994; Pouwels and Leunissen, 1994; Andersson and Sharp, 1996; Kanaya et al., 1996a; Guerdoux-Jamet et al., 1997; Kunst et al., 1997). To characterize species-specific heterogeneity in codon usage, we previously developed a measure denoted as Z1 that is based on the widest range of the axis obtained by PCA of codon usage patterns (Kanaya et al., 1996a,b, 1999, Abbreviations: O-islands, O157-unique segments; PCA, principal component analysis; SOM, self-organizing map * Corresponding author. Tel.: 181-559-81-6788; fax: 181-559-81-6794. E-mail address: [email protected] (T. Ikemura). 2001; Nakayama et al., 1999, 2000). Conventional multivariate analysis methods such as PCA are useful for analyzing codon usage within one or a small number of species, but their respective resolving powers are rather poor when a large number of genes from many species are analyzed simultaneously. Here, we introduce a novel neural-network algorithm with high resolving power, a self-organizing map (SOM), and we analyzed codon usage in approximately 60,000 genes from 29 bacterial species simultaneously. The SOM neural networks have been proposed by von der Malsburg (1973) and Kohonen (1982) and make it possible to visualize high-dimensional systems (reviewed in Kohonen et al., 1996). This method can be used to identify categories from raw data with high resolving power and to trace factors reflected in individual categories. In bacterial genomes, codon usage varies within and between species. Our group and others have shown that, in unicellular organisms, choice among synonymous codons for highly expressed genes is typically dependent on levels 0378-1119/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0378-111 9(01)00673-4 90 S. Kanaya et al. / Gene 276 (2001) 89–99 of isoaccepting tRNAs (Ikemura, 1981a,b, 1982; Dong et al., 1996; Percudani et al., 1997; Kanaya et al., 1999, 2001) and therefore the extent of codon bias for each gene is associated with the level of protein production (Ikemura, 1981a,b, 1982, 1985a,b; Gouy and Gautier, 1982; Medigue et al., 1991; reviewed in Andersson and Kurland, 1990; Kunisawa, 1992; Sharp and Matassi, 1994). Codon usage diversity is also affected by G 1 C% of the genome (Bernardi and Bernardi, 1985; Muto and Osawa, 1987; Sueoka, 1992; Osawa, 1995). These characteristics of codon usage heterogeneity could explain the organization of the SOM obtained in this study. Genes introduced through horizontal transfer from distantly related organisms are known to retain the sequence characteristics of the donor genome and can be distinguished from those of the acceptor genome (Jeltsch and Pingoud, 1996; Lawrence and Ochman, 1998). The present study showed that SOM is an efficient tool for characterizing horizontally transferred genes and predicting the donor/ acceptor relationship with respect to the transferred genes. We applied this method to characterize codon usage heterogeneity in the E. coli O157 genome, which contains the unique segments including O-islands (Perna et al., 2001) that are absent in E. coli K12. 2. Materials and methods To exclude the effects of gene size, amino acid composition, and codon box number, the codon frequency of the kth gene for the v(m)th codon xkv(m) was calculated as "MðmÞ # X xkvðmÞ ¼ f kvðmÞ = f kvðmÞ =MðmÞ ð1Þ v¼1 where, fkv(m) denotes the v(m)th synonymous codon number for the mth amino acid, and M(m) denotes the codon box number. The codon usage pattern for the kth gene is described by a vector xk consisting of xkv(m). The SOM is a neural-network algorithm that implements a characteristic nonlinear projection from the high-dimensional space of input signals onto a low-dimensional array of weights (reviewed in Kohonen et al., 1996). The weights (wij) in the codon frequency space are arranged in a twodimensional lattice denoted by i ( ¼ 0, 1,…, I 2 1) and j ( ¼ 0, 1,…, J 2 1). The learning process of the SOM algorithm in the present study is independent of the order of input vectors (Abe et al., 1999). In the original method, the initial weights vectors wij are set by random values (Kohonen, 1990; Kohonen et al., 1996), but in the present method the vectors are initialized by PCA, which is a statistical method that performs linear mapping to extract optimal features from an input distribution in the mean squared error sense and can be used by self-organizing neural networks to form unsupervised neural preprocessing modules for classification problems (Kohonen et al., 1996). Hence, the initial weight vectors (wij) are set based on the widest scale of the gene distribution in the codon frequency space with PCA. Weights in the first dimension are arranged into 200 lattices (I ¼ 200) corresponding to the width of five times the standard deviation (5s 1) of the first principal component. The second dimension (J) is defined by the nearest integer greater than Is 2/s 1. The weight vector on the ijth lattice is represented as follows: w ij ¼ xav 1 5s 1 ½b1 ði 2 I=2Þ=I 1 b2 ðj 2 J=2Þ=J ð2Þ Here, xav is the average vector for codon usage patterns; b1 and b2 are eigen vectors for the first and second principal components. In Step 2, the Euclidean distances between the input vector xk and the weight vector wij are calculated, and xk is classified into the weight vector (called wi 0 j 0 ) with the smallest distance among them. After classifying all input vectors into the weight vectors, an updating process is done according to Step 3. In Step 3, the ijth weight vector is updated with 0 1 Nij X wij ðnewÞ ¼ wij 1 aðrÞ@ xk =Nij 2 wij A ð3Þ xk [Sij Here, the components of set Sij are input vectors classified into wi 0 j 0 satisfying i 2 bðrÞ # i 0 # i 1 bðrÞ and j 2 bðrÞ # j 0 # j 1 bðrÞ. The two parameters a (r) (0 , aðrÞ , 1) and b (r) are learning coefficients for the rth cycle, and Nij is the number of components of Sij. In the present study, a (r) and b (r) are set by aðrÞ ¼ max{0:01; að1Þð1 2 r=TÞ} ð4Þ bðrÞ ¼ max{0; bð1Þ 2 r} ð5Þ where a (1) and b (1) are the initial values for the T-cycle of the learning process. In the present study, we selected 100 for T, 0.5 for a (1), and I/4 for b (1). The learning process is monitored by the total distance between xk and the nearest weight vector wi 0 j 0 , represented as QðrÞ ¼ N n X k xk 2 w i 0 j 0 k2 o ð6Þ k¼1 where N is the total number of genes analyzed. 3. Results and discussion 3.1. Species-specific codon usage visualized by SOM SOM was constructed with 59,122 genes, each containing at least 100 codons, from the 29 bacterial species whose complete genomic sequences have been determined: Aquifex aeolicus (Deckert et al., 1998), Archaeoglobus fulgidus (Klenk et al., 1997), Aeropyrum pernix (Kawarabayasi et al., 1999), Bacillus subtilis (Kunst et al., 1997), Bacillus halodurans (Takami et al., 2000), Borrelia burgdorferi (Fraser et al., 1997), Buchnera sp. (Shigenobu et al., 2000), Campylobacter jejuni (Parkhill et al., 2000), Chlamydia trachoma- S. Kanaya et al. / Gene 276 (2001) 89–99 tis (Stephens et al., 1998), Chlamydia pneumoniae (Parkhill et al., 2000), Deinococcus radiodurans (White et al., 1999), E. coli (Blattner et al., 1997), Haemophilus influenzae (Fleischmann et al., 1995), Halobactgerium sp. (Ng et al., 2000), Helicobacter pylori (Tomb et al., 1997), Methanococcus jannaschii (Bult et al., 1996), Methanobacterium thermoautotrophicum (Smith et al., 1997), Mycobacterium tuberculosis (Cole et al., 1998), Neisseria meningitidis (Tettelin et al., 2000), Pseudomonas aeruginosa (Stover et al., 2000), Pyrococcus abyssi (Heilig, unpublished data; GenBank Accession number, AL096836), Pyrococcus horikoshii (Kawarabayasi et al., 1998), Rickettsia prowazekii (Andersson et al., 1998), Synechocystis sp. (Kaneko et al., 1996), Thermotoga maritima (Nelson et al., 1999), Treponema pallidum (Fraser et al., 1998), Ureaplasma urealyticum (Glass et al., 2000), Vibrio cholerae (Heidelberg et al., 2000), and Xylella fastidiosa (Simpson et al., 2000). As the first step to obtain the initial weight vectors (see Section 2) codon usage for these 59,122 genes was analyzed by PCA. After the learning process of the 100th cycle, codon usage of the genes was effectively reflected in the weight vectors. The learning process was monitored by Q(r) in Eq. (6). Total error by Eq. (6) decreased from Qð1Þ ¼ 1:13 £ 106 at the initial cycle to Qð100Þ ¼ 3:39 £ 105 at the 100th cycle. Comparison of gene classification into lattice points by the initial vectors (Fig. 1a) with the classification by the final vectors (Fig. 1b) showed clearly that genes within a single species were much more tightly clustered with the final vectors. Lattices that include genes from a single species are indicated in color, and those including genes of more than one species are indicated in black. It is apparent that the resolving power of the conventional multivariate analysis PCA which is depicted in Fig. 1a is poor and rather useless for comprehensive analysis and comparison of a large number of genes from multiple genomes. 3.2. SOM organization with respect to genome G 1 C% and taxonomic relationships Analysis of the raw data and raw vectors revealed that the clustered genes have very similar patterns of codon usage. Neighboring weight vectors in SOM tend to be similar, and distantly separated weight vectors tend to be different. Strongly biased weights are also known to localize to the edge of the SOM, but weakly biased weights tend to localize in the center. This suggests that species with strong codon biases are located at the edge of this map. In fact, genes of U. urealyticum (genome G 1 C%, 25.5%), Buchnera sp. (26.4%), R. prowazekii (29.0%), C. jejuni (30.5%), and M. jannaschii (31.4%) were distributed on the left side of the SOM, and those of P. aeruginosa (66.6%), D. radiodurans (66.6%), and Halobacterium sp. (66.6%) were distributed on the right side (Fig. 1b). Genome G 1 C% increases from left to right and thus is reflected mainly in the horizontal 91 axis. Fig. 2 illustrates the configuration of individual species whose genome G 1 C%s are listed. On the vertical axis, genes for Archaea (M. jannaschii, P. horikoshii, P. abyssi, A. premix, M. thermoautotrophicum, A. fulgidus, and Halobacterium sp.) were distributed at the bottom, and those for g- and b-Proteobacteria (Buchmera sp., H. influenzae, X. fastidiosa, V. cholerae, E. coli, P. aeruginosa, and N. meningitidis) were distributed at the upper part. This shows that the SOM reflects globally the taxonomic relationships. Two thermophilic bacteria, T. maritima and A. aeolicus, were located close to Archaea, indicating that the thermophilic bacteria have codon usage patterns similar to those of Archaea. This may be related to the observation that each of these thermophilic bacteria contains a large number of genes that are most similar to those of thermophilic Archaea (Ochman et al., 2000). 3.3. SOM organization and tRNA gene number The relative proportions of isoaccepting tRNAs in cells are important factors that influence synonymous codon choice in genes of unicellular organisms; codon usage in highly expressed genes is typically dependent on tRNA content (Ikemura, 1981a,b, 1982; Kanaya et al., 1999). Cellular tRNA contents are known to be related to copy numbers of tRNA genes (Ikemura, 1981a,b; Dong et al., 1996; Duret, 2000; Kanaya et al., 1999). To investigate SOM classification from the viewpoint of the levels of isoaccepting tRNAs, we examined tRNA genes of individual species (Table 1). The results shown in Table 1 and those of a previous study (Kanaya et al., 1999) indicate that increases in tRNA genes may occur in two ways. One is multiplication of gene encoding the tRNA with one of anticodons specific for each amino acid, which results in a clear difference in the tRNA levels between the major and minor isoaccepting tRNAs. This phenomenon has been observed for bacteria with large genomes such as N. meningitidis, P. aeruginosa, E. coli, V. cholerae, B. halodurans, B. subtilis, and H. influenzae. Species in which multiplication of tRNA genes for more than ten anticodon types are observed are indicated by blue dots in Fig. 2. The respective species are located primarily in the right and upper zones of the SOM shown in Fig. 2. In the second mechanism, there are single copies of tRNA genes encoding a variety of anticodons specific for one amino acid; the species which have more than 40 anticodon types are indicated by red dots in Fig. 2. This is observed often for bacteria with high G 1 C% genomes. M. tuberculosis, D. radiodurans, T. maritima, and most of the Archaea belong to this category. The remaining bacterial species, which are not indicated by colored dots, are located primarily in the left and upper zones of the SOM in Fig. 2. Bacteria with low G 1 C% genomes tend to have small numbers of isoacceptor species and, therefore, belong to this residual class. It is notable that Buchnera sp. and R. prowazekii, which are taxonomically distant from each other but 92 S. Kanaya et al. / Gene 276 (2001) 89–99 possess almost identical isoacceptor sets, are located in the vicinity in the SOM. A similar observation was also made for P. horikoshii, P. abyssi, T. maritima, and A. fulgidus. Collectively, these findings support the view that species- specific codon usage in bacterial genomes is determined primarily by genome G 1 C% and compositions of isoaccepting tRNAs and that SOM organization is reflective of these two factors. Fig. 1. Gene classification by (a) initial weights and (b) final weights: A. aeolicus (abbreviated as Aaeo), A. fulgidus (Aful), A. pernix (Aper), B. subtilis (Bsub), B. halodurans (Bhal), B. burgdorferi (Bbur), Buchnera sp. (Buch), C. jejuni (Cjej), C. trachomatis and C. pneumonia (Chla), D. radiodurans (Drad), E. coli (Ecol), H. influenzae (Hinf), Halobactgerium sp. (Halo), H. pylori (Hpyl), M. jannaschii (Mjan), M. thermoautotrophicum (Mthe), M. tuberculosis (Mtub), N. meningitidis (Nmen), P. aeruginosa (Paer), P. abyssi (Paby), P. horikoshii (Phor), R. prowazekii (Rpro), Synechocystis sp. (Syne), T. maritima (Tmar), T. pallidum (Tpal), U. urealyticum (Uure), V. cholerae (Vcho), and X. fastidiosa (Xfas). Archaea, eubacteria and two thermophilic bacteria are denoted by yellow, green, and blue letters. The configuration of bacterial species in (b) is depicted in Fig. 2. These SOM results are available on the Xanagen Inc. web (URL http:// www.xanagen.com). S. Kanaya et al. / Gene 276 (2001) 89–99 93 Fig. 2. Configuration of bacterial species in SOM. Genome G 1 C% are indicated with black letters. Abbreviations for bacterial species correspond to those in Fig. 1b. Some species have adapted independently to a common lifestyle: archaeal and bacterial hyperthermophiles (Aravind et al., 1998), for example, and the intracellular pathogens Richettsia and Chlamydia (Wolf et al., 1999). Species within the respective groups show similar codon usage and tRNA compositions and are located close to each other in the SOM. 3.4. Donor/acceptor relationship in horizontal gene transfer Foreign genes can be identified by their atypical nucleotide compositions and codon usage patterns (Lawrence and Ochman, 1998). Parasites and hosts often have coding strategies that can be distinguished (Grantham et al., 1980), and foreign-type genes such as genes of transposons, plasmids, and viruses often have codon usage that are quite different from patterns of the hosts (Medigue et al., 1991). Therefore, codon usage data have been used to identify which genes were transferred horizontally from other genomes (Lawrence and Ochman, 1997, 1998; Nakayama et al., 1999, 2000; Kunisawa et al., 1998). In Fig. 1b, a black lattice within a species-specific colored territory indicates the presence of genes with codon usage that is atypical in its own genome but similar to that of the species represented by the color. This invasion of the alien genes into speciesspecific colored territories in the SOM may provide information concerning the donor/acceptor relationship in horizontal gene transfer. Table 2 lists the numbers of genes present in the territories of different species in the SOM shown in Fig. 1b. For example, 273 E. coli genes are located in the V. cholerae territory, and 116 V. cholerae genes are located in the E. coli territory. A relatively large number of B. subtilis genes have codon usage patterns that differ from the intrinsic pattern, and many genes from various eubac- teria have patterns similar to that of B. subtilis. These findings may provide fundamental knowledge for studies of the donor/acceptor relationship in horizontal gene transfer. This possibility was tested for E. coli O157 and is described in the following section. 3.5. Codon usage heterogeneity in the E. coli O157 genome E. coli O157:H7 is a global threat to public health and has been implicated in many outbreaks of hemorrhagic colitis. The severity of symptoms, the lack of effective treatments, and the potential for large-scale outbreaks from contaminated foods have propelled intensive research into this organism (Su and Brandt, 1995). The genomic sequences has been determined by two groups (Perna et al., 2001; Hayashi et al., 2001). The size of the E. coli O157:H7 genome is much larger than that of E. coli K12; approximately 4 Mb of sequence is homologous between E. coli K12 and E. coli O157, but sequences such as O-islands are present only in E. coli O157. We examined codon usage heterogeneity in approximately 1000 O157-specific genes with the SOM and Z1 parameter analyses. The Z1 parameter is an index of the heterogeneity of codon usage ( Kanaya et al., 1996a, 1999, 2001). A large, positive Z1 value indicates high adaptation of codon usage to the translation system, and a negative Z1 represents low adaptation; foreign genes 94 Table 1 Number of tRNA genes for individual anticodons a S. Kanaya et al. / Gene 276 (2001) 89–99 a E. coli O157 genome sequences reported by Perna et al. (2001) and Hayashi et al. (2001) are abbreviated as O157P and O157H, respectively. C. trachomatis and C. pneumoniae are abbreviated as Ctra and Cpne, respectively. Abbreviations for other species correspond to those in Fig. 1. uua and uca correspond to anticodons for selenocysteine. Table 2 The number of genes in their own territory and of genes with atypical codon usage that invaded territories of other species in the SOM in Fig. 1b a Archaea Eubacteria Aful Aper Drad Halo Mjan Mthe Paby Phor Aaeo Tmar Bbur Bhal Bsub Buch Cjej Cpne Ctra Ecol O157P Hinf Hpyl Mtub Nmen Paer Rpro Syne Tpal Uure Vcho Xfas A. fulgidus A. pernix D. radiodurans Halobacterium sp. M. jannaschii M. thermoautotrophicum P. abyssi P. horikoshii A. aeolicus T. maritima B. burgdorferi B. halodurans B. subtilis Buchnera sp. C. jejuni Chlamydia E. coli H. influenzae H. pylori M. tuberculosis N. meningitidis P. aeruginosa R. prowazekii Synechocystis sp. T. pallidum U. urealyticum V. cholerae X. fastidiosa GN AGN 1602 112 6 8 18 45 19 15 34 3 1 1 3 6 35 2038 94 36 6 26 30 54 5 5 3 2 1 7 9 5 4 1672 21 155 2 3 3 2 22 32 271 1 14 2 1 38 67 31 216 91 1529 57 337 169 121 86 8 9 4 8 19 2 1 1 2 11 1 30 3 1063 7 21 5 5 64 44 157 7 40 2 1 28 51 84 33 1 1137 16 13 5 18 29 11 5 874 326 79 5 1 1 92 79 5 1 28 59 499 1324 63 48 14 42 83 12 23 10 15 35 11 20 29 31 1191 114 2 1 118 29 3 2 30 8 12 12 1428 4 81 6 12 1 70 1 24 1 2 4 10 395 6 101 4 29 4 12 21 3 7 1 1 6 12 2 1 2938 410 1 66 58 131 143 10 22 246 12 22 7 2 24 4 93 181 1572 7 15 13 14 111 176 5 1 2 8 10 400 35 4 1 1 3 1 12 15 1 32 7 15 10 1149 7 5 5 1 9 5 1 3 3 2 17 4 6 17 63 84 3 17 789 702 26 59 13 3 49 3032 3324 1 56 23 3 19 26 18 37 44 2 6 36 1 1 1 6 1 3 10 91 2 4 19 31 4 2 330 114 45 1 4 1 57 18 254 4 12 4 1 27 74 25 2 1 4 1 5 32 46 1 3 201 18 91 1 1 4 2 3 9 89 2 8 3 12 9 7 1 8 8 28 61 99 12 3 6 17 1 2 6 1 4 3 1 1 30 35 208 8 5 2 36 55 31 62 1 11 5 1 1 21 12 19 1 6 1 1 17 46 5 2 6 20 8 1 1 1 3 2 1 48 44 4 6 273 298 13 3 1 6 12 36 53 2028 2611 2927 1761 1542 1644 1692 2001 1499 1684 772 3541 3627 523 1491 968 833 3907 4587 426 573 1255 232 479 507 818 677 308 256 382 603 2055 123 342 179 131 875 1263 a 1 1 2 131 55 174 5 56 30 29 1 2 17 11 15 3 40 10 1 55 10 47 17 19 24 96 13 8 218 16 8 16 1 8 1293 10 5 1033 43 6 44 2535 1 28 14 241 37 1 7 34 4 1 4 78 20 44 23 3 2 24 1572 1371 3676 279 338 1141 4 50 1 242 2 50 5 25 27 6 3 6 57 6 13 3 49 116 1 7 67 12 7 58 15 2 6 70 32 16 9 8 16 40 350 2 1223 11 1 5 4210 12 570 31 4 3 6 45 1 2 14 39 2 68 20 1678 5255 773 455 1045 203 1 9 17 26 65 10 5 1 57 5 1 16 1 2 11 50 66 3 9 66 3 19 14 127 6 17 6 2286 5 3 44 9 2909 623 11 1 4 24 3 15 15 165 2 3 35 6 4 9 6 11 21 2 1 6 540 1 3 11 1 481 19 33 919 565 379 84 3 5 16 1 12 8 22 54 7 16 19 9 2 5 177 60 1 3 68 116 61 12 17 7 8 18 45 15 2 2495 68 3236 741 9 8 60 56 1 3 32 59 10 12 73 59 48 2 19 37 1 130 1310 2045 735 S. Kanaya et al. / Gene 276 (2001) 89–99 SOM territory GN and AGN represent total gene numbers analyzed and the numbers of genes with atypical codon usage, respectively. Abbreviations of species correspond to those in Table 1 and Fig. 1. 95 96 S. Kanaya et al. / Gene 276 (2001) 89–99 Fig. 3. Histogram analysis of Z1 values of E. coli K12 and O157 genes. Ribosomal protein genes (Rp) are presented to show the example for genes highly adaptive to the translation system. tend to have negative Z1 values regardless of the gene expression level (Kanaya et al., 1996a, 1999, 2001). In the following analysis, parameters for calculation of Z1 values were obtained using E. coli K12 genes with more than 100 codons. Then, Z1 values of E. coli K12 and E. coli O157 genes thus calculated were compared (Fig. 3). E. coli O157 had a larger number of genes with negative Z1 values, and O157-specific genes tended mostly to have negative Z1 values. This supports the notion that many O157-specific genes were transferred horizontally into the E. coli genome (Perna et al., 2001; Hayashi et al., 2001). The distribution of Z1 values for E. coli O157 genes is shown in Fig. 4. Of particular interest are genes located in five O-islands (OI#8, OI#84, O#106, O#115, and OI#148 as described by Perna et al., 2001) that have negative Z1 values, and thus, codon usage patterns very different from those of E. coli K12 genes. The O157 genes were then mapped to the SOM territories of individual species in Fig. 1b and are presented in Fig. 4. These data and those in Table 2 indicate that a large number of genes were transferred horizontally to E. coli O157 from V. cholerae or closely related species. To test the feasibility of using SOM for clarifying the donor/acceptor relationship in horizontal gene transfer, the O157-specific genes localized to the Vibrio territory on the SOM in Fig. 1b were examined in detail. Among approximately 1000 O157-specific genes, more than 50 genes were found to be present in the V. cholerae territory on the SOM. Then, we selected 23 genes that have positive Z1 values calculated using the parameters for V. cholerae genes. These genes should have codon usage patterns typical of those found in V. cholerae, that fit well with the translation system of V. cholerae. A BLASTP search showed that seven out of the 23 O-specific genes have homologous genes in Vibrio genomes (Table 3), suggesting that these seven genes originated from Vibrio or closely related species. This finding that about one-third of the O157-specific genes selected solely with SOM and Z1 parameter analyses have homologs in the Vibrio genomes supports the feasibility of the present strategy. It is possible that the homologs of the remaining Fig. 4. Z1 distribution and gene classification by similarity of codon usage across the E. coli O157 genome. In the upper part, the average Z1 values for a window of 11 genes are plotted with a step size of one gene. In the lower part, O157 genes classified into individual species-specific territories in the SOM of Fig. 1b are listed by bars in the row with the species name. S. Kanaya et al. / Gene 276 (2001) 89–99 97 Table 3 O157-specific genes located in the V. cholerae territory in SOM and with positive Z1 values a ID Z1538 Z5334 Z2099 Z1442 Z5415 Z0414 Z2152 Z2083 Z2239 Z0895 Z4385 Z3616 Z5088 Z2568 Z3132 Z2165 eae chuT Z3159 Z4383 Z2053 Z1494 Z5523 Z1 2.88 1.56 0.99 0.98 0.97 0.97 0.95 0.88 0.66 0.63 0.53 0.52 0.43 0.30 0.29 0.26 0.24 0.19 0.19 0.08 0.04 0.01 0.01 Homologs in Vibrio or related species Function Species Fimbriae Yersinia pestis (none) Bacteriophage 933W Bacteriophage 933W (none) (Lactococus lactis) Serratia marcescens Serratia marcescens V. cholerae (Clostridium cochlearium) V. cholerae Bacteriophage APSE-1 V. cholerae Bacteriophage 933W Yersinia enterocolitica V. cholerae Yersinia pseudotuberculosis V. cholerae V. mimicus Erwinia chrysanthemi V. cholerae Bacteriophage 933W P. aeruginosa Shigatoxin 2 Antitermination protein N Hypothetical protein DNA damage inducible protein I DNA damage inducible protein I Outer membrane protein Methylaspartate mutase Ferric vibriobactin ABC transporter Hypothetical protein Transposase orfAB subunitA Shigatoxin 2 Hypothetical protein NADH oxidase Invasin Heme transport Heme receptor Achromobactin transport system Accessory colonization factor Shigatoxin 2 MFS transporter a ID corresponds to the ID number for E. coli O157 sequence (Perna et al., 2001) registered in GenBank. The function of only two sequences, eae and chuT, was assigned in the database sequence. Protein sequences of Vibrio or related species including phages. which show significant homology to the O157-specific sequences, are listed. In the case where no homologs were found, this is noted as (none), and in the case where the homologs were found only in the species taxonomically distant from Vibrio, the species name is given in parentheses. two-thirds of the predicted genes may be found in the Vibrio or related genomes with gene sequencing or molecular biological methods such as Southern blotting hybridization. In other words, the present methods for analyzing codon usage heterogeneity within a species (Z1 parameter) and between species (SOM classification) can be used together as a powerful tool to assess possible donor genomes for horizontal gene transfer. Furthermore, this strategy may provide a key to predicting horizontally transferred genes that have been lost from the present-day genomes of the donor species or are at least absent in the currently sequenced genomes. References Abe, T., Kanaya, S., Kinouchi, M., Kudo, Y., Mori, H., Matsuda, H., Carlos, D.C., Ikemura, T., 1999. Gene classification method based on batch-learning SOM. In: Asai, K., Miyano, S., Takagi, T. (Eds.), Genome Informatics Series No. 10. Universal Academy Press, Tokyo, pp. 314–315. Andersson, S.G., Kurland, C.G., 1990. Codon preferences in free-living microorganisms. Microbiol. Rev. 54, 198–210. Andersson, S.G.E., Sharp, P.M., 1996. Codon usage in the Mycobacterium tuberculosis complex. Microbiology 142, 915–925. Andersson, S.G.E., Zomorodipour, A., Andersson, J.O., Sicheritz-Ponten, T., Alsmark, U.C.M., Podowski, R.M., Naslund, A.K., Eriksson, A., Winkler, H.H., Kurland, C.G., 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396, 133–140. Aravind, L., Tatusov, R.L., Wolf, Y.I., Walker, D.R., Koonin, E.V., 1998. Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles. Trends Genet. 14, 442–444. Bernardi, G., Bernardi, G., 1985. Codon usage and genome composition. J. Mol. Evol. 22, 363–365. Blattner, F.R., Plunkett III, G., Bloch, C.A., et al., 1997. The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1462. Bult, C.J., White, O., Olsen, G.J., et al., 1996. Complete genome sequence of the methanogenic Archaeon, Methanococcus jannaschii. Science 273, 1058–1073. Cole, S.T., Brosch, R., Parkhill, J., et al., 1998. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544. Deckert, G., Warren, P.V., Gaasterland, T., et al., 1998. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358. Dong, H., Nilsson, L., Kurland, C.G., 1996. Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J. Mol. Biol. 260, 649–663. Duret, L., 2000. tRNA gene number and codon usage in the C. elegans genome are co-adapted for the optimal translation of highly expressed genes. Trends Genet. 16, 287–289. Fleischmann, R.D., Adams, M.D., White, O., et al., 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512. Fraser, C.M., Casjens, S., Huang, W.M., et al., 1997. Genomic sequence of a Lyme disease spirochetaete, Borrelia burgdorferi. Nature 390, 580– 586. Fraser, C.M., Norris, S.J., Weinstock, G.M., et al., 1998. Complete genome 98 S. Kanaya et al. / Gene 276 (2001) 89–99 sequence of Treponema pallidum, the syphilis spirochete. Science 281, 375–388. Glass, J.I., Lefkowitz, E.J., Glass, J.S., Cheryl, R., et al., 2000. The complete sequence of the mucosal pathogen Ureaplasma urealyticum. Nature 407, 757–762. Gouy, M., Gautier, C., 1982. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10, 7055–7074. Grantham, R., Gautier, C., Gouy, M., Mercier, R., Pave, A., 1980. Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 8, r49– r62. Guerdoux-Jamet, P., Henaut, A., Nitschke, P., Risler, J., Danchin, A., 1997. Using codon usage to predict genes origin: is the Escherichia coli outer membrane a patchwork of products from different genomes? DNA Res. 4, 257–265. Hayashi, T., Makino, K., Ohnishi, M., Kurokawa, K., et al., 2001. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. 8, 11–22. Heidelberg, J.F., Eisen, J.A., Nelson, W.C., Clayton, R.A., et al., 2000. DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae. Nature 406, 477–483. Ikemura, T., 1981a. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J. Mol. Biol. 146, 1–21. Ikemura, T., 1981b. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409. Ikemura, T., 1982. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes: differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs. J. Mol. Biol. 158, 573–597. Ikemura, T., 1985a. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2, 13–34. Ikemura, T., 1985b. Codon usage, tRNA content, and rate of synonymous substitution. In: Ohta, T., Aoki, K. (Eds.), Population Genetics and Molecular Evolution. Japan Scientific Societies Press, Tokyo, pp. 385–406. Jeltsch, A., Pingoud, A., 1996. Horizontal gene transfer contributes to the wide distribution and evolution of type II restriction-modification systems. J. Mol. Evol. 42, 91–92. Kanaya, S., Kudo, Y., Nakamura, Y., Ikemura, T., 1996a. Detection of genes in Escherichia coli sequences determined by genome projects and prediction of protein production levels, based on multivariate diversity in codon usage. Comput. Appl. Biosci. 12, 213–225. Kanaya, S., Kudo, Y., Suzuki, S., Ikemura, T., 1996b. Systematization of species-specific diversity of genes in codon usage: comparison of the diversity among bacteria and prediction of the protein production levels in cells. In: Akutsu, T. et al. (Ed.), Genome Informatics Series No. 7. Universal Academy Press, Tokyo, pp. 61–71. Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T., 1999. Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238, 143–155. Kanaya, S., Yamada, Y., Kinouchi, M., Kudo, Y., Ikemura, T., 2001. Codon usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and CG-dinucleotide usage as assessed by multivariate analysis. J. Mol. Evol. in press. Kaneko, T., Sato, S., Kotani, H., et al., 1996. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 3, 109–136. Kawarabayasi, Y., Sawada, M., Horikawa, H., et al., 1998. Complete sequence and gene organization of the genome of a hyper-thermophilic Archaebacterium, Pyrococcus horikoshii OT3. DNA Res. 5, 55–76. Kawarabayasi, Y., Hino, Y., Horikawa, H., et al., 1999. Complete genome sequence of an aerobic hyper-thermophilic Crenarchaeon, Aeropyrum pernix K1. DNA Res. 6, 83–101. Klenk, H., Clayton, R.A., Tomb, J., et al., 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390, 364–370. Kohonen, T., 1982. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59–69. Kohonen, T., 1990. The self-organizing map. Proc. IEEE 78, 1464–1480. Kohonen, T., Oja, E., Simula, O., Visa, A., Kangas, J., 1996. Engineering applications of the self-organizing map. Proc. IEEE 84, 1358–1384. Kunisawa, T., 1992. Synonymous codon preferences in bacteriophage T4: a distinctive use of transfer RNAs from T4 and from its host Escherichia coli. J. Theor. Biol. 159, 287–298. Kunisawa, K., Kanaya, S., Kutter, E., 1998. Comparison of synonymous codon distribution patterns of bacteriophage and host genomes. DNA Res. 5, 319–326. Kunst, F., Ogasawara, N., Moszer, I., et al., 1997. The complete genome sequence of the Gram-positive bacterium Bacillus subtilis. Nature 390, 249–256. Lawrence, J.G., Ochman, H., 1997. Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol. 44, 383–397. Lawrence, J.G., Ochman, H., 1998. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95, 9413–9417. Medigue, C., Rouxel, T., Vigier, P., Henaut, A., Danchin, A., 1991. Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222, 851–856. Muto, A., Osawa, S., 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. USA 84, 166– 169. Nakayama, K., Kanaya, S., Ohnishi, M., Terawaki, Y., Hayashi, T., 1999. The complete nucleotide sequence of fCTX, a cytotoxin-converting phage of Pseudomonas aeruginosa: implications for phage evolution and horizontal gene transfer via bacteriophages. Mol. Microbiol. 31, 399–419. Nakayama, K., Takashima, K., Ishihara, H., Shinomiya, T., Kageyama, M., Kanaya, S., Ohnishi, M., Murata, T., Mori, H., Hayashi, T., 2000. The R-type pyocin of Pseudomonas aeruginosa is related to P2 phage, and the F-type is related to lambda phage. Mol. Microbiol. 38, 213–231. Nelson, K.E., Clayton, R.A., Gill, S.R., Gwinn, M., 1999. Evidence for lateral gene transfer between Archaea and bacteria from genome. Sequence of Thermotoga maritima. Nature 399, 323–329. Ng, W.V., Kennedy, S.P., Mahairas, G.G., Berquistc, B., Pan, M., et al., 2000. From the cover genetics genome sequence of Halobacterium species NRC-1. Proc. Natl. Acad. Sci. USA 97, 12176–12181. Ochman, H., Lawrence, J.G., Groisman, E.A., 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299–304. Osawa, S., 1995. Evolution of the Genetic Code. Oxford University Press, Oxford. Parkhill, J., Wren, B.W., Mungall, K., Ketley, J.M., et al., 2000. The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature 403, 665–668. Percudani, R., Pavesi, A., Ottonello, S., 1997. Transfer RNA gene redundancy and translational selection in Saccharomyces cerevisiae. J. Mol. Biol. 268, 322–330. Perna, N.T., Plunkett III, G., Burland, V., Mau, B., et al., 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157: H7. Nature 409, 529–533. Pouwels, P.H., Leunissen, J.A.M., 1994. Divergence in codon usage of Lactobacillus species. Nucleic Acids Res. 22, 929–936. Sharp, P.M., Matassi, G., 1994. Codon usage and genome evolution. Curr. Opin. Genet. Dev. 4, 851–860. Shigenobu, S., Watanabe, H., Hattori, M., Sakaki, Y., Ishikawa, H., 2000. Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 407, 81–86. Simpson, J.G., Reinach, F.C., Arruda, P., Abreu, A., Acenicio, M., 2000. The genome sequence of the plant pathogen Xylella fastidiosa. Nature 406, 151–157. S. Kanaya et al. / Gene 276 (2001) 89–99 Smith, D.R., Douchette-Stamm, L.A., Deloughery, C., et al., 1997. Complete genome sequence of Methanobacterium thermoautotrophicum DH: functional analysis and comparative genomics. J. Bacteriol. 179, 7135–7155. Stephens, R.S., Kalman, S., Lammel, C., et al., 1998. Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis. Science 282, 754–759. Stover, C.K., Pham, X.Q., Erwin, A.L., Mizoguchi, S.D., et al., 2000. Complete genome sequence of Pseudomonas aeruginosa PAO1, an opportunistic pathogen. Nature 406, 959–964. Su, C., Brandt, L.J., 1995. Escherichia coli O157:H7 infection in humans. Ann. Intern. Med. 123, 698–714. Sueoka, N., 1992. Directional mutation pressure, selective constraints, and genetic equilibria. J. Mol. Evol. 3, 95–114. Takami, H., Nakasone, K., Takaki, Y., Maeno, G., et al., 2000. Complete genome sequence of the alkaliphilic bacterium Bacillus halodurans and 99 genomic sequence comparison with Bacillus subtilis. Nucleic Acids Res. 28, 4317–4331. Tettelin, H., Saunders, N.J., Heidelberg, J., Jeffries, A.C., et al., 2000. Complete genome sequence of Neisseria meningitidis Serogroup B strain MC58. Science 287, 1809–1815. Tomb, J., White, O., Kerlavage, A.R., et al., 1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388, 539– 547. von der Malsburg, C., 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85–100. White, O., Eisen, J.A., Heidelberg, J.F., Hickey, E.K., et al., 1999. Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1. Science 286, 1571–1577. Wolf, Y.I., Aravind, L., Koonin, E.V., 1999. Rickettsiae and Chlamydiae evidence of horizontal gene transfer and gene exchange. Trends Genet. 15, 173–175.