Promoter Elements in DrosqPhilla melanogaster
Transcription
Promoter Elements in DrosqPhilla melanogaster
Copyright 0 1995 by the Genetics Society of America Promoter Elements in DrosqPhilla melanogaster Revealed by Sequence Analysis Irina R. Arkhipova Department of Molecular and Cellular Biology, Haruard University, Cambridge, Massachusetts 02138 Manuscript received October 7, 1994 Accepted for publication December 10, 1994 ABSTRACT A Drosophila Promoter Database containing 252 independent Drosophila melanogasterpromoterentries has been compiled. The database and its subsets have been searched for overrepresented sequences. The analysis reveals that the proximal promoter region displays the most dramatic nucleotide sequence irregularities and exhibitsa tripartite structure, consisting of TATA at -25,’ -30 bp, initiator (Inr) at 5 5 bp and a novel class of downstream elements at +20/ +30 bp from the RNA start site. These latter elements are also strand-specific. However,they differ from TATA and Inr in several aspects: ( 1) they are represented not by a single, but by multiple sequences, ( 2 ) they are shorter, ( 3 ) their position is less strictly fixedwithrespecttothe RNA start site, (4) they emerge as a characteristicfeature of Drosophila promoters and ( 5 ) someofthem are strongly overrepresented in the TATA-less, but not can be classified as TATATATA-containing, subset. About one-half of known Drosophila promoters less. The overall sequence organization of the promoter regionis characterized by an extended region withan increase in GGcontent and a decrease in A, which contains a number of binding sites for Drosophila transcription factors. E UKARYOTIC promoters have long been an object of intensive study. The fundamentalprocesses that result in spatially and temporally regulated patterns of gene expression are gradually being uncovered. Much progress has been achieved in identification and functional characterization of transacting protein factors, both basal and regulatory, which can act coordinately to provide proper levels of expression for every gene in a given cell at a given time (reviewed in MCKNICHT and YAMAMOTO 1992; CONAWAY and CONAWAY 1993; ZAWEL and REINBERG 1993; BURATOWSKI 1994; TJLW and MANIATIS 1994). There are,however, numerous unanswered questions and problemsthatremain controversial. Substantial progress has been achieved in identification of the trans acting factors that carry out the transcription process. Characterization of the &acting promoter sequences, which contribute to the proper binding of the trans acting factors or otherwise participate in transcription, has attracted less attention in recent years. Some of these questions can be answered, at least in part, by analyzing the nucleotide sequences in large promoter data sets aligned with respect to the RNA start site. The sequences that can be found with a high degree of probability at a particular position in such a data set ( o r its subsets) represent naturally occurring promoter elements selected by evolution to perform different position-specific functions. Address for correspondence: Department of Molecular and Cellular Biology, HarvardUniversity, 7 Divinity Ave., Cambridge, MA 02138-2092. E-mail: [email protected] Genetics 1 3 9 1359-1369 (March, 1995) Previous studies of this kind were dealing with eukaryotic promotersingeneral (BUCHERand TRIFONOV 1986; BUCHER1990). The datasets used were divided only into invertebrate and vertebrate subsets to keep them large enough for statistical analysis. In these diverse data sets, those overrepresented elements that are most conserved between species, such as the TATA box and the cap-site consensus or initiator ( I n r ) , could be identified unambiguously, andthe CAAT- and G C boxes were found at elevated frequencies in the upstream regions of vertebrate promoters. Analysis of promoter data sets from a single species is attractive because some promoter elements, and the corresponding protein factors, may not in fact be as conserved between species as the TATA box and the Inr. The TATA box is the best characterized basal eukaryotic promoter element. It is located at -25/-30 bp from the RNA start site and is recognized by the TATAbinding protein (TBP) , which induces DNA bending and participates in transcription by all three RNA polymerases (reviewed in HERNANDEZ 1993). Nevertheless, a significant proportion of promoters does not possess any TATA box, and although TBP was shown to participate in transcription from some TATA-less promoters in vitro ( PUCHand TJIAN 1991 ; ZHOU et al. 1992) , the basis of its interaction with DNA in this case isnot clear and is thought tobe assisted by TATA-associated factors (TAFs) that form a multiprotein complex with TBP. The second well-known element is the cap-sitese1989; quence or initiator ( I n r ) ( S U E and BALTIMORE reviewed in WEISand REINBERG 1992; GILL 1994; SMALE 1994) , which is present in the immediate vicinity ofthe 1360 I. R. Arkhipova RNA start site. Its importance in transcription, as well as its functional distinction from the TATA box insome aspects of transcriptional regulation, have been experimentally demonstrated (JARRELL and MESELSON1991; MACK et al. 1993; USHEVA and SHENK1994). In Drosophila, the cap-site element was initially noticed in a number of different promoters (SNYDER et al. 1982; ARKHIPOVA et al. 1986; CHERBAS et al. 1986; HULTMARK et al. 1986), andthe TCAGT pentamer (and its cognates A /,CAGT and TCATT) was recently identified in a statistical study of the -25/ +25 region of 112 arthropod promoters ( CHERBAS and CHERBAS 1993). The RNA start site homologies are much less pronounced in vertebrates (BUCHER 1990) andall but disappear in humans (Penotti 1990). Deletion and mutation studies of several Drosophila promoters indicate the existence of a novel third class of proximal promoter elements that arelocated 20-30 bp downstream from the RNA start site. Elements of this kind have been found in many Drosophila retrotransposons (ARKHIPOVA and ILYIN1991;JARRELL and MESELSON1991; MINCHIOTTIand DINOCERA 1991; MIZROKHI and MAZO 1990; MCLEAN et al. 1993) and, interestingly,in anumber of developmentally regulated genes that do nothave TATA boxes ( BIGGINand TJIAN 1988; PERKINS et al. 1988; SOELLER et al. 1988; THUMMEL, 1989). Elements in this region were shown to be essential for transcription in vivoand/ orin vitro and in some cases associated with binding of nuclear proteins. Limited sequence homologies have been noticed between the downstream regions of these promoters (ARKHIPOVA and ILYIN 1991; MINCHIOTTIand DINOCERA1991; MCLEANet al. 1993),but they were relativelyshort and somewhat variablein location; at least twotypesof downstream elements could be distinguished with no similaritywhatsoever. Thus, the sequence identity of these elements remained questionable. It was also unclear whether they can be identified in a minority or in a substantial fraction of Drosophila promoters. Therefore, it wasof interest to find out how widespread are the downstream elements, what is their consensus sequence ( s ) and whether they represent an essential component ofRNA polymerase I1 (pol 11) promoters in Drosophila. This could be achieved by a statistical analysis ofpromoter sequences. To this end, I have compiled and analyzed a Drosophila Promoter Database (DPD) . It currently consists of 252 independent D. melanogaster entries and is the largest database of those available for a single species. The analysis presented herereveals a tripartite structure of Drosophila promoters,extendsthe consensus sequences for the Drosophila TATA and Inr elements, shows that various types of specifically localized downstream elements can be found in a significant fraction of Drosophila promoters and demonstrates that some of the downstream elements are characteristic for the TATA-less subset. Overall, there is a significant increase in GCcontent toward the RNA start site. Neither the GGGCGGnor the CAAT-motifs are overrepresented at any position in Drosophila promoters. MATERIALS AND METHODS Database: For initial analysis, 85 independent D.melanogasterentries were extracted from theEukaryotic Promoter Database (EPD) release 34 ( BUCHER 1993),which exists as part of the EMBL database and is also available at the JohnsHopkins University Gopher server. (Inspection of release 39 as of June 1994 showed that only four Drosophila entries were added). Sequence analysis was limited to 100 bp downstream and 500 bp upstream from the RNA start site to minimize the influence of the coding regions and to retain 275% of promoter sequences in the most 5 ’ extreme part (close to -500) of the data set. In some cases, the mouse and human subsets of the same EPD release, containing 216 and 148 independent entries, respectively, were also analyzed. To make the database more representative, it has been expanded approximately threefold (to 252 entries) by additional D. melunogastersequencesfrom GenBank release 78, for which the information about the location of the RNA start site was available. The FlyBase (1993) was used to obtain specific information about individual genes. The criteria for including an entry in the database were essentially the same as in the EPD ( BUCHER1993), and theprecision of the RNA start site location was estimated to be 21-3 bp in most promoters. Each entry contained at least 100 bp of upstream sequence.Retrotransposonpromoters, reviewed in ARKHIPOVA and ILMN( 1992),were excluded from the present analysis, as they would have created a strong initial bias toward promoters with downstream elements. Increasing the number of entries has made it possible to divide the database into two subsets, in accordance with the presence or absence of the TATA box. The promotershaving at least part of an AT-rich sequence ( 2 4bp) falling within the -25/-30 interval were regarded as TATA-containing. Such a subdivision is not unambiguous, since in many cases an AT-rich sequence may deviate to a significant extent from the consensus TATAAA, and an experimental proof of its functional significance is generally lacking. Nevertheless, a comparative analysis of such subsets can at least reveal biases in occurrence of particular sequence elements. Overall, 129 promoters were regarded as TATA-containing and 123 as TATA-less. Methods: The sequences were aligned with respect to the RNA start site. The entire database, as well as its subsets, has been extensively searched for overrepresented sequences. As a rule, only the sense strand was analyzed to reveal any strandspecificity. The most conserved elements, like the TATA box and Inr, are readily detected by programs that identify the most frequently occurring sequence elements in an aligned data set ( WATERMAN andJONEs 1990). However, no sequence irregularities were detected in the downstream region by these methods. Therefore,in most cases I used the analysis of positional distribution of 4 “ individual n-mers (“words,” where 1 5 n 5 5 ) , which might be called “word profiling.” With a few exceptions forspecific words (MOUNTet al. 1992; CHERBAS and CHERBAS1993), it has not been routinely used.This approach seems most attractive in cases where multiple, not single, types of elements can be expectedto reside in a particular region. Any entries with improperly localized RNA start 1361 Elements Promoter Drosophila sites should not greatly influence the results as they will not contribute to the peak, only to the background noise. GCG seThe data sets were analyzed with the aid of the quence analysis software package(Genetics Computer Group 1991) running on a SUN Sparcstation. The occurrence frequenciesforall340possible 1-4mers and for selected 5mers were determined for every position and plotted in bins of 2 or 5 bp against their position with respect to the RNA start site. Gapped n-mers were not includedin this study. The frequencies were plotted inabsolutenumbers rather then percentages or fractions.A slight underestimation of the most 5’extreme words resulting from reduction of the local sample size did not influence the conclusions. Theprofilesforeachwordwerevisuallyexamined,and those forn > 2 were also subjected to statistical analysis. These latter could be divided into two groups: those for which the maximal occurrence frequency differed >4 SEs fromthe average occurrence frequency were considered “interesting,” and the others,which displayed a fairly uniform positional distribution and constituted the majority (usually>3/4 of words), were regarded as “uninteresting.” This criterion, although rather arbitrary,was nevertheless suitableas an empiricalcutoff value for identifying localized elements of possible functional significance (see RESULTS). Similarly positioned interesting words were grouped together, and from their comparisontheconsensussequencescouldbededuced. Some profiles are given not for the entire -500/+100 interval,butonlyfor theregionsdisplayingsequencenonrandomness. Profiles for words and/or intervals not presented in this paper are available upon request. Evaluationof the information content was performed by the exact method ( SCHNEIDEK et al. 1986) . The information content was calculated using the multinomial distribution forof occurrences at eachpositionwere mula;thenumbers added in bins of 5 , and the overall occcurrence frequencies were taken as observed in the entire promoter data set. RESULTS Analysis of the D. melanoguster promoter data set reveals severalregions with highlynonrandom nucleotide sequence distribution, some of which are well known and some are not. It should be noted that all the interesting wordsfall into severalspecifically positioned groups, as described below. The highest concentration of locally overrepresented wordsis observed inthe proximal promoter region. Complexstructure of theproximalpromoterregion: The region in the vicinity of the RNA start site ( ? a few dozen bp) , which is usually the site ofinteraction withbasal transcription machinery (RNA pol I1 and associated factors), differs significantly from the rest of the sequence.The TATA box and to some extent the Inr regions can be noted even at the mononucleotide level (Figure 1) . Thedoublet analysis appears more informative ( Figure 2 ) : for many ofthe doublets, this region is characterized by several sharp dropsand/ or rises in sequence composition. The main points of sequence irregularity are located in the intervals -30/ -20, -5/ +5, and +20/ +35. The pattern for doublets not shown in Figure 2 displays much less sequence nonrandomness and is closer to uniformity; TC and AG also havea visible peak at the RNA start site. The picture is different from that formammalian promoters, which do not seem to possess a specifically localized downstream region of highly nonrandom sequencecomposition (not shown) . Thus, locally overrepresented downstream elementsdetectable by this analysis area characteristic feature of Drosophila promoters. The -25 / -30 region: About one-half of Drosophila promoters do not contain a recognizable TATA box at the appropriate location. In the TATA-containing subset, the TATA box produces a dominatingpeak and is strictly strand-specific (Figure 1, A, Figure 2, TA, AT, AA; Figure 3, TATA, ATAA) . Comparison of overlapping interesting triplets (TAT, ATA,TAA, AAA, GTA and AAG) and tetramers (TATA, ATAA,TAAA and AAAA and, to a lesser extent, ATAT,GTAT,CTAT, AAAG, AAGC and AAGG) yields a consensus sequence */G/,TATAAAG/.,,”/, . No locally overrepresented words can be detected at this position in the TATAless subset. Therefore, there are no specific TATA box substitutes that are characteristic for this region. The RNA start site: A typical initiator element can be foundin approximately one-third of Drosophila promoters. Examples of overrepresentationatthe RNA start site are shown in Figure 3 (TCAG, TCAT) . An analogous picture is observed for CAGT, CATT, ATCA, GTCA, TTCA, AGTT, AGTC and GTTG, listed in the order of descending frequency at the RNA start site (not shown). Thus, theconsensus strand-specific pentamer reported by CHERBAS and CHERBAS(1993) can be extended to T/A/GTCAC’/TT?’/CG. As in the above study, no obvious correlation is found between the presence of the Inrconsensus and theTATA box. The RNA start site in TATA-less promoters tends to be enriched in T residues, most frequently organized in short runs of T (Figure 1, T; Figure 2, TT) . Downstreamelements: The third strand-specific region of sequence heterogeneity, which is located downstream from the RNA start site and has not been identified in previous statistical studies, differs from the previous two in several aspects. In contrast to the TATA box and Inr, it does not contain a single predominant sequence element, but several types of elements from which a single consensus cannot be deduced. Thelocation of the downstream elements is less strictly fixed with respect to the RNA start site, with the major peaks appearing in the interval +20/+35. Local overrepresentation of specificwords is best seen at thelevel of3- and 4meranalysis.The interesting downstream triplets are ACA, AAC, TCG and GTG (not shown). Most of the interesting downstream tetramers are shown in Figure 4. The most prominent typeis represented byAACA,ACAA or ACAG. The second type most frequently occurs as TCGA,and its preferred location is slightly closer to the RNA start site (around the position + 2 0 ) . Both ACGT and ACGC can also be I. R. Arkhipova 1362 zl T 240 220- 200180 160 140 - 200t 80 80 , ++ -m 140 220- C 200180 160 140 120 100 - G 200180 - 160 140 120- 100 - 80- 80‘ ‘ I ‘ S I I ’ 2 2 o c 1 8 6 1 4 6 1 0 d - 6 0 ~ - 2 0 ~ 20 I 60 I -280-240-20&160-120-80-40 0 40 80 ~ I ~ ~ 1 6 080I I 2 2 6 1 8 6 1 4 6 1 d - 0 0 ~ - 2 0 ~20 -280-240-200-160-120-80 -40 0 40 FIGURE1.-Distribution of individual bases along the promoter region. The number of occurrences on the sense strand is plotted in bins of 5 against their position with respect to the RNA start site (position 0 ) . +, TATA-containing promoters; 0, TATA-less promoters. Although not all of the profiles display notable differences between subsets, plotting the data for both subsets in a single graph makes the common and specific features more evident. classifiedas interesting. The CGTG tetramer differs from the previous ones in its distribution between two subsets: its overrepresentation at +25/ +30 is clearly biased toward the TATA-less subset (Figure 5 ) . ACGY and CTCG displaythe same bias in distribution between subsets (Figure 5 ) . In individual promoters, one type of downstream element may sometimes be repeated oroccur in combination with other types. No obvious regularity in spacing was found between various downstream elements and the Inr, theRNA start site or the TATA box; the imprecision of the RNA start site determination could have partially obscured this. For some words, the distribution appears to be bimodal. The bimodality is also observed in the information content profiles (see below) . The shortness and multiplicity of the downstream elements make it difficult to estimate the percentage of promoters containingsuch elements, and their functional significance may differ in individual promoters. Judging by the degree of overrepresentation, a rough estimate can be made that more than one-half of Drosophila promoters contain them. Long-rangepromoterorganization: A wide area in the vicinity of the RNA start site (from -150 to +50 bp) , which may be called a “GC-hill,” displays a significant increase in GC content (and a corresponding decrease in A, mainly AA, content), represented by dinucleotides CG and GC (and to a lesser extent by CC and GG) (Figures 1 and 2 ) . This pattern exists for both TATA+ and TATA- subsets and differs from that for mammalian promoters in that they also exhibit a gradual increase in GC frequency toward the RNA start site, but the CC and GG doublets make a significant contribution as well (not shown) . In Drosophila promoters, there is no obvious underrepresentation of CG doublets compared with GC (in agreement with dataforgenomic sequences, ASHBURNER 1989). Again, this is in contrast to mammalian promoters that contain approximately twiceas much GC as CG (not shown). This difference is probably connected with the lack of cytosine methylation in Drosophila. Transcription factorbinding sites: Of the remaining interesting words that are not located in the basal promoter region, the recognition sequence for the Drosophila-specific GAGA factor ( BIGGINand TJIAN1988) is the most notable one.Its concentration is significantly increased on both strands in the wide area roughly corresponding to the GC-mountain with the sites of local overrepresentation in the -SO/- 120 region (Figure 6 ) . The binding sites for the transcription factor zeste, whichhave been frequently found in the vicinityof Drosophila promoters (Benson and Pirrotta 1988), are somewhat similar to GAGA in theirdistribution pattern, although less abundant (not shown). The CAAT-motif,whichis prominent in the -80/ n 190 100 90807060- 1363 Drosophila Promoter Elements AA 190120 110100 - l-r 908070 60504030- 50- 4030- " " od-sol-201 I -80 -40 0 20 I 60 I 100 40 80 P CG GC R* CA GT 6050- 1 irR 10 - FIGURE 2.-Selected doublet profiles. The data are presented as in Figure 1. 45 45 40 TATA 40 35 35 M 30 25 25 20 20 15 I5 10 10 "1 5 5 -0 -480-440-400-580-320-28+240-200-IW120-80 -0 35 -40 0 40 (0 TCAG -480-440-400-3eib320-2240-200-100-120-80 -40 0 40 80 45 ATAA 40 TCAT 35 30 25 20 15 FIGURE 3.-TATA-box and initiator (Inr) elements. Distribution profiles of selected tetramers are given for the entire database. I . R. Xrkhipova 1364 -500-460-420-386-340-300-260-220-180-140-100-60'-20 -480-440-400-380-320-280-240-2W-160-120 -80 -40 20 0 ' 40 60 100 80 -500-460-420-380-340-3W-260420-180-140-1W -480-440-400-360-320-280-240-2W160-120-80 -eo -20 20 -40 0 eo 40 100 80 32 GTGY 26 24 22 20 18 16 14 12 10 8 I 20 10 16 14 12 10 8 6 6 4 4 2 2 0 -500-460-420-380-340--300-260-220-180-14C-100-60~-20 -480-440-400-360-320-280-240-200-160-120-80 -40 0 20 40 60 100 80 -0500-460-420-380-340-300-260-220-180-14O-lW -60 -20: 20 60 100 -480-440-400-360-320-280-240-200-160-120-80 -40 0 40 80 32 32 20 24 22 20 18 16 26 24 22 20 18 16 14 12 10 :: YGTG :: ACGY :I 10 8 8 6 6 4 2 0 4 2 -0 I -480-440-400-360-320-280-240-200-180-120-80 -40 0 40 80 -100 region of mammalian promoters ( BUCIIEK 1990) , is not overrepresented at any positioni n Drosophila promoters ( not shown ) . The majority of GC-containing words can be folmtl as locally overrepresented i n the proximal region of the GC-hill,displaying an overall increase i n the occurrence frequencies toward the RNA start site (excluding the drops at the main points of sequence heterogeneity in the proximal promoter region described above, see Figure 1 ) . In mammals, a strong contributor to this increase istlne recognition sequence for the transcription factor Spl (GGGCGC; i n bothorientations, see RLY:HI.:K 1990) . However, this sequence occurs vel? rarely i n the entire Drosophila promoter data set and is not locally overrepresented ( not shown ) . Base composition: Information regarding base composition of the -.NO/ + I O 0 interval of Drosophila promoters is given i n Table 1. The overall sequence composition of 11. m~/nno~qmfnpromoters is 41.4% ( X , compared with 30.6% GC i n mice and 55.4% GC i n humans. Mhile G and C are approximately equal, A slightly predominates over T. This bias is t o a large - extent introd1lccd by the sequence TATAAA (Figure 1, A ) . The A > T composition bias is not ohsen.ed for the TATA-less subset, which, on tlne contraly, has an increase i n T content at the RNA svart site (Figure 1, T ) . While the overall promoter GGcontent is close to that of the main hand I). rnPl//nogrrs/rr DNA (43%,ASIIIX'KSI-K 1989) , it is wnevenly distributed along the promoter region. The nonruniformityof promoter base composition is particularly evident in the proximal region ( a sharp 1oc;tl rise i n A and a drop in G and <: against the l>roadA decrease and G + C increase, with more or less wniform overall T ) . Information content analysis: A conventional information content analysis that quantifies the entropy, or uncertainty, reduction at each position and reflects the degree of deviation from randomness ( S(:HNEII)EK P/ //I. 1986) represents certain difficulties. To estimate the information content for even, position of each promoter element, a gapped alignment is needed becausc of variable spacing between separatepromoter elements. The downstream elements, however, are especially difficult to align due to their multiplicity, short- B A 151 15 :: CGTG 12 11 10 7 10 0 8 7 6 5 6 5 0 8 ElE 4 3 2 1 0 3 2 1 -0 -480-440-400-360-320-280-240-200-160-120 -80 -40 0 40 -480-440-400-360-320-280-240-200-160-120-80 -40 80 17 16 17 16 13 12 11 10 13 12 11 10 9 0 80 40 : ACGY :: ACGY 0 I 8 V . -~0-460-42d-380-34d-3Od-26d-22&180-140-100 -480-440-400-360-320-280-240-200-160-120 I; E "50d-460-420-380-340-300-200-22d-180-140-1~-60~-20 -480-440-400-360-320-280-240-200-160-120-80 -40 - 0 0 -20 20 60 100 -80 -40 0 40 80 20 0 40 00 ' 100 80 15 15 14 il CTCG CTCG 11 10 0 8 7 7 6 5 4 6 5 3 3 4 2 2 1 1 0 - 0 - -480-440-400-380-320-28(t240-200-160-120 FIGLW .i.-Downstrcam promoters. -80 -40 0 40 80 clemcnts cI1ar;wcristic lor TAT,4-less promoters. ( A ) TAT,-less prolnotcrs; ( I < ) ~T,.\TA-c.ont;lining ness and variable location. A qualitative, rather than quantitative, overall information content profile along the promoter region can be obtained without gapped alignment by adding the frequencies for eachposition in bins of 5, as in the word profiling analysis. At the singlet level such analysis is not extremely informative and reveals mainly the TATA box and the Inr,with the downstream elements beingless visible ( n o t shown; see Figure 1 ) . Analysis of the doublet information content (BERG and \'ON H1rrE:t. 1987) is morepromising.Figure 7 represents the profile o f information content distribution alongthepromoterregion. In addition to the TATA box and Inr at -20/-30 and 0 / + 5 against an overall more or less random background, a prominent maximum appears in the +20/+30 region, with the intensity comparable to that of the TATA box and Inr. Five possible profiles, which differed slightly in relative intensities of each element, were obtained by shifting the 5-bp bins by 1 bp (not shown) ; in three o f them, the upstream (TATA) antl downstream ( + Y O / +X)) elements appear as bimodal(with a IO-bp i n t e n d ) , and in the remainingtwo the bimodality is not resolved, creating a tripartite structure. This paper presents an ovenkv of nucleotide sequence orgmization of a large set of promoters from a single species ( I ) . mr~I~/nogcrs~~r) and reveals a number of interesting features, both common antlspecific compared with other species. It should be emphasized that this kind of analysis identifies those rlem<:nts that arc present in a significant fractionofpromoters, notnecessarily in all o f them, and individual promoters may vary substantially in their properties. Although consenmion is indicative of function, the functional significance o f s11ch elements sllould be establishetl cxperimcntally ill any particular case. The biological rclcvancc of the approach u s c d is that I. R. Arkhipova 1366 B A 12 ii GAGA 0 8 7 e ~~Od-46d-426380-340-3~-26d-~Z0-18614O-lod-60~201 20 1 -480-440-400-36(t320-280-240-20&160-120-80 -40 0 40 -480-44(t4Mt36(t32(tZ8(t24(tZMtl60-120-80 -40 0 40 60 1 o lo 80 12 I’ 10 0 TCTC 8 7 e FIGURE6.-Distribution profiles for GAGA binding sites. (A) TATA-less promoters; ( B ) TATA-containing promoters. DNA-protein recognition can be strongly influenced by neighboring bases and oftendependson relatively short words that can exhibit local overrepresentation against their own background but may not be readily detectable. Indeed, among the majority of uniformly distributed words, a few that are interesting can easily be distinguished, and these are concentrated in several specific regions. Moreover, some of the words exhibit local overrepresentation only in the TATA-less subset and are uniformly distributed in the TATA-containing subset. An overall impression from analysis ofthe nucleotide sequences is that most promoters arecomposed of multiple sequenceelementsactingtogether to achieve proper levels oftranscription. The strict strand-specificity of the three major proximal promoter elements in Drosophila (TATA, Inr and thedownstream elements) implies that together with bound proteins they partici- pate in guiding the RNA polymerase to transcribe in a proper direction. The TATA box, despite its indisputable importance and representation inall eukaryotic organisms, is absent from a significant fraction of Drosophila promoters (about Estimates of this proportion have changed with time, since most of D. melanogasterentries from the EPD subset, for historical reasons, represented promoters of structural and strongly inducible genes that,as a rule, contained goodTATA boxes. Promoters of regulatory genes,onthe contrary, less often possess good TATA boxes, and the shift of interests of researchers has resulted in a change of database composition. Thus, the subdivision of promoters intoTATA-containing and TATA-less ones discriminates to some extent between structural and developmental genes, although thereare a lot of exceptions. The Inris represented by a single type of element in TABLE 1 Base composition of the D. melanogaster promoter data set and its subsets as compared to that of mammalian promoters and the -25/+25 interval of arthropod promoters No. Totalof sequences Source D. melanogmtm D. mlanogmtpr TATA+ D.m l a n o p t p r TATA62,127 D.melanogmter -25/+25 5,600 M. musrulus Homo sapiens 110,716 80 248 126 122 112 148 216 length %A %T %G %C Reference 130,906 68,779 29.8 30.3 29.2 29.0 25.5 22.7 28.8 28.5 29.2 25.0 23.9 21.9 20.7 20.6 20.8 23.0 25.4 27.5 20.7 20.6 20.8 23.0 25.2 27.9 This study This study This study CHERRAS (1993) This study This study 88,800 Elements Promoter Drosophila 1367 Position, bp 0.2 . 0.18 0.16 0.14 I 0.12 0.1 0.08 0.06 -500 -400 -300 -100 -200 0 100 FIGURE7.-Information content ( I ) of the D. melunogaster promoter data set. The average value (in bits per position) -499/ -495, -494/ -490, was calculated for the positions . . . , +95/ +IO0 and plotted against the respective positions. Drosophila and a numberof other arthropods that have beenexamined (CHERBAS and CHERBAS 1993; this study). However, there areseveral types of Inr elements in mammalian promoters, with several corresponding factors (reviewed in WEISand REINBERG 1992; SMALE 1994), and it is probably even more variable in humans than in other mammals ( PENOTTI1990). Experiments with arandomized Inr region yielded the G/A/TT/CAG/TTG sequence for Drosophila and aloose WANT/,w consensus for mammals (JAVAHERY et al. 1994; PURNELL et al. 1994). The downstream elements, which in this study have for the first time emerged at the nucleotide sequence level as integral components of Drosophila promoters, seem to be morediverse than other promoter elements. Some interesting parallels can be drawn between yeast and Drosophila. A survey of mononucleotide composition for 95 yeast promoters (MAIM and FRIESEN 1990) revealed a constant level of G and C throughout the region -loo/ +50, while there is a transition from the T peak centered at -20 to the A peak centered at 0. This transition was named “the locator,” as it was found to influence thelocation of RNA start sites. The doublet composition has not been reported. Visual inspection of the promoters listed in M A w l and FRIESEN (1990) reveals that theT-rich peak is mainly created by numerous sequences resembling theInr, such asTTATT, TCTTT or TCATT; the A-rich peak is largely composed of words like AAAC,AACA,ACAA,AAAG etc., which are listedabove among the downstream elements of Drosophila. This raises an interesting possibility that the RNA start site in Saccharomyces cermisiae may actually correspond in sequence requirements to some downstream elements in Drosophila. If so, the terms “initiator” and “downstream element” may refer to the same element in different organisms. This is not totally unexpected, given the well-known far upstream (-40,’ -120) location of the yeast TATA box ( STRUHL 1987) and the recent finding that the essential TSM-1 gene, a yeast TAFI1150analogue, is able to bind the promoter DNA sequence-specifically (VERRIJZER et al. 1994) . Unlike Drosophila promoters, the mammalian s u b sets exhibit little if any specifically localized local overrepresentation in the downstream region at the triplet level. The distribution of tetramers is more nonrandom, but there is greater variety in the number andposition of overrepresented words than in Drosophila (not shown). Some sequence nonrandomness in the +30 region can be observed at the doublet level. The TATA box dominates overwhelmingly in mammalian promoters, with some additional sequence heterogeneity more downstream. Given the complexity of mammalian genomes, the current promoterdatabase for these species is not large or diverse enough to reveal downstream elements with the degree of overrepresentation similar to that observed in Drosophila. Such elements are either absent from a significant fraction of mammalian promoters, or their multiplicity and the degree of scattering is greater than in Drosophila. Their existence has been experimentally demonstrated for several mammalian genes (reviewed in ARKHIPOVA and ILYIN 1992) . The availabilityof more representative databases,which is expected to result from genome sequencing projects in the nearest future, will resolve this issue. The downstream region of Drosophila promoters has attracted a great deal of attention in recent years. As mentioned in the Introduction, it has been described as a transcriptionally important element both in vitro and in vivo, a binding site for nuclear factors ( BIGGIN and TJIAN1988; PERKINSet al. 1988; ARKHIPOVA and ILYIN1991) and a site ofRNA pol I1 pausing during transcription (LEE et al. 1992; RASMUSSEN and LIS 1993) . Experiments involving mutagenesis of the downstream region were somewhat contradictory: 3 ’- or internal deletions or substitutions in most cases resulted, but in some cases did not, in reduction of the promoter strength or complete inactivation of the promoter (PERKINS et al. 1988; SOELLER et al. 1988; MIZROKHIand MAZO 1990; ARKHIPOVA and ILYIN1991; JARRELL and MESELSON1991; FRIDELL and SEARLES 1992; CONTURSI et al. 1993; MCLEANet al. 1993). Diverse results led to differing conclusions regarding the importance of the downstream region, ranging from total unimportance to absolute dependence on downstream elements. The truth most probably resides somewhere in between:promoters may differ with respect to importance of their downstream elements, and the contributionfrom other elements should play an important role. The TATA box is able to act cooperatively with the Inr ( O’SHEAGREENFIELD and SMALE 1992) or the downstream elements ( FRIDELL and S m E s 1992). Due to the multiplicity and shortness of these elements, deletions and 1368 I. R. Arkhipova substitutions could often result in replacement or creation of a novel element instead of disruption of the old one. Sequence-specific downstream ( u p to +40 bp) contacts of a 150-kDa component of the Drosophila TFIID complex were reported for hsp70, hsp26 and histone H4 promoters ( PURNELL and GILMOUR 1993; PURNELL et al. 1994; SWEs and GILMOUR 1994) and, recently, for a heterologous AdML promoter using cloned and purified Drosophila TAF11150 (VERRIJZER et al. 1994). However, the identity of sequences involved in specific binding remained obscure. The downstream elements identified here are likely to represent such sequences, and this study can provide clues to their functionalidentification in different Drosophila promoters. An interesting question is whether multiple downstream elements located at similar positions are recognized by a single or various proteins. The latter possibility seems intriguing in light of the existing differences between the TATA-containing and TATA-less subsets (Figure 5 ) . The CGTG element, which is strongly overrepresented in the TATA-less subset, is also predominant in LINE-like retrotransposons of Drosophila ( MINCHIOTTI and DINOCERA 1991; MCLEANet al. 1993). These retrotransposons have a completely internal pol I1 promoter located in thevicinity ofthe RNA start site, with no upstream sequencesof their own, since the first nucleotide of the element should at the same time be the first transcribed nucleotide ( MIZROKHIet al. 1988) . It is reasonable that promoters that cannot have a TATA box by definition are able to compensate for its absence by the downstream elements. Transcription of the human LINE-1 retrotransposon also dependsonthe downstream elements ( SWERGOLD 1990; MINAKAMIet al. 1992). The emergence of various downstream elements as widespread components of Drosophila promoter regions raises intriguing questions with regard to their significance. For instance, their multiplicity could be invoked as a possible means of generating the basal promoter variability, in combination with the presence or absence of other promoter elements. In this respect, it isworth noting thatalternatively expressed promoters of the same gene ( A n y ,Adh, hb etc. ) often have different downstream elements and sometimes also differ with respect to the presence of TATA and Inr. Although sequence irregularities in the more remote promoter regions are less pronounced than in the proximal region, they nevertheless exist. The pattern of distribution of the GAGA factor binding sites is consistent with its possible role in displacement and/or restructuring of nucleosomes in the vicinity of the RNA start site (TSUKIYAMA et al. 1994). The similarly located increase in GC-containing words could play an analogous role. In mammals, this increase is mainly exemplified by a significant number of GCrich promoterscon- taining Spl transcription factor binding sites; however, this is not the case in Drosophila. Although the gene for the Drosophila Spl analogue has been cloned and shown to interact with the mammalian binding site (WIMMER et al. 1993), its naturally occurring recognition sequence might differ from the mammalian consensus. Alternatively, it may represent a highly specialized transcription factor with binding sites in very few promoters, since it is known to regulate expression of genes involved in head development. The absence of specific CAAT-box localization also suggests a different mode of action of the Drosophila C-EBP counterpart ( FALBand MANIATIS 1992) . Finally, the expansion ofspecies-specific promoter databases and further searches for promoter elements may make possible further subdivisions into different promoter subsets and the identification of subset-specific sequences. Accumulation of such data may also be helpful in localization of potential promoters in genomic sequences and in more profound understanding of the mechanisms of transcriptional regulation and molecular evolution of promoters. I would like to express my deep gratitude to M. MESELSONfor his support and encouragement throughout thecourse of this work and critical reading of’ the manuscript. I also thank M. WATERMAN and M. EGGERTfor the program RTIDE. Special thanks are due to S. POKROVSKY for help with the information content analysis. This research was supported by the National Institutes of Health grant GM22274 to M. MESEI.SON. LITERATURE CITED I. R., and Y. V. II.XN, 1991 Properties of promoter regions of mdgl Drosophila retrotransposon indicate that it belongs to a specific class of promoters. EMBO J. 10: 1169-1177. ARKHIPOVA, I. R., and Y. V. ILMN,1992 Control of transcription of Drosophila retrotransposons. BioEssays 1 4 161-168. ARKHIPOVA, I. R., A. M. MAZO, V. A. CHERKASOVA, T. V. GORELOVA, N. G . SCHUPPE et al. 1986 The steps of reverse transcription of Drosophila mobile dispersed genetic elements and U3-R-U5 structure of their LTRs. Cell 44: 555-563. ASHBURNER, M., 1989 Drosophila: A Labmatoly Handbook. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N Y . BENSON,M., and V. PIRROTTA,1988 The Drosophila zeste protein binds cooperatively to sites in many gene regulatory regions: implications for transvection and gene regulation. EMBO J. 7: 3907-3915. BERG,0. G., and P. H. VON HIPPEL,1987 Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193: 723750. BIGGIN,M., and R. TJIAN,1988 Transcriptionfactors that activate the Ultrabithmax promoter in developmentally staged extracts. Cell 5 3 699-711. BUCHER,P., 1990 Weight matrixdescriptions of four eukaryotic RNA polymerase I1 promoter elements derived fron 502 unrelated promoter sequences. J. Mol. Biol. 212: 563-578. BUCHER, P., 1993 The Eukaryotic Promoter Database EPD. EMBL nucleotide sequence data library release 34, Postfach 10.2209, D-6900 Heidelberg. BUCHER, P., and E. N. TRIFONOV, 1986 Compilation and analysis of eukaryptic POL I1 promoter sequences. Nucleic Acids Res. 14: 10009-10026. BURATOWSKI, S . , 1994 The basics of basal transcription by RNA polymerase 11. Cell 77: 1-3. ARKHIPOVA, Drosophila Promoter Elements CHERBAS,L., and P. CHERBAS,1993 Thearthropod initiator: the capsite consensusplays an important rolein transcription. Insect Biochem. Mol. Biol. 2 3 81-90. CHERBAS, L., R. A. SCHULZ, M. M. KOEHLER, C. SAVAKISand P. CHERBAS, 1986 Structure of the Eip28/29 gene, an ecdysone-inducible gene from Drosophila. J. Mol. Biol. 189 617-631. CONAWAY, R.C., and J. W. CONAWAY, 1993 General initiation factors for RNA polymerase 11. Annu. Rev. Biochem. 62: 161-190. CONTURSI, C., G. MINCHIOTTI, and P. P. DINOCERA, 1993 Functional dissection of two promoters that control sense and antisense transcription of Drosophila melanogmter F elements. J. Mol. Biol. 234 988-997. FALB,D., and T. MANIATIS, 1992 A conserved regulatory unit implicated in tissue-specific gene expression in Drosophila and man. Genes Dev. 6 454-465. Flybase consortium, 1993 Flybase, a database of genetic and molecular data for Drosophila. The Genetics Society of America, Rockville, MD. FRIDELL, Y.-W., and L. L. SFARLES,1992 In vivo transcriptional analysis of the TATA-less promoter of the Drosophila melanogaster vermilion gene. Mol. Cell. Biol. 1 2 4571-4577. Genetics ComputerGroup, 1991 Programmanual for the GCG package, Version 7, April 1991, Madison, W I . GILL,G., 1994 Taking the initiative. Curr. Biol. 4 374-376. HERNANDEZ, N., 1993 TBP, a universal eukaryotic transcription factor? Genes Dev. 7: 1291-1308. HULTMARK, D., R. KLEMENZ and W. R. GEHRING, 1986 Translational and transcriptional control elements in the untranslated leader of the heat-shock gene hsp22. Cell 44: 429-438. JARRELL, K. A,, and M. MESELSON, 1991 Drosophila retrotransposon promoter includes an essential sequence at the initiation site and requires a downstream sequence for full activity. Proc. Natl. Acad. Sci. USA 88: 102-104. JAVAHERY, R., A. KHACHI, K. Lo, B. ZENZIE-GREGORY and S. T. SMALE, 1994 DNA sequence requirements for transcriptional initiator activity in mammalian cells. Mol. Cell. Biol. 1 4 116-127. LEE, H., K. W. KRAUS,M.F. WOLFNERand J. T. LIS, 1992 DNA sequence requirements for generating paused polymerase at the start of hsp70. Genes Dev. 6: 284-295. MAICAS,E. and J. D. FRIESEN, 1990 A sequence pattern that occurs at the transcription initiation region of yeast RNA polymerase I1 promoters. Nucleic Acids Res. 18: 3387-3393. MACK, D. H., J. VARTIKAR, J. M. PIPAS and L. A. LAIMINS, 1993 Specific repression of TATA-mediated but not initiator-mediated transcription by wild-type p53. Nature 363: 281-283. MCKNIGHT, S. L., and K. R. YAMAMOTO, 1992 Transcriptional regulation. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. MCLEAN,C., A. BUCHETON and D. J. FINNEGAN, 1993 The 5’-untranslated region of the I factor, a long interspersed nuclear element-like retrotransposon of Drosophila mlanogmter, contains an internal promoter and sequences that regulate expression. Mol. Cell. Biol. 13: 1042-1050. MINAKAMI, R., K. KUROSE,K. ETOH,Y. FURUHATA, M. HATTORIet al. 1992 Identification of an internal ciselement essential for the human L1 transcription and a nuclear factor ( s ) binding to the element. Nucleic Acids Res. 20: 3139-3145. MINCHIOTTI, G., and P.P. DINOCERA,1991 Convergent transcrip tion initiates from oppositely oriented promoters within the 5’ end regions of Drosophila melanogasterF elements. Mol. Cell. Biol. 11: 5171-5180. MIZROKHI,L. J., and A. M. IMAZo, 1990 Evidence for horizontal transmission of the mobile element jockey between distant Drcsophila species. Proc. Natl. Acad. Sci. USA 87: 9216-9220. MIZROKHI,L. J., S. G. GEORGIEVA and Y.V. IL’IIN, 1988 Jockey, a mobile Drosophila element similar to mammalian LINES,is transcribed from the internal promoter by RNA polymerase 11. Cell 5 4 685-691. MOUNT,S. M., C. BURKS,G. HERTZ,G. D. STORMO,0. WHITEet al. 1992 Splicing signals in Drosophila: intron size, information content, and consensus sequences. Nucleic Acids Res. 20: 42554262. O’SHEACREENFIELD, A., and S. T. S W E , 1992 Roles of TATA and 1369 initiator elements in determining the start site location and direction of RNA polymerase I1 transcription.J. Biol. Chem. 267: 1391-1402. PENOTTI,F., 1990 Human DNA TATA boxes and transcription initiation sites: a statistical study. J. Mol. Biol. 213 37-52. PERKINS,K. K., G. M.DAILEYand R. TJIAN,1988 In vitro analysis of the Antennapedia P2 promoter: identification of a new Drosophila transcription factor. Genes Dev. 2: 1615-1626. PUGH,B.F., and R. TJIAN,1991 Transcriptionfroma TATA-less promoter requires a multisubunit TFIID complex. Genes Dev. 5: 1935-1945. PURNELL, B. A,, and D. S. GILMOUR, 1993 Contribution of sequences downstream of the TATA element to a protein-DNA complex containing the TATA-binding protein. Mol. Cell. Biol. 13: 25932603. PURNELL,B. A,, P. A. EMANUEL and D. S. GILMOUR,1994 TFIID sequencerecognition of the initiator and sequences farther downstream in Drosophila class I1 genes. Genes Dev. 8: 830-842. RASMUSSEN, E. B., and J. T. LIS, 1993 In vivo transcriptional pausing and cap formation on three Drosophila heat shock genes. Proc. Natl. Acad. Sci. USA 90: 7923-7927. SCHNEIDER, T. D., G. D. STORMO, L. GOLDand A. EHKENFEUCHT, 1986 Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188: 415-431. SMALE,S. T., 1994 Core promoter architecture for eukaryotic protein-coding genes, pp. 63-81 in Transniption: Mechanisms and and J. W. CONAWAY. Raven Regulation, edited by R. C. CONAWAY Press, New York. 1989 The “initiator” as a transcrip SMALE,S. T., and D. BALTIMORE, tional control element. Cell 57: 103-113. SNYDER, M., M.HUNKAPILLER, D. YUEN, D. SILVERT, J. FRISTROM et al. 1982 Cuticle protein genes of Drosophila: structure, organization and evolution of four clustered genes. Cell 29: 1027-1040. SOELLER, W., S.J. POOLEand T.KORNBERG, 1988 In vitro transcription of the Drosophila engrailed gene. Genes Dev. 2: 68-81. STRUHL, K., 1987 Promoters, activator proteins, and the mechanism of transcriptional initiation in yeast. Cell 49: 295-297. SWERGOLD, G., 1990 Identification, characterization, and cell specificity of a human LINE-1 promoter. Mol. Cell. Biol. 10: 67186729. SYFES,M.A., and D. S. GILMOUR, 1994 Protein/DNA crosslinking of a TFIID complex reveals novel interactions downstream of the transcription start. Nucleic Acids Res. 22: 807-814. THUMMEL, C. S., 1989 The Drosophila E74 promoter contains essential sequences downstream from the start site of transcription. Genes Dev. 3: 782-792. 1994 Transcriptional activation: a comTJIAN,R., and T. MANIATIS, plex puzzle with few easy pieces. Cell 77: 5-8. T~UKIYAMA,T.,P. B.BECKER and C. WU, 1994 ATPdependent nucleosome disruption at a heat-shock promoter mediated by binding of GAGA transcription factor. Nature 367: 525-532. USHEVA,A,, andT. SHENK,1994 TATA-binding protein-independent initiation: w 1 , TFIIB, and RNA polymerase I1 direct basal transcription on supercoiled template DNA. Cell 76: 1115-1121, VERRIJZER, P., C. K. YOKOMORI, J.-L. CHEN and R. T~IAN, 1994 DrosophilaTAF,,150: similarity to yeast gene TSM-1 and specific binding to core promoter DNA. Science 264 933-941. WATERMAN, M. S., and R. JONES, 1990 Consensus methods for DNA and protein sequence alignment. Methods Enzymol. 183: 221237. WEIS, L., and D. REINBERG, 1992 Transcription byRNA polymerase 11: initiatordirected formation of transcription-competent complexes. FASEB J. 6: 3300-3309. WIMMER,E. A,, H. JACKLE,C. PFEIFLEand S. M. COHEN,1993 A Drosophila homologue of human Spl is a head-specific segmentation gene. Nature 366: 690-694. ZAWEL, L., and D. REINBERG, 1993 Initiation of transcription by RNA polymerase 11: a multi-step process. Prog. Nucleic Acid Res. Mol. Biol. 44: 67-108. ZHOU, Q., P. M. LIEBERMAN, T. G. B O Y E RA.~J. ~BERK, ~ 1992 H o b TFIID supports transcriptional activation by diverse activators and from a TATA-less promoter. Genes Dev. 6: 1964-1974. Communicating editor: V. G. FINNERTY