Promoter Elements in DrosqPhilla melanogaster

Transcription

Promoter Elements in DrosqPhilla melanogaster
Copyright 0 1995 by the Genetics Society of America
Promoter Elements in DrosqPhilla melanogaster Revealed by Sequence Analysis
Irina R. Arkhipova
Department of Molecular and Cellular Biology, Haruard University, Cambridge, Massachusetts 02138
Manuscript received October 7, 1994
Accepted for publication December 10, 1994
ABSTRACT
A Drosophila Promoter Database containing
252 independent Drosophila melanogasterpromoterentries
has been compiled. The database and its subsets have been searched for overrepresented sequences.
The analysis reveals that the proximal promoter region
displays the most dramatic nucleotide sequence
irregularities and exhibitsa tripartite structure, consisting of TATA at -25,’ -30 bp, initiator (Inr) at
5 5 bp and a novel class of downstream elements at +20/ +30 bp from the RNA start site. These latter
elements are also strand-specific. However,they differ from TATA and Inr in several aspects: ( 1) they
are represented not by a single, but by multiple sequences, ( 2 ) they are shorter, ( 3 ) their position is
less strictly fixedwithrespecttothe
RNA start site, (4) they emerge as a characteristicfeature of
Drosophila promoters and ( 5 ) someofthem are strongly overrepresented in the TATA-less, but not
can be classified as TATATATA-containing, subset. About one-half of known Drosophila promoters
less. The overall sequence organization of the promoter regionis characterized by an extended region
withan increase in GGcontent and a decrease in A, which contains a number of binding sites for
Drosophila transcription factors.
E
UKARYOTIC promoters have long been an object
of intensive study. The fundamentalprocesses that
result in spatially and temporally regulated patterns of
gene expression are gradually being uncovered. Much
progress has been achieved in identification and functional characterization of transacting protein factors,
both basal and regulatory, which can act coordinately
to provide proper levels of expression for every gene
in a given cell at a given time (reviewed in MCKNICHT
and YAMAMOTO 1992; CONAWAY
and CONAWAY
1993;
ZAWEL and REINBERG 1993; BURATOWSKI
1994; TJLW
and MANIATIS 1994).
There are,however, numerous unanswered questions
and problemsthatremain
controversial. Substantial
progress has been achieved in identification of the trans
acting factors that carry out the transcription process.
Characterization of the &acting promoter sequences,
which contribute to the proper binding of the trans
acting factors or otherwise participate in transcription,
has attracted less attention in recent years.
Some of these questions can be answered, at least in
part, by analyzing the nucleotide sequences in large
promoter data sets aligned with respect to the RNA
start site. The sequences that can be found with a high
degree of probability at a particular position in such a
data set ( o r its subsets) represent naturally occurring
promoter elements selected by evolution to perform
different position-specific functions.
Address for correspondence: Department of Molecular and Cellular Biology, HarvardUniversity, 7 Divinity Ave., Cambridge, MA
02138-2092. E-mail: [email protected]
Genetics 1 3 9 1359-1369 (March, 1995)
Previous studies of this kind were dealing with eukaryotic promotersingeneral
(BUCHERand TRIFONOV
1986; BUCHER1990). The datasets used were divided
only into invertebrate and vertebrate subsets to keep
them large enough for statistical analysis. In these diverse data sets, those overrepresented elements that are
most conserved between species, such as the TATA box
and the cap-site consensus or initiator ( I n r ) , could be
identified unambiguously, andthe CAAT- and G C
boxes were found at elevated frequencies in the upstream regions of vertebrate promoters.
Analysis of promoter data sets from a single species
is attractive because some promoter elements, and the
corresponding protein factors, may not in fact be as conserved between species as the TATA box and the Inr.
The TATA box is the best characterized basal eukaryotic promoter element. It is located at -25/-30 bp
from the RNA start site and is recognized by the TATAbinding protein (TBP) , which induces DNA bending
and participates in transcription by all three RNA polymerases (reviewed in HERNANDEZ
1993). Nevertheless,
a significant proportion of promoters does not possess
any TATA box, and although TBP was shown to participate in transcription from some TATA-less promoters
in vitro ( PUCHand TJIAN 1991
; ZHOU et al. 1992) , the
basis of its interaction with DNA in this case isnot clear
and is thought tobe assisted by TATA-associated factors
(TAFs) that form a multiprotein complex with TBP.
The second well-known element is the cap-sitese1989;
quence or initiator ( I n r ) ( S U E and BALTIMORE
reviewed in WEISand REINBERG 1992; GILL 1994; SMALE
1994) , which is present in the immediate
vicinity ofthe
1360
I. R. Arkhipova
RNA start site. Its importance in transcription, as well
as its functional distinction from the TATA box insome
aspects of transcriptional regulation, have been experimentally demonstrated (JARRELL and MESELSON1991;
MACK et al. 1993; USHEVA
and SHENK1994). In Drosophila, the cap-site element was initially noticed in a
number of different promoters (SNYDER
et al. 1982; ARKHIPOVA et al. 1986; CHERBAS et al. 1986; HULTMARK
et
al. 1986), andthe TCAGT pentamer (and its cognates
A
/,CAGT and TCATT) was recently identified in a statistical study of the -25/ +25 region of 112 arthropod
promoters ( CHERBAS
and CHERBAS
1993). The RNA
start site homologies are much less pronounced in vertebrates (BUCHER 1990) andall but disappear in humans (Penotti 1990).
Deletion and mutation studies of several Drosophila
promoters indicate the existence of a novel third class
of proximal promoter elements that arelocated 20-30
bp downstream from the RNA start site. Elements of
this kind have been found in many Drosophila retrotransposons (ARKHIPOVA and ILYIN1991;JARRELL and
MESELSON1991; MINCHIOTTIand DINOCERA 1991;
MIZROKHI and MAZO 1990; MCLEAN
et al. 1993) and, interestingly,in anumber of developmentally regulated
genes that do nothave TATA boxes ( BIGGINand TJIAN
1988; PERKINS
et al. 1988; SOELLER
et al. 1988; THUMMEL,
1989). Elements in this region were shown to be essential for transcription in vivoand/ orin vitro and in some
cases associated with binding of nuclear proteins. Limited sequence homologies have been noticed between
the downstream regions of these promoters (ARKHIPOVA and ILYIN
1991; MINCHIOTTIand DINOCERA1991;
MCLEANet al. 1993),but they were relativelyshort and
somewhat variablein location; at least twotypesof
downstream elements could be distinguished with no
similaritywhatsoever. Thus, the sequence identity of
these elements remained questionable. It was also unclear whether they can be identified in a minority or
in a substantial fraction of Drosophila promoters.
Therefore, it wasof interest to find out how widespread are the downstream elements, what is their consensus sequence ( s ) and whether they represent an essential component ofRNA
polymerase I1 (pol 11)
promoters in Drosophila. This could be achieved by a
statistical analysis ofpromoter sequences.
To this end, I have compiled and analyzed a Drosophila Promoter Database (DPD) . It currently consists of
252 independent D. melanogaster entries and is the
largest database of those available for a single species.
The analysis presented herereveals a tripartite structure
of Drosophila promoters,extendsthe
consensus sequences for the Drosophila TATA and Inr elements,
shows that various types of specifically localized downstream elements can be found in a significant fraction
of Drosophila promoters and demonstrates that some
of the downstream elements are characteristic for the
TATA-less subset. Overall, there is a significant increase
in GCcontent toward the RNA start site. Neither the
GGGCGGnor the CAAT-motifs are overrepresented at
any position in Drosophila promoters.
MATERIALS AND METHODS
Database: For initial analysis, 85 independent D.melanogasterentries were extracted from theEukaryotic Promoter Database (EPD) release 34 ( BUCHER
1993),which exists as part of
the EMBL database and is also available at the JohnsHopkins
University Gopher server. (Inspection of release 39 as of June
1994 showed that only four Drosophila entries were added).
Sequence analysis was limited to 100 bp downstream and 500
bp upstream from the RNA start site to minimize the influence of the coding regions and to retain 275% of promoter
sequences in the most 5 ’ extreme part (close to -500) of
the data set. In some cases, the mouse and human subsets of
the same EPD release, containing 216 and 148 independent
entries, respectively, were also analyzed.
To make the database more representative, it has been
expanded approximately threefold (to 252 entries) by additional D. melunogastersequencesfrom GenBank release 78, for
which the information about the location of the RNA start
site was available. The FlyBase (1993) was used to obtain
specific information about individual genes. The criteria for
including an entry in the database were essentially the same
as in the EPD ( BUCHER1993), and theprecision of the RNA
start site location was estimated to be 21-3 bp in most promoters. Each entry contained at least 100 bp of upstream
sequence.Retrotransposonpromoters,
reviewed in ARKHIPOVA and ILMN( 1992),were excluded from the present
analysis, as they would have created a strong initial bias toward
promoters with downstream elements.
Increasing the number of entries has made it possible to
divide the database into two subsets, in accordance with the
presence or absence of the TATA box. The promotershaving
at least part of an AT-rich sequence ( 2 4bp) falling within
the -25/-30
interval were regarded as TATA-containing.
Such a subdivision is not unambiguous, since in many cases
an AT-rich sequence may deviate to a significant extent from
the consensus TATAAA, and an experimental proof of its
functional significance is generally lacking. Nevertheless, a
comparative analysis of such subsets can at least reveal biases
in occurrence of particular sequence elements. Overall, 129
promoters were regarded as TATA-containing and 123 as
TATA-less.
Methods: The sequences were aligned with respect to the
RNA start site. The entire database, as well as its subsets, has
been extensively searched for overrepresented sequences. As
a rule, only the sense strand was analyzed to reveal any strandspecificity.
The most conserved elements, like the TATA box and Inr,
are readily detected by programs that identify the most frequently occurring sequence elements in an aligned data set
( WATERMAN
andJONEs 1990). However, no sequence irregularities were detected in the downstream region by these
methods. Therefore,in most cases I used the analysis of positional distribution of 4 “ individual n-mers (“words,” where
1 5 n 5 5 ) , which might be called “word profiling.” With a
few exceptions forspecific words (MOUNTet al. 1992; CHERBAS
and CHERBAS1993), it has not been routinely used.This
approach seems most attractive in cases where multiple, not
single, types of elements can be expectedto reside in a particular region. Any entries with improperly localized RNA start
1361
Elements
Promoter
Drosophila
sites should not greatly influence the results as they will not
contribute to the peak, only to the background noise.
GCG seThe data sets were analyzed with the aid of the
quence analysis software package(Genetics Computer Group
1991) running on a SUN Sparcstation. The occurrence frequenciesforall340possible
1-4mers and for selected 5mers were determined for every position and plotted in bins
of 2 or 5 bp against their position with respect to the
RNA
start site. Gapped n-mers were
not includedin this study. The
frequencies were plotted inabsolutenumbers rather then
percentages or fractions.A slight underestimation
of the most
5’extreme words resulting from reduction
of the local sample
size did not influence the conclusions.
Theprofilesforeachwordwerevisuallyexamined,and
those forn > 2 were also subjected to statistical analysis. These
latter could be divided into two groups: those for which the
maximal occurrence frequency differed
>4 SEs fromthe average occurrence frequency were considered
“interesting,” and
the others,which displayed a fairly uniform positional distribution and constituted the majority (usually>3/4 of words),
were regarded as “uninteresting.” This criterion, although
rather arbitrary,was nevertheless suitableas an empiricalcutoff value for identifying localized elements of possible functional significance (see RESULTS). Similarly positioned interesting
words
were
grouped
together,
and
from their
comparisontheconsensussequencescouldbededuced.
Some profiles are given not for the entire -500/+100 interval,butonlyfor
theregionsdisplayingsequencenonrandomness. Profiles for words and/or intervals not presented
in this paper are available upon request.
Evaluationof the information content was performed by
the exact method ( SCHNEIDEK
et al. 1986) . The information
content was calculated using the multinomial distribution forof occurrences at eachpositionwere
mula;thenumbers
added in bins of 5 , and the overall occcurrence frequencies
were taken as observed in the entire promoter data set.
RESULTS
Analysis of the D. melanoguster promoter data set reveals severalregions with highlynonrandom nucleotide
sequence distribution, some of which are well known
and some are not. It should be noted that
all the interesting wordsfall into severalspecifically positioned
groups, as described below. The highest concentration
of locally overrepresented wordsis observed inthe
proximal promoter region.
Complexstructure of theproximalpromoterregion: The region in the vicinity of the RNA start site
( ? a few dozen bp) , which is usually the site ofinteraction withbasal transcription machinery (RNA pol I1
and associated factors), differs significantly from the
rest of the sequence.The TATA box and to some extent
the Inr regions can be noted even at the mononucleotide level (Figure 1) . Thedoublet analysis appears
more informative ( Figure 2 ) : for many ofthe doublets,
this region is characterized by several sharp dropsand/
or rises in sequence composition. The main points of
sequence irregularity are located in the intervals -30/
-20, -5/ +5, and +20/ +35. The pattern for doublets
not shown in Figure 2 displays much less sequence nonrandomness and is closer to uniformity; TC and AG
also havea visible peak at the RNA start site. The picture
is different from that formammalian promoters, which
do not seem to possess a specifically localized downstream region of highly nonrandom sequencecomposition (not shown) . Thus, locally overrepresented downstream elementsdetectable
by this analysis area
characteristic feature of Drosophila promoters.
The -25 / -30 region: About one-half of Drosophila
promoters do not contain a recognizable TATA box
at the appropriate location. In the TATA-containing
subset, the TATA box produces a dominatingpeak and
is strictly strand-specific (Figure 1, A, Figure 2, TA, AT,
AA; Figure 3, TATA, ATAA) . Comparison of overlapping interesting triplets (TAT, ATA,TAA, AAA, GTA
and AAG) and tetramers (TATA, ATAA,TAAA and
AAAA and, to a lesser extent, ATAT,GTAT,CTAT,
AAAG, AAGC and AAGG) yields a consensus sequence
*/G/,TATAAAG/.,,”/,
. No locally overrepresented
words can be detected at this position in the TATAless subset. Therefore, there are no specific TATA box
substitutes that are characteristic for this region.
The RNA start site: A typical initiator element can
be foundin approximately one-third of Drosophila promoters. Examples of overrepresentationatthe
RNA
start site are shown in Figure 3 (TCAG, TCAT) . An
analogous picture is observed for CAGT, CATT, ATCA,
GTCA, TTCA, AGTT, AGTC and GTTG, listed in the
order of descending frequency at the RNA start site
(not shown). Thus, theconsensus strand-specific pentamer reported by CHERBAS
and CHERBAS(1993) can
be extended to T/A/GTCAC’/TT?’/CG.
As in the above
study, no obvious correlation is found between the presence of the Inrconsensus and theTATA box. The RNA
start site in TATA-less promoters tends to be enriched
in T residues, most frequently organized in short runs
of T (Figure 1, T; Figure 2, TT) .
Downstreamelements: The third strand-specific region of sequence heterogeneity, which is located downstream from the RNA start site and has not been identified in previous statistical studies, differs from the
previous two in several aspects. In contrast to the TATA
box and Inr, it does not contain a single predominant
sequence element, but several types of elements from
which a single consensus cannot be deduced. Thelocation of the downstream elements is less strictly fixed
with respect to the RNA start site, with the major peaks
appearing in the interval +20/+35.
Local overrepresentation of specificwords is best
seen at thelevel of3- and 4meranalysis.The interesting
downstream triplets are ACA, AAC, TCG and GTG (not
shown). Most of the interesting downstream tetramers
are shown in Figure 4. The most prominent typeis
represented byAACA,ACAA
or ACAG. The second
type most frequently occurs as TCGA,and its preferred
location is slightly closer to the RNA start site (around
the position + 2 0 ) . Both ACGT and ACGC can also be
I. R. Arkhipova
1362
zl
T
240
220-
200180 160 140 -
200t
80
80
,
++
-m
140
220-
C
200180 160 140 120 100 -
G
200180 -
160 140 120-
100 -
80-
80‘
‘
I
‘
S
I
I
’
2 2 o c 1 8 6 1 4 6 1 0 d - 6 0 ~ - 2 0 ~ 20 I 60 I
-280-240-20&160-120-80-40
0 40 80
~
I
~
~
1 6 080I
I
2 2 6 1 8 6 1 4 6 1 d - 0 0 ~ - 2 0 ~20
-280-240-200-160-120-80
-40 0 40
FIGURE1.-Distribution of individual bases along the promoter region. The number of occurrences on the sense strand is
plotted in bins of 5 against their position with respect to the RNA start site (position 0 ) . +, TATA-containing promoters; 0,
TATA-less promoters. Although not all of the profiles display notable differences between subsets, plotting the data for both
subsets in a single graph makes the common and specific features more evident.
classifiedas interesting. The CGTG tetramer differs
from the previous ones in its distribution between two
subsets: its overrepresentation at +25/ +30 is clearly
biased toward the TATA-less subset (Figure 5 ) . ACGY
and CTCG displaythe same bias in distribution between
subsets (Figure 5 ) .
In individual promoters, one type of downstream element may sometimes be repeated oroccur in combination with other types. No obvious regularity in spacing
was found between various downstream elements and
the Inr, theRNA start site or the TATA box; the imprecision of the RNA start site determination could have
partially obscured this. For some words, the distribution
appears to be bimodal. The bimodality is also observed
in the information content profiles (see below) .
The shortness and multiplicity of the downstream
elements make it difficult to estimate the percentage
of promoters containingsuch elements, and their functional significance may differ in individual promoters.
Judging by the degree of overrepresentation, a rough
estimate can be made that more than one-half of Drosophila promoters contain them.
Long-rangepromoterorganization: A wide area in
the vicinity of the RNA start site (from -150 to +50
bp) , which may be called a “GC-hill,” displays a significant increase in GC content (and a corresponding
decrease in A, mainly AA, content), represented by
dinucleotides CG and GC (and to a lesser extent by CC
and GG) (Figures 1 and 2 ) . This pattern exists for both
TATA+ and TATA- subsets and differs from that for
mammalian promoters in that they also exhibit a gradual increase in GC frequency toward the RNA start site,
but the CC and GG doublets make a significant contribution as well (not shown) .
In Drosophila promoters, there is no obvious underrepresentation of CG doublets compared with GC (in
agreement with dataforgenomic
sequences, ASHBURNER 1989). Again, this is in contrast to mammalian
promoters that contain approximately twiceas much
GC as CG (not shown). This difference is probably
connected with the lack of cytosine methylation in Drosophila.
Transcription factorbinding sites: Of the remaining
interesting words that are not located in the basal promoter region, the recognition sequence for the
Drosophila-specific GAGA factor ( BIGGINand TJIAN1988)
is the most notable one.Its concentration is significantly
increased on both strands in the wide area roughly corresponding to the GC-mountain with the sites of local
overrepresentation in the -SO/- 120 region (Figure
6 ) . The binding sites for the transcription factor zeste,
whichhave been frequently found in the vicinityof
Drosophila promoters (Benson and Pirrotta 1988), are
somewhat similar to GAGA in theirdistribution pattern,
although less abundant (not shown).
The CAAT-motif,whichis prominent in the -80/
n
190
100
90807060-
1363
Drosophila Promoter Elements
AA
190120 110100 -
l-r
908070
60504030-
50-
4030-
"
"
od-sol-201
I
-80 -40 0
20 I 60 I 100
40 80
P
CG
GC
R*
CA
GT
6050-
1
irR
10 -
FIGURE
2.-Selected
doublet profiles. The data are presented as in Figure 1.
45
45
40
TATA
40
35
35
M
30
25
25
20
20
15
I5
10
10
"1
5
5
-0 -480-440-400-580-320-28+240-200-IW120-80
-0
35
-40
0
40
(0
TCAG
-480-440-400-3eib320-2240-200-100-120-80
-40
0
40
80
45
ATAA
40
TCAT
35
30
25
20
15
FIGURE
3.-TATA-box and initiator (Inr) elements. Distribution profiles of selected tetramers are given for the entire database.
I . R. Xrkhipova
1364
-500-460-420-386-340-300-260-220-180-140-100-60'-20
-480-440-400-380-320-280-240-2W-160-120
-80 -40
20
0
'
40
60 100
80
-500-460-420-380-340-3W-260420-180-140-1W
-480-440-400-360-320-280-240-2W160-120-80
-eo -20 20
-40
0
eo
40
100
80
32
GTGY
26
24
22
20
18
16
14
12
10
8
I
20
10
16
14
12
10
8
6
6
4
4
2
2
0
-500-460-420-380-340--300-260-220-180-14C-100-60~-20
-480-440-400-360-320-280-240-200-160-120-80
-40
0
20
40
60
100
80
-0500-460-420-380-340-300-260-220-180-14O-lW -60 -20: 20 60 100
-480-440-400-360-320-280-240-200-160-120-80
-40
0 40 80
32
32
20
24
22
20
18
16
26
24
22
20
18
16
14
12
10
:: YGTG
:: ACGY
:I
10
8
8
6
6
4
2
0
4
2
-0
I
-480-440-400-360-320-280-240-200-180-120-80
-40
0
40
80
-100 region of mammalian promoters ( BUCIIEK
1990) ,
is not overrepresented at any positioni n Drosophila promoters ( not shown ) .
The majority of GC-containing words can be folmtl
as locally overrepresented i n the proximal region of the
GC-hill,displaying an overall increase i n the occurrence
frequencies toward the RNA start site (excluding the
drops at the main points of sequence heterogeneity
in the proximal promoter region described above, see
Figure 1 ) . In mammals, a strong contributor to this
increase istlne recognition sequence for the transcription factor Spl (GGGCGC; i n bothorientations, see
RLY:HI.:K 1990) . However, this sequence occurs vel?
rarely i n the entire Drosophila promoter data set and
is not locally overrepresented ( not shown ) .
Base composition: Information regarding base composition of the -.NO/ + I O 0 interval of Drosophila promoters is given i n Table 1. The overall sequence composition of 11. m~/nno~qmfnpromoters is 41.4% ( X ,
compared with 30.6% GC i n mice and 55.4% GC i n
humans. Mhile G and C are approximately equal, A
slightly predominates over T. This bias is t o a large
-
extent introd1lccd by the sequence TATAAA (Figure 1,
A ) . The A > T composition bias is not ohsen.ed for
the TATA-less subset, which, on tlne contraly, has an
increase i n T content at the RNA svart site (Figure 1,
T ) . While the overall promoter GGcontent is close to
that of the main hand I). rnPl//nogrrs/rr DNA (43%,ASIIIX'KSI-K 1989) , it is wnevenly distributed along the promoter region. The nonruniformityof promoter base
composition is particularly evident in the proximal region ( a sharp 1oc;tl rise i n A and a drop in G and <:
against the l>roadA decrease and G + C increase, with
more or less wniform overall T ) .
Information content analysis: A conventional information content analysis that quantifies the entropy, or
uncertainty, reduction at each position and reflects the
degree of deviation from randomness ( S(:HNEII)EK
P/ //I.
1986) represents certain difficulties. To estimate the
information content for even, position of each promoter element, a gapped alignment is needed becausc
of variable spacing between separatepromoter elements. The downstream elements, however, are especially difficult to align due to their multiplicity, short-
B
A
151
15
:: CGTG
12
11
10
7
10
0
8
7
6
5
6
5
0
8
ElE
4
3
2
1
0
3
2
1
-0 -480-440-400-360-320-280-240-200-160-120
-80 -40
0
40
-480-440-400-360-320-280-240-200-160-120-80
-40
80
17
16
17
16
13
12
11
10
13
12
11
10
9
0
80
40
: ACGY
:: ACGY
0
I
8
V .
-~0-460-42d-380-34d-3Od-26d-22&180-140-100
-480-440-400-360-320-280-240-200-160-120
I;
E
"50d-460-420-380-340-300-200-22d-180-140-1~-60~-20
-480-440-400-360-320-280-240-200-160-120-80
-40
- 0 0 -20 20 60 100
-80 -40
0
40 80
20
0
40
00
' 100
80
15
15
14
il CTCG
CTCG
11
10
0
8
7
7
6
5
4
6
5
3
3
4
2
2
1
1
0
-
0
-
-480-440-400-380-320-28(t240-200-160-120
FIGLW .i.-Downstrcam
promoters.
-80 -40
0
40
80
clemcnts cI1ar;wcristic lor TAT,4-less promoters. ( A ) TAT,-less prolnotcrs; ( I < ) ~T,.\TA-c.ont;lining
ness and variable location. A qualitative, rather than
quantitative, overall information content profile along
the promoter region can be obtained without gapped
alignment by adding the frequencies for eachposition
in bins of 5, as in the word profiling analysis. At the
singlet level such analysis is not extremely informative
and reveals mainly the TATA box and the Inr,with the
downstream elements beingless visible ( n o t shown; see
Figure 1 ) .
Analysis of the doublet information content (BERG
and \'ON H1rrE:t. 1987) is morepromising.Figure
7
represents the profile o f information content distribution alongthepromoterregion.
In addition to the
TATA box and Inr at -20/-30 and 0 / + 5 against an
overall more or less random background, a prominent
maximum appears in the +20/+30 region, with the
intensity comparable to that of the TATA box and Inr.
Five possible profiles, which differed slightly in relative
intensities of each element, were obtained by shifting
the 5-bp bins by 1 bp (not shown) ; in three o f them,
the upstream (TATA) antl downstream ( + Y O / +X))
elements appear as bimodal(with a IO-bp i n t e n d ) ,
and in the remainingtwo the bimodality is not resolved,
creating a tripartite structure.
This paper presents an
ovenkv of nucleotide sequence orgmization of a large set of promoters from
a single species ( I ) . mr~I~/nogcrs~~r)
and reveals a number
of interesting features, both common antlspecific compared with other species. It should be emphasized that
this kind of analysis identifies those rlem<:nts that arc
present in a significant fractionofpromoters, notnecessarily in all o f them, and individual promoters may vary
substantially in their properties. Although consenmion
is indicative of function, the functional significance o f
s11ch elements sllould be establishetl cxperimcntally ill
any particular case.
The biological rclcvancc of the approach u s c d is that
I. R. Arkhipova
1366
B
A
12
ii
GAGA
0
8
7
e
~~Od-46d-426380-340-3~-26d-~Z0-18614O-lod-60~201 20
1
-480-440-400-36(t320-280-240-20&160-120-80
-40
0
40
-480-44(t4Mt36(t32(tZ8(t24(tZMtl60-120-80
-40
0
40
60 1 o
lo
80
12
I’
10
0
TCTC
8
7
e
FIGURE6.-Distribution profiles for
GAGA binding sites. (A) TATA-less promoters; ( B ) TATA-containing promoters.
DNA-protein recognition can be strongly influenced by
neighboring bases and oftendependson
relatively
short words that can exhibit local overrepresentation
against their own background but may not be readily
detectable. Indeed, among the majority of uniformly
distributed words, a few that are interesting can easily
be distinguished, and these are concentrated in several
specific regions. Moreover, some of the words exhibit
local overrepresentation only in the TATA-less subset
and are uniformly distributed in the TATA-containing
subset.
An overall impression from analysis ofthe nucleotide
sequences is that most promoters arecomposed of multiple sequenceelementsactingtogether
to achieve
proper levels oftranscription. The strict strand-specificity of the three major proximal promoter elements in
Drosophila (TATA, Inr and thedownstream elements)
implies that together with bound proteins they partici-
pate in guiding the RNA polymerase to transcribe in a
proper direction.
The TATA box, despite its indisputable importance
and representation inall eukaryotic organisms, is absent from a significant fraction of Drosophila promoters
(about
Estimates of this proportion have changed
with time, since most of D. melanogasterentries from the
EPD subset, for historical reasons, represented promoters of structural and strongly inducible genes that,as a
rule, contained goodTATA boxes. Promoters of regulatory genes,onthe
contrary, less often possess good
TATA boxes, and the shift of interests of researchers
has resulted in a change of database composition. Thus,
the subdivision of promoters intoTATA-containing and
TATA-less ones discriminates to some extent between
structural and developmental genes, although thereare
a lot of exceptions.
The Inris represented by a single type of element in
TABLE 1
Base composition of the D. melanogaster promoter data set and its subsets as compared to that of
mammalian promoters and the -25/+25 interval of arthropod promoters
No.
Totalof
sequences
Source
D. melanogmtm
D. mlanogmtpr TATA+
D.m l a n o p t p r TATA62,127
D.melanogmter -25/+25
5,600
M. musrulus
Homo sapiens
110,716
80
248
126
122
112
148
216
length
%A
%T
%G
%C
Reference
130,906
68,779
29.8
30.3
29.2
29.0
25.5
22.7
28.8
28.5
29.2
25.0
23.9
21.9
20.7
20.6
20.8
23.0
25.4
27.5
20.7
20.6
20.8
23.0
25.2
27.9
This study
This study
This study
CHERRAS (1993)
This study
This study
88,800
Elements
Promoter
Drosophila
1367
Position, bp
0.2
.
0.18
0.16
0.14
I
0.12
0.1
0.08
0.06
-500
-400
-300
-100
-200
0
100
FIGURE7.-Information content ( I ) of the D. melunogaster
promoter data set. The average value (in bits per position)
-499/ -495, -494/ -490,
was calculated for the positions
. . . , +95/ +IO0 and plotted against the respective positions.
Drosophila and a numberof other arthropods that
have
beenexamined
(CHERBAS
and CHERBAS
1993; this
study). However, there areseveral types of
Inr elements
in mammalian promoters, with several corresponding
factors (reviewed in WEISand REINBERG 1992; SMALE
1994), and it is probably even more variable in humans than in other mammals ( PENOTTI1990). Experiments with arandomized
Inr region yielded the
G/A/TT/CAG/TTG sequence for
Drosophila and aloose
WANT/,w consensus for mammals (JAVAHERY et al.
1994; PURNELL
et al. 1994).
The downstream elements, which in this study have
for the first time emerged at the nucleotide sequence
level as integral components of Drosophila promoters,
seem to be morediverse than other promoter elements.
Some interesting parallels can be drawn between yeast
and Drosophila. A survey of mononucleotide composition for 95 yeast promoters (MAIM and FRIESEN
1990)
revealed a constant level of G and C throughout the
region -loo/ +50, while there is a transition from the
T peak centered at -20 to the A peak centered at 0.
This transition was named “the locator,” as it was found
to influence thelocation of RNA start sites. The doublet
composition has not been reported. Visual inspection
of the promoters listed in M
A
w
l and FRIESEN
(1990)
reveals that theT-rich peak is mainly created by numerous sequences resembling theInr, such asTTATT,
TCTTT or TCATT; the A-rich peak is largely composed
of words like AAAC,AACA,ACAA,AAAG etc., which
are listedabove among the downstream elements of
Drosophila. This raises an interesting possibility that
the RNA start site in Saccharomyces cermisiae may actually
correspond in sequence requirements to some downstream elements in Drosophila. If so, the terms “initiator” and “downstream element” may refer to the same
element in different organisms. This is not totally unexpected, given the well-known far upstream (-40,’
-120) location of the yeast TATA box ( STRUHL
1987)
and the recent finding that the essential TSM-1 gene,
a yeast TAFI1150analogue, is able to bind the promoter
DNA sequence-specifically (VERRIJZER
et al. 1994) .
Unlike Drosophila promoters, the mammalian s u b
sets exhibit little if any specifically localized local
overrepresentation in the downstream region at the triplet
level. The distribution of tetramers is more nonrandom,
but there is greater variety in the number andposition
of overrepresented words than in Drosophila (not
shown). Some sequence nonrandomness in the +30
region can be observed at the doublet level. The TATA
box dominates overwhelmingly in mammalian promoters, with some additional sequence heterogeneity more
downstream. Given the complexity of mammalian genomes, the current promoterdatabase for these species
is not large or diverse enough to reveal downstream
elements with the degree of overrepresentation similar
to that observed in Drosophila. Such elements are either absent from a significant fraction of mammalian
promoters, or their multiplicity and the degree of scattering is greater than in Drosophila. Their existence
has been experimentally demonstrated for several
mammalian genes (reviewed in ARKHIPOVA and ILYIN
1992) . The availabilityof more representative databases,which is expected to result from genome sequencing projects in the nearest future, will resolve this
issue.
The downstream region of Drosophila promoters has
attracted a great deal of attention in recent years. As
mentioned in the Introduction, it has been described
as a transcriptionally important element both in vitro
and in vivo, a binding site for nuclear factors ( BIGGIN
and TJIAN1988; PERKINSet al. 1988; ARKHIPOVA and
ILYIN1991) and a site ofRNA pol I1 pausing during
transcription (LEE et al. 1992; RASMUSSEN and LIS
1993) . Experiments involving mutagenesis of the downstream region were somewhat contradictory: 3 ’- or internal deletions or substitutions in most cases resulted,
but in some cases did not, in reduction of the promoter
strength or complete inactivation of the promoter (PERKINS et al. 1988; SOELLER
et al. 1988; MIZROKHIand
MAZO 1990; ARKHIPOVA and ILYIN1991; JARRELL and
MESELSON1991; FRIDELL
and SEARLES 1992; CONTURSI
et al. 1993; MCLEANet al. 1993). Diverse results led to
differing conclusions regarding the importance of the
downstream region, ranging from total unimportance
to absolute dependence on downstream elements. The
truth most probably resides somewhere in between:promoters may differ with respect to importance of their
downstream elements, and the contributionfrom other
elements should play an important role. The TATA
box is able to act cooperatively with the Inr ( O’SHEAGREENFIELD
and SMALE
1992) or the downstream elements ( FRIDELL
and S m E s 1992). Due to the multiplicity and shortness of these elements, deletions and
1368
I. R. Arkhipova
substitutions could often result in replacement or creation of a novel element instead of disruption of the
old one.
Sequence-specific downstream ( u p to +40 bp) contacts of a 150-kDa component of the Drosophila TFIID
complex were reported for hsp70, hsp26 and histone
H4 promoters ( PURNELL and GILMOUR
1993; PURNELL
et al. 1994; SWEs and GILMOUR
1994) and, recently,
for a heterologous AdML promoter using cloned and
purified Drosophila TAF11150 (VERRIJZER
et al. 1994).
However, the identity of sequences involved in specific
binding remained obscure. The downstream elements
identified here are likely to represent such sequences,
and this study can provide clues to their functionalidentification in different Drosophila promoters.
An interesting question is whether multiple downstream elements located at similar positions are recognized by a single or various proteins. The latter possibility seems intriguing in light of the existing differences
between the TATA-containing and TATA-less subsets
(Figure 5 ) . The CGTG element, which is strongly overrepresented in the TATA-less subset, is also predominant in LINE-like retrotransposons of Drosophila ( MINCHIOTTI and DINOCERA 1991; MCLEANet al. 1993).
These retrotransposons have a completely internal pol
I1 promoter located in thevicinity ofthe RNA start site,
with no upstream sequencesof their own, since the first
nucleotide of the element should at the same time be
the first transcribed nucleotide ( MIZROKHIet al. 1988) .
It is reasonable that promoters that cannot
have a TATA
box by definition are able to compensate for its absence
by the downstream elements. Transcription of the human LINE-1 retrotransposon also dependsonthe
downstream elements ( SWERGOLD
1990; MINAKAMIet
al. 1992).
The emergence of various downstream elements as
widespread components of Drosophila promoter regions raises intriguing questions with regard to their
significance. For instance, their multiplicity could be
invoked as a possible means of generating the basal
promoter variability, in combination with the presence
or absence of other promoter elements. In
this respect,
it isworth noting thatalternatively expressed promoters
of the same gene ( A n y ,Adh, hb etc. ) often have different downstream elements and sometimes also differ
with respect to the presence of TATA and Inr.
Although sequence irregularities in the more remote
promoter regions are less pronounced than in the
proximal region, they nevertheless exist. The pattern of distribution of the GAGA factor binding sites is consistent
with its possible role in displacement and/or restructuring of nucleosomes in the vicinity of the RNA start
site (TSUKIYAMA
et al. 1994). The similarly located increase in GC-containing words could play an analogous
role. In mammals, this increase is mainly exemplified
by a significant number of GCrich promoterscon-
taining Spl transcription factor binding sites; however,
this is not the case in Drosophila. Although the gene
for the Drosophila Spl analogue has been cloned and
shown to interact with the mammalian binding site
(WIMMER
et al. 1993), its naturally occurring recognition sequence might differ from the mammalian consensus. Alternatively, it may represent a highly specialized transcription factor with binding sites in very few
promoters, since it is known to regulate expression of
genes involved in head development. The absence of
specific CAAT-box localization also suggests a different
mode of action of the Drosophila C-EBP counterpart
( FALBand MANIATIS 1992) .
Finally, the expansion ofspecies-specific promoter
databases and further searches for promoter elements
may make possible further subdivisions into different
promoter subsets and the identification of subset-specific sequences. Accumulation of such data may also be
helpful in localization of potential promoters in genomic sequences and in more profound understanding
of the mechanisms of transcriptional regulation and
molecular evolution of promoters.
I would like to express my deep gratitude to M. MESELSONfor his
support and encouragement throughout thecourse of this work and
critical reading of’ the manuscript. I also thank M. WATERMAN
and
M. EGGERTfor the program RTIDE. Special thanks are due to S.
POKROVSKY
for help with the information content analysis. This research was supported by the National Institutes of Health grant GM22274 to M. MESEI.SON.
LITERATURE CITED
I. R., and Y. V. II.XN, 1991 Properties of promoter regions of mdgl Drosophila retrotransposon indicate that it belongs to a specific class of promoters. EMBO J. 10: 1169-1177.
ARKHIPOVA, I. R., and Y. V. ILMN,1992 Control of transcription of
Drosophila retrotransposons. BioEssays 1 4 161-168.
ARKHIPOVA, I. R., A. M. MAZO, V. A. CHERKASOVA,
T. V. GORELOVA,
N. G . SCHUPPE
et al. 1986 The steps of reverse transcription
of Drosophila mobile dispersed genetic elements and U3-R-U5
structure of their LTRs. Cell 44: 555-563.
ASHBURNER, M., 1989 Drosophila: A Labmatoly Handbook. Cold Spring
Harbor Laboratory Press, Cold Spring Harbor, N Y .
BENSON,M., and V. PIRROTTA,1988 The Drosophila zeste protein
binds cooperatively to sites in many gene regulatory regions:
implications for transvection and gene regulation. EMBO J. 7:
3907-3915.
BERG,0. G., and P. H. VON HIPPEL,1987 Selection of DNA binding
sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193: 723750.
BIGGIN,M., and R. TJIAN,1988 Transcriptionfactors that activate
the Ultrabithmax promoter in developmentally staged extracts.
Cell 5 3 699-711.
BUCHER,P., 1990 Weight matrixdescriptions of four eukaryotic
RNA polymerase I1 promoter elements derived fron 502 unrelated promoter sequences. J. Mol. Biol. 212: 563-578.
BUCHER, P., 1993 The Eukaryotic Promoter Database EPD. EMBL
nucleotide sequence data library release 34, Postfach 10.2209,
D-6900 Heidelberg.
BUCHER,
P., and E. N. TRIFONOV,
1986 Compilation and analysis of
eukaryptic POL I1 promoter sequences. Nucleic Acids Res. 14:
10009-10026.
BURATOWSKI,
S . , 1994 The basics of basal transcription by RNA polymerase 11. Cell 77: 1-3.
ARKHIPOVA,
Drosophila Promoter Elements
CHERBAS,L., and P. CHERBAS,1993 Thearthropod initiator: the
capsite consensusplays an important rolein transcription. Insect
Biochem. Mol. Biol. 2 3 81-90.
CHERBAS,
L., R. A. SCHULZ,
M. M. KOEHLER,
C. SAVAKISand P. CHERBAS, 1986 Structure of the Eip28/29 gene, an ecdysone-inducible gene from Drosophila. J. Mol. Biol. 189 617-631.
CONAWAY,
R.C., and J. W. CONAWAY,
1993 General initiation factors
for RNA polymerase 11. Annu. Rev. Biochem. 62: 161-190.
CONTURSI,
C., G. MINCHIOTTI,
and P. P. DINOCERA,
1993 Functional
dissection of two promoters that control sense and antisense
transcription of Drosophila melanogmter F elements. J. Mol. Biol.
234 988-997.
FALB,D., and T. MANIATIS,
1992 A conserved regulatory unit implicated in tissue-specific gene expression in Drosophila and man.
Genes Dev. 6 454-465.
Flybase consortium, 1993 Flybase, a database of genetic and molecular data for Drosophila. The Genetics Society of America, Rockville, MD.
FRIDELL, Y.-W.,
and L. L. SFARLES,1992 In vivo transcriptional analysis of the TATA-less promoter of the Drosophila melanogaster vermilion gene. Mol. Cell. Biol. 1 2 4571-4577.
Genetics ComputerGroup, 1991 Programmanual for the GCG
package, Version 7, April 1991, Madison, W I .
GILL,G., 1994 Taking the initiative. Curr. Biol. 4 374-376.
HERNANDEZ,
N., 1993 TBP, a universal eukaryotic transcription factor? Genes Dev. 7: 1291-1308.
HULTMARK,
D., R. KLEMENZ and W. R. GEHRING,
1986 Translational
and transcriptional control elements in the untranslated leader
of the heat-shock gene hsp22. Cell 44: 429-438.
JARRELL,
K. A,, and M. MESELSON,
1991 Drosophila retrotransposon
promoter includes an essential sequence at the initiation site
and requires a downstream sequence for full activity. Proc. Natl.
Acad. Sci. USA 88: 102-104.
JAVAHERY,
R., A. KHACHI,
K. Lo, B. ZENZIE-GREGORY
and S. T. SMALE,
1994 DNA sequence requirements for transcriptional initiator
activity in mammalian cells. Mol. Cell. Biol. 1 4 116-127.
LEE, H., K. W. KRAUS,M.F. WOLFNERand J. T. LIS, 1992 DNA
sequence requirements for generating
paused polymerase at the
start of hsp70. Genes Dev. 6: 284-295.
MAICAS,E. and J. D. FRIESEN,
1990 A sequence pattern that occurs
at the transcription initiation region of yeast RNA polymerase I1
promoters. Nucleic Acids Res. 18: 3387-3393.
MACK, D. H., J. VARTIKAR,
J. M. PIPAS and L. A. LAIMINS,
1993 Specific repression of TATA-mediated but not initiator-mediated
transcription by wild-type p53. Nature 363: 281-283.
MCKNIGHT,
S. L., and K. R. YAMAMOTO,
1992 Transcriptional regulation. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
MCLEAN,C., A. BUCHETON
and D. J. FINNEGAN,
1993 The 5’-untranslated region of the I factor, a long interspersed nuclear
element-like retrotransposon of Drosophila mlanogmter, contains
an internal promoter and sequences that regulate expression.
Mol. Cell. Biol. 13: 1042-1050.
MINAKAMI,
R., K. KUROSE,K. ETOH,Y. FURUHATA,
M. HATTORIet al.
1992 Identification of an internal ciselement essential for the
human L1 transcription and a nuclear factor ( s ) binding to the
element. Nucleic Acids Res. 20: 3139-3145.
MINCHIOTTI,
G., and P.P. DINOCERA,1991 Convergent transcrip
tion initiates from oppositely oriented promoters within the 5’
end regions of Drosophila melanogasterF elements. Mol. Cell. Biol.
11: 5171-5180.
MIZROKHI,L. J., and A. M. IMAZo, 1990 Evidence for horizontal
transmission of the mobile element jockey between distant Drcsophila species. Proc. Natl. Acad. Sci. USA 87: 9216-9220.
MIZROKHI,L. J., S. G. GEORGIEVA
and Y.V. IL’IIN, 1988 Jockey, a
mobile Drosophila element similar to mammalian LINES,is transcribed from the internal promoter by RNA polymerase 11. Cell
5 4 685-691.
MOUNT,S. M., C. BURKS,G. HERTZ,G. D. STORMO,0. WHITEet al.
1992 Splicing signals in Drosophila: intron size, information
content, and consensus sequences. Nucleic Acids Res. 20: 42554262.
O’SHEACREENFIELD,
A., and S. T. S W E , 1992 Roles of TATA and
1369
initiator elements in determining the
start site location and direction of RNA polymerase I1 transcription.J. Biol. Chem. 267:
1391-1402.
PENOTTI,F., 1990 Human DNA TATA boxes and transcription initiation sites: a statistical study. J. Mol. Biol. 213 37-52.
PERKINS,K. K., G. M.DAILEYand R. TJIAN,1988 In vitro analysis of
the Antennapedia P2 promoter: identification of a new Drosophila
transcription factor. Genes Dev. 2: 1615-1626.
PUGH,B.F., and R. TJIAN,1991 Transcriptionfroma
TATA-less
promoter requires a multisubunit TFIID complex. Genes Dev.
5: 1935-1945.
PURNELL,
B. A,, and D. S. GILMOUR,
1993 Contribution of sequences
downstream of the TATA element to a protein-DNA complex
containing the TATA-binding protein. Mol. Cell. Biol. 13: 25932603.
PURNELL,B. A,, P. A. EMANUEL
and D. S. GILMOUR,1994 TFIID
sequencerecognition of the initiator and sequences farther
downstream in Drosophila class I1 genes. Genes Dev. 8: 830-842.
RASMUSSEN,
E. B., and J. T. LIS, 1993 In vivo transcriptional pausing
and cap formation on three Drosophila heat shock genes. Proc.
Natl. Acad. Sci. USA 90: 7923-7927.
SCHNEIDER,
T. D., G. D. STORMO,
L. GOLDand A. EHKENFEUCHT,
1986
Information content of binding sites on nucleotide sequences. J.
Mol. Biol. 188: 415-431.
SMALE,S. T., 1994 Core promoter architecture for eukaryotic protein-coding genes, pp. 63-81 in Transniption: Mechanisms and
and J. W. CONAWAY.
Raven
Regulation, edited by R. C. CONAWAY
Press, New York.
1989 The “initiator” as a transcrip
SMALE,S. T., and D. BALTIMORE,
tional control element. Cell 57: 103-113.
SNYDER, M., M.HUNKAPILLER,
D. YUEN, D. SILVERT,
J. FRISTROM
et al.
1982 Cuticle protein genes of Drosophila: structure, organization and evolution of four clustered genes. Cell 29: 1027-1040.
SOELLER,
W., S.J. POOLEand T.KORNBERG,
1988 In vitro transcription
of the Drosophila engrailed gene. Genes Dev. 2: 68-81.
STRUHL,
K., 1987 Promoters, activator proteins, and the mechanism
of transcriptional initiation in yeast. Cell 49: 295-297.
SWERGOLD,
G., 1990 Identification, characterization, and cell specificity of a human LINE-1 promoter. Mol. Cell. Biol. 10: 67186729.
SYFES,M.A., and D. S. GILMOUR,
1994 Protein/DNA crosslinking
of a TFIID complex reveals novel interactions downstream of
the transcription start. Nucleic Acids Res. 22: 807-814.
THUMMEL,
C. S., 1989 The Drosophila E74 promoter contains essential sequences downstream from the start site of transcription.
Genes Dev. 3: 782-792.
1994 Transcriptional activation: a comTJIAN,R., and T. MANIATIS,
plex puzzle with few easy pieces. Cell 77: 5-8.
T~UKIYAMA,T.,P. B.BECKER and C. WU, 1994 ATPdependent
nucleosome disruption at a heat-shock promoter mediated by
binding of GAGA transcription factor. Nature 367: 525-532.
USHEVA,A,, andT. SHENK,1994 TATA-binding protein-independent initiation: w 1 , TFIIB, and RNA polymerase I1 direct basal
transcription on supercoiled template DNA. Cell 76: 1115-1121,
VERRIJZER, P.,
C. K. YOKOMORI, J.-L.
CHEN and
R. T~IAN,
1994 DrosophilaTAF,,150: similarity to yeast gene TSM-1 and specific binding
to core promoter DNA. Science 264 933-941.
WATERMAN,
M. S., and R. JONES, 1990 Consensus methods for DNA
and protein sequence alignment. Methods Enzymol. 183: 221237.
WEIS, L., and D. REINBERG,
1992 Transcription byRNA polymerase
11: initiatordirected formation of transcription-competent complexes. FASEB J. 6: 3300-3309.
WIMMER,E. A,, H. JACKLE,C. PFEIFLEand S. M. COHEN,1993 A
Drosophila homologue of human Spl is a head-specific segmentation gene. Nature 366: 690-694.
ZAWEL, L.,
and D. REINBERG, 1993 Initiation of transcription by RNA
polymerase 11: a multi-step process. Prog. Nucleic Acid Res. Mol.
Biol. 44: 67-108.
ZHOU, Q., P. M. LIEBERMAN,
T. G. B O Y E RA.~J. ~BERK,
~ 1992 H o b
TFIID supports transcriptional activation by diverse activators
and from a TATA-less promoter. Genes Dev. 6: 1964-1974.
Communicating editor: V. G. FINNERTY