A quantitative study of nucleosome free regions in

Transcription

A quantitative study of nucleosome free regions in
University of California
Los Angeles
A quantitative study of nucleosome free regions
in yeast by segmental semi-Markov Model using
tiling microarrays
A thesis submitted in partial satisfaction
of the requirements for the degree
Master of Science in Statistics
by
Wei Xie
2008
c Copyright by
Wei Xie
2008
The thesis of Wei Xie is approved.
Qing Zhou
Yingnian Wu
Michael Grunstein
Ker-Chau Li, Committee Chair
University of California, Los Angeles
2008
ii
To my father, mother, brother, fiancée and others
who provide their love and support
all through my graduate studies
iii
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1
ChIP-chip assay and data pre-processing . . . . . . . . . . . . . .
8
2.2
Overview of SSMM
. . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3
Design of SSMM . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3 Data collection and validation . . . . . . . . . . . . . . . . . . . .
15
4
Histone occupancy at different chromosome features . . . . . .
18
5
NFR identification by SSMM . . . . . . . . . . . . . . . . . . . .
20
6
Factors of nucleosome depletion: transcriptional activity versus
DNA affinity for histones . . . . . . . . . . . . . . . . . . . . . . . . .
27
7
Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
A Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . .
36
A.1 Signal of TFBS vs. signal of NFR . . . . . . . . . . . . . . . . . .
36
A.2 Algorithm of Segmental Semi-Markov Model . . . . . . . . . . . .
37
A.2.1 Segmental model fitting and emission probability calculation 37
A.2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
39
A.2.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . .
45
A.3 Validation of the raw data . . . . . . . . . . . . . . . . . . . . . .
47
iv
A.4 Compare absolute depletion and relative depletion in NFRs
. . .
50
A.5 Distributions and lengths of NFRs with different DoND . . . . . .
50
A.6 Nucleosome depletion forces: DNA affinity for histones and transcriptional activity . . . . . . . . . . . . . . . . . . . . . . . . . .
57
A.7 A subset of integenic and genic NFRs with high DoND . . . . . .
60
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
v
List of Figures
1.1
Chromatin structure and nucleosomes. . . . . . . . . . . . . . . .
2
1.2
ChIP-chip: Chromatin Immunoprecipitation coupled by microarray.
3
1.3
A schematic representation of this study. . . . . . . . . . . . . . .
7
2.1
Four states in segmental semi-Markov model used in this study to
model the histone occupancy surrounding the NFRs. . . . . . . .
11
2.2
Schematic organization of segmental semi-Markov model.
14
3.1
Comparison between the nucleosome occupancy data in this study
. . . .
and published data. . . . . . . . . . . . . . . . . . . . . . . . . . .
17
5.1
Feature quantification of NFRs based on trapezoid pattern. . . . .
21
5.2
Comparison of linker region in Lee et al. and NFR from this study 23
5.3
Effects of DoND on distributions and lengths of NFRs. . . . . . .
5.4
The distributions of distances and lengths of NFRs within promoter regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
29
Histones are depleted from the promoter of gene GAC1 prior to
its activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3
26
Factors of nucleosome depletion: transcriptional activity versus
DNA affinity for histones. . . . . . . . . . . . . . . . . . . . . . .
6.2
24
32
Histones are depleted from the promoter of gene YMR279C prior
to its activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
A.1 Comparison of TF binding signal and Nucleosome occupancy signal. 36
vi
A.2 Distribution of state durations after convergence . . . . . . . . . .
47
A.3 Correlations between the nucleosome occupancy data in this study
and the data from Lee et al. . . . . . . . . . . . . . . . . . . . . .
49
A.4 Absolute depletion vs. relative depletion . . . . . . . . . . . . . .
50
A.5 Locations of NFRs vs. absolute depletion
. . . . . . . . . . . . .
51
A.6 Locations of NFRs vs. relative depletion . . . . . . . . . . . . . .
52
A.7 Different intergenic region vs. NFR absolute/relative depletion
.
54
A.8 TATA box vs. NFR absolute/relative depletion . . . . . . . . . .
55
A.9 TF binding sites vs. NFR absolute/relative depletion . . . . . . .
56
vii
List of Tables
4.1
Nucleosome occupancies at chromosome features vs. intergenic
regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
A.1 Correlations between our data and the data by Bernstein et al. . .
48
A.2 Correlations between our data and the data by Lee et al. . . . . .
48
A.3 Correlations between our data and the data by Pokholok et al. . .
48
A.4 Correlation matrix of DoND and average Pol II binding . . . . . .
57
A.5 Compare the effects of DNA affinity for histones and transcriptional activity (absolute depletion) . . . . . . . . . . . . . . . . .
58
A.6 Compare the effects of DNA affinity for histones and transcriptional activity (relative depletion) . . . . . . . . . . . . . . . . . .
59
A.7 Distribution of 145 NFRs with high DoND. . . . . . . . . . . . . .
60
A.8 25 genic NFRs in verified ORFs . . . . . . . . . . . . . . . . . . .
61
A.9 33 genic NFRs in un-verified ORFs.
62
viii
. . . . . . . . . . . . . . . .
Acknowledgments
I am greatly indebted to Dr. Ker-Chau Li, my advisor in Department of Statistics,
who is an expert in computational biology and bioinformatics. He has given me
unlimited support and guidance through my study and research in Statistics and
supervised my thesis project. I also want to thank Dr. Michael Grunstein, my
advisor in Molecular Biology Institute, who is a pioneering scientist in yeast
genetics. It is his constant encouragement and support that give me tremendous
motivation during my study at UCLA. I am extremely fortunate to be able to
work with and learn from these two outstanding scientists.
I also want to express my gratitude to Dr. Sun Wei, my good friend, collaborator, and co-first author of our manuscript based this project, who has his
own lab now in University of North Carolina, Chapel Hill. He is a professional
statistician and an enthusiastic scientist. This project could not have been done
without his persistent hard work. It is a wonderful experience and my great
pleasure to work with him.
I want to thank my other committee members Qing Zhou and Yingnian Wu
, for helpful knowledge and skills that I learned from their classes, as well as the
comments and discussions for my thesis. My appreciation also goes to Dr. Chris
Lee, Dr. Steve Horvath, Dr. David Elashoff and Dr. Mark Hansen, whose courses
elicited my deep interest in bioinformatics, programming and genomic work.
I also want to thank Feng Xu, who is always willing to help others and has
kindly provided part of the data for this project.
Lastly, I thank my parents, my brother and my fiancée Di, to whom I dedicate
both this thesis and my love.
ix
Abstract of the Thesis
A quantitative study of nucleosome free regions
in yeast by segmental semi-Markov Model using
tiling microarrays
by
Wei Xie
Master of Science in Statistics
University of California, Los Angeles, 2008
Professor Ker-Chau Li, Chair
DNA, the fundamental molecule carrying the genetic information, is packed into
the form of chromatin inside the nuclei of cells in a highly organized manner.
Nucleosomes as the basic unit of chromatin are not uniformly distributed along
the chromosomes and many genomic loci are depleted of nucleosomes. Nucleosome free regions (NFRs) play an important role in many biological processes
including gene regulation. As the resolution of tiling array gets higher, we expect
to extract out more and more subtle quantitative properties about NFRs such
as the lengths and the degree of nucleosome depletion. Because these quantities
are likely to vary from one NFR to another NFR, a genome-wide portrait of
each individual NFR may help shed light on the dynamic aspect of chromatin
restructuring and gene regulation. Although previous studies have examined
the consensus pattern of nucleosome depletion in promoter regions by a curve
averaging method, the quantitative characterization of each individual NFR, despite the importance, is lost because of averaging. In this study, we presented
a nucleosome occupancy data of the whole yeast genome at 4-bp resolution and
x
developed an efficient algorithm to identify each individual “quantitative NFR”
at the whole genome scale. Our result showed that the majority of the NFRs are
located in intergenic regions/promoters with length of about 400-600 bps, which
is approximately the length of DNA wrapping around two-to-three nucleosomes
plus linkers. Our quantitative NFR results enable an investigation of the relative
impacts of transcription machinery and DNA sequence in evicting histones from
NFRs. We showed that while both factors have significant overall effects, the
specific contributions vary across different subtypes of NFRs. The emphasis of
our approach on the variation rather than the consensus of NFR sets the tone for
enabling the exploration of many subtler dynamic aspects of chromatin biology.
xi
CHAPTER 1
Introduction
DNA is the fundamental molecules that encode the genetic information for every
organism including humans[WC53]. The regulation of DNA based gene activity
is a central question that has been extensively pursued. However, DNA itself
is not nakedly existed in cellular environment, but rather is highly organized
into an extremely compacted structure—chromatin [KL99]. The nucleosome, the
building block of chromatin, is a critical regulator in many biological processes,
such as transcription, DNA repair, and DNA replication [KL99] (Figure 1.1).
The presence of nucleosomes under many occasions hinders the accessibility
of the transcriptional machinery to the underlying DNA; conversely, nucleosome
free regions (NFRs) allow easier access of transcription regulators to DNA sequences [BLH04, LSR04, PHL05, YLD05, LTB07]. This underlies the importance of localizing nucleosomes genome wide, a goal that has been attained using
a technique Chromatin Immunoprecipitation coupling microarray (ChIP-chip)
(Figure 1.2)[BL04]. Briefly, chromatin is fragmented and DNA that are occupied
by histones are enriched by histone-specific antibodies. Together with a background control genomic DNA that is not subjected to antibody enrichment, these
DNA fragments are labelled by fluorescence and hybridized to a DNA microarray,
where the complimentary hybridization between the designed probes and sample
DNA will signal the genomic regions where the histones bind [NLH06].
Genome-wide histone occupancy has been reported by several groups at gene
1
Figure 1.1: Chromatin structure and nucleosomes (adapted from Molecular Cell
Biology, Lodish et al. [LBZ00]). A single DNA molecule is wound around histone
octamers to form the strings of closely packed nucleosomes. Nucleosomes further
fold to form a 30-nm chromatin fiber, which is attached to a flexible protein
scaffold, resulting in long loops of chromatin extending from the scaffold.
resolution [BLH04, LSR04] or at the resolution of 260 bp [PHL05] for the entire yeast genome. Higher resolution mapping (20bp) was reported by Yuan et
al. [YLD05] on 3% of the yeast genome (chromosome III and 223 additional
regulatory regions). Recently, Lee et al. [LTB07] has presented a complete
high-resolution (4bp) map of nucleosome occupancy in yeast. In mammals, nucleosomes have been mapped in human cells in a portion of the genome (3692
promoters) [OSL07]. Despite the successes of these studies, several fundamental
questions regarding to the nature of NFRs remains unknown. First, it is not clear
whether NFRs occur exclusively at the promoter regions. NFRs in non-promoter
2
Figure 1.2: ChIP-chip: Chromatin Immunoprecipitation coupled by microarray
(adapted from Buck et al. [BL04]). Chromatin Immunoprecipitation (ChIP)
is performed using a specific antibody, then the DNA crosslinked with target
proteins is extracted and purified. Immunoprecipitated DNA is amplified and
hybridized to microarrays together with the input DNA. Raw intensity is then
extracted and analyzed for each spot, representing the relative enrichment at each
genomic locus.
3
regions (including coding regions) may have unknown functions. Second, it has
been controversial whether histones are depleted only from active gene promoters. Several works have suggested the existence of transcription-independent
NFRs on individual genes [FSH93, LG92, MCS00, SMS05]. Finally, although
both the transcriptional machinery and DNA sequence have been shown to be
involved in histone eviction [MCS00, SMS05, SFC06], the relationship between
these two factors remains an intriguing question. They may have distinct effects
in different types of NFRs.
To investigate the above issues, it is important to bring out the dynamic
aspects of NFRs. Because of the complex interplay between differential gene regulation and chromatin restructuring, the lengths of NFRs are likely to change
from one NFR to another NFR. Likewise, the degree of nucleosome depletion
(DoND) in each NFR is likely to vary as well. However, while many previous
studies have described nucleosome occupancy in quantitative terms, only static
ensemble properties about NFRs have been described. For instance, in the studies of [YLD05, LTB07], average nucleosome occupancies in promoter regions have
been reported by aligning the raw data (by start codon or transcription start sites
(TSS)) and averaging across all genes. The average nucleosome occupancy curve
does reveal an interesting shared pattern for many NFRs. But at the same time,
the characteristics specific for each individual NFR is also lost after taking average. In fact, as to be shown later in this study, the NFRs in promoter regions do
vary in the lengths and the distances to the corresponding start codons/TSSs.
Furthermore, we also identified NFRs in coding regions; see Results and Discussion.
In order to identify and quantitatively characterize each individual NFR across
the whole genome, an automatic “NFR calling” algorithm that can dissect NFR
4
pattern from a noisy background is required. Currently the major existing algorithms facilitating the analysis of tiling array data are adapted from those
initially designed for detecting the transcription factor binding sites (TFBSs)
[CBN04, LML05, KBZ05, JW05, JLM06]. They are inadequate for detecting
NFRs. First, because most TFBSs are sparsely distributed across the genome,
many TFBS identification algorithms [LML05, JLM06] are designed under the
assumption that the vast majority of the array data are background noise. However, this assumption becomes problematic for exploring epigenetic events that
are often abundant, e.g., nucleosome occupancy and a variety of histone modifications [PHL05]. Second, the signal of a TF binding is typically short and tends
to have a sharp “peak” (Supplementary Figure A.1(a)). In contrast, the pattern from nucleosome occupancy or histone modification occurrence can be much
longer often with a ”segmental” shape (Supplementary Figure A.1(b)). Third,
the binding of a TF typically only requires a qualitative description (i.e. presence/absence), while quantitative shape parameters of the signal patterns are
essential for the characterization of NFRs and other epigenetic marks.
To our knowledge, there are only two published algorithms specifically designed for detecting nucleosomes [YLD05, OSL07]. Yuan et al. [YLD05] employed
a Hidden Markov Model (HMM) to detect positioned/delocalized nucleosomes.
The method infers only the positions of nucleosomes, but does not provide quantitative properties of the NFRs such as DoND. Lee et al. (2007) adapted an
extension of the HMM used by Yuan et al. (2005) to analyze their high resolution nucleosome occupancy data. Alternatively, Ozsolak et al. [OSL07] proposed
a two-step procedure to detect positioned nucleosomes, consisting mainly of (1)
smoothing the raw probe-level data by wavelet decomposition and (2) decomposing the entire chromosome into “peaks” and “troughs” by an edge-detection
technique. How much the data should be smoothed in the first step appears to
5
dictate the entire procedure of nucleosome positioning. This could be difficult to
decide in practice.
In this study, we developed an algorithm for capturing complex signal patterns
in data from high density arrays. Our algorithm, employed to detect NFRs in
this study, is based on a segmental (hidden) semi-Markov model (SSMM), which
is an extension of HMM (see section 2.2-2.3 for overviews and A.2 for details).
In addition to being able to identify desired patterns (e.g., NFR patterns) and
capture the quantitative features of such patterns, this SSMM-based algorithm
also enjoys more flexible model assumptions and higher efficiency compared to
the regular HMM. Figure 1.3 is a schematic representation of our study, showing
how the SSMM is used in characterizing genome-wide NFRs and in exploring the
driving forces of nucleosome depletion.
6
Nucleosome Occupation Data
by ChIP-tiling array
Detect locations of nucleosome
free regions, estimate their sizes,
degree of nucleosome depletion
Transcriptional activity
(RNA Polymerase II Binding
by ChIP-tiling array)
DNA affinity estimations
from Segal et al.
Driving forces of nucleosome
depletion: transcriptional activity
versus DNA affinity to histones
Figure 1.3: A schematic representation of this study.
7
CHAPTER 2
Methods
2.1
ChIP-chip assay and data pre-processing
Three sets of ChIP-chip data were used in this study. The first set of ChIP-chip
data of histone H3 was published previously [XZZ07]. Briefly, Chromatin Immunoprecipitation (ChIP) was performed using antibody against histone H3 (a
kind gift from Dr. Alain Verreault [XZZ07]), and then the DNA crosslinked with
nucleosomes was extracted and purified. Immunoprecipitated DNA was amplified and hybridized to Affymetrix Saccharomyces cerevisiae Tiling 1.0R Array
to map the nucleosome occupancy along chromosomes in a 4-bp high-resolution
manner. Raw intensities were computed by the Two-Sample Analysis method using Affymetrix Tiling Analysis Software v1.1. The tiling array features 2,635,714
oligo probes (25-mer) on the tiling array with 4 bp gaps (i.e. 21 bp overlaps)
between the majority (91.5%) of the adjacent probes. Only less than 1% of
the neighbor probes are separated by gaps longer than 20 bp. The entire yeast
genome except the centromeres is well represented on the arrays.
To validate the above results, we further generated a new histone H3 ChIPchip data set (two biological repeats) with a commercial H3 antibody (Abcam
ab1791) using Affymetrix tiling arrays. The data obtained from the two antibodies are highly consistent (see Results). The average of the all four repeats was
used in our study.
8
In order to quantify the transcriptional activity at each genomic loci, we also
measured genome-wide RNA Polymerase II (using antibody 8WG16, Upstate)
binding by ChIP-chip similarly as for histone H3.
2.2
Overview of SSMM
Segmental semi-Markov model (SSMM) is an extension from hidden Markov
model (HMM). SSMM has two major differences (generalizations) from standard
HMM. First, SSMM uses explicit state length density instead of the implicated
exponential density [Rab89]. For a standard HMM with transition probability aii
from state si to itself, the probability that d consecutive observations are emitted from state si is (aii )d−1 (1 − aii ). This probability decreases exponentially as
the length d increases, which makes long segments impossible. For example, if
aii = 0.9 and there is one probe per 4bp, the probability that a NFR is longer
than 500bp (typical length of NFR, see result section) is smaller than 2e-7. Thus
adaptation of explicit state length density is especially important for high density
tilling array data. The drawback of explicit state length density is that there are
more parameters to estimate, which may lead to over-fitting. However, for tilling
array with millions of probes, over-fitting is unlikely to be a problem. Second,
SSMM employs a segmental model to calculate the emission probability so that
dependency is allowed for all observations within one segment [ODK96]. This
is desirable in analyzing tilling array data because this dependency assumption
is more realistic. Furthermore, the segmental model can provide quantitative
outputs characterizing the shapes of signal patterns. Both HMM and SSMM
have been applied in speech recognition [Rab89, ODK96]. HMM has been introduced to computational biology for sequence alignment and gene detection
[KMH94, DEK98, Edd98], as well as identifying ChIP-based TF binding regions
9
[LML05, JW05] and NFRs [YLD05]. However, despite its flexibility in handling
high density data, to the best of our knowledge, SSMM has not been used for
computational biology before. One possible reason is its heavy computational
burden. The algorithm we introduced in this study incorporates several modifications of regular SSMM, which greatly improve the computational efficiency.
We also designed the different hidden states and likelihood evaluation scheme to
fit the purpose of NFR identification.
2.3
Design of SSMM
Our SSMM is designed to capture two types of NFR patterns from high density
tiling array data: triangle (Figure 2.1 (a)) and trapezoid (Figure 2.1 (b)) patterns.
There are four states in our SSMM (Figure 2.1): state 1 and 2 are horizontal lines
with high and low degree of nucleosome occupancy respectively, which models
signals from nucleosome occupied region (NOR) and NFR respectively; state 3
and 4 are negative/positive slope line, which models the transition from NOR to
NFR or from NFR to NOR respectively. States 1 and 2 have the same shape,
but different state duration probabilities and transition probabilities. A triangle
or trapezoid pattern corresponds to the path: 1 → 3 → 4 → 1 or 1 → 3 → 2 →
4 → 1 respectively (Figure 2.1 (c)).
In order to fit SSMM model, we organized the data into three hierarchical
levels: probe, bin, and segment (Figure 2.2). First, we merged probes within a
50-bp window into a “bin”. Then “segments” were constructed from one or several “bins”. One segment is emitted from one of the four states and the emission
probability is calculated based on linear model fitting. We organized data into
“bin” before “segment” for the following two reasons. First, it greatly reduces
the computation burden by enforcing all the probes in one bin having only one
10
(a)
(b)
1
1
3
4
(c)
1
1
1
3
4
3
2
4
2
Figure 2.1: Four states in segmental semi-Markov model. (a) A triangle pattern
representing a type of NFR observed on tiling arrays. (b) A trapezoid pattern
representing another type of NFR observed on tiling arrays. (c) The allowed
transitions between any two of the four states.
underlying state. Second, discrete time SSMM assumes equally spaced observations. However, in our data, the gaps between adjacent probes are not constant.
Grouping probes into bins ensures the distances between most adjacent bins are
constant. Other possible solutions include modeling the transition probability
between adjacent probes as a function of their distance [NGR98], or implementing continuous time SSMM. These methods however would both significantly
increase the algorithm complexity and computation time.
The bin size was empirically determined for the following considerations. (1)
Each bin should be long enough to include enough probes for linear model fitting.
Long bins also help filter out noise and reduce computation burden. (2) The
signals will be over-smoothed if bins are too long. A bin of 50bp on average
covers 10-12 probes. This allows enough data points for model fitting while
also avoids over-smoothing, given the lengths of NFRs are typically from several
hundred to a few thousand base pairs (see Result).
To compute the emission probability of one segment, we fitted the segmental
model (simple linear model in our case) first and then calculated the emission
probability based on the residuals. In order to obtain continuous prediction of
11
nucleosome occupancy, we require the fitted line to be continuous. More details regarding the emission probability calculation are listed in supplementary
materials.
Several parameters of the SSMM need to be estimated: the transition probabilities from state 3 to state 2/4 (other transition probabilities are fixed as 0
or 1), and the probability distributions of state durations. These parameters
are estimated by an iterative procedure. With parameters initiated according to
uniform distributions, we first identify the most likely path by Viterbi algorithm
[Rab89, DEK98], then estimate the parameters based on the most likely path,
and iterate until the parameter estimations converge. The Viterbi algorithm for
SSMM is different from the one for HMM because in order to determine the
best path ended at time k, state i, in addition to choose the previous state j,
we also need to choose the duration of state i. Give the “best path”, the parameters are estimated as follows. We estimate the transition probabilities by
the corresponding proportions of transitions. The duration probability are estimated by the proportions of observed durations. For example, to estimate
P (duration of state i = k), we first took all the duration lengths of state i, and
then calculate the proportion of the duration lengths that equal to k. For our
data, it takes 7 iterations for the SSMM to converge according to the convergence
rule: the maximum change of transition probabilities and the maximum change
of duration probabilities are both smaller than 1e−5 . The model fitting at convergence is the final output of SSMM. Details of these algorithms and the parameter
estimations at convergence are shown in supplementary materials. We have also
tried different parameter initials. For example, we initiated the state duration
distributions by different normal distributions or initiated the lengths of state 1
and 2 by the empirical distributions of the lengths of chromosome features and
intergenic regions respectively. All these different initials result in almost identi-
12
cal final outputs. Therefore it is highly unlikely that our algorithm is trapped in
a local optimum. We have implemented our algorithm in an R package, ss.hmm,
which can be downloaded at http://www.bios.unc.edu/∼wsun/software.htm.
13
!&!
(a)
!"&!
!!&'
*+.-/0
!&'
hidden states based on viterbi algorithm
$$$$$ %%%%%%%%%%%%%%%%%% """"""""""""""""""""""""" $$$$$$$ %%%%%% """"""""""""""""""" $$$$$$ ###
Segment
!
Bin
#!!!
"!!!
$!!!
()*+,+)-
(c)
%!!!
(b)
Probe
3
2450
2460
2470
2480
2490
3
3
3
3
3
3
2500 2400 2450 2500 2550 2600 2650 2700 2750
Figure 2.2: This figure shows the schematic organization of our segmental semi–
Markov model. (a) Yellow solid line represents observed data and green dash
line indicates the model fitting by SSMM. The state of each bin is labeled by
number 1, 2, 3, and 4 as shown at the bottom. A magnified seven-bin long seg~
ment between the two vertical dotted lines is shown in (b). A single 50-bp bin
from segment in (b) containing 11 probes is shown in (c). It can be seen that the
distances between adjacent probes vary.
14
CHAPTER 3
Data collection and validation
We isolated nucleosomal DNA by chromatin immunoprecipitation (ChIP) using anti-H3 antibodies, and hybridized isolated DNA and genomic DNA to the
Affymetrix Saccharomyces cerevisiae Tiling 1.0R Array with a 4-bp high-resolution
(See Methods). Two biological replicates were performed for each of the two different H3 antibodies employed. Our tiling array data are highly reproducible as
array signals obtained from two different H3 antibodies showed strong correlation
(R = 0.82, across ∼ 2.6 millions probes). In the following analysis, we used the
average of the four replicates.
We first validated our data using previously published whole-genome nucleosome occupancy data. A comparison between the coarse-grained versions of our
data and previous works showed high consistency at different resolutions, with
Pearson correlation of 0.66 (Bernstein et al. [BLH04], ∼ 6000 probes), 0.77 (Lee
et al. [LSR04], ∼ 12000 probes), and 0.62 (Pokholok et al. [PHL05], ∼ 41000
probes) respectively (Supplementary Table A.1-A.3).
We also compared our data to a recent high-resolution genome-wide nucleosome occupancy data in yeast [LTB07]. Despite the overall consistency, a major
difference of these two data sets comes from the protocol of isolating nucleosomal DNA. In our study as well as many previous studies [BLH04], [LSR04],
[PHL05], the nucleosomal DNA is isolated by shearing chromatin randomly into
short fragments (average size 500bp in our study) using sonication and immuno-
15
precipitating histones and associated DNA with anti-H3 antibody. Alternatively
micrococcal nuclease can be used to digest the internucleosomal linker DNA,
retaining only the nucleosomes. Yuan et al. [YLD05] and Lee et al. [LTB07]
used this micrococcal nuclease approach to study the nucleosome occupancy in
∼3% of the yeast genome and the whole yeast genome respectively. Comparing
with micrococcal nuclease treatment, the resolution from anti-H3 ChIP approach
is limited by the sonication step, therefore our data is a smoothed version of
the data from Lee et al. [LTB07]. This could be demonstrated by comparing the
coarse-grained versions of these two datasets using averaging windows of different
sizes. The overall correlation of entire genome, as well as the median correlation
across chromosomes increases significantly as the window size increases. For example, the overall correlation increases from 0.48 to 0.75 as the window size
increases from 50bp to 500bp (Supplementary Figure A.3).
Importantly, although micrococcal nuclease (MN) treatment is preferred when
examining the nucleosome positioning as it retains single nucleosome signal,
sonication-based data allows a better detection of NFRs as they are less complicated by linker regions between nucleosomes, given the signal in linker regions
are much more smoothed (see Figure 3.1 for examples). This is not a problem for
NFR detection as NFRs are relatively longer compared to linker regions, which
is exemplified in Figure 3.1. In conclusion, our data are of high quality, which
ensures a better quantification of NFRs than previous work, generating both the
positions and DoND of NFRs facilitated by our SSMM model.
16
Figure 3.1: Comparison between the nucleosome occupancy data in this study
and the data from Lee et al. [LTB07]. Two representative NFRs are shown in
(a) and (b). In both cases trimmed data from Lee et al. are shown in upper
panels. The rectangles at the bottom of upper panels indicate the results form
HMM nucleosome calling algorithm [YLD05] as reported by Lee et al. (2007).
Dark/light green rectangles represent localized/delocalized nucleosome calls respectively; white spaces represents linker calls. The lower panels show the data
from this study. The green broken lines are the SSMM output. Vertical broken
lines show the start, end of the triange/trapezoid patterns and the boundaries
of NFRs. The absolute depletion (A) and relative depletion (R) are also annotated. The names of the ORFs in the region and their description directions are
displayed in the bottom.
17
CHAPTER 4
Histone occupancy at different chromosome
features
Previous studies have reported relatively low nucleosome levels in some chromosome features such as promoters and enhancers [BLH04, LSR04, YLD05, KL99,
SFC06]. Here, we carried out a systematic study for the nucleosome occupancy
level on all annotated chromosome features, including ORF, ARS, rRNA, tRNA,
snRNA, snoRNA, telomeric element, intron, long terminal repeat and transposon (Table 4.1). Chromosome features are defined by Saccharomyces Cerevisiae
Database (SGD) [CAB98], which include ORF (Open Reading Frame), ARS (Autonomously Replicating Sequence), rRNA, tRNA, snRNA (Small nuclear RNA),
snoRNA (Small nucleolar RNA), rRNA, long terminal repeat, telomeric elements,
introns and transposons etc. The comparison between chromosome features and
intergenic regions was carried out by student t-test. One technical problem is
the test will be biased by the high dependency between signals from adjacent
probes due to their partial overlap. Thus for each feature, we randomly selected one probe from each instance of the feature and one probe from each
intergenic regions, forming two groups for t-test. This was repeated for 50 times
and the medians and two quantiles of the t-statistics are reported in Table 4.1.
Our study reveals that ORF, transposon, rRNA, telomere (except the telomeric
repeats where there is no histone binding [WGZ92]) have higher degree of nucloe-
18
some occupancy
1
than intergenic regions. tRNA and snoRNA have significantly
lower histone levels than intergenic regions, presumably due to the intense transcriptional activities. Interestingly, introns and ARS also have low nucleosome
occupancy.
Table 4.1: Nucleosome occupancies at chromosome features vs. intergenic regions
n is the total number of instances of one chromosome feature. nmedian is the median
number of probes selected. In different permutations, the number of probes selected
may be different because different instances of one feature may overlaps, thus nmedian
may be smaller than n. tmedian is the median of t-statistics. pmedian , p25% , and
p75% are respectively median, 1st quantile, and 3rd quantile of the t-test p-values. The
features are ordered by medians of t-statistics.
Feature
n
nmedian tmedian
pmedian
p25%
p75%
tRNA
snoRNA
intron
ARS
ncRNA
telomeric repeat
snRNA
X element combinatorial repeats
ARS consensus sequence
telomere
pseudogene
X element core sequence
Y’ element
rRNA
retrotransposon
transposable element gene
ORF
299
75
367
248
9
31
6
28
66
32
21
32
19
27
50
89
6604
275
75
334
248
8
31
6
28
32
32
21
32
19
25
50
89
6574
1.9e-147
2.5e-06
0.0081
0.13
0.22
0.52
0.75
8.4e-06
7.7e-07
4.2e-10
1.8e-09
1.7e-12
1.3e-10
8.6e-13
1.3e-26
2.3e-44
0
1.9e-149
1.4e-06
0.0051
0.084
0.18
0.35
0.49
2.8e-07
4.8e-07
1.2e-11
5.8e-10
2.3e-14
3.2e-13
1.7e-14
1.6e-29
1.2e-47
0
7.5e-146
4.9e-06
0.018
0.19
0.27
0.79
0.83
4.3e-05
1e-06
3.6e-09
7.7e-09
2.6e-11
3.6e-08
4.6e-11
1.2e-24
3.7e-41
0
1
-43.63
-5.09
-2.66
-1.52
-1.35
-0.65
-0.34
5.45
6.14
8.84
10.22
10.97
12.62
13.02
20.06
24.14
49.97
We simply measure the degree of nucloesome occupancy by the log ratio of the ChIP
enriched signals versus the signals from genomic control.
19
CHAPTER 5
NFR identification by SSMM
Initial inspection of the tiling array data revealed two types of NFR patterns:
triangle (Figure 2.1 (a)) and trapezoid (Figure 2.1 (b)) patterns. We design the
four hidden states of our SSMM and appropriate transition probabilities to capture both patterns. Our SSMM algorithm can be understood as a curve-fitting
algorithm, which is designed to capture specific patterns (see Methods section
for details). SSMM fitting enables us to derive four quantitative features of each
NFR: (1) the location, (2) the range (length), (3) the absolute DoND level (“absolute depletion”), and (4) the relative DoND level comparing with its neighbor
regions (“relative depletion”). The triangle pattern is just one special case of
trapezoid pattern where the bottom becomes one point, thus we only describe
how to obtain the quantitative features for trapezoid pattern here, which is illustrated in Figure 5.1. First, the location of a NFR is defined as the mid-point
of its bottom. Second, the range of a NFR is defined as the horizontal distance
between two mid-points of two opposite sides. The “absolute depletion” (A for
short) is defined as the signal level (log ratio) at the bottom. The “relative depletion” (R for short) is used to measure the signal decrease in the NFR compared
with its neighborhood, which is defined as the difference between the signal level
at the bottom and the lower signal level of the two neighboring regions.
Our initial model fitting led to the identification of 9593 NFRs in total, among
which 35% are trapezoid patterns and 65% are triangle patterns. In the following
20
xRight
Signal
xLeft
xBottom
pLeft pBottom1 pBottom2 pRight
Chromosome Position
Figure 5.1: Feature quantification of NFRs based on trapezoid pattern. The
triangle pattern is just one special case of trapezoid pattern, with pBottom1=pBottom2.
Four quantities can be estimated as follows:
Location,
0.5*(pBottom1 + pBottom2); “absolute depletion”, xBottom; “relative depletion”, min(xLeft, xRight) - xBottom; range, [0.5(pLeft + pBottom1), 0.5(pRight
+ pBottom2)].
analysis, we further selected NFRs based on absolute (A) or relative (R) DoND. A
diagnostic scatter plot of absolute and relative depletion (Supplementary Figure
A.4) shows that a simple linear relation R = −A can well capture the main
pattern of the scatter plot. Thus for simplicity we used R = A > α as the
primary cutoff criteria for selecting NFRs. We obtained highly similar results
based on cutoffs using only relative depletion or absolute depletion (data not
shown unless otherwise described).
Figure 3.1 exemplifies how NFRs are identified by SSMM based on our tiling
array data. The lower left panel shows an NFR located in the shared promoter
region of two highly active genes HHT2 & HHF2 encoding histones H3 and H4
respectively. The lower right panel shows another NFR in the promoter region
of RPS17B, again an active gene encoding a ribosomal protein [HJW98]. For
comparison, in the upper left and right panels, we show the raw data and HMM
21
calls from Lee et al. [LTB07]. We can see the HMM outputs do not distinguish
the NFRs from the linker regions and do not provide DoND information for NFRs.
Both Yuan et al. [YLD05] and Lee et al. [LTB07] referred all the regions
that are not occupied by nucleosomes (well-positioned or delocalized) as linker
DNA. We take a closer look at the probe intensity (the log ratio of nucleosome
occupancy) within the linker DNA of Lee et al. Interestingly, we found that the
positions with lower probe intensities are more likely to fall into the NFRs that
we identified (Figure 5.2). For example, among the 314,457 linker probes with
probe intensity higher than -1.0 ( in log ratio), only 35.5% reside in NFRs, while
for those 259,494 probes with nucleosome occupancy signal lower than -1.0, 64%
reside in NFRs. This agrees well with the anticipation that NFRs tend to have
much lower nucleosome occupancy than inter-nucleosomal linker regions.
Effects of DoND on the distributions of NFRs
The heterogeneity of nucleosome depletions in different chromosome features
highlights the importance of DoND. The quantification of DoND for each NFR
allowed us to closely examine the distributions of NFRs according to different
DoND cutoffs. Nucleosomes are often depleted in promoter and intergenic regions [BLH04, LSR04, SFC06]. Our results also confirmed these observations.
We further showed that DoND is a critical determining factor for the distribution of NFRs. When using various cutoffs based on both R and A, we found that
NFRs with higher DoND are more likely located in the intergenic regions or upstream of coding regions (Figure 5.3). This is also true when using either relative
depletion or absolute depletion as cutoffs (Supplementary Figure A.5-A.6). In
addition, those regions that are both intergenic and upstream of coding regions
are more likely to contain NFRs (Chi-square test p-value < 5e−10 for any DoND
22
0.6
0.8
None−NFR
NFR
0.0
0.2
0.4
Density
1.0
1.2
Probe intensity in linker regions
−5
−4
−3
−2
−1
0
1
probe intensity (log(ratio))
Figure 5.2: Comparison of probe intensity (raw data of nucleosome occupancy,
i.e., log(ratio) from Lee et al. [LTB07]) between those linker probes within NFRs
or not. Whether a probe is in linker DNA regions is inferred by the HMM in the
work of Lee et al.[LTB07]. Whether a region is a NFR is inferred by our SSMM
algorithm in this study.
cutoff from 0.2 to 1.0).
NFRs with higher DoND are preferentially located in divergent intergenic
regions, in which the neighbor genes share the same 5’ upstream sequences. As the
DoND increases, the proportion of NFRs located in divergent regions increases;
the proportion for convergent regions, in which the neighbor genes share same 3’
downstream sequences, decreases; the proportion for tandom intergenic regions,
in which the neighbor genes are transcribed in the same direction, is roughly
constant. (Supplementary Figure A.7).
Our data also showed that the occurrences of NFRs are related to DNA properties. Furthermore, these relationships are strengthened as DoND increases.
For instance, the proportion of NFRs within TATA box-containing promoters
23
0.8
0.7
0.71
●
●
●
0.72
●
0.73
●
0.82
0.82
●
●
0.75
●
0.91
0.75
0.92
●
●
0.83
0.83
●
●
●
0.78
●
0.8
●
●
7442
0.91
●
0.87
●
0.82
0.79
●
●
●
0.78
●
●
0.7
0.6
●
●
372
●
511
●
674
●
931
0.61
2519
●
1863
0.56
1332
0.57
●
●
0.5
●
0.46
●
0.4
295
●
3431
●
●
5766
●
0.74
●
6000
0.82
0.91
4000
0.81
●
●
2000
0.82
0.92
Total Number of NFR Patterns
●
0
0.92
●
either
8000
1.1
0.92
●
0.9
0.92
upstream of ORFs
●
0.5
Proportion of NFR Patterns
1.0
intergenic regions
−1.0
−0.8
−0.6
−0.4
−0.2
0.0
Absolute
Depletion
(A) / - Relative
Depletion
(-R)
Absolute
Depletion
/ −Relative
Depletion
Figure 5.3: The proportion of NFR patterns located at intergenic regions, 500bp
upstream of ORFs, and either region given different cutoffs of DoND. Specifically,
the cutoff α (α < 0) indicates the absolute depletion is smaller than α and the
relative depletion is bigger than −α. The dash line indicates the total number of
NFR patterns at different cutoffs (corresponding to the axis on the right side).
increases as the DoND increases (Supplementary Figure A.8), which is consistent
with a prediction by Segal et al. [SFC06] using a computational model. Previous
work has shown that TFBSs are over-represented in nucleosome-depleted promoters [BLH04]. We further demonstrated that the proportion of NFRs harboring
TFBSs increases as DoND increases (Supplementary Figure A.9).
Effects of DoND on the lengths of NFRs
All previous studies about the likely locations and lengths of NFRs have been
carried out only at the ensemble-level by averaging a large number of nucleosome
occupancy curves in gene promoter regions. Yuan et al. [YLD05] have reported
a consensus NFR of ∼150bp long at 200bp upstream of ORF start codons. Lee
24
et al. [LTB07] further identified a more coherent relation between the location of
the consensus NFR and TSS [DHG06]. However, although the specific locations
and lengths of NFRs should vary from gene to gene, neither paper provided such
important information. In our data, with the NFRs characterized individually,
we can easily examine the location and length of each NFR in more details rather
than merely conveying the “average” pattern. This is reported next.
We found a total of 3448 NFRs located in promoter regions of 3447 distinct
ORFs (within 500 upstream of ORF). Among them, we obtained the TSS locations for 2601 ORFs [DHG06] and used them to examine the relative locations
of the corresponding NFRs. Consistent with the aforementioned literature, the
centers of these NFRs were found to locate about 100-200bp upstream of TSSs.
More interestingly, NFRs with higher DoND are located further away from TSSs
(i.e., shifted to 5’ direction) (Figure 5.4(a)). We found that the start points (5’) of
NFRs are located 350-450 upstream of TSSs, and the end points (3’) of NFRs are
located primarily at 100bp downstream of TSSs. The distribution of NFR (5’/3’)
boundaries also shift to 5’ direction as DoND increases (Figure 5.4(b)). Similar
patterns (but with more variations) are observed when we used the locations of
the start codons of the ORFs instead of TSSs to study the relative locations of
NFRs (data not shown).
We next examined the lengths of NFRs. The majority of the NFRs within
promoter regions were found to have lengths around 500bps and the peak of the
distribution becomes even more significant when DoND increases (Figure 5.4(c)).
The length distribution for all NFRs (including those outside of the promoter
regions) is similar (data not shown). We conclude that NFRs have a typical length
of 400-600 bps, which is approximately the length of DNA wrapping around 2-3
nucleosomes plus linkers. Geometrically speaking, because of the variation in the
25
Density
0.0000
0.000
−800
−600
−400
−200
0
NFR center relative to TSS
200
A<0&R>0
A < −0.2 & R > 0.2
A < −0.4 & R > 0.4
A < −0.6 & R > 0.6
0.0020
5' boundary
3' boundary
0.0010
0.004
(c)
0.002
Density
A<0&R>0
A < −0.2 & R > 0.2
A < −0.4 & R > 0.4
A < −0.6 & R > 0.6
0.000
Density
(b)
0.002
0.004
(a)
−1000
−600
−200 0
200
NFR boundaries relative to TSS
0
500
1000
1500
Length of NFR
Figure 5.4: The distributions of locations and lengths of NFRs within promoter
regions (500bp upstream of ORFs) according to different cutoffs of DoND (A:
absolute depletion, R: relative depletion). Here we use transcription start sites
(TSSs) as references of the NFR locations. Supplementary Figure 16 shows similar plots using ORFs as reference locations. (a) The distributions of NFR centers
relative to TSSs. (b) The distribution of NFR boundaries relative to TSSs. The
color codes are the same as in (a). (c) The distribution of NFR lengths.
locations and lengths of individual curves, most individual NFRs have to be much
longer than the length of the consensus pattern of nucleosome depletion which is
derived by curve averaging (c.f. aforementioned results of [YLD05, LTB07]).
26
CHAPTER 6
Factors of nucleosome depletion: transcriptional
activity versus DNA affinity for histones
One of the long-standing puzzles in chromatin studies has been the mechanism
by which histones are evicted from NFRs. At least two major driving forces have
been reported: the transcriptional activity [BLH04, LSR04, PHL05, YLD05] and
the DNA affinity for histones [FSH93, LG92, MCS00, SMS05, SFC06]. However,
the relative effects of these two factors and their relationship remain unclear. We
seek to address these questions based on the NFRs identified by our SSMM. It
is worth to emphasize that the linear regression and correlation analysis below,
which provide estimations of effect sizes, however are not simply equivalent to
causal inference.
We first performed a genome-wide RNA Polymerase II (Pol II) binding ChIPchip experiment to measure transcriptional activity (See Methods). We then
examined the Pol II binding level around each NFR. We found that the DoND
of NFRs correlates best with the Pol II binding level in the neighbor regions
of NFRs rather than the ones within NFRs (Supplementary Table A.4). This
is presumably because NFRs typically occur at gene promoters as shown before, while the strongest Pol II binding occurs at the neighbor coding regions.
Based on this observation, we averaged the Pol II binding signals within 1kb
upstream/downstream of a NFR, and used the higher one to represent local
27
transcriptional activity, considering the bidirectionality of NFRs. Using 500bp
upstream/downstream regions yielded similar results (data not shown). We computed the DNA affinity for histones within a NFR based on a published data
[SFC06], using the average DNA affinity (measured as nucleosome occupancy
probability) across all the nucleotides in the NFRs.
We found no strong correlation between DNA affinity for histones and Pol
II binding in all the cases we considered (correlation -0.088 – 0), indicating that
the effect of DNA affinity on transcription activity, if any, is undetectable from
the data. The scatter plots of the two factors did not show obvious non-linear
relation as well (data not shown).
Initial examination of all the 9593 NFRs by additive multiple regression model
confirmed that both factors have significant effects on nucleosome depletion in
NFRs, while the effect of transcriptional activity is dominant overall. For example, if we use absolute depletion to measure DoND, 12.6% of the variance of
DoND in total can be explained by either DNA affinity for histones or transcriptional activity. The majority (90.4%) of the explained variance is attributed to
transcription activity, only about 7.6% is attributed to DNA affinity for histones,
and less than 2% can be explained by either factor (Figure 6.1, Supplementary
Table A.5). Similar conclusions can be drawn using relative depletion to measure
DoND (Supplementary Table A.6).
We further examined the effects of these two factors in different NFR subgroups: NFRs located in intergenic regions or 500bp upstream of ORFs (Inter/Up) or other genomic regions (Others); NFRs located in convergent (Converge), tandem (Tandem), and divergent (Diverge) intergenic regions; NFRs located in TATA-containing or TATA-less promoters (500bp upstream of ORFs);
TFBS-containing NFRs or TFBS-less NFRs (Figure 6.1, Supplementary Table
28
0.30
0.00
0.10
R2
0.20
R2_DNA
R2_PolII
R2_both
Others
Converge
Tandem
Diverge
TATA
TATA−less
TFBS
TFBS−less
−0.02
Inter/Up
−0.08
Correlation
All
Figure 6.1: This figure illustrates the effects of two factors affecting nucleosome
occupancy. The upper panel shows the total R2 (percentage of variance) that can
be explained by either DNA affinity for histones or transcriptional activity (Pol
II binding), which is further divided into three parts: those explained by DNA
affinity for histones (R2 DNA), by Pol II binding (R2 PolII), or both (R2 both).
Part of the variance can be explained by both factors since weak correlation does
exist between each other, as shown in the the lower panel.
A.5-A.6).
Several interesting results were revealed when we compared the contribution
of Pol II binding and DNA affinity. (1) While both pol II and DNA affinity affect
intergenic/promoter regions as expected, Pol II has an overall larger effect than
DNA affinity for histones, which indicates that Pol II is a major deterministic
factor of nucleosome depletion in intergenic/promoter regions. (2) Pol II binding
explains more variance in regions with higher transcription activity, such as divergent and tandem intergenic regions. In contrast, DNA affinity has a larger effect
in regions with less transcription activity such as convergent intergenic regions
For example, when relative depletion was used as the measurement of DoND,
29
DNA affinity explains 9%, 24.4%, and 78.7% of the total variance explained by
the two factors in divergent, tandem, and convergent intergenic regions respectively (Supplementary Table A.6). (3) DNA affinity effect increases dramatically
in NFRs with the presence of TATA box or TFBS. For example, DNA affinity contribution increases 28 times more in TFBS-containing than in TFBS-less NFRs,
and 6 times more in NFRs within TATA-containing promoters than TATA-less
promoters (using absolute depletion, see Figure 6.1 and Supplementary Table
A.5). In contrast, Pol II binding effect remains similar (less than 2 fold change)
regardless of the presence of TFBS or TATA-box (Supplementary Table A.5).
Using relative depletion yields similar result (Supplementary Table A.6).
Interestingly, among those TFBS or TATA box-containing genes that are dominated by DNA sequence property, a number of them are stress response genes.
For example, GAC1, which is repressed in rich medium and induced upon diauxic transition when the glucose is limited [PEP99], is depleted of nucleosomes
at its promoter (Figure 6.2). YMR279C, a gene that is activated upon heat stress
[STK03], also loses histones from its promoter despite the fact that it is not transcribed (Figure 6.3). Both promoter regions of these genes have been predicted
to contain sequences that are poorly bound by nucleosomes [SFC06]. It has been
known that TATA-box containing genes are highly enriched by stress-response
genes[BZP04]. This indicates that the DNA sequences in the promoters of these
genes could have evolved to have strong repulsion against histones. Therefore
the histones may be pre-cleared in these promoters prior to the entry of transcription machinery, presumably to allow the rapid binding of TBP (TATA-box
binding protein) under environmental stress. Similar mechanism could apply to
TFBS-containing genes. The low nucleosome occupancy on TATA-box has been
reported to be encoded by the intrinsic DNA sequence features [SFC06, IAZ06].
The histone depletions around TFBSs have also been reported by several labs
30
[BLH04, YLD05, SFC06, LTB07]. However, a lingering question is that it is still
unclear whether the histone occupancy in those TATA-box or TFBS containing
genes is associated with active Pol II. Our study answered this question by showing that low histone occupancy still exists even after excluding the transcription
effect.
31
0.0
−1.0
0.8
0.4
0.0
0.5
0.0
−1.5 −1.0 −0.5
Pol II Binding
(log ratio)
DNA Affinity
to Histone
Nucleosome Occupancy
(log ratio)
GAC1
668000
669000
SYC1
670000
671000
672000
chr15 (bp)
Figure 6.2: Histones are depleted from the promoter of gene GAC1 prior to its
activation. This figure shows the RNA polymerase II binding (log ratio from
our ChIP-chip results), DNA affinity for histone (posterior probability of histone
binding from Segal et al. [SFC06]), and nucleosome occupancy (log ratio from
our ChIP-chip results) around gene GAC1 in rich medium.
32
−0.8
−1.4
0.8
0.4
0.0
0.2
−0.2
−0.6
Pol II Binding
(log ratio)
DNA Affinity
to Histone
Nucleosome Occupancy
(log ratio)
YMR279C
825000
825500
CAT8
826000
826500
827000
827500
chr13 (bp)
Figure 6.3: Histones are depleted from the promoter of gene YMR279C prior to
its activation. This figure shows the RNA polymerase II binding (log ratio from
our ChIP-chip results), DNA affinity to histone (posterior probability histone
binding from Segal et al. [SFC06]), and nucleosome occupancy (log ratio from
our ChIP-chip results) around gene YMR279C.
33
CHAPTER 7
Discussions
We have proposed an algorithm to detect segmental patterns using high resolution tiling array data. In contrast to many algorithms designed to detect the
binding of TFs, our SSMM algorithm is especially useful to characterize segmental features, which are commonly observed in epigenetic studies. We have applied
this algorithm to a genome-wide nucleosome occupancy data and identified all
NFRs across the entire yeast genome at 4bp high resolution. The location and
DoND (Figure 5.1) were quantified for each NFR. We showed that DoND, as
measured by SSMM, greatly affects the distributions of NFRs.
We also studied the relative impacts of transcription machinery and DNA sequence in evicting histones from NFRs. The DoND measure we introduced plays
a key role in formulating this biological question mathematically. A genomewide RNA Polymerase II (Pol II) binding ChIP-chip was used to measure the
transcriptional activity. We showed that Pol II and DNA play distinct roles in
different types of NFRs. Our study is a novel example of genome-wide investigations by combining Pol II binding, genetic code and epigenetic information to
address biological questions.
It is interesting to take a close look at the NFRs with various cutoffs of DoND
measures. For example, with a stringent cutoff (R > 0.4 and A < −0.4), we obtained 1863 NFRs. We found that although the majority of these NFRs are within
intergenic region or 500bp upstream of coding regions, there are still 145 NFRs
34
(7.8%) located in other regions (Supplementary Table A.7). Among them, 52 are
associated with tRNAs, which can be explained by the are high transcription rate
of tRNAs. There are 16 of them falling into ARS regions. This raises an interesting possibility of the involvement of NFRs in DNA replications. We also found
58 NFRs located in 58 distinct ORFs (12 uncharacterized, 21 dubious, and 25
verified ORFs, Supplementary Table A.8-A.9). One explanation for the presence
of NFRs in these genic regions is these NFRs may harbor regulatory regions for
neighboring genes that are located more than 500bps away. Alternatively, these
regions could impact the genes in which they reside. For example, they may be
cryptic transcription start sites within coding regions allowing transcription to
initiate there under certain conditions [KLW03, LGC07]. The functional roles of
these NFRs within genic regions are worthy of further study.
Our NFR calling algorithm is based on SSMM, an extension of HMM. Compared to HMM, SSMM is more appropriate for high resolution tilling array data
because of its flexibility in adjusting for state duration and emission probability
distribution. In addition, our SSMM algorithm is computationally more efficient
than standard HMM/SSMM because we group data into bins first, and estimate
parameters by the dynamic programming algorithm (Viterbi algorithm) instead
of EM algorithm (Baum-Welch algorithm) [Rab89] (see Methods for details).
As a curve fitting method, our SSMM algorithm has at least two advantages:
first, it utilizes the transition matrix to restrict the shape of the detected patterns; second, it chooses “changing points (knots)” in a simple and automatic
manner. Our algorithm is flexible to allow the choice of different segmental models depending on the research interests, which makes it general enough to be
adapted to many other types of high density tiling array data; for instance, DNA
replication data [WFB04, WES04, WBI05] and chromosome translocation data
[TME07, GKB07, KIM07].
35
APPENDIX A
Supplementary Materials
A.1
Signal of TFBS vs. signal of NFR
This figure compares the signal patterns of TFBS and NFR from tilling array
data.
(a)
●
686
● ● ●●
687
● ●
●
688
● ●
● ●
689
●
690
0.2
−0.6
−0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
246
chromosome location (Chr5, kb)
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1.0
Nucleosome Occupancy
5
10
15
20
25
●
●
0
GCN5 Binding Signal
(b)
247
248
249
250
chromosome location (Chr11, kb)
Figure A.1: Comparison of TF binding signal [PHL05] and Nucleosome occupancy signal (this study).
36
A.2
Algorithm of Segmental Semi-Markov Model
A.2.1
Segmental model fitting and emission probability calculation
Segmental Semi-Markov Model (SSMM) has been used in speech recognition
[Rab89, ODK96]. We modified the coventional SSMM algorithm mainly in two
aspects:
1. We organized the input data in a hierarchical structure: probe → bin →
segment, so that we do not require the time points to be strictly evenly
spaced.
2. We required the continuity of consecutive segmental models.
We denote the parameters of a segmental semi-Markov model as Λ = {π, A, L, D, di (·), e(·)}),
which include 6 components:
1. π: the initial probability, π(i) = P (q1 = si ), where 1 6 i 6 I, I is the total
number of states, qt indicates the state at time t, and si indicates the ith
possible state.
2. A = {aij }: the transition probability, aij = P (qt+1 = sj |qt = si ).
3. L: length of each bin.
4. D: the maximum duration.
5. di (·): the density of duration of state i.
6. e(·): emission probability.
We denote the total number of bins as Z = ceil(n/L), where n is the total
number of data points (probes). We use {ts1 , ..., tsZ } to indicate the start of each
37
bin and use {te1 , ..., teZ } to indicate the end of each bin. We use seg(i, p, q) to
indicate a segment from bin p to bin q (including p and q) with underlying state i.
Without requirement of continuity, previous works of SSMM defined the emission
probabilities of one segment seg(i, p, q) as ei (p, q) = P (xsp , xsp +1 , ..., xeq |q(p, q) =
si ), where q(p, q) are states from bin p to bin q. In this study, we use linear
model as segment model, and require the linear model for segment seg(i, p,
q) pass the predicted end point of previous segment seg(j, o, p − 1). Thus we
define the emission probability based on the previous segment too: eji (o, p, q) =
P (xsp , xsp +1 , ..., xeq |q(p, q) = si , qo,p−1 = sj ), where 1 6 o 6 p − 1.
Our SSMM model for NFR detection is illustrated by Figure 2.1-2.2 in the
main text. We use linear model as segmental model and we list below additional
details of segmental model fitting and emission probability calculation specifically
designed for NFR detection.
1. Given the end point of previous segment, the linear model for state 1 and
2, which are horizontal lines, are already decided. Thus there is no need for
model fitting.
2. Given the end point of previous segment, denoted by (tprev , xprev ), the
linear model for state 3 and 4 is (xw − xprev ) = b(tw − tprev ), where w
is index of those observation in the current segment of state 3 or 4. The
coefficients b can be estimated by least square method. Specifically,
b=
X
(xw − xprev )(tw − tprev )/
w
X
(tw − tprev )2 .
w
3. We denote the residuals as rw = xw − x̂w , where x̂w is fitted value from
the segmental (linear) model. The emission probability (likelihood) of the
segment is calculated by assuming the residuals are from normal distribution with mean 0 and a maximum likelihood estimation of the variance
38
σ̂ 2 =
P
w
rw2 /nseg , where nseg is the number of observations in the seg-
ment.
4. The segmental models of state 3 and 4 have one more parameter than state
1 or 2, the slope. Penalized likelihood should be calculated, e.g., AIC or
BIC. In this study, we used BIC.
5. The triangle/trapezoid patterns with very small slopes on the two edges
can be frequently caused by data noise and are not of interest to us. They
may also cause over-fitting. Thus if the absolute value of an estimated slope
is smaller than 0.001, we forced it to be -0.001 or 0.001 in order to calculate
the emission probabilities for state 3 or 4 respectively.
A.2.2
Algorithms
Analogous to the algorithms in regular HMM, we presented the following four
algorithms for SSMM: Viterbi, forward, backward, and posterior probability. The
“Viterbi” algorithm finds the most likely complete path, while the forward and
backward algorithm together identify the posterior probability of which state one
bin is emitted from.
Viterbi algorithm is favored in our model fitting for the following reasons. As
one of the challenges in our model fitting, estimation of the emission probability
eji (o, p, q) between bin p and q requires the knowledge of end point of the previous
segmental model from bin o to p − 1. This can be easily obtained in “Viterbi”
algorithm since when we calculate the emission probability of one segment, the
most likely path in the previous segments are already known. However, in forward
and backward algorithm, we only know that the previous segment is from bin o
to p − 1, corresponding to state sj , i.e., seg(j, o, p-1). The calculation of the
39
end point of seg(j, o, p-1) requires knowledge of where the segment before bin
o ends, which is unknown. This however can be solved without requiring the
continuity of final model fitting. The starting point of seg(j, o, p-1) can be
set to be free allowing calculation of its end point, which can be used as the
start point of the segment seg(i, p, q). Even though, another limitation of the
forward-backward algorithm is that it takes much more computation time than
Viterbi algorithm, which is critical for high resolution tiling array data analysis.
For instance, the Viterbi algorithm is 34 times faster than forward-backward
algorithm for the model fitting of a 3000-probe segment(on a 2GHz Intel Core
Duo MacBook Pro, 1GB RAM). Given that it takes around 1 day for the Viterbi
algorithm to finish all the model fitting and parameter estimations for the entire
genome, approximately one month is needed for forward-backward algorithm.
The following algorithms are implemented in a R package ss.hmm, which can
be freely downloaded at http://www.bios.unc.edu/∼wsun/software.htm. In order
to avoid underflow, we carried out all the calculations in log scale. A function
logsumexp(v) is used during the calculation:
logsumexp(v) = log
k
X
!
exp(vi )
(A.1)
i=1
where v = {v1 , v2 , ..., vk } is a vector.
A.2.2.1
Viterbi
Input
X = {x1 , x2 , ..., xn }, T = {t1 , t2 , ..., tn } and parameters Λ = {π, A, L, D, di (·), e(·)},
where X are observations and T are the corresponding time.
Ouput
path(t): the most probable path along time T .
40
Intermediate Variables
p(k, i): the maximum probability that state i ends at bin k, log.p(k, i) =
log(p(k, i)).
dura(k, i): the duration of state i that ends at bin k.
prev(k, i): the state that is before the state i, which ends at bin k.
Algorithm
1. Calculate intermediate variables
For the first bin, k = 1,
p(1, i) = πi di (1)e(i, 1, 1)
log.p(1, i) = log(πi ) + log(di (1)) + log(ei (1, 1))
dura(1, i) = 1
(A.2)
(A.3)
(A.4)
For k > 2, use d to indicate the duration of state i, 1 6 d 6 min(k − 1, D),
for a specific duration d and a previous state j, the start point of previous
segment is k 0 = k − d − dura(k − d, j) + 1:
p(k, i, d, j) = p(k − d, j)aji di (d)eji (k 0 , k − d + 1, k)
(A.5)
log(p(k, i, d, j)) = log(p(k − d, j)) + log(aji ) + log(di (d))
+ log(eji (k 0 , k − d + 1, k))
(A.6)
If k 6 D, it is possible that state i begins from the first time point, then
p(k, i, d = k, j = N U LL) = πi di (k)e(i, 1, k)
(A.7)
log(p(t, i, d = k, j = N U LL)) = log(πi ) + log(di (k)) + log(ei (1, k))
(A.8)
41
Then we can calculate the best path ended at time k, state i by
p(k, i) = max p(k, i, d, j)
(A.9)
log.p(k, i) = max log(p(k, i, d, j))
(A.10)
d,j
d,j
dura(k, i) = argmaxd log(p(k, i, d, j))
(A.11)
prev(k, i) = argmaxj log(p(k, i, d, j))
(A.12)
2. Trace back the best path
path(Z) = which.max(log.p(Z, ))
(A.13)
then find the previous segment that corresponds to state prev(Z, path(Z))
and ends at time Z − dura(Z, path(Z)). Keep recurring to find the entire
path.
A.2.2.2
Forward
Input
X = {x1 , x2 , ..., xn }, T = {t1 , t2 , ..., tn } and parameters Λ = {π, A, L, D, di (·), e(·)}.
Ouput
The forward probabilities for state i from bin p to q: f (i, p, q) = P (x1 , ..., xeq , q(p, q) =
si |Λ), where 1 6 p 6 q 6 Z
Algorithm
Initialization
p = 1, q = {1, ..., min(Z, D)}:
f (i, 1, q) = π(i)di (q)e(i, 1, q)
log(f (i, 1, q)) = log(π(i)) + log(di (q)) + log(e(i, 1, q))
42
(A.14)
(A.15)
Recursion
p = {2, ..., Z}, and for each p, q = {p, ..., min(p + D − 1, Z)}, o = {max(1, p −
D), ..., p − 1}.
""
f (i, p, q) =
#
X
X
j6=i
o
#
f (j, o, p − 1)eji (o, p, q) aji di (q − p + 1)
(A.16)
log(f (i, p, q)) = logsumexpj6=i [logsumexpo [log(f (j, o, p − 1)) + log(eji (o, p, q))]
+ log(aji )] + log(di (q − p + 1))
A.2.2.3
(A.17)
Backward
Input
X = {x1 , x2 , ..., xn }, T = {t1 , t2 , ..., tn } and parameters Λ = {π, A, L, D, di (·), e(·)}
Ouput
The backward probabilities for state i from bin p to q: b(i, p, q) = P (xsq +1 , ..., xn |q(p, q) =
si , Λ), where 1 6 p 6 q 6 Z
Algorithm
Initialization
b(i, p, Z) = 1
(A.18)
log(b(i, p, Z)) = 0
(A.19)
Recursion
q = {Z − 1, ..., 1}, and for each q, p = {max(1, q − D + 1), ..., q}, r = {q +
43
1, ..., min(q + D, Z)}.
"
b(i, p, q) =
X
"
aij
##
X
eij (p, q + 1, r)dj (r − q)b(j, q + 1, r)
(A.20)
r
j6=i
log(b(i, p, q)) = logsumexpj6=i [log(aij ) + logsumexpr [log(eij (p, q + 1, r))
+ log(dj (r − q)) + log(b(j, q + 1, r))]]
A.2.2.4
(A.21)
Posterior Probability
Calculate the posterior probability based on forward and backward algorithm.
Input
forward probability {f (i, u, v)} and backward probability {b(i, u, v)}, where i
(1 6 i 6 m) indicates the state and u and v (1 6 u 6 v 6 Z) indicate the bins.
Ouput
pi (k): posterior probability P (q(k) = si |X, Λ), where q(k) indicates state of the
k-th bin.
Algorithm
pi (k) = P (q(k) = si |X, Λ) =
P (q(k) = si , X|Λ)
P (X|Λ)
∼ P (q(k) = si , X|Λ)
XX
=
f (i, u, v)b(i, u, v)
u
(A.22)
v
where max(t − D + 1, 1) 6 u 6 k and k 6 v 6 min(u + D − 1, Z).
log(pi (k)) = log(P (q(k) = si |X, Λ)
= logsumexpu [logsumexpv [log(f (i, u, v)) + log(b(i, u, v))]]
(A.23)
44
Our SSMM model for NFR detection is illustrated by Figure 3, 4 in the main
text. Additional details of emission probability calculation specifically designed
for the application in nucleosome free region (NFR) detection are listed below.
1. The segmental models of state 3 and 4 have one more parameter than state
1 or 2, the slope. Penalized likelihood should be calculated, e.g., AIC or
BIC. In this study, we used BIC.
2. The triangle/trapezoid patterns with ”flat” slopes on the two edges are not
of interest to us as they could be simply caused by array noise. They may
also cause over-fitting. Thus if the absolute value of an estimated slope is
smaller than 0.001, we forced it to be -0.001 or 0.001 in order to calculate
the emission probabilities for state 3 or 4 respectively.
A.2.3
Parameter estimation
We need to estimate the transition probabilities from state 3 to state 2/4 (other
transition probabilities are fixed as 0 or 1), and the probability distributions of
state durations (See main text Figure 3 for description of the states). For HMM,
parameters are usually estimated by Baum-Welch algorithm (an EM algorithm)
[Rab89, DEK98]. However, as explained in section 1.2, this EM algorithm cannot
be applied to our SSMM because we require the continuity of the fitted curve.
Thus we use Viterbi algorithm [Rab89, DEK98] for parameter estimation. With
one set of initial parameters, we can generate the most likely path, which is used
to update the parameter estimations, and iterate until the parameter estimations
converge. The most likely path at convergence is our curve fitting result.
The difference between the algorithm we used and Baum-Welch algorithm is
analogous to the difference between (hard) K-mean clustering and soft K-mean
45
(EM) clustering. For hard K-mean, one point is assigned to the most likely
cluster, while for soft K-mean, posterior probabilities of cluster memberships are
estimated. Similarly, in our SSMM algorithm, we assume one bin is emitted from
the most likely state, while in Baum-Welch algorithm, posterior probabilities of
underlying states are used. The Forward-backward algorithm can be implemented
if we do not require the continuity of the fitted curve; however it takes much more
time than using Viterbi algorithm (see Supplementary Materials 2.2 for details).
We do not wish to make restrictions on the distribution functions of duration
or transition probabilities. We start with the uniform distributions. The initial
transition matrix is:

0
0
1
0





 0 0 0 1 




 0 0.5 0 0.5 


1 0 0 0
where the number in i-th row and j-th column is the transition probability from
state i to j: aij . The duration of each state is counted by the number of “bins”
(each “bin” covers 50bp). The initial distributions of durations for state 1 to 4
are uniform(6,100), uniform(3,30), uniform(3,50), and uniform(3,50) respectively.
The only restriction here is the ranges. We do not allow too short durations in
order to avoid over-fitting. The maximums of durations are set to be large enough
to cover all possible durations.
At convergence, the transition probabilities are a32 = 0.37, a34 = 0.63. The
following figure shows the distribution of state durations at convergence of parameter estimation, where X-axis is the duration in base pair, and Y-axis is the
frequency.
46
0
1000
Frequency
3000
0
500
1000
1500
2000
2500
200
1400
3000
0
0
3000
1000
duration of state 2
Frequency
duration of state 1
600
500
1000
1500
200 400 600 800
duration of state 3
1200
duration of state 4
Figure A.2: Distribution of state durations after convergence
A.3
Validation of the raw data
First we compared our nucleosome occupancy data with other published genomewide data. Because our data has the highest resolution, we calculate Pearson’s
correlation between another data and the corresponding coarse-grained version
of our data. Table 1-3 list the correlations between our data and data from
Bernstein et al. [BLH04], Lee et al. [LSR04], and Pokholok et al. [PHL05] respectively. In each table, rows correspond to the mean or median versions of our
data, columns correspond to measurements in another data.
We compare our data with the data from Lee et al. [LTB07] in more detail as
both data cover the whole yeast genome at a 4-bp high resolution. The following
figure shows the overall correlation between the two data and the individual correlation across each of 16 chromosomes, which suggests high consistency between
47
Table A.1: Correlations between our data and the data by Bernstein et al.
Bernstein et al.
[BLH04] studied the nucleosome occupancy in ∼
6000 intergenic/promoter regions in yeast genome for both H2B and
H3.
The lengths of these intergenic regions range from 60bp to
1594bp with median 370bp.
The data was downloaded from website:
The integenic
http://www.broad.harvard.edu/chembio/lab schreiber/index.html.
region annotation was downloaded from SGD (genome-ftp.stanford.edu).
Mean
Median
H3
H2B
Average H3 H2B
0.68
0.67
0.56
0.56
0.66
0.66
Table A.2: Correlations between our data and the data by Lee et al.
Lee et al. [LSR04] examined the nucleosome occupancy for both H3 and H4 in ∼ 12000
intergenic regions and ORFs, of which the lengths vary from 51bps to 12280bps with
median 611bp. The data was downloaded from GEO GSE4727.
Mean
Median
MycH4
H3
Average MycH4 H3
0.75
0.74
0.71
0.70
0.78
0.77
Table A.3: Correlations between our data and the data by Pokholok et al.
Pokholok et al. [PHL05] performed ChIP-chip experiments for H3 and H4 using 60mer Agilent DNA microarrays, which have ∼ 41000 probes covering 85% of the yeast
genome. Data was downloaded from http://web.wi.mit.edu/young/index.html.
Mean
Median
H4
H3
Average H4 H3
0.68
0.68
0.48
0.48
0.62
0.62
48
0.7
0.6
0.5
Correlation
0.8
the two datasets.
50
100
200
300
400
500
600
700
800
900
1000
Window Size
Figure A.3: Correlations between the nucleosome occupancy data in this study
and the data from Lee et al.
We averaged the data across un-overlapped windows of given size and then calculated
Pearson correlations using the average values. The orange dots are overall correlations
across the whole genome. Each boxplot illustrates the distributions of 16 correlations
corresponding to 16 chromosomes.
49
A.4
Compare absolute depletion and relative depletion in
NFRs
Figure A.4: Absolute depletion vs. relative depletion
R denotes relative depletion and A denotes absolute depletion. The red line is R = −A.
This scatter plot shows that the two measures of nucleosome depletion are
correlated well. We shall use R = −A > α as primary cutoff criteria for selecting
NFRs for further investigation.
A.5
Distributions and lengths of NFRs with different DoND
We systematically examined the proportion of NFRs in intergenic or promoter
regions (defined as 500bp upstream of coding regions). Information of intergenic
regions and ORF start positions are download from SGD [CAB98]. The total
50
length of intergenic regions is about 2.88Mb, accounting for about 23.9% of the
12.07Mb yeast genome. The 500bp upstream of 6604 ORFs occupy about 2.82Mb
DNA sequences, accounting for about 23.3% of the yeast genome. About 1.83
Mb (15.2%) DNA sequence is both intergenic region and 500bp upstream. As
expected, NFR with higher DoND are more likely located in intergenic or promoter regions (Figure A.5, A.6). The enrichment of NFRs in intergenic regions
promoter regions is highly significant (Chi-square test p-value < 1e−80 for any
cutoff of absolute depletion and/or relative depletion from -0.2/0.2 to -1.0/1.0).
In addition, those regions that are both intergenic and upstream regions are
more likely to contain NFRs (Chi-square test p-value < 5e−10 for any cutoff from
0.8
0.81
0.89
0.9
●
●
0.8
0.81
●
●
●
●
●
●
0.69
0.71
●
487
●
0.82
●
●
●
796
●
0.78
0.81
●
5766
●
●
●
●
0.77
● 4380
0.72
●
0.7
●
0.74
●
● 3286
●
●
620
7442
0.87
●
0.81
0.76
●
●
●
0.9
0.73
0.6
0.68
●
●
●
●
0.91
6000
●
●
1017
●
2516
0.67
0.61
●
1856
1352
●
0.56
0.57
●
0.5
Total Number of NFR Patterns
●
0.9
4000
0.81
0.9
either
2000
●
upstream of ORFs
●
●
0
0.9
0.8
0.7
0.91
0.67
0.5
Proportion of NFR Patterns
1.0
intergenic regions
8000
-0.2/0.2 to -1.0/1.0).
0.4
0.46
−1.0
−0.8
−0.6
−0.4
−0.2
0.0
Absolute Depletion
Figure A.5: Locations of NFRs vs. absolute depletion
The proportion of NFR patterns located in intergenic regions, 500bp upstream of ORFs,
and either intergenic or 500bp upstream region according to different cutoffs of absolute
depletion. The dash line indicates the total number of NFR patterns at different cutoffs
of absolute depletion (corresponding to the axis at the right side).
51
0.93
0.92
0.92
0.92
●
●
●
●
●
●
●
0.82
0.83
0.83
0.84
0.79
●
●
●
0.82
0.83
●
0.83
●
0.79
0.79
●
●
●
0.67 0.76
●
● 4318
0.91
●
0.78
●
●
●
●
0.77
●
0.75
●
0.73
●
0.72
0.6
●
0.63
●
2950
●
0.46
2177
●
1590
●
0.39
●
1154 ●
832
●
606
●
●
●
447
●
339
Total Number of NFR Patterns
0.8
●
0.93
0.92
●
0.88
0.76
either
2000 4000 6000 8000
upstream of ORFs
9593
0
1.0
●
0.4
Proportion of NFR Patterns
intergenic regions
0.36
0.0
0.2
0.4
0.6
0.8
1.0
Relative Depletion
Figure A.6: Locations of NFRs vs. relative depletion
The proportion of NFR patterns located in intergenic regions, 500bp upstream of ORFs,
and either intergenic or 500bp upstream region according to different cutoffs of relative
depletion. The dash line indicates the total number of NFR patterns at different cutoffs
of relative depletion (corresponding to the axis at the right side).
52
We further divided the intergenic regions into three classes based on the
strands of two neighboring chromosome features:
• Divergent: intergenic regions containing 5’ regions of both neighboring
genes.
• Tandem: intergenic regions containing 5’ regions of only one neighboring
gene.
• Convergent: intergenic regions containing none of the 5’ regions of neighboring genes.
Among all the 6640 intergenic regions, 1617 (24.4%) are divergent, 3087 (46.5%)
are tandem, and 1599 (24.1%) are convergent. We excluded 337 (5%) intergenic regions, in which at least one of the adjacent chromosomal features lack
transcription orientation, e.g., ARS. The total lengths of divergent, tandem, and
convergent intergenic regions are respectively 0.47Mb (17% of all the intergenic
regions), 1.42Mb (51%), and 0.87Mb (31%).
53
●
●
0.56
●
●
0.39
0.4
0.41
0.4
4000
0.58
0.58
●
0.38
●
●
●
0.0
0.58
236
0.57
●
294
●
409
●
538
●
●
3600
●
0.56
●
0.32
●
0.31
1078
739
●
●
●
●
●
●
●
●
●
0.04
0.04
0.04
0.04
0.04
0.05
−0.8
3422
0.56
2613
● 2020
●
●
●
0.37 0.36
0.34
● 1508
0.04
−1.0
●
●
●
0.43
●
0.57
●
−0.6
−0.4
●
0.06
0.13
●
0.14
0.09
−0.2
Total Number of NFR Patterns
0.55
●
2000
0.54
●
1000
●
●
●
●
3000
●
0.53
0.4
divergent
0
0.6
tandem
0.2
Proportion of NFR Patterns
convergent
0.0
Absolute Depletion / −Relative Depletion
Figure A.7: Different intergenic region vs. NFR absolute/relative depletion
The proportion of NFR patterns located at convergent, divergent and tandom integenic regions according to different cutoffs of relative depletion and absolute depletion.
Specifically, the cutoff α (α < 0) indicates the absolute depletion is smaller than α
and the relative depletion is bigger than −α. The dash line indicates the total number of NFR patterns within the three types of intergenic regions at different cutoffs
(corresponding to the axis at the right side).
54
We also partitioned the promoter regions of 6604 ORFs based on whether
they contain TATA box [BZP04]. After excluding 933 promoters with data not
available, 1090 (19.2%) of the remaining promoters are classified as TATA boxcontaining promoters and 4581 (80.8%) are TATA-less promoters. The proportion
of NFRs that are located at different promoter regions were examined, and we
found that NFRs with heavy nucleosome depletion are more likely to be TATA-
Promoters with TATA box
2435
●
●
●
0.39
2000
0.4
0.44
0.38
●
●
0.35
0.3
0.33
●
0.33
1417
1000
●
●
0.26
657
16
●
66
41
●
●
−1.5
113
●
187
346
●
●
●
0.22
●
●
●
0.19
0.19
0
0.2
Proportion of NFR Patterns
●
●
Total Number of NFR Patterns
3150
3000
0.5
containing promoters (Figure A.8).
−1.0
−0.5
0.0
Absolute Depletion / −Relative Depletion
Figure A.8: TATA box vs. NFR absolute/relative depletion
The proportion of NFR patterns located in 500 bp upstream promoters with TATA box
according to different cutoffs of relative depletion and absolute depletion. Specifically,
the cutoff α (α < 0) indicates the absolute depletion is smaller than α and the relative
depletion is bigger than −α. The dash line indicates the total number of NFR patterns
located in 500 bp upstream promoters at different cutoffs (corresponding to the axis at
the right side).
55
Previous result has shown that TF binding sites are over-represented in nucleosomedepleted promoters [BLH04]. We further examined the co-occurrence of TF binding sites and NFRs in a genome-wide scale. The TF binding sites were previously
inferred [MWG06] based on a genome-wide TF binding study [HGL04]. The data
includes 4312 binding sites with lengths ranging from 4bp to 22bp. If the midpoint of a binding site is within the range of a NFR pattern, we say the NFR
pattern harbors the binding site. We showed that the proportion of NFR patterns harboring TF binding sites increases as the degree of nucleosome depletion
●
0.6
0.65
●
0.64
●
0.61
5000
0.69
●
0.5
0.55
●
0.51
0.1
3431
●
0.29
1000
0.3
●
1863
3000
●
●
0.38
●
●
●
●
●
39
78
121
189
−1.5
●
295
●
931
●
511
−1.0
0.15
−0.5
0
0.4
●
0.44
Total Number of NFR Patterns
●
7442
0.2
Proportion of NFR Patterns
0.7
NFR patterns harboring TF binding site(s)
●
7000
0.8
increase (Figure A.9).
0.0
Absolute Depletion / −Relative Depletion
Figure A.9: TF binding sites vs. NFR absolute/relative depletion
The proportion of NFR patterns harboring TF binding site(s) according to different
cutoffs of relative depletion and absolute depletion. Specifically, the cutoff α (α < 0)
indicates the absolute depletion is smaller than α and the relative depletion is bigger
than −α. The dash line indicates the total number of NFR patterns at different cutoffs
(corresponding to the axis at the right side).
56
A.6
Nucleosome depletion forces: DNA affinity for histones and transcriptional activity
Table A.4: Correlation matrix of DoND and average Pol II binding
Variables: Abs.D, absolute depletion, Rel.D, relative depletion, polNfr, average polymerase II binding within NFR, polPro(1000), average polymerase II binding 1000bp
upstream of NFR, polAfr(1000), average polymerase II binding 1000bp downstream of
NFR, polAdj(1000), max(polPro, polAfr).
Abs.D
Abs.D
Rel.D
polNfr
polPro(1k)
polAfr(1k)
polAdj(1k)
Rel.D
polNfr
polPro(1k)
polAfr(1k)
polAdj(1k)
1.000 -0.853
-0.853 1.000
-0.086 0.042
-0.188 0.159
-0.196 0.170
-0.342 0.282
-0.086
0.042
1.000
0.779
0.749
0.824
-0.188
0.159
0.779
1.000
0.518
0.813
-0.196
0.170
0.749
0.518
1.000
0.812
-0.342
0.282
0.824
0.813
0.812
1.000
57
Table A.5: Compare the effects of DNA affinity for histones and transcriptional
activity (absolute depletion)
We compared the effects of DNA affinity for histones and transcriptional activity by
the following three linear models:
(1) Absolute depletion ∼ DNA,
(2) Absolute depletion ∼ Pol II,
(3) Absolute depletion ∼ DNA + Pol II,
where Pol II signal was calculated as the maximum of the two averages across 1000bps
up- and down-stream of each NFR.
Column “N” is the number of NFRs. Use R12 , R22 , and R32 to denote the R2 of model
(1), (2), and (3) respectively. RT2 otal = R32 , which is the total proportion of variance
2
2
2
explained by DNA affinity for histones or transcriptional activity. RDN
A = R3 − R2 ,
which is the R2 explained solely by DNA affinity for histones. RP2 olII = R32 − R12 , which
2
= RT2 otal - RP2 olII is the R2 explained solely by Polymerase II binding signal. RBoth
2
2
RDN
A is the R explained by both Polymerase II signal and DNA affinity for histones.
2
RBoth
is not zero because there is correlation between DNA affinity and Polymerase II
binding [NNK04]. PP olII is the ANOVA p-value comparing model (3) against model
(1). PDN A is the ANOVA p-value comparing model (3) against model (2).
N
RT2 otal
2
RDN
A
2
RP
olII
2
RBoth
PP olII
PDN A
All NFRs
Inter./Up.
Others
9593
4386
5207
0.1263
0.1911
0.0122
0.0096(7.6%)
0.0373(19.5%)
0.0052(42.6%)
0.1142(90.4%)
0.1459(76.3%)
0.0067(54.9%)
0.0024(1.9%)
0.0079(4.1%)
3e-04(2.5%)
3e-258
5e-160
3e-09
1e-24
7e-45
2e-07
Convergent
Tandem
Divergent
516
2042
1125
0.0837
0.1955
0.2483
0.0302(36.1%)
0.0472(24.1%)
0.0179(7.2%)
0.0499(59.6%)
0.1429(73.1%)
0.2159(87.0%)
0.0036(4.3%)
0.0053(2.7%)
0.0146(5.9%)
2e-07
2e-74
2e-63
5e-05
4e-27
3e-07
TATA
TATA-less
612
2570
0.2542
0.2082
0.0567(22.3%)
0.0084(4.0%)
0.1965(77.3%)
0.1941(93.2%)
9e-04(0.4%)
0.0057(2.7%)
8e-33
2e-124
2e-11
2e-07
TFBS
TFBS-less
1116
8477
0.198
0.0808
0.0874(44.1%)
0.0032(4.0%)
0.0961(48.5%)
0.077(95.3%)
0.0146(7.4%)
6e-04(0.7%)
3e-29
3e-150
8e-27
5e-08
58
Table A.6 is the same as Table A.5, except relative depletion is used instead
of absolute depletion.
Table A.6: Compare the effects of DNA affinity for histones and transcriptional
activity (relative depletion)
N
RT2 otal
2
RDN
A
2
RP
olII
2
RBoth
PP olII
PDN A
All NFRs
Inter./Up.
Others
9593
4386
5207
0.0887
0.1203
0.0062
0.009(10.1%)
0.0293(24.4%)
0.005(80.6%)
0.0779(87.8%)
0.0857(71.2%)
0.001(16.1%)
0.0019(2.1%)
0.0054(4.5%)
1e-04(1.6%)
5e-173
1e-90
0.02
3e-22
5e-33
3e-07
Convergent
Tandem
Divergent
516
2042
1125
0.0428
0.1081
0.1731
0.0337(78.7%)
0.0264(24.4%)
0.0156(9.0%)
0.0076(17.8%)
0.0787(72.8%)
0.1465(84.6%)
0.0015(3.5%)
0.003(2.8%)
0.011(6.4%)
0.04
2e-39
1e-41
3e-05
1e-14
5e-06
TATA
TATA-less
612
2570
0.1749
0.1204
0.0423(24.2%)
0.007(5.8%)
0.132(75.5%)
0.1095(90.9%)
7e-04(0.4%)
0.0039(3.2%)
2e-21
2e-67
3e-08
6e-06
TFBS
TFBS-less
1116
8477
0.1483
0.0464
0.0668(45.0%)
0.0029(6.2%)
0.0707(47.7%)
0.0431(92.9%)
0.0109(7.3%)
4e-04(0.9%)
5e-21
2e-83
5e-20
4e-07
59
A.7
A subset of integenic and genic NFRs with high DoND
Table A.7: Distribution of 145 NFRs with high DoND (R > 0.4 and A < −0.4)
These 145 NFRs are not located in intergenic or 500bp upstream of coding regions.
Feature
Number of NFRs
ORF
tRNA
ARS
long terminal repeat
Y’ element
intron
rRNA
X element core sequence
snRNA
58
52
16
9
4
2
2
1
1
60
Table A.8: 25 genic NFRs in verified ORFs.
ORF
Symbol
Chr
Start
End
YLL042C
YPL111W
YIR030C
YPR166C
YFL003C
YHR091C
YAR002W
YKR003W
YDL232W
YLR148W
YFR034C
YOR361C
YGR170W
YOR348C
YOR210W
YLR141W
YOL110W
YOL122C
YDR308C
YBR150C
YNL070W
YER093C
YLR024C
YAL002W
YAR035W
ATG10
CAR1
DCG1
MRP2
MSH4
MSR1
NUP60
OSH6
OST4
PEP3
PHO4
PRT1
PSD2
PUT4
RPB10
RRN5
SHR5
SMF1
SRB7
TBS1
TOM7
TSC11
UBR2
VPS8
YAT1
12
16
9
16
6
8
1
11
4
12
6
15
7
15
15
12
15
15
4
2
14
5
12
1
1
52589
339943
412767
876625
137152
286772
152259
445024
38488
434642
225946
1017650
837147
988779
738321
423684
109176
91419
1078445
544487
493367
347608
193282
143709
190187
52086
340944
412033
876278
134516
284841
153878
446370
38598
437398
225008
1015359
840563
986896
738533
424775
109889
89692
1078023
541203
493549
343316
187664
147533
192250
61
Table A.9: 33 genic NFRs in un-verified ORFs. The dubious ORFs are over-represented in the 58 ORFs (hypergeometric p-value = 1e-5) but not for uncharacterized ORFs (p-value = 0.55) based on the SGD [CAB98] annotations, implying
that some of the dubious ORFs may not be real coding genes.
Type
ORF
Symbol
Chr
Start
End
Strand
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Dubious
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
Uncharacterized
YAR060C
YBL048W
YBR209W
YDR010C
YDR215C
YDR274C
YDR278C
YGR107W
YHL041W
YHR070C-A
YHR212C
YIL054W
YKL102C
YML089C
YML122C
YNL285W
YOR029W
YOR050C
YOR343C
YPR014C
YPR064W
YAR064W
YBL044W
YER077C
YFR032C-B
YGL176C
YGR068C
YHR202W
YHR213W-B
YJR003C
YMR196W
YOR268C
YPR159C-A
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
1
2
2
4
4
4
4
7
8
8
8
9
11
13
13
14
15
15
15
16
16
1
2
5
6
7
7
8
8
10
13
15
16
217483
127302
642578
465380
894498
1011956
1017314
702671
17390
236514
538094
254541
248011
91409
26419
96173
384600
424619
968471
587515
678948
220189
136001
316596
223961
173085
627088
502388
540800
442468
655075
825931
860411
217148
127613
642895
465048
894115
1011585
1016997
703120
17839
236104
537759
254858
247706
91041
26039
96544
384935
424272
968145
587186
679367
220488
136369
314530
223698
171421
625328
504196
541099
440909
658341
825533
860310
C
W
W
C
C
C
C
W
W
C
C
W
C
C
C
W
W
C
C
C
W
W
W
C
C
C
C
W
W
C
W
C
C
62
References
[BL04]
M.J. Buck and J.D. Lieb. “ChIP-chip: considerations for the design,
analysis, and application of genome-wide chromatin immunoprecipitation experiments.” Genomics, 83:349–360, Mar 2004.
[BLH04]
B. E. Bernstein, C. L. Liu, E. L. Humphrey, E. O. Perlstein, and S. L.
Schreiber. “Global nucleosome occupancy in yeast.” Genome Biol,
5(9):R62, 2004.
[BZP04]
A. D. Basehoar, S. J. Zanton, and B. F. Pugh. “Identification
and distinct regulation of yeast TATA box-containing genes.” Cell,
116(5):699–709, 2004.
[CAB98] J M Cherry, C Adler, C Ball, S A Chervitz, S S Dwight, E T Hester,
Y Jia, G Juvik, T Roe, M Schroeder, S Weng, and D Botstein. “SGD:
Saccharomyces Genome Database.” Nucleic Acids Res, 26(1):73–79,
Jan 1998.
[CBN04] Simon Cawley, Stefan Bekiranov, Huck H Ng, Philipp Kapranov,
Edward A Sekinger, Dione Kampa, Antonio Piccolboni, Victor
Sementchenko, Jill Cheng, Alan J Williams, Raymond Wheeler,
Brant Wong, Jorg Drenkow, Mark Yamanaka, Sandeep Patel, Shane
Brubaker, Hari Tammana, Gregg Helt, Kevin Struhl, and Thomas R
Gingeras. “Unbiased mapping of transcription factor binding sites
along human chromosomes 21 and 22 points to widespread regulation
of noncoding RNAs.” Cell, 116(4):499–509, Feb 2004.
[DEK98] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins
and Nucleic Acids. Cambridge University Press, 1998.
[DHG06] L. David, W. Huber, M. Granovskaia, J. Toedling, C.J. Palm,
L. Bofkin, T. Jones, R.W. Davis, and L.M. Steinmetz. “A highresolution map of transcription in the yeast genome.” Proc. Natl.
Acad. Sci. U.S.A., 103:5320–5325, Apr 2006.
[Edd98]
SR Eddy.
“Profile hidden Markov models.”
14(9):755–763, 1998.
[FSH93]
K D Fascher, J Schmitz, and W Horz. “Structural and functional
requirements for the chromatin transition at the PHO5 promoter
in Saccharomyces cerevisiae upon PHO5 activation.” J Mol Biol,
231(3):658–667, Jun 1993.
63
Bioinformatics,
[GKB07] S M Gribble, D Kalaitzopoulos, D C Burford, E Prigmore, R R Selzer,
B L Ng, N S W Matthews, K M Porter, R Curley, S J Lindsay, J Baptista, T A Richmond, and N P Carter. “Ultra-high resolution array
painting facilitates breakpoint sequencing.” J Med Genet, 44(1):51–
58, 2007.
[HGL04]
Christopher T Harbison, D Benjamin Gordon, Tong Ihn Lee, Nicola J
Rinaldi, Kenzie D Macisaac, Timothy W Danford, Nancy M Hannett,
Jean-Bosco Tagne, David B Reynolds, Jane Yoo, Ezra G Jennings,
Julia Zeitlinger, Dmitry K Pokholok, Manolis Kellis, P Alex Rolfe,
Ken T Takusagawa, Eric S Lander, David K Gifford, Ernest Fraenkel,
and Richard A Young. “Transcriptional regulatory code of a eukaryotic genome.” Nature, 431(7004):99–104, Sep 2004.
[HJW98] F. C. Holstege, E. G. Jennings, J. J. Wyrick, T. I. Lee, C. J. Hengartner, M. R. Green, T. R. Golub, E. S. Lander, and R. A. Young.
“Dissecting the regulatory circuitry of a eukaryotic genome.” Cell,
95(5):717–28, 1998.
[IAZ06]
I. P. Ioshikhes, I. Albert, S. J. Zanton, and B. F. Pugh. “Nucleosome positions predicted through comparative genomics.” Nat Genet,
38(10):1210–5, 2006.
[JLM06]
W Evan Johnson, Wei Li, Clifford A Meyer, Raphael Gottardo, Jason S Carroll, Myles Brown, and X Shirley Liu. “Model-based analysis of tiling-arrays for ChIP-chip.” Proc Natl Acad Sci U S A,
103(33):12457–12462, Aug 2006. Evaluation Studies.
[JW05]
Hongkai Ji and Wing Hung Wong. “TileMap: create chromosomal
map of tiling array hybridizations.” Bioinformatics, 21(18):3629–
3636, Sep 2005.
[KBZ05]
Tae Hoon Kim, Leah O Barrera, Ming Zheng, Chunxu Qu, Michael A
Singer, Todd A Richmond, Yingnian Wu, Roland D Green, and Bing
Ren. “A high-resolution map of active promoters in the human
genome.” Nature, 436(7052):876–880, Aug 2005.
[KIM07]
AC Karcanias, K Ichimura, MJ Mitchell, CA Sargent, and NA Affara.
“Analysis of sex chromosome abnormalities using X and Y chromosome DNA Tiling path arrays.” J Med Genet, 2007.
[KL99]
R. D. Kornberg and Y. Lorch. “Twenty-five years of the nucleosome,
fundamental particle of the eukaryote chromosome.” Cell, 98(3):285–
94, 1999.
64
[KLW03] Craig D Kaplan, Lisa Laprade, and Fred Winston. “Transcription
elongation factors repress transcription initiation from cryptic sites.”
Science, 301(5636):1096–1099, Aug 2003.
[KMH94] Anders Krogh, I. Saira Mian, and David Haussler. “A hidden Markov
model that finds genes in E.coli DNA.” Nucl. Acids Res., 22(22):4768–
4778, 1994.
[LBZ00]
Harvey Lodish, Arnold Berk, Lawrence S. Zipursky, Paul Matsudaira,
David Baltimore, and James Darnell. Molecular Cell Biology. W. H.
Freeman and Company, 2000.
[LG92]
M S Lee and W T Garrard. “Uncoupling gene activity from chromatin structure: promoter mutations can inactivate transcription of
the yeast HSP82 gene without eliminating nucleosome-free regions.”
Proc Natl Acad Sci U S A, 89(19):9166–9170, Oct 1992.
[LGC07]
Bing Li, Madelaine Gogol, Mike Carey, Samantha G Pattenden, Chris
Seidel, and Jerry L Workman. “Infrequently transcribed long genes depend on the Set2/Rpd3S pathway for accurate transcription.” Genes
Dev, 21(11):1422–1430, Jun 2007.
[LML05]
Wei Li, Clifford A Meyer, and X Shirley Liu. “A hidden Markov
model for analyzing ChIP-chip experiments on genome tiling arrays
and its application to p53 binding sequences.” Bioinformatics, 21
Suppl 1:274–282, Jun 2005.
[LSR04]
C. K. Lee, Y. Shibata, B. Rao, B. D. Strahl, and J. D. Lieb. “Evidence
for nucleosome depletion at active regulatory regions genome-wide.”
Nat Genet, 36(8):900–5, 2004.
[LTB07]
William Lee, Desiree Tillo, Nicolas Bray, Randall H Morse, Ronald W
Davis, Timothy R Hughes, and Corey Nislow. “A high-resolution atlas
of nucleosome occupancy in yeast.” Nat Genet, 39(10):1235–1244, Oct
2007.
[MCS00] X Mai, S Chou, and K Struhl. “Preferential accessibility of the yeast
his3 promoter is determined by a general property of the DNA sequence, not by specific elements.” Mol Cell Biol, 20(18):6668–6676,
Sep 2000.
[MWG06] Kenzie D MacIsaac, Ting Wang, D Benjamin Gordon, David K Gifford, Gary D Stormo, and Ernest Fraenkel. “An improved map of
conserved regulatory sites for Saccharomyces cerevisiae.” BMC Bioinformatics, 7:113, 2006.
65
[NGR98] MA Newton, MN Gould, CA Reznikoff, and JD Haag. “On the statistical analysis of allelic-loss data.” Stat Med, 17:1425–45, 1998.
[NLH06]
N. Ngre, S. Lavrov, J. Hennetin, M. Bellis, and G. Cavalli. “Mapping the distribution of chromatin proteins by ChIP on chip.” Meth.
Enzymol., 410:316–341, 2006.
[NNK04] Chris J. Nachtsheim, John Neter, and Michael H. Kutner. Applied
linear statistical models. McGraw-Hill, 2004.
[ODK96] M. Ostendorf, V.V. Digalakis, and O.A. Kimball. “From HMM’s
to segment models: a unified view of stochastic modelingfor speech
recognition.” IEEE Transactions on Speech and Audio Processing,
4(5):360–378, 1996.
[OSL07]
F. Ozsolak, J.S. Song, X.S. Liu, and D.E. Fisher. “High-throughput
mapping of the chromatin structure of human promoters.” Nat.
Biotechnol., 25:244–248, Feb 2007.
[PEP99]
J.L. Parrou, B. Enjalbert, L. Plourde, A. Bauche, B. Gonzalez, and
J. Franois. “Dynamic responses of reserve carbohydrate metabolism
under carbon and nitrogen limitations in Saccharomyces cerevisiae.”
Yeast, 15:191–203, Feb 1999.
[PHL05]
Dmitry K Pokholok, Christopher T Harbison, Stuart Levine, Megan
Cole, Nancy M Hannett, Tong Ihn Lee, George W Bell, Kimberly
Walker, P Alex Rolfe, Elizabeth Herbolsheimer, Julia Zeitlinger, Fran
Lewitter, David K Gifford, and Richard A Young. “Genome-wide
map of nucleosome acetylation and methylation in yeast.” Cell,
122(4):517–527, Aug 2005.
[Rab89]
L.R. Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE, 77(2):257–
286, Feb 1989.
[SFC06]
E. Segal, Y. Fondufe-Mittendorf, L. Chen, A. Thastrom, Y. Field, I. K.
Moore, J. P. Wang, and J. Widom. “A genomic code for nucleosome
positioning.” Nature, 442(7104):772–8, 2006.
[SMS05]
Edward A Sekinger, Zarmik Moqtaderi, and Kevin Struhl. “Intrinsic
histone-DNA interactions and low nucleosome density are important
for preferential accessibility of promoter regions in yeast.” Mol Cell,
18(6):735–748, Jun 2005.
66
[STK03]
K. Sakaki, K. Tashiro, S. Kuhara, and K. Mihara. “Response of genes
associated with mitochondrial function to mild heat stress in yeast
Saccharomyces cerevisiae.” J. Biochem., 134:373–384, Sep 2003.
[TME07] Andreas Tzschach, Corinna Menzel, Fikret Erdogan, Marei Schubert,
Maria Hoeltzenbein, Gotthold Barbi, Christine Petzenhauser, HansHilger Ropers, Reinhard Ullmann, and Vera Kalscheuer. “Characterization of a 16 Mb interstitial chromosome 7q21 deletion by tiling path
array CGH.” Am J Med Genet A, 143(4):333–337, 2007.
[WBI05]
Kathryn Woodfine, David M Beare, Koichi Ichimura, Silvana Debernardi, Andrew J Mungall, Heike Fiegler, V Peter Collins, Nigel P
Carter, and Ian Dunham. “Replication timing of human chromosome
6.” Cell Cycle, 4(1):172–176, 2005.
[WC53]
J.D. WATSON and F.H. CRICK. “Molecular structure of nucleic
acids; a structure for deoxyribose nucleic acid.” Nature, 171:737–738,
Apr 1953.
[WES04] Eric J White, Olof Emanuelsson, David Scalzo, Thomas Royce, Steven
Kosak, Edward J Oakeley, Sherman Weissman, Mark Gerstein, Mark
Groudine, Michael Snyder, and Dirk Schubeler. “DNA replicationtiming analysis of human chromosome 22 at high resolution and different developmental states.” Proc Natl Acad Sci U S A, 101(51):17771–
17776, 2004.
[WFB04] Kathryn Woodfine, Heike Fiegler, David M Beare, John E Collins,
Owen T McCann, Bryan D Young, Silvana Debernardi, Richard Mott,
Ian Dunham, and Nigel P Carter. “Replication timing of the human
genome.” Hum Mol Genet, 13(2):191–202, 2004.
[WGZ92] J H Wright, D E Gottschling, and V A Zakian. “Saccharomyces telomeres assume a non-nucleosomal chromatin structure.” Genes Dev,
6(2):197–210, Feb 1992.
[XZZ07]
F. Xu, Q. Zhang, K. Zhang, W. Xie, and M. Grunstein. “Sir2 deacetylates histone H3 lysine 56 to regulate telomeric heterochromatin structure in yeast.” Mol. Cell, 27:890–900, Sep 2007.
[YLD05]
Guo-Cheng Yuan, Yuen-Jong Liu, Michael F Dion, Michael D Slack,
Lani F Wu, Steven J Altschuler, and Oliver J Rando. “Genomescale identification of nucleosome positions in S. cerevisiae.” Science,
309(5734):626–630, Jul 2005.
67