A quantitative study of nucleosome free regions in
Transcription
A quantitative study of nucleosome free regions in
University of California Los Angeles A quantitative study of nucleosome free regions in yeast by segmental semi-Markov Model using tiling microarrays A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Statistics by Wei Xie 2008 c Copyright by Wei Xie 2008 The thesis of Wei Xie is approved. Qing Zhou Yingnian Wu Michael Grunstein Ker-Chau Li, Committee Chair University of California, Los Angeles 2008 ii To my father, mother, brother, fiancée and others who provide their love and support all through my graduate studies iii Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 ChIP-chip assay and data pre-processing . . . . . . . . . . . . . . 8 2.2 Overview of SSMM . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Design of SSMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Data collection and validation . . . . . . . . . . . . . . . . . . . . 15 4 Histone occupancy at different chromosome features . . . . . . 18 5 NFR identification by SSMM . . . . . . . . . . . . . . . . . . . . 20 6 Factors of nucleosome depletion: transcriptional activity versus DNA affinity for histones . . . . . . . . . . . . . . . . . . . . . . . . . 27 7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A Supplementary Materials . . . . . . . . . . . . . . . . . . . . . . . . 36 A.1 Signal of TFBS vs. signal of NFR . . . . . . . . . . . . . . . . . . 36 A.2 Algorithm of Segmental Semi-Markov Model . . . . . . . . . . . . 37 A.2.1 Segmental model fitting and emission probability calculation 37 A.2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 39 A.2.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . 45 A.3 Validation of the raw data . . . . . . . . . . . . . . . . . . . . . . 47 iv A.4 Compare absolute depletion and relative depletion in NFRs . . . 50 A.5 Distributions and lengths of NFRs with different DoND . . . . . . 50 A.6 Nucleosome depletion forces: DNA affinity for histones and transcriptional activity . . . . . . . . . . . . . . . . . . . . . . . . . . 57 A.7 A subset of integenic and genic NFRs with high DoND . . . . . . 60 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 v List of Figures 1.1 Chromatin structure and nucleosomes. . . . . . . . . . . . . . . . 2 1.2 ChIP-chip: Chromatin Immunoprecipitation coupled by microarray. 3 1.3 A schematic representation of this study. . . . . . . . . . . . . . . 7 2.1 Four states in segmental semi-Markov model used in this study to model the histone occupancy surrounding the NFRs. . . . . . . . 11 2.2 Schematic organization of segmental semi-Markov model. 14 3.1 Comparison between the nucleosome occupancy data in this study . . . . and published data. . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1 Feature quantification of NFRs based on trapezoid pattern. . . . . 21 5.2 Comparison of linker region in Lee et al. and NFR from this study 23 5.3 Effects of DoND on distributions and lengths of NFRs. . . . . . . 5.4 The distributions of distances and lengths of NFRs within promoter regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 29 Histones are depleted from the promoter of gene GAC1 prior to its activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 26 Factors of nucleosome depletion: transcriptional activity versus DNA affinity for histones. . . . . . . . . . . . . . . . . . . . . . . 6.2 24 32 Histones are depleted from the promoter of gene YMR279C prior to its activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 A.1 Comparison of TF binding signal and Nucleosome occupancy signal. 36 vi A.2 Distribution of state durations after convergence . . . . . . . . . . 47 A.3 Correlations between the nucleosome occupancy data in this study and the data from Lee et al. . . . . . . . . . . . . . . . . . . . . . 49 A.4 Absolute depletion vs. relative depletion . . . . . . . . . . . . . . 50 A.5 Locations of NFRs vs. absolute depletion . . . . . . . . . . . . . 51 A.6 Locations of NFRs vs. relative depletion . . . . . . . . . . . . . . 52 A.7 Different intergenic region vs. NFR absolute/relative depletion . 54 A.8 TATA box vs. NFR absolute/relative depletion . . . . . . . . . . 55 A.9 TF binding sites vs. NFR absolute/relative depletion . . . . . . . 56 vii List of Tables 4.1 Nucleosome occupancies at chromosome features vs. intergenic regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.1 Correlations between our data and the data by Bernstein et al. . . 48 A.2 Correlations between our data and the data by Lee et al. . . . . . 48 A.3 Correlations between our data and the data by Pokholok et al. . . 48 A.4 Correlation matrix of DoND and average Pol II binding . . . . . . 57 A.5 Compare the effects of DNA affinity for histones and transcriptional activity (absolute depletion) . . . . . . . . . . . . . . . . . 58 A.6 Compare the effects of DNA affinity for histones and transcriptional activity (relative depletion) . . . . . . . . . . . . . . . . . . 59 A.7 Distribution of 145 NFRs with high DoND. . . . . . . . . . . . . . 60 A.8 25 genic NFRs in verified ORFs . . . . . . . . . . . . . . . . . . . 61 A.9 33 genic NFRs in un-verified ORFs. 62 viii . . . . . . . . . . . . . . . . Acknowledgments I am greatly indebted to Dr. Ker-Chau Li, my advisor in Department of Statistics, who is an expert in computational biology and bioinformatics. He has given me unlimited support and guidance through my study and research in Statistics and supervised my thesis project. I also want to thank Dr. Michael Grunstein, my advisor in Molecular Biology Institute, who is a pioneering scientist in yeast genetics. It is his constant encouragement and support that give me tremendous motivation during my study at UCLA. I am extremely fortunate to be able to work with and learn from these two outstanding scientists. I also want to express my gratitude to Dr. Sun Wei, my good friend, collaborator, and co-first author of our manuscript based this project, who has his own lab now in University of North Carolina, Chapel Hill. He is a professional statistician and an enthusiastic scientist. This project could not have been done without his persistent hard work. It is a wonderful experience and my great pleasure to work with him. I want to thank my other committee members Qing Zhou and Yingnian Wu , for helpful knowledge and skills that I learned from their classes, as well as the comments and discussions for my thesis. My appreciation also goes to Dr. Chris Lee, Dr. Steve Horvath, Dr. David Elashoff and Dr. Mark Hansen, whose courses elicited my deep interest in bioinformatics, programming and genomic work. I also want to thank Feng Xu, who is always willing to help others and has kindly provided part of the data for this project. Lastly, I thank my parents, my brother and my fiancée Di, to whom I dedicate both this thesis and my love. ix Abstract of the Thesis A quantitative study of nucleosome free regions in yeast by segmental semi-Markov Model using tiling microarrays by Wei Xie Master of Science in Statistics University of California, Los Angeles, 2008 Professor Ker-Chau Li, Chair DNA, the fundamental molecule carrying the genetic information, is packed into the form of chromatin inside the nuclei of cells in a highly organized manner. Nucleosomes as the basic unit of chromatin are not uniformly distributed along the chromosomes and many genomic loci are depleted of nucleosomes. Nucleosome free regions (NFRs) play an important role in many biological processes including gene regulation. As the resolution of tiling array gets higher, we expect to extract out more and more subtle quantitative properties about NFRs such as the lengths and the degree of nucleosome depletion. Because these quantities are likely to vary from one NFR to another NFR, a genome-wide portrait of each individual NFR may help shed light on the dynamic aspect of chromatin restructuring and gene regulation. Although previous studies have examined the consensus pattern of nucleosome depletion in promoter regions by a curve averaging method, the quantitative characterization of each individual NFR, despite the importance, is lost because of averaging. In this study, we presented a nucleosome occupancy data of the whole yeast genome at 4-bp resolution and x developed an efficient algorithm to identify each individual “quantitative NFR” at the whole genome scale. Our result showed that the majority of the NFRs are located in intergenic regions/promoters with length of about 400-600 bps, which is approximately the length of DNA wrapping around two-to-three nucleosomes plus linkers. Our quantitative NFR results enable an investigation of the relative impacts of transcription machinery and DNA sequence in evicting histones from NFRs. We showed that while both factors have significant overall effects, the specific contributions vary across different subtypes of NFRs. The emphasis of our approach on the variation rather than the consensus of NFR sets the tone for enabling the exploration of many subtler dynamic aspects of chromatin biology. xi CHAPTER 1 Introduction DNA is the fundamental molecules that encode the genetic information for every organism including humans[WC53]. The regulation of DNA based gene activity is a central question that has been extensively pursued. However, DNA itself is not nakedly existed in cellular environment, but rather is highly organized into an extremely compacted structure—chromatin [KL99]. The nucleosome, the building block of chromatin, is a critical regulator in many biological processes, such as transcription, DNA repair, and DNA replication [KL99] (Figure 1.1). The presence of nucleosomes under many occasions hinders the accessibility of the transcriptional machinery to the underlying DNA; conversely, nucleosome free regions (NFRs) allow easier access of transcription regulators to DNA sequences [BLH04, LSR04, PHL05, YLD05, LTB07]. This underlies the importance of localizing nucleosomes genome wide, a goal that has been attained using a technique Chromatin Immunoprecipitation coupling microarray (ChIP-chip) (Figure 1.2)[BL04]. Briefly, chromatin is fragmented and DNA that are occupied by histones are enriched by histone-specific antibodies. Together with a background control genomic DNA that is not subjected to antibody enrichment, these DNA fragments are labelled by fluorescence and hybridized to a DNA microarray, where the complimentary hybridization between the designed probes and sample DNA will signal the genomic regions where the histones bind [NLH06]. Genome-wide histone occupancy has been reported by several groups at gene 1 Figure 1.1: Chromatin structure and nucleosomes (adapted from Molecular Cell Biology, Lodish et al. [LBZ00]). A single DNA molecule is wound around histone octamers to form the strings of closely packed nucleosomes. Nucleosomes further fold to form a 30-nm chromatin fiber, which is attached to a flexible protein scaffold, resulting in long loops of chromatin extending from the scaffold. resolution [BLH04, LSR04] or at the resolution of 260 bp [PHL05] for the entire yeast genome. Higher resolution mapping (20bp) was reported by Yuan et al. [YLD05] on 3% of the yeast genome (chromosome III and 223 additional regulatory regions). Recently, Lee et al. [LTB07] has presented a complete high-resolution (4bp) map of nucleosome occupancy in yeast. In mammals, nucleosomes have been mapped in human cells in a portion of the genome (3692 promoters) [OSL07]. Despite the successes of these studies, several fundamental questions regarding to the nature of NFRs remains unknown. First, it is not clear whether NFRs occur exclusively at the promoter regions. NFRs in non-promoter 2 Figure 1.2: ChIP-chip: Chromatin Immunoprecipitation coupled by microarray (adapted from Buck et al. [BL04]). Chromatin Immunoprecipitation (ChIP) is performed using a specific antibody, then the DNA crosslinked with target proteins is extracted and purified. Immunoprecipitated DNA is amplified and hybridized to microarrays together with the input DNA. Raw intensity is then extracted and analyzed for each spot, representing the relative enrichment at each genomic locus. 3 regions (including coding regions) may have unknown functions. Second, it has been controversial whether histones are depleted only from active gene promoters. Several works have suggested the existence of transcription-independent NFRs on individual genes [FSH93, LG92, MCS00, SMS05]. Finally, although both the transcriptional machinery and DNA sequence have been shown to be involved in histone eviction [MCS00, SMS05, SFC06], the relationship between these two factors remains an intriguing question. They may have distinct effects in different types of NFRs. To investigate the above issues, it is important to bring out the dynamic aspects of NFRs. Because of the complex interplay between differential gene regulation and chromatin restructuring, the lengths of NFRs are likely to change from one NFR to another NFR. Likewise, the degree of nucleosome depletion (DoND) in each NFR is likely to vary as well. However, while many previous studies have described nucleosome occupancy in quantitative terms, only static ensemble properties about NFRs have been described. For instance, in the studies of [YLD05, LTB07], average nucleosome occupancies in promoter regions have been reported by aligning the raw data (by start codon or transcription start sites (TSS)) and averaging across all genes. The average nucleosome occupancy curve does reveal an interesting shared pattern for many NFRs. But at the same time, the characteristics specific for each individual NFR is also lost after taking average. In fact, as to be shown later in this study, the NFRs in promoter regions do vary in the lengths and the distances to the corresponding start codons/TSSs. Furthermore, we also identified NFRs in coding regions; see Results and Discussion. In order to identify and quantitatively characterize each individual NFR across the whole genome, an automatic “NFR calling” algorithm that can dissect NFR 4 pattern from a noisy background is required. Currently the major existing algorithms facilitating the analysis of tiling array data are adapted from those initially designed for detecting the transcription factor binding sites (TFBSs) [CBN04, LML05, KBZ05, JW05, JLM06]. They are inadequate for detecting NFRs. First, because most TFBSs are sparsely distributed across the genome, many TFBS identification algorithms [LML05, JLM06] are designed under the assumption that the vast majority of the array data are background noise. However, this assumption becomes problematic for exploring epigenetic events that are often abundant, e.g., nucleosome occupancy and a variety of histone modifications [PHL05]. Second, the signal of a TF binding is typically short and tends to have a sharp “peak” (Supplementary Figure A.1(a)). In contrast, the pattern from nucleosome occupancy or histone modification occurrence can be much longer often with a ”segmental” shape (Supplementary Figure A.1(b)). Third, the binding of a TF typically only requires a qualitative description (i.e. presence/absence), while quantitative shape parameters of the signal patterns are essential for the characterization of NFRs and other epigenetic marks. To our knowledge, there are only two published algorithms specifically designed for detecting nucleosomes [YLD05, OSL07]. Yuan et al. [YLD05] employed a Hidden Markov Model (HMM) to detect positioned/delocalized nucleosomes. The method infers only the positions of nucleosomes, but does not provide quantitative properties of the NFRs such as DoND. Lee et al. (2007) adapted an extension of the HMM used by Yuan et al. (2005) to analyze their high resolution nucleosome occupancy data. Alternatively, Ozsolak et al. [OSL07] proposed a two-step procedure to detect positioned nucleosomes, consisting mainly of (1) smoothing the raw probe-level data by wavelet decomposition and (2) decomposing the entire chromosome into “peaks” and “troughs” by an edge-detection technique. How much the data should be smoothed in the first step appears to 5 dictate the entire procedure of nucleosome positioning. This could be difficult to decide in practice. In this study, we developed an algorithm for capturing complex signal patterns in data from high density arrays. Our algorithm, employed to detect NFRs in this study, is based on a segmental (hidden) semi-Markov model (SSMM), which is an extension of HMM (see section 2.2-2.3 for overviews and A.2 for details). In addition to being able to identify desired patterns (e.g., NFR patterns) and capture the quantitative features of such patterns, this SSMM-based algorithm also enjoys more flexible model assumptions and higher efficiency compared to the regular HMM. Figure 1.3 is a schematic representation of our study, showing how the SSMM is used in characterizing genome-wide NFRs and in exploring the driving forces of nucleosome depletion. 6 Nucleosome Occupation Data by ChIP-tiling array Detect locations of nucleosome free regions, estimate their sizes, degree of nucleosome depletion Transcriptional activity (RNA Polymerase II Binding by ChIP-tiling array) DNA affinity estimations from Segal et al. Driving forces of nucleosome depletion: transcriptional activity versus DNA affinity to histones Figure 1.3: A schematic representation of this study. 7 CHAPTER 2 Methods 2.1 ChIP-chip assay and data pre-processing Three sets of ChIP-chip data were used in this study. The first set of ChIP-chip data of histone H3 was published previously [XZZ07]. Briefly, Chromatin Immunoprecipitation (ChIP) was performed using antibody against histone H3 (a kind gift from Dr. Alain Verreault [XZZ07]), and then the DNA crosslinked with nucleosomes was extracted and purified. Immunoprecipitated DNA was amplified and hybridized to Affymetrix Saccharomyces cerevisiae Tiling 1.0R Array to map the nucleosome occupancy along chromosomes in a 4-bp high-resolution manner. Raw intensities were computed by the Two-Sample Analysis method using Affymetrix Tiling Analysis Software v1.1. The tiling array features 2,635,714 oligo probes (25-mer) on the tiling array with 4 bp gaps (i.e. 21 bp overlaps) between the majority (91.5%) of the adjacent probes. Only less than 1% of the neighbor probes are separated by gaps longer than 20 bp. The entire yeast genome except the centromeres is well represented on the arrays. To validate the above results, we further generated a new histone H3 ChIPchip data set (two biological repeats) with a commercial H3 antibody (Abcam ab1791) using Affymetrix tiling arrays. The data obtained from the two antibodies are highly consistent (see Results). The average of the all four repeats was used in our study. 8 In order to quantify the transcriptional activity at each genomic loci, we also measured genome-wide RNA Polymerase II (using antibody 8WG16, Upstate) binding by ChIP-chip similarly as for histone H3. 2.2 Overview of SSMM Segmental semi-Markov model (SSMM) is an extension from hidden Markov model (HMM). SSMM has two major differences (generalizations) from standard HMM. First, SSMM uses explicit state length density instead of the implicated exponential density [Rab89]. For a standard HMM with transition probability aii from state si to itself, the probability that d consecutive observations are emitted from state si is (aii )d−1 (1 − aii ). This probability decreases exponentially as the length d increases, which makes long segments impossible. For example, if aii = 0.9 and there is one probe per 4bp, the probability that a NFR is longer than 500bp (typical length of NFR, see result section) is smaller than 2e-7. Thus adaptation of explicit state length density is especially important for high density tilling array data. The drawback of explicit state length density is that there are more parameters to estimate, which may lead to over-fitting. However, for tilling array with millions of probes, over-fitting is unlikely to be a problem. Second, SSMM employs a segmental model to calculate the emission probability so that dependency is allowed for all observations within one segment [ODK96]. This is desirable in analyzing tilling array data because this dependency assumption is more realistic. Furthermore, the segmental model can provide quantitative outputs characterizing the shapes of signal patterns. Both HMM and SSMM have been applied in speech recognition [Rab89, ODK96]. HMM has been introduced to computational biology for sequence alignment and gene detection [KMH94, DEK98, Edd98], as well as identifying ChIP-based TF binding regions 9 [LML05, JW05] and NFRs [YLD05]. However, despite its flexibility in handling high density data, to the best of our knowledge, SSMM has not been used for computational biology before. One possible reason is its heavy computational burden. The algorithm we introduced in this study incorporates several modifications of regular SSMM, which greatly improve the computational efficiency. We also designed the different hidden states and likelihood evaluation scheme to fit the purpose of NFR identification. 2.3 Design of SSMM Our SSMM is designed to capture two types of NFR patterns from high density tiling array data: triangle (Figure 2.1 (a)) and trapezoid (Figure 2.1 (b)) patterns. There are four states in our SSMM (Figure 2.1): state 1 and 2 are horizontal lines with high and low degree of nucleosome occupancy respectively, which models signals from nucleosome occupied region (NOR) and NFR respectively; state 3 and 4 are negative/positive slope line, which models the transition from NOR to NFR or from NFR to NOR respectively. States 1 and 2 have the same shape, but different state duration probabilities and transition probabilities. A triangle or trapezoid pattern corresponds to the path: 1 → 3 → 4 → 1 or 1 → 3 → 2 → 4 → 1 respectively (Figure 2.1 (c)). In order to fit SSMM model, we organized the data into three hierarchical levels: probe, bin, and segment (Figure 2.2). First, we merged probes within a 50-bp window into a “bin”. Then “segments” were constructed from one or several “bins”. One segment is emitted from one of the four states and the emission probability is calculated based on linear model fitting. We organized data into “bin” before “segment” for the following two reasons. First, it greatly reduces the computation burden by enforcing all the probes in one bin having only one 10 (a) (b) 1 1 3 4 (c) 1 1 1 3 4 3 2 4 2 Figure 2.1: Four states in segmental semi-Markov model. (a) A triangle pattern representing a type of NFR observed on tiling arrays. (b) A trapezoid pattern representing another type of NFR observed on tiling arrays. (c) The allowed transitions between any two of the four states. underlying state. Second, discrete time SSMM assumes equally spaced observations. However, in our data, the gaps between adjacent probes are not constant. Grouping probes into bins ensures the distances between most adjacent bins are constant. Other possible solutions include modeling the transition probability between adjacent probes as a function of their distance [NGR98], or implementing continuous time SSMM. These methods however would both significantly increase the algorithm complexity and computation time. The bin size was empirically determined for the following considerations. (1) Each bin should be long enough to include enough probes for linear model fitting. Long bins also help filter out noise and reduce computation burden. (2) The signals will be over-smoothed if bins are too long. A bin of 50bp on average covers 10-12 probes. This allows enough data points for model fitting while also avoids over-smoothing, given the lengths of NFRs are typically from several hundred to a few thousand base pairs (see Result). To compute the emission probability of one segment, we fitted the segmental model (simple linear model in our case) first and then calculated the emission probability based on the residuals. In order to obtain continuous prediction of 11 nucleosome occupancy, we require the fitted line to be continuous. More details regarding the emission probability calculation are listed in supplementary materials. Several parameters of the SSMM need to be estimated: the transition probabilities from state 3 to state 2/4 (other transition probabilities are fixed as 0 or 1), and the probability distributions of state durations. These parameters are estimated by an iterative procedure. With parameters initiated according to uniform distributions, we first identify the most likely path by Viterbi algorithm [Rab89, DEK98], then estimate the parameters based on the most likely path, and iterate until the parameter estimations converge. The Viterbi algorithm for SSMM is different from the one for HMM because in order to determine the best path ended at time k, state i, in addition to choose the previous state j, we also need to choose the duration of state i. Give the “best path”, the parameters are estimated as follows. We estimate the transition probabilities by the corresponding proportions of transitions. The duration probability are estimated by the proportions of observed durations. For example, to estimate P (duration of state i = k), we first took all the duration lengths of state i, and then calculate the proportion of the duration lengths that equal to k. For our data, it takes 7 iterations for the SSMM to converge according to the convergence rule: the maximum change of transition probabilities and the maximum change of duration probabilities are both smaller than 1e−5 . The model fitting at convergence is the final output of SSMM. Details of these algorithms and the parameter estimations at convergence are shown in supplementary materials. We have also tried different parameter initials. For example, we initiated the state duration distributions by different normal distributions or initiated the lengths of state 1 and 2 by the empirical distributions of the lengths of chromosome features and intergenic regions respectively. All these different initials result in almost identi- 12 cal final outputs. Therefore it is highly unlikely that our algorithm is trapped in a local optimum. We have implemented our algorithm in an R package, ss.hmm, which can be downloaded at http://www.bios.unc.edu/∼wsun/software.htm. 13 !&! (a) !"&! !!&' *+.-/0 !&' hidden states based on viterbi algorithm $$$$$ %%%%%%%%%%%%%%%%%% """"""""""""""""""""""""" $$$$$$$ %%%%%% """"""""""""""""""" $$$$$$ ### Segment ! Bin #!!! "!!! $!!! ()*+,+)- (c) %!!! (b) Probe 3 2450 2460 2470 2480 2490 3 3 3 3 3 3 2500 2400 2450 2500 2550 2600 2650 2700 2750 Figure 2.2: This figure shows the schematic organization of our segmental semi– Markov model. (a) Yellow solid line represents observed data and green dash line indicates the model fitting by SSMM. The state of each bin is labeled by number 1, 2, 3, and 4 as shown at the bottom. A magnified seven-bin long seg~ ment between the two vertical dotted lines is shown in (b). A single 50-bp bin from segment in (b) containing 11 probes is shown in (c). It can be seen that the distances between adjacent probes vary. 14 CHAPTER 3 Data collection and validation We isolated nucleosomal DNA by chromatin immunoprecipitation (ChIP) using anti-H3 antibodies, and hybridized isolated DNA and genomic DNA to the Affymetrix Saccharomyces cerevisiae Tiling 1.0R Array with a 4-bp high-resolution (See Methods). Two biological replicates were performed for each of the two different H3 antibodies employed. Our tiling array data are highly reproducible as array signals obtained from two different H3 antibodies showed strong correlation (R = 0.82, across ∼ 2.6 millions probes). In the following analysis, we used the average of the four replicates. We first validated our data using previously published whole-genome nucleosome occupancy data. A comparison between the coarse-grained versions of our data and previous works showed high consistency at different resolutions, with Pearson correlation of 0.66 (Bernstein et al. [BLH04], ∼ 6000 probes), 0.77 (Lee et al. [LSR04], ∼ 12000 probes), and 0.62 (Pokholok et al. [PHL05], ∼ 41000 probes) respectively (Supplementary Table A.1-A.3). We also compared our data to a recent high-resolution genome-wide nucleosome occupancy data in yeast [LTB07]. Despite the overall consistency, a major difference of these two data sets comes from the protocol of isolating nucleosomal DNA. In our study as well as many previous studies [BLH04], [LSR04], [PHL05], the nucleosomal DNA is isolated by shearing chromatin randomly into short fragments (average size 500bp in our study) using sonication and immuno- 15 precipitating histones and associated DNA with anti-H3 antibody. Alternatively micrococcal nuclease can be used to digest the internucleosomal linker DNA, retaining only the nucleosomes. Yuan et al. [YLD05] and Lee et al. [LTB07] used this micrococcal nuclease approach to study the nucleosome occupancy in ∼3% of the yeast genome and the whole yeast genome respectively. Comparing with micrococcal nuclease treatment, the resolution from anti-H3 ChIP approach is limited by the sonication step, therefore our data is a smoothed version of the data from Lee et al. [LTB07]. This could be demonstrated by comparing the coarse-grained versions of these two datasets using averaging windows of different sizes. The overall correlation of entire genome, as well as the median correlation across chromosomes increases significantly as the window size increases. For example, the overall correlation increases from 0.48 to 0.75 as the window size increases from 50bp to 500bp (Supplementary Figure A.3). Importantly, although micrococcal nuclease (MN) treatment is preferred when examining the nucleosome positioning as it retains single nucleosome signal, sonication-based data allows a better detection of NFRs as they are less complicated by linker regions between nucleosomes, given the signal in linker regions are much more smoothed (see Figure 3.1 for examples). This is not a problem for NFR detection as NFRs are relatively longer compared to linker regions, which is exemplified in Figure 3.1. In conclusion, our data are of high quality, which ensures a better quantification of NFRs than previous work, generating both the positions and DoND of NFRs facilitated by our SSMM model. 16 Figure 3.1: Comparison between the nucleosome occupancy data in this study and the data from Lee et al. [LTB07]. Two representative NFRs are shown in (a) and (b). In both cases trimmed data from Lee et al. are shown in upper panels. The rectangles at the bottom of upper panels indicate the results form HMM nucleosome calling algorithm [YLD05] as reported by Lee et al. (2007). Dark/light green rectangles represent localized/delocalized nucleosome calls respectively; white spaces represents linker calls. The lower panels show the data from this study. The green broken lines are the SSMM output. Vertical broken lines show the start, end of the triange/trapezoid patterns and the boundaries of NFRs. The absolute depletion (A) and relative depletion (R) are also annotated. The names of the ORFs in the region and their description directions are displayed in the bottom. 17 CHAPTER 4 Histone occupancy at different chromosome features Previous studies have reported relatively low nucleosome levels in some chromosome features such as promoters and enhancers [BLH04, LSR04, YLD05, KL99, SFC06]. Here, we carried out a systematic study for the nucleosome occupancy level on all annotated chromosome features, including ORF, ARS, rRNA, tRNA, snRNA, snoRNA, telomeric element, intron, long terminal repeat and transposon (Table 4.1). Chromosome features are defined by Saccharomyces Cerevisiae Database (SGD) [CAB98], which include ORF (Open Reading Frame), ARS (Autonomously Replicating Sequence), rRNA, tRNA, snRNA (Small nuclear RNA), snoRNA (Small nucleolar RNA), rRNA, long terminal repeat, telomeric elements, introns and transposons etc. The comparison between chromosome features and intergenic regions was carried out by student t-test. One technical problem is the test will be biased by the high dependency between signals from adjacent probes due to their partial overlap. Thus for each feature, we randomly selected one probe from each instance of the feature and one probe from each intergenic regions, forming two groups for t-test. This was repeated for 50 times and the medians and two quantiles of the t-statistics are reported in Table 4.1. Our study reveals that ORF, transposon, rRNA, telomere (except the telomeric repeats where there is no histone binding [WGZ92]) have higher degree of nucloe- 18 some occupancy 1 than intergenic regions. tRNA and snoRNA have significantly lower histone levels than intergenic regions, presumably due to the intense transcriptional activities. Interestingly, introns and ARS also have low nucleosome occupancy. Table 4.1: Nucleosome occupancies at chromosome features vs. intergenic regions n is the total number of instances of one chromosome feature. nmedian is the median number of probes selected. In different permutations, the number of probes selected may be different because different instances of one feature may overlaps, thus nmedian may be smaller than n. tmedian is the median of t-statistics. pmedian , p25% , and p75% are respectively median, 1st quantile, and 3rd quantile of the t-test p-values. The features are ordered by medians of t-statistics. Feature n nmedian tmedian pmedian p25% p75% tRNA snoRNA intron ARS ncRNA telomeric repeat snRNA X element combinatorial repeats ARS consensus sequence telomere pseudogene X element core sequence Y’ element rRNA retrotransposon transposable element gene ORF 299 75 367 248 9 31 6 28 66 32 21 32 19 27 50 89 6604 275 75 334 248 8 31 6 28 32 32 21 32 19 25 50 89 6574 1.9e-147 2.5e-06 0.0081 0.13 0.22 0.52 0.75 8.4e-06 7.7e-07 4.2e-10 1.8e-09 1.7e-12 1.3e-10 8.6e-13 1.3e-26 2.3e-44 0 1.9e-149 1.4e-06 0.0051 0.084 0.18 0.35 0.49 2.8e-07 4.8e-07 1.2e-11 5.8e-10 2.3e-14 3.2e-13 1.7e-14 1.6e-29 1.2e-47 0 7.5e-146 4.9e-06 0.018 0.19 0.27 0.79 0.83 4.3e-05 1e-06 3.6e-09 7.7e-09 2.6e-11 3.6e-08 4.6e-11 1.2e-24 3.7e-41 0 1 -43.63 -5.09 -2.66 -1.52 -1.35 -0.65 -0.34 5.45 6.14 8.84 10.22 10.97 12.62 13.02 20.06 24.14 49.97 We simply measure the degree of nucloesome occupancy by the log ratio of the ChIP enriched signals versus the signals from genomic control. 19 CHAPTER 5 NFR identification by SSMM Initial inspection of the tiling array data revealed two types of NFR patterns: triangle (Figure 2.1 (a)) and trapezoid (Figure 2.1 (b)) patterns. We design the four hidden states of our SSMM and appropriate transition probabilities to capture both patterns. Our SSMM algorithm can be understood as a curve-fitting algorithm, which is designed to capture specific patterns (see Methods section for details). SSMM fitting enables us to derive four quantitative features of each NFR: (1) the location, (2) the range (length), (3) the absolute DoND level (“absolute depletion”), and (4) the relative DoND level comparing with its neighbor regions (“relative depletion”). The triangle pattern is just one special case of trapezoid pattern where the bottom becomes one point, thus we only describe how to obtain the quantitative features for trapezoid pattern here, which is illustrated in Figure 5.1. First, the location of a NFR is defined as the mid-point of its bottom. Second, the range of a NFR is defined as the horizontal distance between two mid-points of two opposite sides. The “absolute depletion” (A for short) is defined as the signal level (log ratio) at the bottom. The “relative depletion” (R for short) is used to measure the signal decrease in the NFR compared with its neighborhood, which is defined as the difference between the signal level at the bottom and the lower signal level of the two neighboring regions. Our initial model fitting led to the identification of 9593 NFRs in total, among which 35% are trapezoid patterns and 65% are triangle patterns. In the following 20 xRight Signal xLeft xBottom pLeft pBottom1 pBottom2 pRight Chromosome Position Figure 5.1: Feature quantification of NFRs based on trapezoid pattern. The triangle pattern is just one special case of trapezoid pattern, with pBottom1=pBottom2. Four quantities can be estimated as follows: Location, 0.5*(pBottom1 + pBottom2); “absolute depletion”, xBottom; “relative depletion”, min(xLeft, xRight) - xBottom; range, [0.5(pLeft + pBottom1), 0.5(pRight + pBottom2)]. analysis, we further selected NFRs based on absolute (A) or relative (R) DoND. A diagnostic scatter plot of absolute and relative depletion (Supplementary Figure A.4) shows that a simple linear relation R = −A can well capture the main pattern of the scatter plot. Thus for simplicity we used R = A > α as the primary cutoff criteria for selecting NFRs. We obtained highly similar results based on cutoffs using only relative depletion or absolute depletion (data not shown unless otherwise described). Figure 3.1 exemplifies how NFRs are identified by SSMM based on our tiling array data. The lower left panel shows an NFR located in the shared promoter region of two highly active genes HHT2 & HHF2 encoding histones H3 and H4 respectively. The lower right panel shows another NFR in the promoter region of RPS17B, again an active gene encoding a ribosomal protein [HJW98]. For comparison, in the upper left and right panels, we show the raw data and HMM 21 calls from Lee et al. [LTB07]. We can see the HMM outputs do not distinguish the NFRs from the linker regions and do not provide DoND information for NFRs. Both Yuan et al. [YLD05] and Lee et al. [LTB07] referred all the regions that are not occupied by nucleosomes (well-positioned or delocalized) as linker DNA. We take a closer look at the probe intensity (the log ratio of nucleosome occupancy) within the linker DNA of Lee et al. Interestingly, we found that the positions with lower probe intensities are more likely to fall into the NFRs that we identified (Figure 5.2). For example, among the 314,457 linker probes with probe intensity higher than -1.0 ( in log ratio), only 35.5% reside in NFRs, while for those 259,494 probes with nucleosome occupancy signal lower than -1.0, 64% reside in NFRs. This agrees well with the anticipation that NFRs tend to have much lower nucleosome occupancy than inter-nucleosomal linker regions. Effects of DoND on the distributions of NFRs The heterogeneity of nucleosome depletions in different chromosome features highlights the importance of DoND. The quantification of DoND for each NFR allowed us to closely examine the distributions of NFRs according to different DoND cutoffs. Nucleosomes are often depleted in promoter and intergenic regions [BLH04, LSR04, SFC06]. Our results also confirmed these observations. We further showed that DoND is a critical determining factor for the distribution of NFRs. When using various cutoffs based on both R and A, we found that NFRs with higher DoND are more likely located in the intergenic regions or upstream of coding regions (Figure 5.3). This is also true when using either relative depletion or absolute depletion as cutoffs (Supplementary Figure A.5-A.6). In addition, those regions that are both intergenic and upstream of coding regions are more likely to contain NFRs (Chi-square test p-value < 5e−10 for any DoND 22 0.6 0.8 None−NFR NFR 0.0 0.2 0.4 Density 1.0 1.2 Probe intensity in linker regions −5 −4 −3 −2 −1 0 1 probe intensity (log(ratio)) Figure 5.2: Comparison of probe intensity (raw data of nucleosome occupancy, i.e., log(ratio) from Lee et al. [LTB07]) between those linker probes within NFRs or not. Whether a probe is in linker DNA regions is inferred by the HMM in the work of Lee et al.[LTB07]. Whether a region is a NFR is inferred by our SSMM algorithm in this study. cutoff from 0.2 to 1.0). NFRs with higher DoND are preferentially located in divergent intergenic regions, in which the neighbor genes share the same 5’ upstream sequences. As the DoND increases, the proportion of NFRs located in divergent regions increases; the proportion for convergent regions, in which the neighbor genes share same 3’ downstream sequences, decreases; the proportion for tandom intergenic regions, in which the neighbor genes are transcribed in the same direction, is roughly constant. (Supplementary Figure A.7). Our data also showed that the occurrences of NFRs are related to DNA properties. Furthermore, these relationships are strengthened as DoND increases. For instance, the proportion of NFRs within TATA box-containing promoters 23 0.8 0.7 0.71 ● ● ● 0.72 ● 0.73 ● 0.82 0.82 ● ● 0.75 ● 0.91 0.75 0.92 ● ● 0.83 0.83 ● ● ● 0.78 ● 0.8 ● ● 7442 0.91 ● 0.87 ● 0.82 0.79 ● ● ● 0.78 ● ● 0.7 0.6 ● ● 372 ● 511 ● 674 ● 931 0.61 2519 ● 1863 0.56 1332 0.57 ● ● 0.5 ● 0.46 ● 0.4 295 ● 3431 ● ● 5766 ● 0.74 ● 6000 0.82 0.91 4000 0.81 ● ● 2000 0.82 0.92 Total Number of NFR Patterns ● 0 0.92 ● either 8000 1.1 0.92 ● 0.9 0.92 upstream of ORFs ● 0.5 Proportion of NFR Patterns 1.0 intergenic regions −1.0 −0.8 −0.6 −0.4 −0.2 0.0 Absolute Depletion (A) / - Relative Depletion (-R) Absolute Depletion / −Relative Depletion Figure 5.3: The proportion of NFR patterns located at intergenic regions, 500bp upstream of ORFs, and either region given different cutoffs of DoND. Specifically, the cutoff α (α < 0) indicates the absolute depletion is smaller than α and the relative depletion is bigger than −α. The dash line indicates the total number of NFR patterns at different cutoffs (corresponding to the axis on the right side). increases as the DoND increases (Supplementary Figure A.8), which is consistent with a prediction by Segal et al. [SFC06] using a computational model. Previous work has shown that TFBSs are over-represented in nucleosome-depleted promoters [BLH04]. We further demonstrated that the proportion of NFRs harboring TFBSs increases as DoND increases (Supplementary Figure A.9). Effects of DoND on the lengths of NFRs All previous studies about the likely locations and lengths of NFRs have been carried out only at the ensemble-level by averaging a large number of nucleosome occupancy curves in gene promoter regions. Yuan et al. [YLD05] have reported a consensus NFR of ∼150bp long at 200bp upstream of ORF start codons. Lee 24 et al. [LTB07] further identified a more coherent relation between the location of the consensus NFR and TSS [DHG06]. However, although the specific locations and lengths of NFRs should vary from gene to gene, neither paper provided such important information. In our data, with the NFRs characterized individually, we can easily examine the location and length of each NFR in more details rather than merely conveying the “average” pattern. This is reported next. We found a total of 3448 NFRs located in promoter regions of 3447 distinct ORFs (within 500 upstream of ORF). Among them, we obtained the TSS locations for 2601 ORFs [DHG06] and used them to examine the relative locations of the corresponding NFRs. Consistent with the aforementioned literature, the centers of these NFRs were found to locate about 100-200bp upstream of TSSs. More interestingly, NFRs with higher DoND are located further away from TSSs (i.e., shifted to 5’ direction) (Figure 5.4(a)). We found that the start points (5’) of NFRs are located 350-450 upstream of TSSs, and the end points (3’) of NFRs are located primarily at 100bp downstream of TSSs. The distribution of NFR (5’/3’) boundaries also shift to 5’ direction as DoND increases (Figure 5.4(b)). Similar patterns (but with more variations) are observed when we used the locations of the start codons of the ORFs instead of TSSs to study the relative locations of NFRs (data not shown). We next examined the lengths of NFRs. The majority of the NFRs within promoter regions were found to have lengths around 500bps and the peak of the distribution becomes even more significant when DoND increases (Figure 5.4(c)). The length distribution for all NFRs (including those outside of the promoter regions) is similar (data not shown). We conclude that NFRs have a typical length of 400-600 bps, which is approximately the length of DNA wrapping around 2-3 nucleosomes plus linkers. Geometrically speaking, because of the variation in the 25 Density 0.0000 0.000 −800 −600 −400 −200 0 NFR center relative to TSS 200 A<0&R>0 A < −0.2 & R > 0.2 A < −0.4 & R > 0.4 A < −0.6 & R > 0.6 0.0020 5' boundary 3' boundary 0.0010 0.004 (c) 0.002 Density A<0&R>0 A < −0.2 & R > 0.2 A < −0.4 & R > 0.4 A < −0.6 & R > 0.6 0.000 Density (b) 0.002 0.004 (a) −1000 −600 −200 0 200 NFR boundaries relative to TSS 0 500 1000 1500 Length of NFR Figure 5.4: The distributions of locations and lengths of NFRs within promoter regions (500bp upstream of ORFs) according to different cutoffs of DoND (A: absolute depletion, R: relative depletion). Here we use transcription start sites (TSSs) as references of the NFR locations. Supplementary Figure 16 shows similar plots using ORFs as reference locations. (a) The distributions of NFR centers relative to TSSs. (b) The distribution of NFR boundaries relative to TSSs. The color codes are the same as in (a). (c) The distribution of NFR lengths. locations and lengths of individual curves, most individual NFRs have to be much longer than the length of the consensus pattern of nucleosome depletion which is derived by curve averaging (c.f. aforementioned results of [YLD05, LTB07]). 26 CHAPTER 6 Factors of nucleosome depletion: transcriptional activity versus DNA affinity for histones One of the long-standing puzzles in chromatin studies has been the mechanism by which histones are evicted from NFRs. At least two major driving forces have been reported: the transcriptional activity [BLH04, LSR04, PHL05, YLD05] and the DNA affinity for histones [FSH93, LG92, MCS00, SMS05, SFC06]. However, the relative effects of these two factors and their relationship remain unclear. We seek to address these questions based on the NFRs identified by our SSMM. It is worth to emphasize that the linear regression and correlation analysis below, which provide estimations of effect sizes, however are not simply equivalent to causal inference. We first performed a genome-wide RNA Polymerase II (Pol II) binding ChIPchip experiment to measure transcriptional activity (See Methods). We then examined the Pol II binding level around each NFR. We found that the DoND of NFRs correlates best with the Pol II binding level in the neighbor regions of NFRs rather than the ones within NFRs (Supplementary Table A.4). This is presumably because NFRs typically occur at gene promoters as shown before, while the strongest Pol II binding occurs at the neighbor coding regions. Based on this observation, we averaged the Pol II binding signals within 1kb upstream/downstream of a NFR, and used the higher one to represent local 27 transcriptional activity, considering the bidirectionality of NFRs. Using 500bp upstream/downstream regions yielded similar results (data not shown). We computed the DNA affinity for histones within a NFR based on a published data [SFC06], using the average DNA affinity (measured as nucleosome occupancy probability) across all the nucleotides in the NFRs. We found no strong correlation between DNA affinity for histones and Pol II binding in all the cases we considered (correlation -0.088 – 0), indicating that the effect of DNA affinity on transcription activity, if any, is undetectable from the data. The scatter plots of the two factors did not show obvious non-linear relation as well (data not shown). Initial examination of all the 9593 NFRs by additive multiple regression model confirmed that both factors have significant effects on nucleosome depletion in NFRs, while the effect of transcriptional activity is dominant overall. For example, if we use absolute depletion to measure DoND, 12.6% of the variance of DoND in total can be explained by either DNA affinity for histones or transcriptional activity. The majority (90.4%) of the explained variance is attributed to transcription activity, only about 7.6% is attributed to DNA affinity for histones, and less than 2% can be explained by either factor (Figure 6.1, Supplementary Table A.5). Similar conclusions can be drawn using relative depletion to measure DoND (Supplementary Table A.6). We further examined the effects of these two factors in different NFR subgroups: NFRs located in intergenic regions or 500bp upstream of ORFs (Inter/Up) or other genomic regions (Others); NFRs located in convergent (Converge), tandem (Tandem), and divergent (Diverge) intergenic regions; NFRs located in TATA-containing or TATA-less promoters (500bp upstream of ORFs); TFBS-containing NFRs or TFBS-less NFRs (Figure 6.1, Supplementary Table 28 0.30 0.00 0.10 R2 0.20 R2_DNA R2_PolII R2_both Others Converge Tandem Diverge TATA TATA−less TFBS TFBS−less −0.02 Inter/Up −0.08 Correlation All Figure 6.1: This figure illustrates the effects of two factors affecting nucleosome occupancy. The upper panel shows the total R2 (percentage of variance) that can be explained by either DNA affinity for histones or transcriptional activity (Pol II binding), which is further divided into three parts: those explained by DNA affinity for histones (R2 DNA), by Pol II binding (R2 PolII), or both (R2 both). Part of the variance can be explained by both factors since weak correlation does exist between each other, as shown in the the lower panel. A.5-A.6). Several interesting results were revealed when we compared the contribution of Pol II binding and DNA affinity. (1) While both pol II and DNA affinity affect intergenic/promoter regions as expected, Pol II has an overall larger effect than DNA affinity for histones, which indicates that Pol II is a major deterministic factor of nucleosome depletion in intergenic/promoter regions. (2) Pol II binding explains more variance in regions with higher transcription activity, such as divergent and tandem intergenic regions. In contrast, DNA affinity has a larger effect in regions with less transcription activity such as convergent intergenic regions For example, when relative depletion was used as the measurement of DoND, 29 DNA affinity explains 9%, 24.4%, and 78.7% of the total variance explained by the two factors in divergent, tandem, and convergent intergenic regions respectively (Supplementary Table A.6). (3) DNA affinity effect increases dramatically in NFRs with the presence of TATA box or TFBS. For example, DNA affinity contribution increases 28 times more in TFBS-containing than in TFBS-less NFRs, and 6 times more in NFRs within TATA-containing promoters than TATA-less promoters (using absolute depletion, see Figure 6.1 and Supplementary Table A.5). In contrast, Pol II binding effect remains similar (less than 2 fold change) regardless of the presence of TFBS or TATA-box (Supplementary Table A.5). Using relative depletion yields similar result (Supplementary Table A.6). Interestingly, among those TFBS or TATA box-containing genes that are dominated by DNA sequence property, a number of them are stress response genes. For example, GAC1, which is repressed in rich medium and induced upon diauxic transition when the glucose is limited [PEP99], is depleted of nucleosomes at its promoter (Figure 6.2). YMR279C, a gene that is activated upon heat stress [STK03], also loses histones from its promoter despite the fact that it is not transcribed (Figure 6.3). Both promoter regions of these genes have been predicted to contain sequences that are poorly bound by nucleosomes [SFC06]. It has been known that TATA-box containing genes are highly enriched by stress-response genes[BZP04]. This indicates that the DNA sequences in the promoters of these genes could have evolved to have strong repulsion against histones. Therefore the histones may be pre-cleared in these promoters prior to the entry of transcription machinery, presumably to allow the rapid binding of TBP (TATA-box binding protein) under environmental stress. Similar mechanism could apply to TFBS-containing genes. The low nucleosome occupancy on TATA-box has been reported to be encoded by the intrinsic DNA sequence features [SFC06, IAZ06]. The histone depletions around TFBSs have also been reported by several labs 30 [BLH04, YLD05, SFC06, LTB07]. However, a lingering question is that it is still unclear whether the histone occupancy in those TATA-box or TFBS containing genes is associated with active Pol II. Our study answered this question by showing that low histone occupancy still exists even after excluding the transcription effect. 31 0.0 −1.0 0.8 0.4 0.0 0.5 0.0 −1.5 −1.0 −0.5 Pol II Binding (log ratio) DNA Affinity to Histone Nucleosome Occupancy (log ratio) GAC1 668000 669000 SYC1 670000 671000 672000 chr15 (bp) Figure 6.2: Histones are depleted from the promoter of gene GAC1 prior to its activation. This figure shows the RNA polymerase II binding (log ratio from our ChIP-chip results), DNA affinity for histone (posterior probability of histone binding from Segal et al. [SFC06]), and nucleosome occupancy (log ratio from our ChIP-chip results) around gene GAC1 in rich medium. 32 −0.8 −1.4 0.8 0.4 0.0 0.2 −0.2 −0.6 Pol II Binding (log ratio) DNA Affinity to Histone Nucleosome Occupancy (log ratio) YMR279C 825000 825500 CAT8 826000 826500 827000 827500 chr13 (bp) Figure 6.3: Histones are depleted from the promoter of gene YMR279C prior to its activation. This figure shows the RNA polymerase II binding (log ratio from our ChIP-chip results), DNA affinity to histone (posterior probability histone binding from Segal et al. [SFC06]), and nucleosome occupancy (log ratio from our ChIP-chip results) around gene YMR279C. 33 CHAPTER 7 Discussions We have proposed an algorithm to detect segmental patterns using high resolution tiling array data. In contrast to many algorithms designed to detect the binding of TFs, our SSMM algorithm is especially useful to characterize segmental features, which are commonly observed in epigenetic studies. We have applied this algorithm to a genome-wide nucleosome occupancy data and identified all NFRs across the entire yeast genome at 4bp high resolution. The location and DoND (Figure 5.1) were quantified for each NFR. We showed that DoND, as measured by SSMM, greatly affects the distributions of NFRs. We also studied the relative impacts of transcription machinery and DNA sequence in evicting histones from NFRs. The DoND measure we introduced plays a key role in formulating this biological question mathematically. A genomewide RNA Polymerase II (Pol II) binding ChIP-chip was used to measure the transcriptional activity. We showed that Pol II and DNA play distinct roles in different types of NFRs. Our study is a novel example of genome-wide investigations by combining Pol II binding, genetic code and epigenetic information to address biological questions. It is interesting to take a close look at the NFRs with various cutoffs of DoND measures. For example, with a stringent cutoff (R > 0.4 and A < −0.4), we obtained 1863 NFRs. We found that although the majority of these NFRs are within intergenic region or 500bp upstream of coding regions, there are still 145 NFRs 34 (7.8%) located in other regions (Supplementary Table A.7). Among them, 52 are associated with tRNAs, which can be explained by the are high transcription rate of tRNAs. There are 16 of them falling into ARS regions. This raises an interesting possibility of the involvement of NFRs in DNA replications. We also found 58 NFRs located in 58 distinct ORFs (12 uncharacterized, 21 dubious, and 25 verified ORFs, Supplementary Table A.8-A.9). One explanation for the presence of NFRs in these genic regions is these NFRs may harbor regulatory regions for neighboring genes that are located more than 500bps away. Alternatively, these regions could impact the genes in which they reside. For example, they may be cryptic transcription start sites within coding regions allowing transcription to initiate there under certain conditions [KLW03, LGC07]. The functional roles of these NFRs within genic regions are worthy of further study. Our NFR calling algorithm is based on SSMM, an extension of HMM. Compared to HMM, SSMM is more appropriate for high resolution tilling array data because of its flexibility in adjusting for state duration and emission probability distribution. In addition, our SSMM algorithm is computationally more efficient than standard HMM/SSMM because we group data into bins first, and estimate parameters by the dynamic programming algorithm (Viterbi algorithm) instead of EM algorithm (Baum-Welch algorithm) [Rab89] (see Methods for details). As a curve fitting method, our SSMM algorithm has at least two advantages: first, it utilizes the transition matrix to restrict the shape of the detected patterns; second, it chooses “changing points (knots)” in a simple and automatic manner. Our algorithm is flexible to allow the choice of different segmental models depending on the research interests, which makes it general enough to be adapted to many other types of high density tiling array data; for instance, DNA replication data [WFB04, WES04, WBI05] and chromosome translocation data [TME07, GKB07, KIM07]. 35 APPENDIX A Supplementary Materials A.1 Signal of TFBS vs. signal of NFR This figure compares the signal patterns of TFBS and NFR from tilling array data. (a) ● 686 ● ● ●● 687 ● ● ● 688 ● ● ● ● 689 ● 690 0.2 −0.6 −0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 246 chromosome location (Chr5, kb) ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 Nucleosome Occupancy 5 10 15 20 25 ● ● 0 GCN5 Binding Signal (b) 247 248 249 250 chromosome location (Chr11, kb) Figure A.1: Comparison of TF binding signal [PHL05] and Nucleosome occupancy signal (this study). 36 A.2 Algorithm of Segmental Semi-Markov Model A.2.1 Segmental model fitting and emission probability calculation Segmental Semi-Markov Model (SSMM) has been used in speech recognition [Rab89, ODK96]. We modified the coventional SSMM algorithm mainly in two aspects: 1. We organized the input data in a hierarchical structure: probe → bin → segment, so that we do not require the time points to be strictly evenly spaced. 2. We required the continuity of consecutive segmental models. We denote the parameters of a segmental semi-Markov model as Λ = {π, A, L, D, di (·), e(·)}), which include 6 components: 1. π: the initial probability, π(i) = P (q1 = si ), where 1 6 i 6 I, I is the total number of states, qt indicates the state at time t, and si indicates the ith possible state. 2. A = {aij }: the transition probability, aij = P (qt+1 = sj |qt = si ). 3. L: length of each bin. 4. D: the maximum duration. 5. di (·): the density of duration of state i. 6. e(·): emission probability. We denote the total number of bins as Z = ceil(n/L), where n is the total number of data points (probes). We use {ts1 , ..., tsZ } to indicate the start of each 37 bin and use {te1 , ..., teZ } to indicate the end of each bin. We use seg(i, p, q) to indicate a segment from bin p to bin q (including p and q) with underlying state i. Without requirement of continuity, previous works of SSMM defined the emission probabilities of one segment seg(i, p, q) as ei (p, q) = P (xsp , xsp +1 , ..., xeq |q(p, q) = si ), where q(p, q) are states from bin p to bin q. In this study, we use linear model as segment model, and require the linear model for segment seg(i, p, q) pass the predicted end point of previous segment seg(j, o, p − 1). Thus we define the emission probability based on the previous segment too: eji (o, p, q) = P (xsp , xsp +1 , ..., xeq |q(p, q) = si , qo,p−1 = sj ), where 1 6 o 6 p − 1. Our SSMM model for NFR detection is illustrated by Figure 2.1-2.2 in the main text. We use linear model as segmental model and we list below additional details of segmental model fitting and emission probability calculation specifically designed for NFR detection. 1. Given the end point of previous segment, the linear model for state 1 and 2, which are horizontal lines, are already decided. Thus there is no need for model fitting. 2. Given the end point of previous segment, denoted by (tprev , xprev ), the linear model for state 3 and 4 is (xw − xprev ) = b(tw − tprev ), where w is index of those observation in the current segment of state 3 or 4. The coefficients b can be estimated by least square method. Specifically, b= X (xw − xprev )(tw − tprev )/ w X (tw − tprev )2 . w 3. We denote the residuals as rw = xw − x̂w , where x̂w is fitted value from the segmental (linear) model. The emission probability (likelihood) of the segment is calculated by assuming the residuals are from normal distribution with mean 0 and a maximum likelihood estimation of the variance 38 σ̂ 2 = P w rw2 /nseg , where nseg is the number of observations in the seg- ment. 4. The segmental models of state 3 and 4 have one more parameter than state 1 or 2, the slope. Penalized likelihood should be calculated, e.g., AIC or BIC. In this study, we used BIC. 5. The triangle/trapezoid patterns with very small slopes on the two edges can be frequently caused by data noise and are not of interest to us. They may also cause over-fitting. Thus if the absolute value of an estimated slope is smaller than 0.001, we forced it to be -0.001 or 0.001 in order to calculate the emission probabilities for state 3 or 4 respectively. A.2.2 Algorithms Analogous to the algorithms in regular HMM, we presented the following four algorithms for SSMM: Viterbi, forward, backward, and posterior probability. The “Viterbi” algorithm finds the most likely complete path, while the forward and backward algorithm together identify the posterior probability of which state one bin is emitted from. Viterbi algorithm is favored in our model fitting for the following reasons. As one of the challenges in our model fitting, estimation of the emission probability eji (o, p, q) between bin p and q requires the knowledge of end point of the previous segmental model from bin o to p − 1. This can be easily obtained in “Viterbi” algorithm since when we calculate the emission probability of one segment, the most likely path in the previous segments are already known. However, in forward and backward algorithm, we only know that the previous segment is from bin o to p − 1, corresponding to state sj , i.e., seg(j, o, p-1). The calculation of the 39 end point of seg(j, o, p-1) requires knowledge of where the segment before bin o ends, which is unknown. This however can be solved without requiring the continuity of final model fitting. The starting point of seg(j, o, p-1) can be set to be free allowing calculation of its end point, which can be used as the start point of the segment seg(i, p, q). Even though, another limitation of the forward-backward algorithm is that it takes much more computation time than Viterbi algorithm, which is critical for high resolution tiling array data analysis. For instance, the Viterbi algorithm is 34 times faster than forward-backward algorithm for the model fitting of a 3000-probe segment(on a 2GHz Intel Core Duo MacBook Pro, 1GB RAM). Given that it takes around 1 day for the Viterbi algorithm to finish all the model fitting and parameter estimations for the entire genome, approximately one month is needed for forward-backward algorithm. The following algorithms are implemented in a R package ss.hmm, which can be freely downloaded at http://www.bios.unc.edu/∼wsun/software.htm. In order to avoid underflow, we carried out all the calculations in log scale. A function logsumexp(v) is used during the calculation: logsumexp(v) = log k X ! exp(vi ) (A.1) i=1 where v = {v1 , v2 , ..., vk } is a vector. A.2.2.1 Viterbi Input X = {x1 , x2 , ..., xn }, T = {t1 , t2 , ..., tn } and parameters Λ = {π, A, L, D, di (·), e(·)}, where X are observations and T are the corresponding time. Ouput path(t): the most probable path along time T . 40 Intermediate Variables p(k, i): the maximum probability that state i ends at bin k, log.p(k, i) = log(p(k, i)). dura(k, i): the duration of state i that ends at bin k. prev(k, i): the state that is before the state i, which ends at bin k. Algorithm 1. Calculate intermediate variables For the first bin, k = 1, p(1, i) = πi di (1)e(i, 1, 1) log.p(1, i) = log(πi ) + log(di (1)) + log(ei (1, 1)) dura(1, i) = 1 (A.2) (A.3) (A.4) For k > 2, use d to indicate the duration of state i, 1 6 d 6 min(k − 1, D), for a specific duration d and a previous state j, the start point of previous segment is k 0 = k − d − dura(k − d, j) + 1: p(k, i, d, j) = p(k − d, j)aji di (d)eji (k 0 , k − d + 1, k) (A.5) log(p(k, i, d, j)) = log(p(k − d, j)) + log(aji ) + log(di (d)) + log(eji (k 0 , k − d + 1, k)) (A.6) If k 6 D, it is possible that state i begins from the first time point, then p(k, i, d = k, j = N U LL) = πi di (k)e(i, 1, k) (A.7) log(p(t, i, d = k, j = N U LL)) = log(πi ) + log(di (k)) + log(ei (1, k)) (A.8) 41 Then we can calculate the best path ended at time k, state i by p(k, i) = max p(k, i, d, j) (A.9) log.p(k, i) = max log(p(k, i, d, j)) (A.10) d,j d,j dura(k, i) = argmaxd log(p(k, i, d, j)) (A.11) prev(k, i) = argmaxj log(p(k, i, d, j)) (A.12) 2. Trace back the best path path(Z) = which.max(log.p(Z, )) (A.13) then find the previous segment that corresponds to state prev(Z, path(Z)) and ends at time Z − dura(Z, path(Z)). Keep recurring to find the entire path. A.2.2.2 Forward Input X = {x1 , x2 , ..., xn }, T = {t1 , t2 , ..., tn } and parameters Λ = {π, A, L, D, di (·), e(·)}. Ouput The forward probabilities for state i from bin p to q: f (i, p, q) = P (x1 , ..., xeq , q(p, q) = si |Λ), where 1 6 p 6 q 6 Z Algorithm Initialization p = 1, q = {1, ..., min(Z, D)}: f (i, 1, q) = π(i)di (q)e(i, 1, q) log(f (i, 1, q)) = log(π(i)) + log(di (q)) + log(e(i, 1, q)) 42 (A.14) (A.15) Recursion p = {2, ..., Z}, and for each p, q = {p, ..., min(p + D − 1, Z)}, o = {max(1, p − D), ..., p − 1}. "" f (i, p, q) = # X X j6=i o # f (j, o, p − 1)eji (o, p, q) aji di (q − p + 1) (A.16) log(f (i, p, q)) = logsumexpj6=i [logsumexpo [log(f (j, o, p − 1)) + log(eji (o, p, q))] + log(aji )] + log(di (q − p + 1)) A.2.2.3 (A.17) Backward Input X = {x1 , x2 , ..., xn }, T = {t1 , t2 , ..., tn } and parameters Λ = {π, A, L, D, di (·), e(·)} Ouput The backward probabilities for state i from bin p to q: b(i, p, q) = P (xsq +1 , ..., xn |q(p, q) = si , Λ), where 1 6 p 6 q 6 Z Algorithm Initialization b(i, p, Z) = 1 (A.18) log(b(i, p, Z)) = 0 (A.19) Recursion q = {Z − 1, ..., 1}, and for each q, p = {max(1, q − D + 1), ..., q}, r = {q + 43 1, ..., min(q + D, Z)}. " b(i, p, q) = X " aij ## X eij (p, q + 1, r)dj (r − q)b(j, q + 1, r) (A.20) r j6=i log(b(i, p, q)) = logsumexpj6=i [log(aij ) + logsumexpr [log(eij (p, q + 1, r)) + log(dj (r − q)) + log(b(j, q + 1, r))]] A.2.2.4 (A.21) Posterior Probability Calculate the posterior probability based on forward and backward algorithm. Input forward probability {f (i, u, v)} and backward probability {b(i, u, v)}, where i (1 6 i 6 m) indicates the state and u and v (1 6 u 6 v 6 Z) indicate the bins. Ouput pi (k): posterior probability P (q(k) = si |X, Λ), where q(k) indicates state of the k-th bin. Algorithm pi (k) = P (q(k) = si |X, Λ) = P (q(k) = si , X|Λ) P (X|Λ) ∼ P (q(k) = si , X|Λ) XX = f (i, u, v)b(i, u, v) u (A.22) v where max(t − D + 1, 1) 6 u 6 k and k 6 v 6 min(u + D − 1, Z). log(pi (k)) = log(P (q(k) = si |X, Λ) = logsumexpu [logsumexpv [log(f (i, u, v)) + log(b(i, u, v))]] (A.23) 44 Our SSMM model for NFR detection is illustrated by Figure 3, 4 in the main text. Additional details of emission probability calculation specifically designed for the application in nucleosome free region (NFR) detection are listed below. 1. The segmental models of state 3 and 4 have one more parameter than state 1 or 2, the slope. Penalized likelihood should be calculated, e.g., AIC or BIC. In this study, we used BIC. 2. The triangle/trapezoid patterns with ”flat” slopes on the two edges are not of interest to us as they could be simply caused by array noise. They may also cause over-fitting. Thus if the absolute value of an estimated slope is smaller than 0.001, we forced it to be -0.001 or 0.001 in order to calculate the emission probabilities for state 3 or 4 respectively. A.2.3 Parameter estimation We need to estimate the transition probabilities from state 3 to state 2/4 (other transition probabilities are fixed as 0 or 1), and the probability distributions of state durations (See main text Figure 3 for description of the states). For HMM, parameters are usually estimated by Baum-Welch algorithm (an EM algorithm) [Rab89, DEK98]. However, as explained in section 1.2, this EM algorithm cannot be applied to our SSMM because we require the continuity of the fitted curve. Thus we use Viterbi algorithm [Rab89, DEK98] for parameter estimation. With one set of initial parameters, we can generate the most likely path, which is used to update the parameter estimations, and iterate until the parameter estimations converge. The most likely path at convergence is our curve fitting result. The difference between the algorithm we used and Baum-Welch algorithm is analogous to the difference between (hard) K-mean clustering and soft K-mean 45 (EM) clustering. For hard K-mean, one point is assigned to the most likely cluster, while for soft K-mean, posterior probabilities of cluster memberships are estimated. Similarly, in our SSMM algorithm, we assume one bin is emitted from the most likely state, while in Baum-Welch algorithm, posterior probabilities of underlying states are used. The Forward-backward algorithm can be implemented if we do not require the continuity of the fitted curve; however it takes much more time than using Viterbi algorithm (see Supplementary Materials 2.2 for details). We do not wish to make restrictions on the distribution functions of duration or transition probabilities. We start with the uniform distributions. The initial transition matrix is: 0 0 1 0 0 0 0 1 0 0.5 0 0.5 1 0 0 0 where the number in i-th row and j-th column is the transition probability from state i to j: aij . The duration of each state is counted by the number of “bins” (each “bin” covers 50bp). The initial distributions of durations for state 1 to 4 are uniform(6,100), uniform(3,30), uniform(3,50), and uniform(3,50) respectively. The only restriction here is the ranges. We do not allow too short durations in order to avoid over-fitting. The maximums of durations are set to be large enough to cover all possible durations. At convergence, the transition probabilities are a32 = 0.37, a34 = 0.63. The following figure shows the distribution of state durations at convergence of parameter estimation, where X-axis is the duration in base pair, and Y-axis is the frequency. 46 0 1000 Frequency 3000 0 500 1000 1500 2000 2500 200 1400 3000 0 0 3000 1000 duration of state 2 Frequency duration of state 1 600 500 1000 1500 200 400 600 800 duration of state 3 1200 duration of state 4 Figure A.2: Distribution of state durations after convergence A.3 Validation of the raw data First we compared our nucleosome occupancy data with other published genomewide data. Because our data has the highest resolution, we calculate Pearson’s correlation between another data and the corresponding coarse-grained version of our data. Table 1-3 list the correlations between our data and data from Bernstein et al. [BLH04], Lee et al. [LSR04], and Pokholok et al. [PHL05] respectively. In each table, rows correspond to the mean or median versions of our data, columns correspond to measurements in another data. We compare our data with the data from Lee et al. [LTB07] in more detail as both data cover the whole yeast genome at a 4-bp high resolution. The following figure shows the overall correlation between the two data and the individual correlation across each of 16 chromosomes, which suggests high consistency between 47 Table A.1: Correlations between our data and the data by Bernstein et al. Bernstein et al. [BLH04] studied the nucleosome occupancy in ∼ 6000 intergenic/promoter regions in yeast genome for both H2B and H3. The lengths of these intergenic regions range from 60bp to 1594bp with median 370bp. The data was downloaded from website: The integenic http://www.broad.harvard.edu/chembio/lab schreiber/index.html. region annotation was downloaded from SGD (genome-ftp.stanford.edu). Mean Median H3 H2B Average H3 H2B 0.68 0.67 0.56 0.56 0.66 0.66 Table A.2: Correlations between our data and the data by Lee et al. Lee et al. [LSR04] examined the nucleosome occupancy for both H3 and H4 in ∼ 12000 intergenic regions and ORFs, of which the lengths vary from 51bps to 12280bps with median 611bp. The data was downloaded from GEO GSE4727. Mean Median MycH4 H3 Average MycH4 H3 0.75 0.74 0.71 0.70 0.78 0.77 Table A.3: Correlations between our data and the data by Pokholok et al. Pokholok et al. [PHL05] performed ChIP-chip experiments for H3 and H4 using 60mer Agilent DNA microarrays, which have ∼ 41000 probes covering 85% of the yeast genome. Data was downloaded from http://web.wi.mit.edu/young/index.html. Mean Median H4 H3 Average H4 H3 0.68 0.68 0.48 0.48 0.62 0.62 48 0.7 0.6 0.5 Correlation 0.8 the two datasets. 50 100 200 300 400 500 600 700 800 900 1000 Window Size Figure A.3: Correlations between the nucleosome occupancy data in this study and the data from Lee et al. We averaged the data across un-overlapped windows of given size and then calculated Pearson correlations using the average values. The orange dots are overall correlations across the whole genome. Each boxplot illustrates the distributions of 16 correlations corresponding to 16 chromosomes. 49 A.4 Compare absolute depletion and relative depletion in NFRs Figure A.4: Absolute depletion vs. relative depletion R denotes relative depletion and A denotes absolute depletion. The red line is R = −A. This scatter plot shows that the two measures of nucleosome depletion are correlated well. We shall use R = −A > α as primary cutoff criteria for selecting NFRs for further investigation. A.5 Distributions and lengths of NFRs with different DoND We systematically examined the proportion of NFRs in intergenic or promoter regions (defined as 500bp upstream of coding regions). Information of intergenic regions and ORF start positions are download from SGD [CAB98]. The total 50 length of intergenic regions is about 2.88Mb, accounting for about 23.9% of the 12.07Mb yeast genome. The 500bp upstream of 6604 ORFs occupy about 2.82Mb DNA sequences, accounting for about 23.3% of the yeast genome. About 1.83 Mb (15.2%) DNA sequence is both intergenic region and 500bp upstream. As expected, NFR with higher DoND are more likely located in intergenic or promoter regions (Figure A.5, A.6). The enrichment of NFRs in intergenic regions promoter regions is highly significant (Chi-square test p-value < 1e−80 for any cutoff of absolute depletion and/or relative depletion from -0.2/0.2 to -1.0/1.0). In addition, those regions that are both intergenic and upstream regions are more likely to contain NFRs (Chi-square test p-value < 5e−10 for any cutoff from 0.8 0.81 0.89 0.9 ● ● 0.8 0.81 ● ● ● ● ● ● 0.69 0.71 ● 487 ● 0.82 ● ● ● 796 ● 0.78 0.81 ● 5766 ● ● ● ● 0.77 ● 4380 0.72 ● 0.7 ● 0.74 ● ● 3286 ● ● 620 7442 0.87 ● 0.81 0.76 ● ● ● 0.9 0.73 0.6 0.68 ● ● ● ● 0.91 6000 ● ● 1017 ● 2516 0.67 0.61 ● 1856 1352 ● 0.56 0.57 ● 0.5 Total Number of NFR Patterns ● 0.9 4000 0.81 0.9 either 2000 ● upstream of ORFs ● ● 0 0.9 0.8 0.7 0.91 0.67 0.5 Proportion of NFR Patterns 1.0 intergenic regions 8000 -0.2/0.2 to -1.0/1.0). 0.4 0.46 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 Absolute Depletion Figure A.5: Locations of NFRs vs. absolute depletion The proportion of NFR patterns located in intergenic regions, 500bp upstream of ORFs, and either intergenic or 500bp upstream region according to different cutoffs of absolute depletion. The dash line indicates the total number of NFR patterns at different cutoffs of absolute depletion (corresponding to the axis at the right side). 51 0.93 0.92 0.92 0.92 ● ● ● ● ● ● ● 0.82 0.83 0.83 0.84 0.79 ● ● ● 0.82 0.83 ● 0.83 ● 0.79 0.79 ● ● ● 0.67 0.76 ● ● 4318 0.91 ● 0.78 ● ● ● ● 0.77 ● 0.75 ● 0.73 ● 0.72 0.6 ● 0.63 ● 2950 ● 0.46 2177 ● 1590 ● 0.39 ● 1154 ● 832 ● 606 ● ● ● 447 ● 339 Total Number of NFR Patterns 0.8 ● 0.93 0.92 ● 0.88 0.76 either 2000 4000 6000 8000 upstream of ORFs 9593 0 1.0 ● 0.4 Proportion of NFR Patterns intergenic regions 0.36 0.0 0.2 0.4 0.6 0.8 1.0 Relative Depletion Figure A.6: Locations of NFRs vs. relative depletion The proportion of NFR patterns located in intergenic regions, 500bp upstream of ORFs, and either intergenic or 500bp upstream region according to different cutoffs of relative depletion. The dash line indicates the total number of NFR patterns at different cutoffs of relative depletion (corresponding to the axis at the right side). 52 We further divided the intergenic regions into three classes based on the strands of two neighboring chromosome features: • Divergent: intergenic regions containing 5’ regions of both neighboring genes. • Tandem: intergenic regions containing 5’ regions of only one neighboring gene. • Convergent: intergenic regions containing none of the 5’ regions of neighboring genes. Among all the 6640 intergenic regions, 1617 (24.4%) are divergent, 3087 (46.5%) are tandem, and 1599 (24.1%) are convergent. We excluded 337 (5%) intergenic regions, in which at least one of the adjacent chromosomal features lack transcription orientation, e.g., ARS. The total lengths of divergent, tandem, and convergent intergenic regions are respectively 0.47Mb (17% of all the intergenic regions), 1.42Mb (51%), and 0.87Mb (31%). 53 ● ● 0.56 ● ● 0.39 0.4 0.41 0.4 4000 0.58 0.58 ● 0.38 ● ● ● 0.0 0.58 236 0.57 ● 294 ● 409 ● 538 ● ● 3600 ● 0.56 ● 0.32 ● 0.31 1078 739 ● ● ● ● ● ● ● ● ● 0.04 0.04 0.04 0.04 0.04 0.05 −0.8 3422 0.56 2613 ● 2020 ● ● ● 0.37 0.36 0.34 ● 1508 0.04 −1.0 ● ● ● 0.43 ● 0.57 ● −0.6 −0.4 ● 0.06 0.13 ● 0.14 0.09 −0.2 Total Number of NFR Patterns 0.55 ● 2000 0.54 ● 1000 ● ● ● ● 3000 ● 0.53 0.4 divergent 0 0.6 tandem 0.2 Proportion of NFR Patterns convergent 0.0 Absolute Depletion / −Relative Depletion Figure A.7: Different intergenic region vs. NFR absolute/relative depletion The proportion of NFR patterns located at convergent, divergent and tandom integenic regions according to different cutoffs of relative depletion and absolute depletion. Specifically, the cutoff α (α < 0) indicates the absolute depletion is smaller than α and the relative depletion is bigger than −α. The dash line indicates the total number of NFR patterns within the three types of intergenic regions at different cutoffs (corresponding to the axis at the right side). 54 We also partitioned the promoter regions of 6604 ORFs based on whether they contain TATA box [BZP04]. After excluding 933 promoters with data not available, 1090 (19.2%) of the remaining promoters are classified as TATA boxcontaining promoters and 4581 (80.8%) are TATA-less promoters. The proportion of NFRs that are located at different promoter regions were examined, and we found that NFRs with heavy nucleosome depletion are more likely to be TATA- Promoters with TATA box 2435 ● ● ● 0.39 2000 0.4 0.44 0.38 ● ● 0.35 0.3 0.33 ● 0.33 1417 1000 ● ● 0.26 657 16 ● 66 41 ● ● −1.5 113 ● 187 346 ● ● ● 0.22 ● ● ● 0.19 0.19 0 0.2 Proportion of NFR Patterns ● ● Total Number of NFR Patterns 3150 3000 0.5 containing promoters (Figure A.8). −1.0 −0.5 0.0 Absolute Depletion / −Relative Depletion Figure A.8: TATA box vs. NFR absolute/relative depletion The proportion of NFR patterns located in 500 bp upstream promoters with TATA box according to different cutoffs of relative depletion and absolute depletion. Specifically, the cutoff α (α < 0) indicates the absolute depletion is smaller than α and the relative depletion is bigger than −α. The dash line indicates the total number of NFR patterns located in 500 bp upstream promoters at different cutoffs (corresponding to the axis at the right side). 55 Previous result has shown that TF binding sites are over-represented in nucleosomedepleted promoters [BLH04]. We further examined the co-occurrence of TF binding sites and NFRs in a genome-wide scale. The TF binding sites were previously inferred [MWG06] based on a genome-wide TF binding study [HGL04]. The data includes 4312 binding sites with lengths ranging from 4bp to 22bp. If the midpoint of a binding site is within the range of a NFR pattern, we say the NFR pattern harbors the binding site. We showed that the proportion of NFR patterns harboring TF binding sites increases as the degree of nucleosome depletion ● 0.6 0.65 ● 0.64 ● 0.61 5000 0.69 ● 0.5 0.55 ● 0.51 0.1 3431 ● 0.29 1000 0.3 ● 1863 3000 ● ● 0.38 ● ● ● ● ● 39 78 121 189 −1.5 ● 295 ● 931 ● 511 −1.0 0.15 −0.5 0 0.4 ● 0.44 Total Number of NFR Patterns ● 7442 0.2 Proportion of NFR Patterns 0.7 NFR patterns harboring TF binding site(s) ● 7000 0.8 increase (Figure A.9). 0.0 Absolute Depletion / −Relative Depletion Figure A.9: TF binding sites vs. NFR absolute/relative depletion The proportion of NFR patterns harboring TF binding site(s) according to different cutoffs of relative depletion and absolute depletion. Specifically, the cutoff α (α < 0) indicates the absolute depletion is smaller than α and the relative depletion is bigger than −α. The dash line indicates the total number of NFR patterns at different cutoffs (corresponding to the axis at the right side). 56 A.6 Nucleosome depletion forces: DNA affinity for histones and transcriptional activity Table A.4: Correlation matrix of DoND and average Pol II binding Variables: Abs.D, absolute depletion, Rel.D, relative depletion, polNfr, average polymerase II binding within NFR, polPro(1000), average polymerase II binding 1000bp upstream of NFR, polAfr(1000), average polymerase II binding 1000bp downstream of NFR, polAdj(1000), max(polPro, polAfr). Abs.D Abs.D Rel.D polNfr polPro(1k) polAfr(1k) polAdj(1k) Rel.D polNfr polPro(1k) polAfr(1k) polAdj(1k) 1.000 -0.853 -0.853 1.000 -0.086 0.042 -0.188 0.159 -0.196 0.170 -0.342 0.282 -0.086 0.042 1.000 0.779 0.749 0.824 -0.188 0.159 0.779 1.000 0.518 0.813 -0.196 0.170 0.749 0.518 1.000 0.812 -0.342 0.282 0.824 0.813 0.812 1.000 57 Table A.5: Compare the effects of DNA affinity for histones and transcriptional activity (absolute depletion) We compared the effects of DNA affinity for histones and transcriptional activity by the following three linear models: (1) Absolute depletion ∼ DNA, (2) Absolute depletion ∼ Pol II, (3) Absolute depletion ∼ DNA + Pol II, where Pol II signal was calculated as the maximum of the two averages across 1000bps up- and down-stream of each NFR. Column “N” is the number of NFRs. Use R12 , R22 , and R32 to denote the R2 of model (1), (2), and (3) respectively. RT2 otal = R32 , which is the total proportion of variance 2 2 2 explained by DNA affinity for histones or transcriptional activity. RDN A = R3 − R2 , which is the R2 explained solely by DNA affinity for histones. RP2 olII = R32 − R12 , which 2 = RT2 otal - RP2 olII is the R2 explained solely by Polymerase II binding signal. RBoth 2 2 RDN A is the R explained by both Polymerase II signal and DNA affinity for histones. 2 RBoth is not zero because there is correlation between DNA affinity and Polymerase II binding [NNK04]. PP olII is the ANOVA p-value comparing model (3) against model (1). PDN A is the ANOVA p-value comparing model (3) against model (2). N RT2 otal 2 RDN A 2 RP olII 2 RBoth PP olII PDN A All NFRs Inter./Up. Others 9593 4386 5207 0.1263 0.1911 0.0122 0.0096(7.6%) 0.0373(19.5%) 0.0052(42.6%) 0.1142(90.4%) 0.1459(76.3%) 0.0067(54.9%) 0.0024(1.9%) 0.0079(4.1%) 3e-04(2.5%) 3e-258 5e-160 3e-09 1e-24 7e-45 2e-07 Convergent Tandem Divergent 516 2042 1125 0.0837 0.1955 0.2483 0.0302(36.1%) 0.0472(24.1%) 0.0179(7.2%) 0.0499(59.6%) 0.1429(73.1%) 0.2159(87.0%) 0.0036(4.3%) 0.0053(2.7%) 0.0146(5.9%) 2e-07 2e-74 2e-63 5e-05 4e-27 3e-07 TATA TATA-less 612 2570 0.2542 0.2082 0.0567(22.3%) 0.0084(4.0%) 0.1965(77.3%) 0.1941(93.2%) 9e-04(0.4%) 0.0057(2.7%) 8e-33 2e-124 2e-11 2e-07 TFBS TFBS-less 1116 8477 0.198 0.0808 0.0874(44.1%) 0.0032(4.0%) 0.0961(48.5%) 0.077(95.3%) 0.0146(7.4%) 6e-04(0.7%) 3e-29 3e-150 8e-27 5e-08 58 Table A.6 is the same as Table A.5, except relative depletion is used instead of absolute depletion. Table A.6: Compare the effects of DNA affinity for histones and transcriptional activity (relative depletion) N RT2 otal 2 RDN A 2 RP olII 2 RBoth PP olII PDN A All NFRs Inter./Up. Others 9593 4386 5207 0.0887 0.1203 0.0062 0.009(10.1%) 0.0293(24.4%) 0.005(80.6%) 0.0779(87.8%) 0.0857(71.2%) 0.001(16.1%) 0.0019(2.1%) 0.0054(4.5%) 1e-04(1.6%) 5e-173 1e-90 0.02 3e-22 5e-33 3e-07 Convergent Tandem Divergent 516 2042 1125 0.0428 0.1081 0.1731 0.0337(78.7%) 0.0264(24.4%) 0.0156(9.0%) 0.0076(17.8%) 0.0787(72.8%) 0.1465(84.6%) 0.0015(3.5%) 0.003(2.8%) 0.011(6.4%) 0.04 2e-39 1e-41 3e-05 1e-14 5e-06 TATA TATA-less 612 2570 0.1749 0.1204 0.0423(24.2%) 0.007(5.8%) 0.132(75.5%) 0.1095(90.9%) 7e-04(0.4%) 0.0039(3.2%) 2e-21 2e-67 3e-08 6e-06 TFBS TFBS-less 1116 8477 0.1483 0.0464 0.0668(45.0%) 0.0029(6.2%) 0.0707(47.7%) 0.0431(92.9%) 0.0109(7.3%) 4e-04(0.9%) 5e-21 2e-83 5e-20 4e-07 59 A.7 A subset of integenic and genic NFRs with high DoND Table A.7: Distribution of 145 NFRs with high DoND (R > 0.4 and A < −0.4) These 145 NFRs are not located in intergenic or 500bp upstream of coding regions. Feature Number of NFRs ORF tRNA ARS long terminal repeat Y’ element intron rRNA X element core sequence snRNA 58 52 16 9 4 2 2 1 1 60 Table A.8: 25 genic NFRs in verified ORFs. ORF Symbol Chr Start End YLL042C YPL111W YIR030C YPR166C YFL003C YHR091C YAR002W YKR003W YDL232W YLR148W YFR034C YOR361C YGR170W YOR348C YOR210W YLR141W YOL110W YOL122C YDR308C YBR150C YNL070W YER093C YLR024C YAL002W YAR035W ATG10 CAR1 DCG1 MRP2 MSH4 MSR1 NUP60 OSH6 OST4 PEP3 PHO4 PRT1 PSD2 PUT4 RPB10 RRN5 SHR5 SMF1 SRB7 TBS1 TOM7 TSC11 UBR2 VPS8 YAT1 12 16 9 16 6 8 1 11 4 12 6 15 7 15 15 12 15 15 4 2 14 5 12 1 1 52589 339943 412767 876625 137152 286772 152259 445024 38488 434642 225946 1017650 837147 988779 738321 423684 109176 91419 1078445 544487 493367 347608 193282 143709 190187 52086 340944 412033 876278 134516 284841 153878 446370 38598 437398 225008 1015359 840563 986896 738533 424775 109889 89692 1078023 541203 493549 343316 187664 147533 192250 61 Table A.9: 33 genic NFRs in un-verified ORFs. The dubious ORFs are over-represented in the 58 ORFs (hypergeometric p-value = 1e-5) but not for uncharacterized ORFs (p-value = 0.55) based on the SGD [CAB98] annotations, implying that some of the dubious ORFs may not be real coding genes. Type ORF Symbol Chr Start End Strand Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Dubious Uncharacterized Uncharacterized Uncharacterized Uncharacterized Uncharacterized Uncharacterized Uncharacterized Uncharacterized Uncharacterized Uncharacterized Uncharacterized Uncharacterized YAR060C YBL048W YBR209W YDR010C YDR215C YDR274C YDR278C YGR107W YHL041W YHR070C-A YHR212C YIL054W YKL102C YML089C YML122C YNL285W YOR029W YOR050C YOR343C YPR014C YPR064W YAR064W YBL044W YER077C YFR032C-B YGL176C YGR068C YHR202W YHR213W-B YJR003C YMR196W YOR268C YPR159C-A NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 2 2 4 4 4 4 7 8 8 8 9 11 13 13 14 15 15 15 16 16 1 2 5 6 7 7 8 8 10 13 15 16 217483 127302 642578 465380 894498 1011956 1017314 702671 17390 236514 538094 254541 248011 91409 26419 96173 384600 424619 968471 587515 678948 220189 136001 316596 223961 173085 627088 502388 540800 442468 655075 825931 860411 217148 127613 642895 465048 894115 1011585 1016997 703120 17839 236104 537759 254858 247706 91041 26039 96544 384935 424272 968145 587186 679367 220488 136369 314530 223698 171421 625328 504196 541099 440909 658341 825533 860310 C W W C C C C W W C C W C C C W W C C C W W W C C C C W W C W C C 62 References [BL04] M.J. Buck and J.D. Lieb. “ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments.” Genomics, 83:349–360, Mar 2004. [BLH04] B. E. Bernstein, C. L. Liu, E. L. Humphrey, E. O. Perlstein, and S. L. Schreiber. “Global nucleosome occupancy in yeast.” Genome Biol, 5(9):R62, 2004. [BZP04] A. D. Basehoar, S. J. Zanton, and B. F. Pugh. “Identification and distinct regulation of yeast TATA box-containing genes.” Cell, 116(5):699–709, 2004. [CAB98] J M Cherry, C Adler, C Ball, S A Chervitz, S S Dwight, E T Hester, Y Jia, G Juvik, T Roe, M Schroeder, S Weng, and D Botstein. “SGD: Saccharomyces Genome Database.” Nucleic Acids Res, 26(1):73–79, Jan 1998. [CBN04] Simon Cawley, Stefan Bekiranov, Huck H Ng, Philipp Kapranov, Edward A Sekinger, Dione Kampa, Antonio Piccolboni, Victor Sementchenko, Jill Cheng, Alan J Williams, Raymond Wheeler, Brant Wong, Jorg Drenkow, Mark Yamanaka, Sandeep Patel, Shane Brubaker, Hari Tammana, Gregg Helt, Kevin Struhl, and Thomas R Gingeras. “Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs.” Cell, 116(4):499–509, Feb 2004. [DEK98] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. [DHG06] L. David, W. Huber, M. Granovskaia, J. Toedling, C.J. Palm, L. Bofkin, T. Jones, R.W. Davis, and L.M. Steinmetz. “A highresolution map of transcription in the yeast genome.” Proc. Natl. Acad. Sci. U.S.A., 103:5320–5325, Apr 2006. [Edd98] SR Eddy. “Profile hidden Markov models.” 14(9):755–763, 1998. [FSH93] K D Fascher, J Schmitz, and W Horz. “Structural and functional requirements for the chromatin transition at the PHO5 promoter in Saccharomyces cerevisiae upon PHO5 activation.” J Mol Biol, 231(3):658–667, Jun 1993. 63 Bioinformatics, [GKB07] S M Gribble, D Kalaitzopoulos, D C Burford, E Prigmore, R R Selzer, B L Ng, N S W Matthews, K M Porter, R Curley, S J Lindsay, J Baptista, T A Richmond, and N P Carter. “Ultra-high resolution array painting facilitates breakpoint sequencing.” J Med Genet, 44(1):51– 58, 2007. [HGL04] Christopher T Harbison, D Benjamin Gordon, Tong Ihn Lee, Nicola J Rinaldi, Kenzie D Macisaac, Timothy W Danford, Nancy M Hannett, Jean-Bosco Tagne, David B Reynolds, Jane Yoo, Ezra G Jennings, Julia Zeitlinger, Dmitry K Pokholok, Manolis Kellis, P Alex Rolfe, Ken T Takusagawa, Eric S Lander, David K Gifford, Ernest Fraenkel, and Richard A Young. “Transcriptional regulatory code of a eukaryotic genome.” Nature, 431(7004):99–104, Sep 2004. [HJW98] F. C. Holstege, E. G. Jennings, J. J. Wyrick, T. I. Lee, C. J. Hengartner, M. R. Green, T. R. Golub, E. S. Lander, and R. A. Young. “Dissecting the regulatory circuitry of a eukaryotic genome.” Cell, 95(5):717–28, 1998. [IAZ06] I. P. Ioshikhes, I. Albert, S. J. Zanton, and B. F. Pugh. “Nucleosome positions predicted through comparative genomics.” Nat Genet, 38(10):1210–5, 2006. [JLM06] W Evan Johnson, Wei Li, Clifford A Meyer, Raphael Gottardo, Jason S Carroll, Myles Brown, and X Shirley Liu. “Model-based analysis of tiling-arrays for ChIP-chip.” Proc Natl Acad Sci U S A, 103(33):12457–12462, Aug 2006. Evaluation Studies. [JW05] Hongkai Ji and Wing Hung Wong. “TileMap: create chromosomal map of tiling array hybridizations.” Bioinformatics, 21(18):3629– 3636, Sep 2005. [KBZ05] Tae Hoon Kim, Leah O Barrera, Ming Zheng, Chunxu Qu, Michael A Singer, Todd A Richmond, Yingnian Wu, Roland D Green, and Bing Ren. “A high-resolution map of active promoters in the human genome.” Nature, 436(7052):876–880, Aug 2005. [KIM07] AC Karcanias, K Ichimura, MJ Mitchell, CA Sargent, and NA Affara. “Analysis of sex chromosome abnormalities using X and Y chromosome DNA Tiling path arrays.” J Med Genet, 2007. [KL99] R. D. Kornberg and Y. Lorch. “Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome.” Cell, 98(3):285– 94, 1999. 64 [KLW03] Craig D Kaplan, Lisa Laprade, and Fred Winston. “Transcription elongation factors repress transcription initiation from cryptic sites.” Science, 301(5636):1096–1099, Aug 2003. [KMH94] Anders Krogh, I. Saira Mian, and David Haussler. “A hidden Markov model that finds genes in E.coli DNA.” Nucl. Acids Res., 22(22):4768– 4778, 1994. [LBZ00] Harvey Lodish, Arnold Berk, Lawrence S. Zipursky, Paul Matsudaira, David Baltimore, and James Darnell. Molecular Cell Biology. W. H. Freeman and Company, 2000. [LG92] M S Lee and W T Garrard. “Uncoupling gene activity from chromatin structure: promoter mutations can inactivate transcription of the yeast HSP82 gene without eliminating nucleosome-free regions.” Proc Natl Acad Sci U S A, 89(19):9166–9170, Oct 1992. [LGC07] Bing Li, Madelaine Gogol, Mike Carey, Samantha G Pattenden, Chris Seidel, and Jerry L Workman. “Infrequently transcribed long genes depend on the Set2/Rpd3S pathway for accurate transcription.” Genes Dev, 21(11):1422–1430, Jun 2007. [LML05] Wei Li, Clifford A Meyer, and X Shirley Liu. “A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences.” Bioinformatics, 21 Suppl 1:274–282, Jun 2005. [LSR04] C. K. Lee, Y. Shibata, B. Rao, B. D. Strahl, and J. D. Lieb. “Evidence for nucleosome depletion at active regulatory regions genome-wide.” Nat Genet, 36(8):900–5, 2004. [LTB07] William Lee, Desiree Tillo, Nicolas Bray, Randall H Morse, Ronald W Davis, Timothy R Hughes, and Corey Nislow. “A high-resolution atlas of nucleosome occupancy in yeast.” Nat Genet, 39(10):1235–1244, Oct 2007. [MCS00] X Mai, S Chou, and K Struhl. “Preferential accessibility of the yeast his3 promoter is determined by a general property of the DNA sequence, not by specific elements.” Mol Cell Biol, 20(18):6668–6676, Sep 2000. [MWG06] Kenzie D MacIsaac, Ting Wang, D Benjamin Gordon, David K Gifford, Gary D Stormo, and Ernest Fraenkel. “An improved map of conserved regulatory sites for Saccharomyces cerevisiae.” BMC Bioinformatics, 7:113, 2006. 65 [NGR98] MA Newton, MN Gould, CA Reznikoff, and JD Haag. “On the statistical analysis of allelic-loss data.” Stat Med, 17:1425–45, 1998. [NLH06] N. Ngre, S. Lavrov, J. Hennetin, M. Bellis, and G. Cavalli. “Mapping the distribution of chromatin proteins by ChIP on chip.” Meth. Enzymol., 410:316–341, 2006. [NNK04] Chris J. Nachtsheim, John Neter, and Michael H. Kutner. Applied linear statistical models. McGraw-Hill, 2004. [ODK96] M. Ostendorf, V.V. Digalakis, and O.A. Kimball. “From HMM’s to segment models: a unified view of stochastic modelingfor speech recognition.” IEEE Transactions on Speech and Audio Processing, 4(5):360–378, 1996. [OSL07] F. Ozsolak, J.S. Song, X.S. Liu, and D.E. Fisher. “High-throughput mapping of the chromatin structure of human promoters.” Nat. Biotechnol., 25:244–248, Feb 2007. [PEP99] J.L. Parrou, B. Enjalbert, L. Plourde, A. Bauche, B. Gonzalez, and J. Franois. “Dynamic responses of reserve carbohydrate metabolism under carbon and nitrogen limitations in Saccharomyces cerevisiae.” Yeast, 15:191–203, Feb 1999. [PHL05] Dmitry K Pokholok, Christopher T Harbison, Stuart Levine, Megan Cole, Nancy M Hannett, Tong Ihn Lee, George W Bell, Kimberly Walker, P Alex Rolfe, Elizabeth Herbolsheimer, Julia Zeitlinger, Fran Lewitter, David K Gifford, and Richard A Young. “Genome-wide map of nucleosome acetylation and methylation in yeast.” Cell, 122(4):517–527, Aug 2005. [Rab89] L.R. Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE, 77(2):257– 286, Feb 1989. [SFC06] E. Segal, Y. Fondufe-Mittendorf, L. Chen, A. Thastrom, Y. Field, I. K. Moore, J. P. Wang, and J. Widom. “A genomic code for nucleosome positioning.” Nature, 442(7104):772–8, 2006. [SMS05] Edward A Sekinger, Zarmik Moqtaderi, and Kevin Struhl. “Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of promoter regions in yeast.” Mol Cell, 18(6):735–748, Jun 2005. 66 [STK03] K. Sakaki, K. Tashiro, S. Kuhara, and K. Mihara. “Response of genes associated with mitochondrial function to mild heat stress in yeast Saccharomyces cerevisiae.” J. Biochem., 134:373–384, Sep 2003. [TME07] Andreas Tzschach, Corinna Menzel, Fikret Erdogan, Marei Schubert, Maria Hoeltzenbein, Gotthold Barbi, Christine Petzenhauser, HansHilger Ropers, Reinhard Ullmann, and Vera Kalscheuer. “Characterization of a 16 Mb interstitial chromosome 7q21 deletion by tiling path array CGH.” Am J Med Genet A, 143(4):333–337, 2007. [WBI05] Kathryn Woodfine, David M Beare, Koichi Ichimura, Silvana Debernardi, Andrew J Mungall, Heike Fiegler, V Peter Collins, Nigel P Carter, and Ian Dunham. “Replication timing of human chromosome 6.” Cell Cycle, 4(1):172–176, 2005. [WC53] J.D. WATSON and F.H. CRICK. “Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid.” Nature, 171:737–738, Apr 1953. [WES04] Eric J White, Olof Emanuelsson, David Scalzo, Thomas Royce, Steven Kosak, Edward J Oakeley, Sherman Weissman, Mark Gerstein, Mark Groudine, Michael Snyder, and Dirk Schubeler. “DNA replicationtiming analysis of human chromosome 22 at high resolution and different developmental states.” Proc Natl Acad Sci U S A, 101(51):17771– 17776, 2004. [WFB04] Kathryn Woodfine, Heike Fiegler, David M Beare, John E Collins, Owen T McCann, Bryan D Young, Silvana Debernardi, Richard Mott, Ian Dunham, and Nigel P Carter. “Replication timing of the human genome.” Hum Mol Genet, 13(2):191–202, 2004. [WGZ92] J H Wright, D E Gottschling, and V A Zakian. “Saccharomyces telomeres assume a non-nucleosomal chromatin structure.” Genes Dev, 6(2):197–210, Feb 1992. [XZZ07] F. Xu, Q. Zhang, K. Zhang, W. Xie, and M. Grunstein. “Sir2 deacetylates histone H3 lysine 56 to regulate telomeric heterochromatin structure in yeast.” Mol. Cell, 27:890–900, Sep 2007. [YLD05] Guo-Cheng Yuan, Yuen-Jong Liu, Michael F Dion, Michael D Slack, Lani F Wu, Steven J Altschuler, and Oliver J Rando. “Genomescale identification of nucleosome positions in S. cerevisiae.” Science, 309(5734):626–630, Jul 2005. 67