Analysis of Array CGH Data for the Estimation of Genetic Tumor
Transcription
Analysis of Array CGH Data for the Estimation of Genetic Tumor
Analysis of Array CGH Data for the Estimation of Genetic Tumor Progression by Laura Toloşi Supervisors: Dr. Jörg Rahnenführer Prof. Dr. Thomas Lengauer A thesis submitted in conformity with the requirements for the degree of Master of Science Computer Science Department Saarland University June 2006 Abstract Analysis of ArrayCGH Data for the Estimation of Genetic Tumor Progression Laura Toloşi Master of Science Department of Computer Science Saarland University 2006 In cancer research, prediction of time to death or relapse is important for a meaningful tumor classification and selecting appropriate therapies. The accumulation of genetic alterations during tumor progression can be used for the assessment of the genetic status of the tumor. ArrayCGH technology is used to measure genomic amplifications and deletions, with a high resolution that allows the detection of down to single genes copy number changes. We propose an automated method for analysis of cancer mutations accumulation based on statistical analysis of arrayCGH data. The method consists of the four steps: arrayCGH smoothing, aberrations detection, consensus analysis and oncogenetic tree models estimation. For the second and third steps, we propose new algorithmic solutions. First, we use the adaptive weights smoothing-based algorithm GLAD for identifying regions of constant copy number. Then, in order to select regions of gain and loss, we fit robust normals to the smoothed Log2 Ratios of each CGH array and choose appropriate significance cutoffs. The consensus analysis step consists of an automated selection of recurrent aberrant regions when multiple CGH experiments on the same tumor type are available. We propose to associate p-values to each measured genomic position and to select the regions where the p-value is sufficiently small. The aberrant regions computed by our method can be further used to estimate evolutionary trees, which model the dependencies between genetic mutations and can help to predict tumor progression stages and survival times. We applied our method to two arrayCGH data sets obtained from prostate cancer and glioblastoma patients, respectively. The results confirm previous knowledge on the genetic mutations specific to these types of cancer, but also bring out new regions, often reducing to single genes, due to the high resolution of arrayCGH measurements. An oncogenetic tree mixture model fitted to the Prostate Cancer data set shows two distinct evolutionary patterns discriminating between two different cell lines. Moreover, when used as clustering features, the genetic mutations our algorithm outputs separate well arrays representing 4 different cell lines, proving that we extract meaningful information. i ii I hereby declare that this thesis is entirely my own work except where otherwise indicated. I have used only the resources given in the list of references. Laura Toloşi June 22, 2006 iii iv Acknowledgments While carrying out this work, I learned that the most enjoyable and fruitful moments were the discussions with the people around me. Their professional enthusiasm taught me to pursue my work with passion and optimism. I am grateful to my supervisor Dr. Jörg Rahnenführer for his ideas and continuous guiding, Prof. Dr. Thomas Lengauer for his support and energy and the whole group for creating a great working environment. I thank Konstantin Halachev for lifting my spirit up and always encouraging me. v vi Contents 1 Introduction 1.1 Problem Statement 1.2 Motivation . . . . . 1.3 Related Work . . . 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 2 ArrayCGH Data 2.1 Genetic Mutations in Cancer Genesis 2.2 ArrayCGH Technology . . . . . . . . 2.3 Statistical analysis of arrayCGH data 2.3.1 ArrayCGH smoothing . . . . 2.3.2 Aberrations detection . . . . . 2.3.3 Region selection . . . . . . . . 2.3.4 ArrayCGH analysis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 7 11 11 17 20 22 3 Genetic Tumor Progression 3.1 Oncogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Formal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Genetic Progression Score . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 24 29 4 Application to Cancer Data 4.1 Array CGH data sets . . . . . . . . . . . . . . 4.2 Implementation . . . . . . . . . . . . . . . . . 4.2.1 Implementation steps . . . . . . . . . . 4.3 Results for analysis of prostate cancer data . . 4.3.1 Genetic mutations in individual arrays 4.3.2 Consensus analysis . . . . . . . . . . . 4.3.3 Oncogenetic trees . . . . . . . . . . . . 4.4 Results for analysis of glioblastomas data . . . 4.4.1 Genetic mutations in individual arrays 4.4.2 Consensus analysis . . . . . . . . . . . 4.4.3 Oncogenetic trees . . . . . . . . . . . . 4.5 Validation . . . . . . . . . . . . . . . . . . . . 34 34 35 35 39 39 40 44 48 48 48 53 55 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusions and future work 58 A Prostate Cancer 59 B Glioblastomas 84 viii Chapter 1 Introduction 1.1 Problem Statement During the last years, much research in the fields of Molecular Biology and Bioinformatics has focused on finding good indicators of cancer staging. These so called biomarkers [23], used together with traditional medical measures, could significantly improve patient care, if their meaning is correctly stated and assessed. ArrayCGH data is one such biomarker, that provides information about the genetic amplifications and deletions occurring during cancer progression. These aberrations are specific to each type of cancer. Previous results show that the order in which the amplifications and deletions occur follows preferred patterns that characterize cancer evolution. Mixtures of oncogenetic trees have been proposed as a statistical model of mutation accumulation patterns and can be used as predictors of tumor progression status [1]. Based on the tree mixture model, each tumor sample is assigned a genetic pattern and a genetic progression score, that gives a prediction of survival time and can help in selecting suitable therapies. Our first goal was to develop an automated method to determine the chromosomal gains and losses specific to different types of tumor, based on statistical inference on arrayCGH data. The second objective was to apply the method to Prostate Cancer and Glioblastomas data sets, then estimate and analyze the corresponding oncogenetic trees. 1.2 Motivation Important problems frequently met in cancer treatment, like prediction of survival time, choice of treatment, subtype prediction, are currently addressed based almost exclusively on the anatomic characteristics of the disease. Much effort is spent on including a large 1 series of biomarkers in standard clinical practice, which will potentially improve the quality of treatment. The technical means that support cancer treatment are also subject to continuous development. Comparative Genomic Hybridization (CGH) is an experimental technique used to measure certain types of genetic mutations associated with cancer onset and progression. During the past decade, CGH technology has evolved to overcome resolution obstacles, currently allowing the detection of refined mutations, single gene copy number changes or even smaller target areas. However, CGH data contains too large an amount of information and additional noise to be processed directly and fully by medical experts, and thus automated methods are needed to extract the relevant information. 1.3 Related Work Many recent publications address the problem of detecting chromosomal aberrations in different types of cancer. Most of them describe the analysis of CGH experiments, which identify low resolution mutations ([8], [15], [22]). The number of publicly available experimental arrayCGH data sets has been rapidly increasing during the past year, and also the amount of publications which present methods and results of arrayCGH analysis. Typically, smoothing methods are described ([25], [26], [28]), with applications to single arrays or small sets of experiments. ArrayCGH smoothing algorithms deal with the problem of removing the noise from arrayCGH data and detecting regions of gains and losses (see [21] for a comparison study of the methods). Among the publications that propose tumor stage and evolution models, we cite Beerenwinkel et al. ([30, 1]), who describe oncogenetic tree mixtures as statistical models of genetic mutations accumulations during cancer progression. The tree models are specific for each type of tumor and can can be used to predict time to death, subtype or stage of the disease, or a suitable treatment. We use oncogenetic tree models in our work and we give a detailed description in Chapter 3. 1.4 Contribution We propose an automated method for analysis of cancer mutations accumulation based on statistical analysis of arrayCGH data. The algorithm consists of the following steps: arrayCGH smoothing, aberrations detection, consensus analysis and oncogenetic tree models estimation. ArrayCGH smoothing solves the problem of determining regions of constant copy number. The aberrations detection is a procedure of identifying chromo2 somal gains and losses in individual experiments. The consensus analysis step addresses the selection of recurrent mutated regions in a set of arrayCGH experiments. The oncogenetic tree models estimation is a statistical learning method that builds a model of tumor evolution patterns in terms of mutations accumulation. Our contribution to the method are the second and the third steps. Aberration detection methods have been proposed before in the literature ([25], [3]), we describe here an alternative approach that is robust and data adaptive. To our knowledge, an automated consensus analysis algorithm has not been published yet. We applied our method to Prostate Cancer and Glioblastomas arrayCGH data sets. We discovered amplifications and deletions, often reducing to single genes, that narrowed down the regions identified by CGH or that are not yet reported in the literature. 3 Chapter 2 ArrayCGH Data An important goal in biomedical research is to identify biological indicators that characterize well the stage of tumors. The process of accumulation of genetic mutations gives information of the evolution of most cancer types. Array CGH technology is used to measure these mutations, typically amplifications and deletions. This chapter is an overview of the array CGH technology and data analysis. The first section gives an introduction to the biological mechanisms involved in cancer onset and progression, with focus on the genetic mutations measured by arrayCGH experiments. Technical details on how arrayCGH data is obtained from tumor tissue of diagnosed patients are given in Section 2.2. An automated method of statistical analysis of array CGH data is proposed in the last section of this chapter. The analysis consists of several consecutive steps: smoothing of the CGH data, aberrations detection and region selection. Each of these steps will be presented in detail, with emphasis on the last two, which are our contribution to the overall method. 2.1 Genetic Mutations in Cancer Genesis This section explains several basic biology notions about cancer, that are necessary for understanding the data and the analysis methodology proposed. All aspects of cancer discussed in this section refer to human tumors, and are mainly based on [20]. Basic notions The complete genetic information of an organism is contained in its genome. The genetic material is organized as DNA molecules, contained in the nucleus of each cell. The DNA is a double stranded helix formed of two paired sequences of nucleotides of the following four types: adenine(A), thymine(T), cytosine(C)and guanine(G) (see Figure 2.1). The main force promoting the formation of this helix is complementary basepairing: adenines form hydrogen bonds with thymines and cytosines form hydrogen bonds with guanines. The human DNA is 3 billion basepairs long. 4 Figure 2.1: DNA fragment. Figure 2.2: Condensed structure of a chromosome. The DNA strand has several structural layers. The DNA is packed in macromolecules called chromosomes. In humans, each cell contains 23 pairs of chromosomes (see Figure 2.3). The DNA strand of a chromosome has a highly compressed structure, as shown in Figure 2.2. The DNA sequence contains genetic specifications of all biological processes of a cell. Of particular interest are the genes, specific subsequences of 1000-10000 base pairs length of DNA. Genes are translated into single-stranded nucleic acid called RNA, which can be further translated into macromolecules that perform most of the biological functions in the cell, called proteins. The transcription of genes into RNA and further into proteins is regulated by the needs of the organism. The genetic information is identical in all cells of the human organism. At each cell division, the genome is duplicated into identical halves, via a process called mitosis. Often, the DNA replication fails to produce identical copies, which results in mutated 5 Figure 2.3: The 23 pairs of chromosomes in human. The picture shows a normal male karyotype. A normal female karyotype has a second X chromosome instead of the Y chromosome. daughter cells. The regulatory mechanisms of the cell, however, detect these mutations and under normal conditions, the cell goes to apoptosis, or programmed cell death. A healthy functioning assumes a permanent balance between cell proliferation and cell death, to comply with the needs of the organism. Cancer Occasionally, the controls that regulate cell multiplication fail to function normally. A cell in which this occurs begins to divide in an unregulated fashion, without regard to the body’s need for further cells of its type. The descendants of such a cell inherit the incapacity to respond to regulation, which may lead to a mass of cells able to expand indefinitely. This mass is called tumor, and its existence may not have serious health consequences, if it remains local, or might lead to the death of the patient if it spreads to other tissue types. The disease associated to this malfunction is called cancer. Cancer is caused by DNA mutations which involve genes that normally regulate cell multiplication. Two classes of genes play a key role in cancer induction: protooncogenes and tumor-suppressor genes. The proto-oncogenes encode proteins that promote cell proliferation. In cancer, mutations of these genes lead to their abnormal, increased activity. One of the mechanisms that produce these mutations is the localized reduplication, or amplification of chromosome segments that include proto-oncogenes. As a consequence, all genes located within amplified segments have increased copy number in the tumor tissue, leading to overexpression of their encoded proteins. Figure 2.4 a) shows a schema of a chromosomal gain mutation. 6 Figure 2.4: a) Chromosomal gain mutation. A segment of chromosome 10 is inserted twice during mitosis; b) Chromosomal loss mutation. A segment of chromosome 5 is deleted during mitosis. However, mutations in proto-oncogenes cannot induce an accelerated division of the cells by themselves, unless they are complemented by mutations in the genes that promote apoptosis, called tumor-suppressor genes. Deletions of segments (Figure 2.4 b)) of chromosomes which contain such genes lead to a decrease in their copy number and thus to underexpression of the encoded proteins. Since the DNA contains many genes of both types described above, a cell must undergo multiple mutations until it becomes cancerous. Direct observations of DNA from tumor tissue in different stages also show that mutations accumulate over time. Moreover, it seems that in most of the cancer types, mutations arise in specific, preferred orders. Therefore, if chromosomal mutations are determined accurately over a large enough number of tumor samples in different stages, a statistical analysis can provide with models of evolution which characterize particular types of cancer. These models can improve prediction of tumor stage, survival time and can help in therapy selection. 2.2 ArrayCGH Technology This section is an overview of the technology that measures DNA copy number gains and losses and maps these aberrations to the genomic sequence. For reasons that will become clear below, it is called Comparative Genomic Hybridization, in short CGH. 7 Figure 2.5: Chromosomal CGH technology. The main steps. CGH The CGH procedure was first developed and described by Kallioniemi et al. in 1992 (see [19]) and it has been continuously improved since. The basic strategy of the technique is to use genomic DNA from cancer cells labelled with one fluorochrome and genomic DNA from healthy reference cells labelled with a second fluorochrome and then allow them to competitively hybridize to immobilized target DNA. The two fluorochromes are chosen such that to emit easily distinguishable wavelengths, typically red and green. In regions where there are no amplifications or deletions in the cancer genome, binding of both samples will be equal, which will result in a perceived yellow florescence. However, where there are losses in the copy number in the cancer DNA, the color with which the normal DNA was labelled will predominate. Similarly, in regions with DNA copy number gain, the color with which the tumor DNA was labelled will be apparent [24]. Chromosomal-CGH Initially, the CGH procedure used intact chromosomes as target for hybridization, which were conveniently immobilized during the metaphase of cell mitosis. The chromosomes were scanned and ratios of the two fluorescent intensities were measured by 8 Figure 2.6: Array CGH technology. The main steps. quantitative image analysis. Figure 2.5 shows a schema of chromosomal-CGH technique. However, chromosomal CGH cannot detect aberrations smaller than 3-10 mega basepairs, due to the highly condensed structure of chromosomes. For the same reason, it also fails to determine precisely the endpoints of the altered regions. Higher resolution analysis is needed to be able to detect single gene copy number changes and to identify much smaller target areas for these changes so that single causative genes to be ultimately identified. Microarray-based CGH A significant improvement in CGH technology was announced in 1997 by SolinasToldo et al. with a technique they described as matrix-CGH [32]. In the subsequent years, several refining developments of matrix-CGH overcame the resolution limit problem ([33], [27]). Probes that map to evenly spaced loci along the entire length of the genome were printed onto glass slides, called microarrays. The microarrays were then used as targets for hybridization instead of chromosomes, allowing a high resolution of 9 the measurements. This improved technology is called array CGH (Fig. 2.6). Depending on the spacing and length of the clones, array CGH can measure single genes copy number changes or even smaller regions. Several types of probes can be used to produce microarrays. Highly used are bacterial artificial chromosomes (BACs). Each BAC clone consists of a small 100-200 Kilobase (Kb) segment of DNA, grown in bacteria and immobilized onto slides according to its genomic location. Tumor and control tissues labelled with different fluorochromes are hybridized on the microarrays as in chromosomal CGH. The microarrays are scanned to produce two separate images that show intensities for the two wavelengths. At each spot, the ratio of the intensities of the two fluorochromes gives a measure of the abundance of the corresponding gene in the tumor tissue. However, the quality of microarray images varies considerably, as the measured intensity of a spot includes a contribution of non-specific hybridization and other chemicals on the glass. Suboptimal experimental conditions may strongly affect the spot quantification, and clean images as in Figure 2.6 are rather exceptional. Therefore, specific image analysis algorithms are used to extract corrected fluorescent intensities. Several further transformations of the intensity values are needed before the copy number changes can be analyzed and interpreted. This adjustment is called normalization (see [4]), and it consists of three steps: - Transformation to normality refers to the adjustment of the intensity ratios such that they approximate a Gaussian distribution. A normal variation of the data is a prerequisite of many types of statistical analysis. There is a general agreement that a log transformation of arrayCGH data provides a good approximation of a Gaussian distribution. If x and y denote the corrected fluorescent intensities of a spot, the transformed values log2 (x/y), or, equivalently, log2 (x) − log2 (y), are approximately normally distributed. - Centralization removes biases from the data. Several sources, including variation within and among arrays, unequal dye incorporation or poor scanning quality introduce uneven bias along the microarray. Among the proposed methods for centralization, LOESS normalization has been widely used [8]. - Re-scaling is a final step that may be applied to ensure that data from different hybridizations have equal variances. This step is usually omitted since the variances may differ not only due to error in the experiments, but also due to treatment effects, which should not be altered. Figure 2.7 shows an example of a normalized CGH array. In what follows, we will refer to the normalized logarithm of the ratios of dye intensities as Log2 Ratios , or, in a 10 Figure 2.7: Chromosome 1 form a CGH experiment. Each point on the plot corresponds to a gene. The x-axis represents the position of the gene on the chromosome, in Kilo basepairs. The y-axis shows the normalized Log2 Ratio of the fluorescent intensities of the spot corresponding to the gene. A Log2 Ratio close to 0 indicate a normal abundance of the gene in the tumor tissue, whereas a level significantly greater (or smaller) than 0 indicates a gain (or loss) in copy number. more relaxed notation, as copy numbers. 2.3 2.3.1 Statistical analysis of arrayCGH data ArrayCGH smoothing Array CGH data can reveal chromosomal deletions and amplifications in tumor tissues. We expect changes in copy number to cover multiple consecutive genes, since usually segments of chromosomes are affected. However, array CGH data is noisy, which means that a smoothing method should be applied to increase the reliability of detecting changes. Classical regression methods are not very suitable for this purpose, as they tend to blur the sudden changes in copy number and round the segments between jumps, instead of flattening them. The reason is that most regression models are continuous functions, while our purpose is to detect discontinuities. The problem resumes to fitting a piecewise constant function to the data (see Fig. 2.8), and it consists of two subproblems: finding the breakpoints that separate segments with homogeneous Log2 Ratios , and estimating the copy number for each such segment. The first subproblem is more difficult, and there have been many efficient approaches to solve it. The next section is a summary of the most frequently used 11 2 1 0 −1 Log ratio 0 50000 100000 150000 200000 250000 Chromosome position, in Kb Figure 2.8: Piecewise constant regression. Chromosome 1 from an arrayCGH experiment exhibits deletions and amplifications. Each point on the plot corresponds to a gene. The x-axis represents the position of the gene on the chromosome, in Kilo basepairs. The y-axis shows the normalized Log2 Ratio of the fluorescent intensities of the spot corresponding to the gene. The fitted red line indicates regions of constant copy number. algorithms for this purpose. A comparative analysis can be found in [21]. ArrayCGH smoothing algorithms overview The most common approach is to model the data as a partition in segments, with unknown boundaries and unknown height, which will have to be estimated from the observations. An optimization criterion must involve the quality of the fit, but should also penalize the number of discontinuities, to avoid overfitting. Jong et al. [18] use a stochastic genetic algorithm to maximize a likelihood with a penalty term containing the number of breakpoints. Fridlyand et al. [11] use a Hidden Markov model (HMM) in which the underlying copy numbers are hidden states with associated transition probabilities. The HMM is fitted to the observed data using the Forward-Backward and Baum-Welch methods. A penalized sum of squares method has been modified by using the L1 norm instead of L2 norm, which gives sharper boundaries between segments (Eilers et al.)[10]. The optimization problem is then solved using a quantile regression idea. Hsu et al. [14] propose to fit wavelets to the data and shrink the coefficients. In the flat parts most of the higher frequency coefficients will become zero, but near the 12 jumps they will be retained. The positions of the jumps will be indicated this way. One can also systematically traverse the data series and introduce or remove breakpoints iteratively, maximizing a likelihood statistic. This method is called Circular Binary Segmentation, and it is proposed by Olshen et al. in [26]. Hupé et al.[16] use adaptive weights smoothing to detect the breakpoints. The method is called GLAD, and we used this algorithm in our analysis. One of the reasons that favored this particular choice is that, unlike most of the other approaches, it does not consider equal distances between consecutive probes, but their real position on the chromosome. This additional information is very important for an accurate alignment of multiple CGH arrays smoothing functions, given that Log2 Ratio intensities are frequently not available for arbitrary positions (due to the quality of the image spots, for example). The next subsection is a detailed presentation of the GLAD algorithm. The GLAD algorithm The GLAD (Gain and Loss Analysis of DNA) algorithm was proposed by Hupé et al. (2004) [16]. It solves a piecewise constant regression problem by using a local maximum likelihood approach. The method is divided into two main steps: detecting the breakpoints (positions where the underlying copy number changes) and estimating the status of each segment. We discuss the algorithm in detail, as presented in the cited publication. Formal model Each arrayCGH experiment can be formally represented as a series of N independent observations (X1 , Y1 ), ..., (XN , YN ), where each Xi represents the position on the chromosome and each Yi is the corresponding measured Log2 Ratio. The locations are sorted increasingly: X1 < ... < Xi < ... < XN . The underlying statistical model assumed to have produced this data is a piecewise constant function with additive gaussian noise, meaning that the random variable Yi depends on the location Xi via a parameter θ and an additive gaussian noise i as follows: Yi = θ(Xi ) + i (2.1) In this context, θ is the copy number that needs to be estimated and the i are i.i.d. N (0, σ 2 ). The function θ is piecewise constant, which is be formally expressed as: θ(x) = M X am 1Im (2.2) m=1 The number M of disjoint segments I1 , ..., IM , the limits of the segments and the values am are unknown. They will be estimated from the data, such that a local likelihood 13 statistic is maximized. The detection of the breakpoints uses the Adaptive Weights Smoothing (AWS) procedure proposed by Polzehl and Spokoiny (2002) [28]. This is an iterative algorithm that finds around each location Xi a maximum neighborhood in which the copy number can be assumed constant and then fits a local maximum likelihood value at Xi . The locality is forced by assigning weights wij (0 ≤ wij ≤ 1) to all other observations Xj and readapting these weights to the data such that eventually the observations that lie within the same constant segment receive much larger weights than the others. The algorithm is a two-stage iterative process: readapting the weights and reestimating the parameters. The readapting of the weights is done via a location penalty kernel Kl that punishes distant observations (w.r.t. position on the chromosome) and a statistical penalty kernel Ks that punishes two different local models. Kernels are widely used in local regression models, with the purpose of assigning higher weights to close observations than to distant ones and therefore forcing the prediction at a certain data point to depend only on neighboring observations. By adding a second kernel for statistical penalty, the AWS procedure decreases the large weights of neighboring observations to Xi that do not also have close response values to Yi . Formally, the kernels are non-increasing symmetrical functions that fulfill Kl (0) = Ks (0) = 1, controlled by two parameters: a geometric growth rate a that enlarges the neighborhood around the observation at Xi with every iteration, and λ, that adjusts the magnitude of the statistical penalty. A memory parameter η stabilizes the procedure by involving the old weights in the computation of the new ones. The reestimation of the parameters resumes to maximizing a weighted likelihood statistic of the form: N X p(Yj , θ) 0 (2.3) L(Wi , θ, θ ) = wij log p(Yj , θ0 ) j=1 where θ0 is an arbitrary point in the parameter space, Wi = diag{wi1 , ..., wiN } and p(·, θ) is the probability distribution of the response variables for a given parameter θ. Thus, the MLE for θ is given by: θ̂i := θ̂(Xi ) = argsupL(Wi , θ, θ0 ) (2.4) θ∈Θ Intuitively, the estimate at observation Xi is given by such θi for which the likelihood that the Y values in the neighborhood of Xi are approximated by the constant θ is maximized. Given the Gaussian model of the data (Formula 2.2), one can prove that the MLE 14 estimate coincides with the weighted least squares estimator by replacing p(y, θ) = √ 1 e− 2πσ (y−θ)2 2σ 2 in Formula 2.3: θ̂i = argsup θ∈Θ N X wij (log p(Yj , θ) − log p(Yj , θ0 )) j=1 N 1 X wij ((Yj − θ0 )2 − (Yj − θ)2 ) 2 θ∈Θ 2σ j=1 ! ! N N X X − = argsup wij θ2 + 2 wij Yj θ + = argsup θ∈Θ PN = j=1 j=1 wij Yj PN j=1 wij , j=1 ∀θ0 ∈ Θ N X !! wij (θ02 − 2Yj θ0 ) j=1 (2.5) The AWS procedure requires an estimate for the standard deviance σ of the noise. A robust estimator is given in [16] by the formula: σ̂ = IQR(Z1 , ..., ZN −1 ) √ IQR(N (0, 1)) × 2 where Zi = Yi+1 − Yi and IQR is the interquartile range (i.e. the difference between the first and the third quartile of the given sample). Initially, all weights wij are set to 1 and the parameter estimates for all observations are given by the maximum likelihood constant function fitted to the weighted data, which is in fact the mean of all observations. The AWS algorithm Below we give the AWS algorithm, as presented in [28]. Input: a set of N observations (Xi , Yi )i=1..N . Output: smoothing values (θ̂i )i = 1..N . Parameters: penalty kernels Kl and Ks , memory parameter η, initial bandwidth h(1) , growth rate a, maximal bandwidth h∗ and statistical penalty control λ. We discuss the possible values and the meaning of the parameters after the algorithm is presented. 15 AWS procedure (1) Initialization: Calculate the global MLE θ̂(0) of θ: N 1 X Yi N i=1 θ̂(0) = (0) (0) = θ̂(0) and define Wi For every i = 1, ..., N , set θ̂i k = 1. as the unit matrix. Set (2) Iteration: for every i = 1, ..., N : (a) Calculate the adaptive weights: For every point Xj , calculate the penalties: (k) lij = | ρ(Xi , Xj )/h(k) |2 , (k) (k−1) sij = λ−1 [L(Wik−1 , θ̂i = λ−1 · 1 2σ̂ 2 (k−1) · (θ̂i (k−1) , θ̂j (k−1) − θ̂j (k−1) ) + L(Wjk−1 , θ̂j ) (k−1) PN k=1 (wik (k−1) , θ̂i (k−1) − wjk )]/2 (k−1) )(Yk − θ̂i (k−1) +θ̂j 2 where ρ(x, x0 ) is a metric in the input space and h(k) controls the size of the neighborhood of each Xi . Calculate: (k) (k) (k) w̃ij = Kl (lij )Ks (sij ) (k) and define the weight wij as: (k) (k−1) wij = ηwij (k) Denote by Wi (k) + (1 − η)w̃ij (k) the diagonal matrix Wi (k) (k) (b) Estimation: Calculate the new local MLE θ̂i (k) θ̂i (k) = diag{wi1 , ..., wiN } of θi : (k) j=1 wij Yj PN (k) j=1 wij PN = (3) Stopping: Stop if ah(k) > k ∗ , otherwise increase k by 1, set h(k) = ah(k−1) and continue with step 2. 16 ) The AWS procedure provides with estimates θ̂i for each observation at Xi . The breakpoints are positions Xi such that θ̂i ∈ / [θ̂i+1 − , θ̂i+1 + ]. In the default case, = 10−2 . The final step to finish the regression is to choose a maximum likelihood constant fit within each segment, which is the mean of the Yi values, given the assumed normality of the data. Parameters discussion The AWS procedure involves several parameters that can be tuned in order to make the method more or less sensitive to discontinuities. We present them below, with the same notations that they appear in [16]. (a) Kernels Ks and Kl . By default, the kernels are exponential functions: Kl (u) = Ks (u) = e−|u| . In order to decrease the computational complexity of the method, the statistical penalty kernel can be chosen to be the triangle function: Ks (u) = (1 + u)+ 1(−∞,0] + (1 − u)+ 1[0,∞) . (b) Memory parameter η. The value η ∈ (0, 1) can be viewed as the memory parameter of the algorithm. The larger the value of η, the more stable the method w.r.t. iteration. However, it decreases the sensitivity to local changes. (c) Bandwidth parameters h(1) , a and h∗ . The initial bandwidth h(1) should be taken as small as possible, as to comprise only one observation. The parameter a controls the growth rate of the local neighborhoods for each observation Xi . It should be selected such that at each iteration, the number of data points within a distance of h(k) from Xi grows geometrically with a factor agrow . The maximal bandwidth h∗ can be taken very large, but since it is involved in the termination condition of the algorithm, a smaller value would reduce the computational complexity. Data-driven optimal stopping can be decided via cross validation, for example. (d) Parameter λ. This is the most important parameter of the method, because it scales the statistical penalty sij and thus directly controls the sensitivity of the method to local changes. Small values of λ lead to overpenalization, which may result in instability in the prediction in homogeneous regions (potentially too many breakpoints). On the other hand, large values of λ will oversmooth the data (less sensitivity to changes). Theoretical arguments are given in [28] to support the choice λ = t.985 (χ21 ), the 0.985 quantile of a chi-squared distribution with 1 degree of freedom. 2.3.2 Aberrations detection Array CGH smoothing methods such as AWS do not completely solve the problem of detecting the gained and lost DNA segments. There are regions with estimated 17 0.0 0.5 1.0 1.5 2.0 2.5 Frequency −1.0 −0.5 0.0 0.5 1.0 1.5 LogRatio Figure 2.9: Histogram of smoothed Log2 Ratios from an arrayCGH experiment along the whole genome. The distribution is approximately Gaussian with mean zero. Log2 Ratio very close to 0 which we might not want to choose as aberrations, as it might still be due to noise. The problem has been addressed in several publications, typically by choosing appropriate cutoffs c with |Log2 Ratio|> c. A very popular cutoff is c = 0.2, or, as proposed by Nakao et al. in [25], c = 0.225. Other more adaptive cutoffs take into account the genome-wide standard deviation σ of the Log2 Ratios . The cutoff is then chosen as, for example, c = 1.3σ. Hupé et al. ([16]) cluster the smoothed Log2 Ratios, take as normal level the maximum cardinality cluster, and assign all others to gains or losses. We developed our own method to determine the thresholds. Typically, we select the segments with values significantly different from zero. Therefore, the problem reduces to assessing significance for each homogeneous region. We also decided to compute separate cutoffs for each array. The reason behind this decision is that each array is a stand-alone experiment, and thus the errors introduced might have different magnitudes. In order to get an intuition on the distribution of the GLAD estimates we analyzed corresponding histograms, which in the majority of the cases had the shape of Gaussians, centered around zero (see Figure 2.9). This observation agrees with the expectation that the genes with a normal copy number are more frequent than the ones in gained (or lost) regions, when the entire genome is considered. We gain more information about the underlying distribution by analyzing quantile18 Figure 2.10: QQ-plot of the smoothed Log2 Ratios from an arrayCGH experiment along the whole genome. The x-axis shows theoretical quantiles of a standard normal distribution and the y-axis shows sample quantiles. The red line is determined by the 25% and the 75% percentiles and corresponds to a robust normal. Figure 2.11: Robust normal fitted to the smoothed Log2 Ratios (the red curve). It passes through the first and the third quartiles of the data. The blue vertical lines are the loss (left) and gain (right) cutoffs. quantile plots. For the purpose of comparing a sample distribution against a Gaussian, theoretical quantiles of a standard normal are plotted against sample quantiles (see Figure 2.10). If the sample is Gaussian, then the plotted pairs should follow a linear trend. As observed in almost all arrays, the middle section of the data (in an order statistic) can be clearly considered normal, while the tails deviate significantly. The red line in Figure 2.10 is defined by two points in the theoretical - sample quantiles plane, which correspond to the first (0.25) and the third (0.75) quartiles. The quartiles are values that divide a distribution or a sample from a distribution in four equal segments. The Gaussian distribution with the same first and third quartiles as the given sample is unique and it is called a robust normal, because it does not depend on the first and the last 25% of the data (in the order statistic). Figure 2.11 shows this normal distribution fitted to the data. The cutoffs that determine gains and losses in copy number are selected two standard deviations to the left and right of the mean of the fitted robust normal. Figure 2.11 shows these cutoffs (the blue lines). The algorithm presented above for computing thresholds that separate amplified or deleted regions from normal regions is based on observations and heuristics. No theoretical arguments to asses performance on other data sets than the ones we studied is 19 available yet. However, the method is highly intuitive and robust and it worked well on the Prostate Cancer and Brain Tumor data sets ( see Chapter 4 for validation results). Each region in the genome with constant copy number is assigned a status, either lost, gained or normal. A natural assignment of -1, 1 or 0 to the genes in the corresponding regions will reflect their status accordingly. We thus obtain arrays of values {-1,0,1} and size equal to the initial arrays. Moreover, the probes that were not available on the initial arrays (for example spots that could not be resolved by the image analysis) can now be assigned a status. This can be done by first locating them within a segment of constant copy number (such segments exist and are unique, since they define a partition of the genome) and then assigning the respective status to the gene. In what follows, we will refer to these {-1,0,1} arrays as status arrays, or gain/loss arrays. 2.3.3 Region selection Each cancer type has specific amplifications and deletions that can be determined via a consensus analysis of multiple CGH arrays. We have created an automated method that selects highly recurrent alterations across many samples of the same tumor type. The problem that needs to be solved is that of deciding how many arrays should have a region amplified or, respectively, deleted, such that to select it among the representative mutations for the disease. A second issue is that of determining the boundaries of such regions, when the samples do not agree on a clear start or end position. This can be also due to biological causes, not only to our method of selecting altered segments. To our knowledge, there is no proposed automated method for selecting highly recurrent regions across arrayCGH samples. However, from a biological point of view, all mutations are interesting for further analysis, even those that occur in a small fraction of the arrays. These regions may show interesting particularities of the disease, that may be correlated to individual genotypic or phenotypic characteristics. Our solution to the region detection problem is to consider the number of arrays that have a certain gene simultaneously gained or lost, respectively, as a random variable Z, and then associate a p-value to each observation of this variable. The p-values can be estimated by analyzing the distribution of Z along the genome. In what follows, we describe only to the algorithm that computes p-values for the gains, the procedure is similar for losses. Formally, let n be the number of arrays in the data set. We associate the random variable Zi to each array i, representing the status of a random gene in the array: Zi = 1 if the gene is gained and Zi = 0 if the status is normal. We can estimate the distribution of each Zi . Define pi = Pr(Zi = 1), which implies Pr(Zi = 0) = 1 − pi . We 20 Figure 2.12: P-values measuring significance of each gene in terms of the number of arrays in which it is amplified (example). The x-axis shows the number of arrays. The y-axis shows the p-value associated. The blue horizontal line is the significance cutoff. estimate pi by the frequency of gains along the genome in array i: p̂i = number of gained genes total number of genes Observe that Z = Z1 + Z2 + ... + Zn . Since all Zi are Bernoulli distributed, Zi ∼ Bernoulli(p̂i ), for any given k ∈ {0, ..., n}, the probability Pr(Z = k) can be computed as: n X Y Pr(Z = k) = Pr(Z1 + ... + Zn = k) = p̂ai i (1 − p̂i )1−ai P ai ∈{0,1}, i=1 ai =k For efficient computation reasons, a recursion can be used. Denote by: Pk (l) = Pr(Z1 + ... + Zl = k), l ∈ {1, .., n}, k ∈ {0, .., n} Then, the following recurrence holds: Pk (l) = Pk−1 (l − 1) · p̂l + Pk (l − 1) · (1 − p̂l ), ∀l ∈ {1, .., n}, ∀k ∈ {0, .., n} with starting values for l = 1 and k = 0: l Y P0 (l) = (1 − p̂i ) ∀l ∈ {1, ..., n} i=1 21 (2.6) 1 − p̂1 Pk (1) = p̂1 0 if k = 0, if k = 1, if k > 1 ∀k ∈ {0, ..., n} (2.7) This gives Pr(Z = k) = Pr(Z1 + ... + Zn = k) = Pk (n) for all k ∈ {0, .., n}. The p-value associated to observation k of variable Z is: p-value(k) = Pr(Z ≥ k) = n X Pr(Z = i) i=k An occurrence of k simultaneous gains at a certain position is unlikely to happen by chance if the corresponding p-value is small. The p-value cutoff is a parameter of the method. A very small cutoff would result in a small number of highly recurrent selected regions, potentially leaving out relevant mutations. However, if the cutoff is large, many regions will be selected which could contain and therefore mask highly relevant ones. Given the property of the mutations to be locally constant, the selection method will output continuous regions, significantly amplified across all arrays. In what follows, we will refer to the selected mutated regions (both gains and losses) as genetic events. We say that an event has occurred in a sample if there is at least one gene gained (lost, respectively) in the corresponding gained (lost) region. Formally, let V be {1, ..., l} the set of the l events output by the method. We can associate each gain/loss array s a binary array ps ∈ {0, 1}l of length l, such that: ( 1 if event i has occurred in s (2.8) ps [i] = 0 if event i has not occurred in s We will refer to these binary arrays as genetic patterns. 2.3.4 ArrayCGH analysis algorithm We summarize our arrayCGH analysis method below. Input: a set of n CGH arrays (X, Yk )1≤k≤n , where X = (X1 , ..., XN ) gives the genome positions of the microarray spots and Yk = (Yk1 , ..., YkN ) are the Log2 Ratios of the k th experiment. Output: a list of genetic events E = (e1 , ..., el ) and a corresponding list of genetic patterns P = (p1 , ..., pn ), pi ∈ {0, 1}l . Each genetic event is specified by its chromosome number, start and end position and type of mutation. 22 Parameters: smoothing parameters α(discussed din the previous section), p-value cutoff β. ArrayCGH analysis algorithm 1. ArrayCGH Smoothing For each array k ∈ {1, .., n} Apply glad(α) to compute smoothed Log2 Ratios Θk = (θk1 , ..., θkN ) 2. Aberrations detection For each array k ∈ {1, .., n} a) Compute the first and the third quartiles of the smoothed Log2 Ratios Θk b) Compute the mean µk and the standard deviance σk of a robust Gaussian distribution with the same first and third quartiles as obtained in a). c) Compute the gain cutoff gk and loss cutoff lk : gk = µk + 2σk lk = µk − 2σk d) Compute status arrays sk = (s1k , ..., sN k ): −1, θik < lk ; 1, θik > gk ; , sik = 0, otherwise. i ∈ {1, ..., N } 3. Region selection Compute p-values associated to gains and losses: pgain and ploss , as suggested in (2.6). For all positions j ∈ {1, .., N } If pgain (| {sjk = 1, 1 ≤ k ≤ n} |) < β then select position j as gain. If ploss (| {sjk = −1, 1 ≤ k ≤ n} |) < β then select position j as loss. Output the list of regions of continuous loss or gain E = (e1 , ..., el ). Compute genetic patterns P = (p1 , ..., pn ) as in (2.8). 23 Chapter 3 Genetic Tumor Progression The problem of choosing appropriate treatments for cancer patients is currently addressed based almost exclusively on clinical measurements, such as the age and sex of the patient, the tumor volume, the lymph node spread, the presence or absence of metastasis). The stage of the tumor, the time until the death of a patient or until a relapse can be better determined if additional biological markers are analyzed. Genetic mutations such as chromosomal amplifications and deletions accumulate during cancer progression in preferred orders, allowing a more precise stage prediction. These mutations are now measurable with higher precision via arrayCGH technology. The accumulation process can be modeled using oncogenetic tree mixtures, as proposed by Beerenwinkel et al. [30]. In what follows, we will introduce the formal statistical model underlying the oncogenetic trees. A measure of the progression status of a tumor computed based on the tree mixture model will be presented in Section 3.2. 3.1 Oncogenetic Trees The oncogenetic trees model the order in which permanent genetic changes occur during cancer evolution. Typically, the genetic events targeted are amplifications and deletions. They are represented as the nodes of the trees, each edge in the tree being labeled with the conditional probability that the child event occurs given that the parent event occurred. The topology of the tree mixture and the model parameters are learned from a set of genetic patterns, that can be the result of arrayCGH data analysis, as in our case. 3.1.1 Formal model Let V = {1, ..., l} be a set of genetic events, to which we artificially add a null event that occurs with probability 1. Consider also a set of n genetic patterns which describe 24 subsets of observed events. Each genetic pattern is modified by adding a 1 before the first position, to indicate the occurrence of the null event. The set of all n observed patterns can then be represented by a binary matrix of dimension n × (l + 1): X = (xij )1≤i≤n 1≤j≤l with xij = 1, if the j th genetic event occurs in the ith pattern; 0, otherwise. Each event j has an associated binary random variable Zj that indicates the occurrence of the event in a genetic pattern. Therefore, column j of the pattern matrix is a set of observations of size n on the variable Zj . Formally, an oncogenetic tree T = (V, E, r, p) has the genetic events as vertices, with the null event r as root. The set E of edges are labeled with conditional probabilities p : E → [0, 1], such that for an edge e = (u, v) ∈ E, p(e) = Pr(Zv = 1 | Zu = 1) is the conditional probability of event v given event u. Oncogenetic trees estimate the joint distribution of the events based on the observed genetic patterns. An oncogenetic tree induces a probability distribution over the set of all possible patterns Ω = {0, 1}l . Let x be a pattern and S ⊆ V the set of events present in x. If there exists a subset E 0 ⊆ E such that S is the set of all vertices reachable from r in the induced subtree Tx = (V [E 0 ], E 0 ) then x can be generated by T , with the positive probability: Y Y p(e) · (1 − p(e)). Pr(x | T ) = e∈E 0 e∈S×V \S Otherwise, if there is no such edge subset E 0 , pattern x cannot be generated by T and Pr(x | T ) = 0. Learning oncogenetic trees An algorithm for fitting a tree that approximates the multivariate distribution of the patterns is presented by Desper et al. in [7]. The tree is constructed as a maximum weight branching in a complete graph on l + 1 vertices, using Edmonds’ algorithm in O(| V || E |) time. The weight of an edge e = (u, v) is given by: Pr(u) Pr(u, v) · = log Pr(u, v)−log(Pr(u)+Pr(v))−log Pr(v) w(e) = log Pr(u)Pr(v) Pr(u) + Pr(v) where Pr(u) is the marginal probability of event u and Pr(u, v) is the joint probability of events u and v. Intuitively, the weights reflect the desirability of having event v as 25 a direct descendant of u in the tree. In practice, the joint probabilities are not known, therefore they have to be estimated from the set of observations. For a large data set, the algorithm reconstructs the correct oncogenetic tree with high probability. The estimated model is not necessarily the maximum likelihood one (quantitative statements about the quality of the approximation are given in Desper et al. [7]). Oncogenetic tree mixtures In a real-life scenario, the assumption that all observable patterns are generated by a single tree topology is too restrictive in a probabilistic sense. In order to avoid the situation when an oncogenetic tree estimated based on a set of patterns does not generate all of them with positive probability, tree mixtures have been proposed by Beerenwinkel et al. [1]. Typically, the first tree in the mixture is called the noise component and it has a star topology, allowing all patterns to have a positive likelihood. Figure 3.1 shows an example of oncogenetic tree mixture. Formally, we define a K-oncogenetic tree mixture M as a collection of K oncogenetic trees Tk = (V, Ek , r, pk ) that induces a mixed distribution on the pattern space: M= K X αk Tk k=1 P with αk ∈ [0, 1] and K k=1 = 1. Consequently, the likelihood of a pattern x in the mixture model is given by: Pr(x | M) = K X αk Pr(x | Tk ) k=1 Learning oncogenetic tree mixtures Given the number K of trees, the tree mixture has to be reconstructed based on the observed set of patterns X. This translates into estimating the parameters αk and the tree topologies Tk . Assume that, for each pattern, we know the tree component that generated it. In this case, we can use the algorithm described above K times to reconstruct all trees from the corresponding subset of patterns. But the responsibilities are not known, therefore they have to be estimated from the data. This procedure results in an EM-like algorithm (Dempster et al., [6]). 26 27 0.92, a genetic pattern is generated by the nontrivial tree component. The edges of the trees are labeled with the conditional probabilities between the events, the confidence intervals and the bootstrap sample counts. Figure 3.1: Oncogenetic tree mixture example. 8% of the samples are best explained by independence of the events. With probability We present below the main steps of the algorithm, as given in [30]: 1. Guess initial responsibilities: Run (K-1)-means clustering algorithm on all patterns and set responsibilities according to clustering results. 2. Maximization-like step: Estimate the star tree T1 and the other components T2 ...TK with Edmonds’ algorithm from all events weighted with their responsibilities. Compute the mixture parameters as the sum of responsibilities. 3. Expectation step: Compute new responsibilities of all patterns from likelihood with respect to tree components. The only parameter that still needs to be estimated is the optimum number of trees K. This can be done via cross-validation, by trying out several values for K and then choosing the simplest model within one standard deviation of the one that achieves the maximum mean log-likelihood. The presented algorithm differs from the traditional EM algorithm in the fact that the maximization step does not provide an ML estimate. Moreover, convergence to a local maximum of the log-likelihood is not guaranteed. The detailed algorithm is given below: EM-like algorithm for learning K-oncogenetic tree mixtures 1. INPUT • Patterns of events X = (xij )1≤i≤n 1≤j≤l • Number of oncogenetic trees K ≥ 2 2. OUTPUT • K-oncogenetic trees mixture model PK k=1 αk Tk 3. PROCEDURE 1. Guess initial responsibilities: (a) Run (K-1)-means clustering algorithm (b) Set responsibilities 1 , if xi is in cluster k − 1; 2 γik = 1 , else. 2(K−1) 28 2. M-like step. P Update model parameters: Set Nk = N i=1 γij for all k = 1, ..., K Let T1 be a star with edge weights β= l N 1 XX γi1 xij lN1 j=1 i=1 For k = 2, ..., K: (a) For all pairs of events (u, v), 1 ≤ u, v ≤ l, estimate their joint probabilities N 1 X pk (u, v) = γik xiu xiv . Nk i=1 (b) Compute the maximum weight branching Tk from the complete digraph with weights w derived from pk . (c) Compute the mixture parameter αk = NNk . 3. E-step. Compute responsibilities: αk Pr(xi | Tk ) γik = PK m=1 αm Pr(xi | Tm ) 4. Iterate steps 2 and 3 until convergence. Model stability Beerenvinkel et al. [30] propose to use bootstrap analysis (Hastie, Tibshirani and Friedman, 2001) for measuring the stability of the trees, i.e. for measuring the dependence of the topology on sampling effects. The task is reduced to single oncogenetic trees. Given the responsibilities γ computed with the EM algoritm, resampling with replacement is carried out for each pattern xi with probability γik . From the bootstrap sample of size N , an oncogenetic tree is reconstructed. The procedure is repeated sufficiently many times. As a test statistic, the relative count of each edge e ∈ E in the bootstrap trees is considered. The second, non trivial component in figure 3.1 shows strong evidence of the events 8q13,24+ and 14q12,24 as initial events, and, for example, of the succession 1q21-23+ → 18q21,23-. 3.2 Genetic Progression Score Various clinical and histological markers have been proposed and used for determining the progression status of human tumors. The main applications are prediction of survival time and selection of a suitable therapy for every patient. Scores that measure 29 the progression status of tumor samples can be computed from their associated genetic patterns. Many naive scores assume independence and cumulative effects of the genetic events, assumptions that do not hold in general. The genetic progression score (GPS) proposed in [1] integrates dependencies between events by using oncogenetic tree mixture models. In what follows, we will formally present the GPS and briefly argue about its predictive power. Time stamps are added to the oncogenetic trees in order to express the accumulation of genetic events not only in terms of preferred order, but also considering the time intervals at which subsequent events occur. Given this supplementary information, we can estimate the ”age” of a tumor and also give a prediction of the survival time. A timed oncogenetic tree can be obtained by assuming independent Poisson processes for the occurrence of events on the tree edges and for the sampling time of the tumor (i.e. the time from onset until the tumor is analyzed). Denote by Ti the waiting time of event i, representing the difference of occurrence times between the parent event pa(i) and the event i itself. Assume Ti is exponentially distributed with parameter λi and let the sampling time of the tumor Ts be exponentially distributed with parameter λs . We want to label all edges (pa(i), i) in the tree with the estimated value of the waiting time Ti . Observe that, if X and Y are exponentially distributed with parameters λ and µ respectively, the following relations hold: E[X] = Z 1 λ (3.1) ∞ Pr(X ≥ t) = λe−λx dx = e−λt , ∀t > 0 (3.2) t Pr(X ≥ α + β | X ≥ β) = = (3.2) = = 30 Pr(X ≥ α + β ∧ X ≥ β) Pr(X ≥ β) Pr(X ≥ α + β) Pr(X ≥ β) −λ(α+β) e = e−λα e−λβ Pr(X ≥ α), ∀α, β > 0 (3.3) and ∞ Z Z Pr(X ≥ Y ) = ∞ λe 0 −λx dx µe−µy dy y Z ∞ e−(λ+µ)y dy = µ 0 µ = λ+µ (3.4) The relations above are used for computing the expected waiting time E[Ti ]. Denote by T[i] the cumulative time until the occurrence of event i. Then pi = Pr(Zi = 1 | Zpa(i) = 1) = Pr(Ts ≥ Ti + T[pa(i)] | Ts ≥ T[pa(i)] ) (3.3) = (3.4) = Pr(Ts ≥ Ti ) λi λi + λs (3.5) Therefore, (3.1) E[Ti ] = 1 − pi 1 (3.5) 1 − pi 1 = = E(Ts ) λi p i λs pi (3.6) The tumor age at the time of sampling is not known, in general. Thus, the parameter λs cannot be estimated from the data and the expected waiting time E[Ti ] cannot be scaled to the true time scale of the oncogenetic process. We define unitless waiting times E[Ti ] by normalizing the mean sampling time to E[Ts ] = λ1s = 1. The expected times E[Ti ] can therefore be explicitly calculated from the oncogenetic trees using formula (3.6). In what follows, we show how to extend the computations to estimating waiting time for a genetic pattern x = (x1 , ..., xl ). Intuitively, waiting times accumulate when traversing the tree from the root towards the leaves. However, in general, one cannot compute formally the expectation of the resulting random variable. Therefore, the proposed solution is to simulate the waiting process along the tree nsim times (typically nsim ≥ 106 ) by drawing observations from variables Ti ∼ exp(λs pi /(1 − pi )) on all tree edges. For each simulation, a natural consistency rule is applied to filter out the cases when there exist events i and j such that xi = 1 and xj = 0, but the observed waiting time for event i is larger than the waiting time for event j. The observed waiting time for an event i is the sum of all simulated waiting times of events lying on the path from i to the root. 31 Figure 3.2: Estimating waiting times for patterns in timed oncogenetic trees. For pattern x in which all events occurred, i.e. x = (1, 1, 1, 1), first observations ti are drawn from exponential distributions with parameters λs pi /(1 − pi ) and a cumulative waiting time is computed as t = max(t1 + t3 , t2 ). In general, waiting times of subsequent events are added and the maximum of the cumulative times of events in different subtrees is chosen. In cases of inconsistency, a NULL waiting time is returned. For all consistent simulations, the cumulative rule summarized in Figure 3.2 is applied to obtain an observation on the waiting time for x. The expected waiting time of the pattern is finally estimated as the average of all observed waiting times. We refer to this unitless waiting time as the GPS of the pattern. Thus, GPS reflects the progression of tumor development along the oncogenetic tree model of genetic aberrations. For the sample x, we define: GPS(x) = EM (Tx ), where Tx denotes the waiting time until pattern x and the expectation is taken with respect to the distribution induced by the underlying oncogenetic tree mixture model M. Figure 3.3 shows the nontrivial oncogenetic tree from figure 3.1, annotated with waiting times. In [30], GPS was computed for tumor samples from Prostate Cancer and Glioblastomas. For both cases, the results showed that GPS has prognostic value, i.e. it can be used to differentiate patient subgroups with respect to expected clinical outcome. For example, in the prostate cancer case, the patients with GPS < 1 have a significantly longer time to PSA (prostate specific antigen) recurrence than the ones with GPS > 1. The prostate specific antigen is a substance produced by the prostate that may be found in an increased amount in the blood of patients who have prostate cancer, widely used as an indicator of the presence of the disease. Moreover, the GPS can improve performance over established histopathological parameters, such as Gleason 32 Figure 3.3: Example of timed oncogenetic tree. Each edge is annotated with the expected waiting time until the occurrence of the child, once the parent has occurred. For this example, the mean sample time was assumed to be E[Ts ] = 1/λs = 100, i.e. λs = 0.01. score in the prostate cancer case. The Gleason score is a measure of tumor aggressiveness that ranges between 1 (the lightest) and 10 (the most severe). tumor For the largest group of tumors with an average Gleason score of 7, experiments showed that GPS further identifies subgroups with different prognosis with respect to time to relapse after surgery. 33 Chapter 4 Application to Cancer Data We tested our algorithm for analysis of arrayCGH data on two array CGH data sets, representing experiments on prostate and glioblastoma tumors. This chapter presents the results of the two applications, starting with the description of the input data and technical details of the implementation, followed by the presentation, interpretation and validation of the results. 4.1 Array CGH data sets Prostate Cancer The prostate cancer data set was made available by the Department of Urology and Pediatric Urology of University of Saarland, via a collaboration with Prof. Dr. Bernd Wullich and Dr. Jörn Kamradt. It consists of 17 array CGH experiments on tumors belonging to 4 prostate cancer cell lines: PC3, DU145, LNCaP and CWR22. Cell lines are populations of cells derived from a single ancestor cell and grown in the laboratory. In cancer research, a parent cell from a tumor tissue is cultivated in vitro or in vivo (typically implanted and grown in mice). The resulting cells are genetically identical due to the common ancestor, therefore arrayCGH experiments on cell lines have high quality. PC3 cell lines were generated from a brain metastasis and DU145 from a skeletal metastasis. They are both very advanced forms of cancer and highly resistent to therapies. The LNCaP cells originate from a lymph node metastatic lesion of human prostatic carcinoma, and have been widely used in the study of prostate cancer. CWR22 derive from a primary prostate tumor with bone metastasis with a Gleason score 9, a measure of the tumor aggressiveness that ranges from 1 (the lightest) to 10 (the most severe). It is shown in the literature that prostate tumors are characterized by an increased level of male-specific hormones called androgens and that they evolve from an androgen34 dependent early stage which responds well to antiandrogenic treatments to advanced stages which are androgen-independent. Cell lines PC3 and DU145 are androgenindependent, while LNCaP and CWR22 are androgen-dependent. Due to the different metastasis location and stages of evolution of the 4 cell lines, it is expected that they exhibit different patterns of genetic mutations accumulations. The array CGH analysis in Section 4.3 will show these differences. Glioblastomas The brain tumor data set was published by Markus Bredel et al. in 2005 and it is available at the Stanford Microarray Database 1 . The description of the experiment parameters can be found in [4]. Out of the 54 arrays, we selected 21 that refer to glioblastomas, a type of malignant brain tumor that grows rapidly and is fatal in most of the cases. It is known that this type of tumor has cells that are genetically very different from healthy cells. The resolution of the experiments is of approximately 40.000 genes, out of which 36.000 were mapped to genome position. Normalized Log2 Ratios , locuslink IDs and gene names are available for each spot on the array. Section 4.4 presents the results of our statistical analysis of the array CGH experiments on the glioblastomas data set. 4.2 Implementation We implemented our method in R (version 2.1.1), a software environment for statistical computing and graphics 1 often used in applications in Bioinformatics. R is a high-level programming language, suitable for our purposes since it allows easy and optimized computations with large matrices. Moreover, a comprehensive collection of packages implementing most of the frequently used statistical learning algorithms is available. Below are described the main steps of the application together with the packages and functions used (see Chapter 2 Section 2.3.4 for the algorithm). 4.2.1 Implementation steps 1. Input Each input CGH array is a data frame object that keeps for each spot on the experimental chip the following information: • Chromosome: the chromosome to which it is mapped 1 1 http://smd.stanford.edu/cgi-bin/publication/viewPublication.pl?pub no=182 http://www.r-project.org 35 • PosOrder: an index which gives the relative order of the spots on the corresponding chromosome • PosBase: the base position on the specified chromosome at which the corresponding sequence begins • LogRatio: the normalized log2 ratio of the two fluorescence intensities as output by the array CGH experiment These fields are obligatory. Additionally, if available, information on the gene name or locuslink ID or other gene identifiers are useful for a quick interpretation of the results. 2. Array CGH smoothing The array CGH smoothing step is necessary for noise reduction and it amounts to fitting a piecewise constant function to the Log2 Ratios . We used the package GLAD1 which, apart from the main function glad that implements the piecewise constant regression, contains several plotting facilities for a convenient visualization of the results, e.g. cytogenetic bands annotation. Cytogenetic bands are segments of the chromosomes of different fluorescent intensities as a result of staining. The function glad adapts the more general function aws from the R package AWS2 , which implements a local polynomial adaptive weights smoothing method, to the particular setting of array CGH analysis. We used glad with the default parameters setting: glad(profileCGH, smoothfunc="lawsglad", model="Gaussian", lkern="Exponential", qlambda=0.985, ...) • profileCGH: the input array CGH data in the format required above; • smoothfunc: chooses the piecewise regression method used, in the default case adaptive weights smoothing (aws) • model: determines the underlying assumption about the distribution of the Log2 Ratios , in the default case Gaussian • lkern: chooses the location kernel for the aws function (see Kl parameter in the the parameter discussion section on page 17) 1 2 http://www.bioconductor.org/packages/bioc/stable/src/contrib/html/GLAD.html http://cran.r-project.org/src/contrib/Descriptions/aws.html 36 • qlambda: stochastic penalty that tunes the sensitivity of the method to local changes (see parameter λ on page 17) The function glad calls aws, with the following parameters: aws(y, x = NULL, p = 0, sigma2 = NULL, qlambda = NULL, eta = 0.5, lkern = "Triangle", hinit = NULL, hincr = NULL, hmax = 10, ...) • y: the observed Log2 Ratios ; • x: the chromosomal locations of the observations; • p: degree of the local polynomial used in the regression model; p = 0 for a local constant fit; • sigma2: estimate of the variance of the model; if NULL, it is estimated from the data; • qlambda: stochastic penalty, same as in glad; • eta: memory parameter used to stabilize the procedure, with 0.5 default value (see parameter η on page 17); • lkern: location kernel, same as in glad, but with a different default value (see parameter Kl on page 17); • hinit: initial bandwidth for the location penalty (see parameter h(0) on page 17); glad sets this parameter to a default value of 1; • hincr: factor to increase the bandwidth between iterations (see parameter a on page 17); its default value when called by glad is 1.2 • hmax: maximal bandwidth to be used, set to 10 by default; it determines the number of iterations and is used as the stopping rule (see parameter h∗ on page 17). The output of the glad function is a data frame object that contains the complete input information and a new column Smoothing, specifying the fitted values. Missing values handling Array CGH data contains many not available (NA) Log2 Ratio values due to unresolved spot images on the experimental chips. The missing values are nevertheless sparse, and a prediction of their values is useful for a coherent alignment of the arrays in the consensus analysis. We use the smoothing function as predictor of missing values in a natural way: we assign to a not available Log2 Ratio the value of the neighboring observations, following the local constant assumption. 37 3. Aberrations detection We used hist from the graphics package to visualize histograms of the smoothed Log2 Ratios for each array separately, which suggested normal distributions. We also used the package stats, containing the function qqnorm which produces quantile-quantile plots and the function qqline which fits a robust normal to the Log2 Ratios and returns its mean and standard deviance. The gain and loss cutoffs at 2 standard deviations from the mean determine the status of each observation spot to be either 1 (gain), 0 (normal) or −1 (loss). 4. Region selection The region selection algorithm amounts to aligning all the status arrays and computing associated p-values to each observation. We implemented the dynamic programming scheme suggested in section 2.3.3 which requires quadratic running time in the number of arrays. A cutoff of 0.01 was used to filter out all chromosomal regions with larger pvalue. The remaining regions are called genetic events. They are stored in a data frame object containing the following information: • Chromosome: the chromosome on which the region is located • Start: the starting position of the region, in basepairs • End: the ending position of the region, in basepairs • Status: the type of mutation, either 1 (gain) or −1 (loss) 5. Oncogenetic trees estimation To estimate oncogenetic trees from the genetic events selected, we used the mtreemix software package1 developed by Niko Beerenwinkel et al. [2]. Input files format Two input files with the same name and different extensions are required as input: a profile (.prf) and a pattern (.pat) file. The profile file contains a listing of the genetic events labels, one label per line. For an easy interpretation, these labels should refer to established chromosome regions nomenclature, such as cytogenetic bands for large regions or isolated genes names in the case of narrow regions. 1 http://mtreemix.bioinf.mpi-sb.mpg.de 38 The pattern file contains a binary data matrix preceded by its dimensions (number of rows, number of columns) in the first line. Each column corresponds to a genetic event, starting with the null event added artificially as explained in section 3.1.1. Each row represents the genetic pattern of an array, starting with a 1 in the null event column and followed by a space-separated list of zeroes and ones. Applications • mtreemix_bootstrap: fits a mixture model and analyzes its stability using bootstrap sampling. Among the parameters required are the profile and pattern files, the number of trees in the mixture model, the parameter of the exponential sampling times and the minimum conditional probability to include an edge in the model. • mtreemix_time: adds waiting time estimates to the mixture model. The parameters required include the ones listed above, in addition the number of simulations used for time estimation must be specified. • mtreemix_select: carries out model selection, optimizing the number of trees in the mixture model. The model selection method can be either cross-validation, modified Bayes Information Criterion (BIC) or standard BIC [13]. The trees, the bootstrap stability values and the estimated waiting times can be visualized using the treeify function. 4.3 Results for analysis of prostate cancer data This section presents the results of our arrayCGH analysis applied to the Prostate Cancer data set, compares them to previous experiments and gives an interpretation of the estimated oncogenetic trees. 4.3.1 Genetic mutations in individual arrays The analysis of individual arrays supports the assumption that the smoothed Log2 Ratios follow normal distributions around 0, but the quantile-quantile plots show that the tails of these distributions deviate. Figures A.1 to A.17 from Appendix A show the histograms of the smoothed Log2 Ratios together with the fitted robust normals, the corresponding qq-plots, the gains/losses cutoffs and the resulting mutations on the first chromosome of each of the 17 arrays. Our algorithm successfully chooses gain and loss cutoffs that separate the tails of the distributions in a fully data adaptive way. However, since our method is searching for regions with Log2 Ratios significantly different from 0, it cannot identify low-magnitude 39 Figure 4.1: p-values associated to gains and losses in the Prostate data set. Regions with p-values smaller that 0.01 are called significant. The figures show that at least 5 gains or 4 losses should be observed at a certain position in order to select it. amplifications or deletions which might appear repeatedly in a large fraction of the arrays. For instance, experiments HV3 20 68CGH (Fig. A.4), HV3 20 60CGH (Fig. A.7), HV3 20 58CGH (Fig. A.15), HV3 20 70CGH (Fig. A.16) and HV3 20 61CGH (fig. A.17) show a low-magnitude loss on the short arm of chromosome 1, between 1p12 and 1p34 which is not output by our method. From a biological point of view, these regions may be interesting to analyze, and our method can be adjusted to be more sensitive to low amplifications or deletions by shrinking the cutoffs (e.g. to 1.5 times the standard deviation of the robust normals). However, this might lead to an undesired masking of narrow regions of high gain or loss, which lie within low regions. In our experiments, we focus on identifying rather narrow, highly specific gained or lost regions, as they make the search for single genes involved in tumor genesis easier. The same reason also justifies the use of high resolution CGH array technology. 4.3.2 Consensus analysis The p-value cutoff of 0.01 selects the regions where at least 5 out of 17 arrays show a gain in copy number and those where at least 4 out of 17 show a loss. Figure 4.1 shows the p-value curves for gains and losses in the Prostate Cancer data set. Our method outputs 19 genetic events, listed in table 4.1. We compared our results with previously published studies on Prostate Cancer genetic mutations [22], which were typically the outcome of CGH experiments. 40 Chromosome 1. Amplifications of the entire long arm (1q) were indicated previously. Our method detects narrower gains at 1q21-q23 and 1q24, due to the higher resolution of array CGH data. It also shows a clear loss at 1q25-q32, a mutation that was not detected by chromosomal CGH experiments. Chromosome 4. Previous results show a loss in the long arm, from 4q31 to the end, a mutation also detected by our analysis. An additional loss was identified at 4q22. Chromosome 5. Amplifications of the entire short arm 5p were identified by chromosomal CGH. Array CGH data show a much narrower gain at 5p12. Chromosome 7. Gains located on the 7th chromosome are frequently reported in the literature, either of the entire chromosome, or only of the short arm (7p) or regions of the long arm. The array CGH analysis shows that 5 out of the 17 arrays we used have an amplification of the SEMA3C gene, positioned within 7q21 band. Overexpression of this gene has already been correlated with ovarian and breast cancer [12] and malignant gliomas [31]. It is believed to play a role in cancer metastasis. Chromosome 8. Gains of the long arm 8q of chromosome 8 and losses of the short arm 8p are two of the most referenced mutations in prostate cancer. Our method confirms the amplifications of the entire 8q arm, but narrows down to 8p22 the loss in the short arm. Chromosome 10. Amplifications located at 10q are known from CGH experiments. Our analysis of CGH arrays outputs a much narrower region, located at 10q22. Chromosome 12. Our analysis locates a gain in copy number of KRAS gene located in the band 12p12.1, a well known oncogene involved in many types of cancer. Chromosome 14. Chromosomal CGH reports amplifications along the 14q arm, which are confirmed by our analysis. Three narrow regions are identified by our analysis, one of which resumes to a few genes. Chromosome 18. Among the frequently referenced mutations in prostate cancer is the loss of the long arm of chromosome 18. Our method identifies the same region. Chromosome 19. 5 out of the 17 CGH arrays show a loss of a narrow region located at 19q13.4. This region contains several genes that encode zinc finger proteins (ZNF701, ZNF137). These proteins are involved in DNA transcription processes. Many zinc finger proteins encoding genes appear as tumor suppressor genes candidates in the literature [34]. Figure 4.2 shows on two separate consensus plots the amplifications and deletions detected by our method. For a more detailed view, see Figures A.18 to A.30 in Appendix A. 41 Chromosome Start Position End Position Mutation Label 1 1 1 2 2 4 4 5 7 8 8 10 12 12 14 14 14 18 19 140608453 168325828 171062336 43742879 162306337 89900841 147757519 29115522 79984333 68008046 185394 73973114 25249834 27004735 27243148 45903485 88854803 37913083 57707639 158830654 171032956 196730612 53975994 162426906 99866357 191599721 43704902 79984333 142189219 11602966 81580145 25259666 29616341 34770818 71980491 89570828 76104136 57792000 gain gain loss loss loss loss loss gain gain gain loss gain gain gain gain gain gain loss loss 1q21-q23+ 1q24 + 1q25-q322p16-p21PSMD144q224q31-q355p12+ SEMA3C+ 8q+ 8p2210q22+ 12p11.1+ KRAS+ 14q12-q21+ 14q22-q31+ CALM1,RPS6KA5+ 18qZNF701,ZNF137- Table 4.1: Summary of amplifications and deletions identified in the Prostate Cancer data set. For each mutation, the chromosome location, the starting end the ending position in basepairs, the type of mutation and the label (referring to cytogenetic bands) are given. 42 43 along the genome. The 24 chromosomes are annotated on the top axis of each plot. The black stripes indicate the locations of either gain or loss, in each of the 26 arrays. Figure 4.2: Mutations in Prostate Cancer arrayCGH data set. The gains (top figure) and the losses (bottom figure) are aligned 4.3.3 Oncogenetic trees We associated genetic patterns to the 17 prostate cancer CGH arrays, corresponding to the 19 genetic events presented in the previous section (Table 4.1). We then estimated an oncogenetic tree mixture model with two components based on this set of genetic patterns (see Figure 4.3). The weight of the star component in the mixture is 0.76, therefore as many as 76% (13 out of 17) of the genetic patterns are most likely generated by the model that assumes independence of the events (noise component). We explain this poor mixture model by the heterogeneity of the genetic patterns, and the relative small amount of arrays compared to the number of events. The LNCaP and CWR cell lines are not well represented (only 2 samples of each), therefore their specific dependencies do not have a strong enough support to be captured by the tree models and they are considered noise. We decided to leave out the LNCaP and CWR cell lines and estimate a tree mixture model based only on the better represented PC3 and DU145 cell lines. The mixture model has two components (see Figure 4.4), but a higher proportion of the arrays (8 out of 13) are now explained by the nontrivial tree component. The stronger dependencies between the events within this restricted data set is explained by the increased similarity between the arrays. The analysis of the nontrivial component shows that events 8q+ and 14q22-q31+ appear early in the course of the disease. The literature reports gain of 8q as one of the mutations associated with the onset of malignancy in prostate cancer, therefore present in a large fraction of the advanced cases. The relative count of bootstrap samples (687 out of 1000 and 685 out of 1000, respectively) strongly support 8q+ and 14q22-q31+, respectively, as initial events. The topology of the tree indicates two distinct evolutionary pathways. The left subtree characterizes PC3 cell line arrays, while the right subtree describes better DU125 arrays. The events located in the subtree rooted at 1q25-32+ appear predominantly in PC3 cell lines, while the events in the subtree rooted at 1q21-23+ are found rather in DU145 cell lines (Table 4.2). Therefore, the tree model learns from the data originating from two different cell lines two different progression pathways of prostate cancer. Bootstrap counts show strong dependencies between events 1q24+ and 8p22-, 1q2123+ and 4q31-35 and 1q24+ and KRAS+. However, in general, the confidence intervals are rather large and the relative bootstrap counts small. This poor quality of the tree mixture model is a consequence of having too many features (genetic events) and not enough examples to learn from (arrays). It is not a particular problem of oncogenetic trees, but of most statistical learning models. 44 45 the bottom box contains the nontrivial tree component. Each edge in the model is annotated with the conditional probability, the confidence interval associated and the bootstrap samples count as a measure of stability. The weight of each component in the mixture is given in the top left corner of the box. Figure 4.3: Oncogenetic tree mixture model for Prostate Cancer. The top box contains the star-shaped noise component and 46 two cell lines into two evolutionary pathways. The subtree rooted at 1q25-q32- characterizes the PC3 cell lines, while the subtree rooted at 1q21-q23+ contains events predominant in DU145 cell line. Figure 4.4: Oncogenetic tree mixture model based on arrays from the PC3 and DU145 cell lines. The tree topology separates the 47 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 PC3 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 PC3 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 PC3 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 PC3 1 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 PC3 0 0 0 0 0 0 0 1 1 1 0 1 0 1 1 1 1 PC3 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 DU145 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 DU145 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 DU145 1 0 1 1 1 0 1 0 0 0 0 0 0 0 1 1 1 DU145 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 DU145 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 DU145 to an array. The genetic events are grouped relative to their position in the oncogenetic tree mixture model: the first two events are initial events, the second group forms the left subtree and it is highly characteristic for PC3 arrays, while the third group belongs to the right subtree and it is representative for the DU145 arrays. Table 4.2: Genetic patterns of arrays from PC3 and DU145 cell lines. Each row corresponds to a genetic event and each column 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 1q25-321q24+ 4q2210q22+ 8p22KRAS+ CALM1,RPS6KA5+ 1q21-23+ 4q31-355p12+ 14q12-21+ 18qZNF701,ZNF137SEMA3C+ 12p11.1+ 0 0 8q+ 14q22-31+ PC3 4.4 Results for analysis of glioblastomas data This section presents the results of our arrayCGH analysis of the Glioblastomas data set, compares them to previous experiments and gives an interpretation of the estimated oncogenetic trees. 4.4.1 Genetic mutations in individual arrays The analysis of the individual arrays shows substantial differences between the Prostate Cancer data set and the Glioblastomas data set. The histograms of the smoothed Log2 Ratios indicate that a large proportion of them accumulate very close to zero, having a high peak around the mean and steep slopes towards the tails. As a consequence, the amplifications and deletions cutoffs are rather close to the mean. However, the qq-plots show the same normal distribution trend with deviating tails (see Figures B.1 to B.26 in Appendix B). Several reasons explain the differences between the distributions of the smoothed Log2 Ratios from the two data sets. In the case of Prostate Cancer, the cell lines used for experiments contain only tumor cells, in (approximately) the same stage of progression. The Glioblastomas experiments use tumor tissue prevailed directly from diagnosed patients, which contain cancerous cells in different stages of progression, potentially mixed with healthy cells. With a certain probability, healthy cells labeled as cancerous hybridize against the reference tissue, influencing the ratio of fluorescent intensities. Therefore, the amplifications and deletions measured by arrayCGH might have a lower magnitude compared to the true levels, as measured in pure tumor cells. Another reason that explains the steep slopes of the histograms is the low resolution of mutations. In general, the Glioblastomas show narrower and less coherent amplified or deleted regions, compared to the Prostate Cancer arrays. This may be due to the impure tissue as shown before, or due to biologically different evolution patterns of the two diseases. 4.4.2 Consensus analysis A p-value cutoff of 0.01 determines the selection of regions where at least 7 out of the 26 arrays have a mutation of a fixed type (either gain or loss). Figure 4.5 shows the p-value curves for gains and losses in the Glioblastomas data set. Our region selection method outputs 34 genetic events, out of which 13 are mapped to chromosome 22. Figure 4.6 highlights these regions on a consensus plot. The oversegmentation in this case has both advantages and drawbacks. The positive aspect is that narrow regions serve well the goal of identifying single causative genes. It is not clear though whether learning a prediction model for tumor progression from a set of many small regions would result in a more powerful model than one learned from few larger 48 Figure 4.5: p-values associated to gains and losses in the Glioblastomas data set. Regions with p-values smaller that 0.01 are called significant. The figures show that at least 7 gains (or losses) should be observed at a certain position in order to select it. compact segments. In the case of oncogenetic trees, too many events may lead to poor tree mixtures, as shown in the previous section. Biological considerations should also be involved in the selection decision. This is a typical feature selection problem and, in a general setting, one seeks to optimize an objective function. In our case however, the goal is to build tumor progression models that can predict tumor stage and survival time. We do not propose a solution to the region selection problem here, it is subject to future work. In order to continue our analysis with the estimation of oncogenetic tree mixtures, we decided to select two broad regions from the losses of chromosome 22. They are listed in Table 4.3, together with all other mutations found by our analysis. In what follows, we comment on these results and we compare them with previously published studies on Glioblastomas, which are based on CGH experiments (see [17], [15] for summaries of aberrations in Glioblastomas). Chromosome 1. Frequent gains of 1q are reported in the literature. Our method outputs a recurrent gain of a much restricted region, at 1q43-q44. Losses at the terminal position of the short arm 1p were identified by chromosomal CGH and confirmed by our method, but two particular genes, MAD2L2 and GGPS1 were found to be lost in a high number of patients. MAD2L2 is involved in cell division and has been associated with cancer genesis. Chromosome 5. Our analysis detected a recurrent gain (7 out of 26)of PRLR (prolactin receptor) gene located at 5p12. This gene is involved in anti-apoptosis processes, therefore it may play a role in tumor development. Chromosome 6. Losses of the long arm were detected by chromosomal CGH. Our 49 Figure 4.6: Losses on chromosome 22. Some of the regions are separated by very small intervals. method identifies a single gene (KIAA0408) within the 6q23 cytogenetic band , lost in 7 out of 26 patients. The function of this gene is still unknown. Chromosome 7. Our analysis confirms the amplifications of all chromosome 7 reported by CGH experiments. Chromosome 8. 10 out of 26 patients have amplifications located at the terminal positions (8q24.3) of the long arm. Much broader regions were reported before. Chromosome 9. CGH experiments find gains of the long arm 9q and losses of the short arm 9p. We detect recurrent deletions of a segment of the short arm, 9p21-p24. Chromosome 10. Partial or full deletions of chromosome 10 in almost all patients confirm the CGH results and identify this mutation as a good genetic marker of glioma oncogenesis. Chromosome 11. We detect an amplification of gene UCP3 in 8 patients. The overexpression of this gene is related to muscle wasting during cancer [5]. Chromosome 13. Deletions of the long arm in a large fraction of patients indicate the presence of tumor suppressor genes and identify this mutation as one of the most common in gliomas progression. Chromosome 16. We locate a recurrent gain of the gene FLJ37464 at 6q22 in 9 patients. No previous references relate this gene to cancer. Chromosome 18. CGH experiments show sparse gains and losses along the chromosome 18. Our method identifies a deletion of the TYMS (thymidylate synthetase) gene in 10 patients. It is known to be involved in DNA repair processes and therefore to play a role in tumor genesis. Chromosome 19. We find amplifications along the entire chromosome in almost all 50 patients. A particular recurrent loss of the CACNA1A (calcium channel) at 19p13.1 is also identified. This gene is not directly connected to cancer in the literature. Chromosome 20. Gains of the entire chromosome are typical mutations in CGH experiments. Our method identifies amplified regions that span almost the whole chromosome. Chromosome 22. Recurrent losses of the long arm were frequently detected by CGH experiments. We identify multiple lost regions, one of which consists of only the neurofibromin 2 gene (NF2), which is a known tumor suppressor gene involved in meningiomas. Chromosome Start Position End Position Mutation Label 1 1 1 5 6 7 8 9 9 10 11 13 16 18 19 19 20 20 20 21 22 22 23 236978915 11668803 231817793 35099984 127803428 288195 143950776 111040 27099285 170642 73388984 19105879 65587028 586997 232045 13179114 71251 5934878 9024931 44256633 15446202 28324118 108585154 245410192 11668803 231817793 35099984 127803428 158320342 146248629 21957751 27099285 135256286 73388984 114098116 65587028 647650 63778638 13179114 1491567 7811630 62357564 44256633 18085619 41872032 108585154 gain loss loss gain loss gain gain loss loss loss gain loss gain loss gain loss gain gain gain loss loss loss loss 1q43-q44+ MAD2L2GGPS1PRLR+ KIAA04087+ 8q24.3+ 9p21-p24TEK10UCP3+ 13qFLJ37464TYMS19+ CACNA1A20p13+ 20p12.3+ 20p12.2-qter+ TMEM122q11.2122q12.2-q13.2NXT2- Table 4.3: Summary of amplifications and deletions identified in the Glioblastomas data set. For each mutation the chromosome location, the starting end the ending position in basepairs, the type of mutation and the label (referring to cytogenetic bands) are given. Figure 4.7 shows on two separate consensus plots the amplifications and deletions detected by our method. For a more detailed view, see Figures B.27 to B.44 in Appendix B. 51 52 the genome. The 24 chromosomes are annotated on the top axis of each plot. The black stripes indicate the locations of either gain or loss, in each of the 26 arrays. Figure 4.7: Mutations in Glioblastomas arrayCGH data set. The gains (top figure) and the losses (bottom figure) are aligned along 4.4.3 Oncogenetic trees We estimated an oncogenetic tree mixture with two components based on the 26 genetic patterns and 24 events presented in the previous section. The trees are shown in Figure 4.8. The nontrivial component of the mixture has weight 0.12, therefore only 3 patterns are likely to be generated by the corresponding tree. A mixture of three tree components does not capture better the dependencies between the events, as only 7 patterns are explained by the nontrivial topologies. In fact, cross validation selects the tree mixture with only one component (the star shaped tree) as the optimum model, thus independence of the events is the most likely assumption given the observed patterns. We believe that we do not obtain meaningful oncogenetic tree models because of the large number of events compared to the number of observed patterns. Since it is apparent that aggressive tumor types such as Glioblastomas show a large number of mutations and the high resolution arrayCGH data can capture up to single genes deletions and amplifications, it is necessary to develop a strategy of selection of key events which mark the stages of tumor progression for different types of cancer. However, it might be true that there isn’t a specific order of mutations accumulation for all types of cancer, which contradicts the current beliefs. Regardless of the quality of the tree mixture obtained, we expect mutations 10-, 7+, 19+, 13q- to appear in the early stages of evolution of Glioblastomas. Much research focuses on identifying candidate genes for tumor onset within these regions. 53 54 box contains the nontrivial tree component. Each edge in the model is annotated with the transition probability, the confidence interval associated and the bootstrap samples count as a measure of stability. The weight of each component in the mixture is given in the top left corner of the box. Figure 4.8: Oncogenetic tree mixture model for Glioblastomas. The top box contains the star-shaped noise component and the bottom 4.5 Validation The goal of our arrayCGH data analysis method is the identification of genetic markers (chromosomal amplifications and deletions) that characterize specific types of tumors. Our algorithm consists of several steps of data filtering based on statistical decisions. We believe that the information we extract characterizes well the stage of tumors, while the dimension reduction is substantial, from approximately 30000 Log2 Ratios (CGH arrays resolution) to 20 genetic events. To support this affirmation, we validated our method based on the following scenario: we clustered the 17 Prostate Cancer arrays (see Section 4.1) represented in four different feature spaces, expecting different cell lines to group together. The biological reasons that support our expectation have been presented in Section 4.1. The feature spaces are described below. 1. Initial Log2 Ratios. The arrays are represented in the 30000 dimensional space of the initial Log2 Ratios of intensities. 2. 50 highest variance genes. We select 50 genes that have the largest variance across all arrays and use them as features. 3. Status arrays. The -1 (loss), 0 (normal level), 1(gain) information is used to represent the arrays in an Euclidean space with 30000 dimensions. 4. Genetic patterns. The genetic patterns are represented in the space of the 19 genetic events. In each of the settings above, we used an agglomerative hierarchical clustering method with Euclidean distance between the data points and average-linkage distance between clusters [13]. The method starts with each data point assigned to a separate cluster and then iteratively joins pairs of closest clusters, until one cluster is obtained. The result of hierarchical clustering can be visualized by means of dendrograms, see Figure 4.9 for the clustering results in each of the four settings. It becomes clear from the clustering dendograms that the genetic events selected by our method best separate the four cell lines. The initial Log2 Ratios do not discriminate between the PC3 and DU145 cell lines and only a slight improvement is achieved when selecting the 50 highest variance genes. However, when clustering the status arrays, the CWR and the DU145 cell lines form separate groups. When using the genetic events as features, the cell lines separate very well, in the sense that a partition in 4 clusters groups together CWR and LNCaP cell lines in one cluster, all DU145 cell lines in the second cluster, while the PC3 cell lines are split in 2 groups. A partition in 5 clusters further separates the CWR and LNCaP cell lines. 55 We also computed the correlations between the mutual distances of cell lines (represented in each of the the four feature spaces described) and ideal distances, defined as zero between the same cell lines and one between different cell lines. The correlation increases from 0.27 for the Log2 Ratios representation, up to 0.75 correlation coefficient in the case of genetic patterns representation. The correlation coefficients are shown in Figure 4.10. The results prove that the information we extract from the arrayCGH data captures the genetic differences between the cell lines. Figure 4.9: Dendrograms of the hierarchical clustering of the Prostate Cancer arrays in different feature spaces: a) the initial Log2 Ratios ; b) the 50 highest variance genes; c) the status arrays; c) the genetic patterns. The labels indicate cell lines. 56 Figure 4.10: Correlation coefficients between the mutual distances of cell lines (represented in 4 feature spaces) and true distances. 57 Chapter 5 Conclusions and future work We presented an automated method for statistical analysis of arrayCGH data which detects genetic amplifications and deletions in different types of cancer. Our contribution to the method consists of an algorithm for determining appropriate gains and losses cutoffs in individual CGH arrays and a selection method of mutated regions when a large collection of experiments is available. The genetic events computed by our method can be further used for learning oncogenetic evolutionary models, which can help to predict tumor stages and survival time. We applied our algorithm to Prostate Cancer and Glioblastoma data sets. As the results show, the high resolution of arrayCGH data allows a more precise characterization of cancer mutations. The regions obtained often reduce to single genes, which can be further investigated as candidates for cancer onset and progression. The evolutionary tree models estimated for the Prostate Cancer show two different progression patterns, characterizing two different cell lines. Therefore, the events we select capture the characteristics of different disease subtypes. This conclusion is also confirmed by the clustering results of the four cell lines. We can further improve our algorithm by adding a step of automated parameters optimization, to adapt to the particularities of each data set. This is not a difficult task and established methods such as cross-validation are expected to work well. The region selection however raises the more complicated problem of reducing the genetic mutations to a small set of highly representative key-events. The optimization of mutation sets needs to consider statistical as well as biological criteria. For further assessment, we need to test our method on larger arrayCGH data sets, which are already publicly available. We can also test the prediction accuracy of the oncogenetic tree mixture models if arrayCGH experiments contain additional clinical information, such as tumor stage as predicted by traditional methods or survival time of the patients. 58 Appendix A Prostate Cancer Individual analysis Figures A.1 to A.17 show histograms, qq-plots of the smoothed Log2 Ratios and the first chromosome of each array from the Prostate Cancer arrayCGH data set, as output by our method. Each figure is labeled with the ID of the arrayCGH experiment. 59 Figure A.1: HV3 7 18CGH 60 Figure A.2: HV3 7 13CGH 61 Figure A.3: HV3 7 06CGH 62 Figure A.4: HV3 20 68CGH 63 Figure A.5: HV3 20 66CGH 64 Figure A.6: HV3 20 63CGH 65 Figure A.7: HV3 20 60CGH 66 Figure A.8: HV3 20 67CGH 67 Figure A.9: HV3 20 53CGH 68 Figure A.10: HV3 7 25CGH 69 Figure A.11: HV3 7 19CGH 70 Figure A.12: HV3 7 09CGH 71 Figure A.13: HV3 20 64CGH 72 Figure A.14: HV3 20 59CGH 73 Figure A.15: HV3 20 58CGH 74 Figure A.16: HV3 20 70CGH 75 Figure A.17: HV3 20 61CGH 76 Consensus Plots Figures A.18 to A.30 are consensus plots of the chromosomes which contain the aberrations detected, showing individual mutations of all arrays as well as highlighting the regions selected. The x-axis of each plot shows the position on the corresponding chromosome and is annotated with the cytogenetic bands represented by the light blue vertical lines. The y-axis aligns the 17 arrays. The black stripes indicate the regions of either gain or loss, for each array individually. The regions with a low p-value are highlighted by green, vertical stripes. Figure A.18: Gains on chromosome 1. 8 arrays indicate gains of 1q21-q23 and 7 show gains of the 1q24 region. 77 Figure A.19: Losses on chromosome 1. There is a strong evidence of a loss of 1q25-q32. Figure A.20: Losses on Chromosome 2. Narrow regions at 2p16-p21 and 2q24 are lost. 78 Figure A.21: Losses on chromosome 4. There are recurrent losses on the long arm 4q, but regions 4q22 and 5q31-q35 appear to have a higher frequency. Figure A.22: Gains on chromosome 5. There are repeated gains at 5p12. 79 Figure A.23: Gains on chromosome 7. Two arrays show an amplification of the whole chromosome and one only of the long arm. A very narrow region within the 7q21 band which contains gene SEMA3C is lost in 5 arrays. Figure A.24: Gains on chromosome 8. Gains along the long arm are identified in 6 arrays. 80 Figure A.25: Losses on chromosome 8. There are deletions of the 8p22 band in 4 arrays. Figure A.26: Gains on chromosome 10. Amplifications at 10q22 are frequent. 81 Figure A.27: Gains on chromosome 12. 2 arrays have amplifications of the entire chromosome, while a narrow region 12p11.1 appears to be gained in 6 of them. Figure A.28: Gains on chromosome 14. Multiple gains are located along the 4q arm of the chromosome. 3 regions are indicated as significant: 14q12 − q21, 14q22 − q31 and a short region at 14q32. 82 Figure A.29: Losses on chromosome 18. A large part of the long arm is deleted. Figure A.30: Losses on chromosome 19. A region which contains several zinc-finger protein encoding genes is lost in 5 arrays. 83 Appendix B Glioblastomas Individual analysis Figures B.1 to B.9 show histograms and qq-plots of the smoothed Log2 Ratios of each array from the Glioblastomas arrayCGH data set, as output by our method. Each figure is labeled with the ID of the arrayCGH experiment. Figure B.1: bredel51551 84 Figure B.2: bredel51550 Figure B.3: bredel51552 Figure B.4: bredel51557 85 Figure B.5: bredel51559 Figure B.6: bredel51516 Figure B.7: bredel51544 86 Figure B.8: bredel51554 Figure B.9: bredel51565 Figure B.10: bredel51563 87 Figure B.11: bredel51558 Figure B.12: bredel51564 Figure B.13: bredel51518 88 Figure B.14: bredel51531 Figure B.15: bredel51566 Figure B.16: bredel51546 89 Figure B.17: bredel51530 Figure B.18: bredel51556 Figure B.19: bredel51511 90 Figure B.20: bredel51529 Figure B.21: bredel51528 Figure B.22: bredel51515 91 Figure B.23: bredel51549 Figure B.24: bredel51548 Figure B.25: bredel51536 92 Figure B.26: bredel51540 93 Consensus plots Figures B.27 to B.40 are consensus plots of the chromosomes which contain the aberrations detected, showing individual mutations of all arrays as well as highlighting the regions selected. The x-axis of each plot shows the position on the corresponding chromosome, and is annotated with the cytogenetic bands represented by the light blue vertical lines. The y-axis aligns the 26 arrays. The light grey points indicate gene locations, and also correspond to spots on the array CGH chips. The black points emphasize the locations of either gain or loss, for each array individually. The regions with a low p-value are highlighted by green, vertical stripes. Figure B.27: Chromosome 1 gains 94 Figure B.28: Chromosome 1 losses Figure B.29: Chromosome 5 gains 95 Figure B.30: Chromosome 6 losses Figure B.31: Chromosome 7 gains 96 Figure B.32: Chromosome 8 gains Figure B.33: Chromosome 9 losses 97 Figure B.34: Chromosome 10 losses Figure B.35: Chromosome 11 gains 98 Figure B.36: Chromosome 13 losses Figure B.37: Chromosome 16 gains 99 Figure B.38: Chromosome 18 losses Figure B.39: Chromosome 19 gains 100 Figure B.40: Chromosome 19 losses Figure B.41: Chromosome 20 gains 101 Figure B.42: Chromosome 21 losses Figure B.43: Chromosome 22 losses 102 Figure B.44: Chromosome 23 losses 103 Bibliography [1] Beerenwinkel,N., Rahnenfürer,J., Däumer,M., Hoffmann,D., Kaiser,R., Selbig,J., Lengauer,T. (2005) Learning multiple evolutionary pathways from cross-sectional data. Journal of Computational Biology, 12, 584-698 [2] Beerenwinkel,N., Rahnenfürer,J., Kaiser,R., Hoffmann,D., Selbig,J., Lengauer,T. (2005) Mtreemix: a software package for learning and using mixture models of mutagenetic trees Bioinformatics, 21(9), 2106-2107 [3] Bilke,S., Chen,Q.-R., Whiteford,C.C., Khan,J. (2005) Detection of low level genomic alterations by comparative genomic hybridization based on cDNA micro-arrays Bioinformatics, 21-7, 11381145 [4] Bredel,M., Bredel,C., Juric,D., Harsh,G.R., Vogel,H., Recht,L.D., Sikic,B.I. (2005) HighResolution Genome-Wide Mapping of Genetic Alterations in Human Glial Brain Tumors. Cancer Research 65(10), 4088-4096 [5] Busquets,S., Garcia-Martinez,C., Olivan,M., Barreiro,E., Lopez-Soriano,F.J., Argiles,J.M. (2005) Overexpression of UCP3 in both murine and human myotubes is linked with the activation of proteolytic systems: a role in muscle wasting? Biochimica et Biophysica, 1760, 253-258 [6] Dempster,A., Laird,N.M., Rubin,D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series , B 39, 1-38 [7] Desper,A., Jiang,F., Kallioniemi,O.P. (1999) Inferring tree models for oncogenesis from comparative genome hybridization data. Journal of Computational Biology, 6(1), 37-51 [8] Dudoit,S., Yang,Y.H., Callow,M.J., Speed,T.P. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments Statistica Sinica, 12, 111-139 [9] Eilers,P.H.C., de Menezes,R.X. (2005) Quantile smoothing of array CGH data. Bioinformatics, 21, 1146-1153 [10] Eilers,P.H.C., de Menezes,R.X. (2005) Quantile smoothing of array CGH data. Bioinformatics, 21, 1146-1153 [11] Fridlyand,J., Snijders,A.M., Pinkel,D., Albertson,D.J., Jain,A. N. (2004) Hidden Markov models approach to the analysis of array CGH data. Journal of multivariate analysis, 90 (1), 132-153 [12] Galani,E., Sgouros,J., Petropoulou,C., Janinis,J., Aravantinos,G., Dionysiou-Asteriou,D., Skarlos,D., Gonos,E. (2002) Correlation of MDR-1, nm23-H1 and H Sema E gene expression with histopathological findings and clinical outcome in ovarian and breast cancer patients. Anticancer Research, 22, 2275-2280 [13] Hastie,T., Tibshirani,R., and Friedman,J. (2001) The Elements of Statistical Learning. SpringerVerlag 193-222 [14] Hsu,L., Self,S.G., Grove,D., Randolph,T., Wang,K., Delrow,J.J., Loo,L., Porter,P. (2005) Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics 6, 211-226 104 [15] Huhn,S.L., Mohapatra,G., Bollen,A., Lamborn,K., Prados,M.D., Feuerstein,B.G., (1999) Chromosomal Abnormalities in Glioblastoma Multiforme by Comparative Genomic Hybridization: Correlation with Radiation Treatment Outcome1 Clinical Cancer Research, 5, 14351443 [16] Hupé,P., Stransky,N., Thiery,J.P., Radvanyi,F., Barillot,E. (2004) Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20, 3413-3422 [17] Inda,M.M., Fan,X., Munoz,J., Perot,C., Fauvet,D., Danglot,G., Palacio,A., Madero,P., Zazpe,I, Portillo,E., Tunon,T., Martinez-Penuela,H.M., Alfaro,J., Eiras,J., Bernheim,A., Castresana, J.S. (2003) Chromosomal Abnormalities in Human Glioblastomas: Gain in Chromosome 7p Correlating With Loss in Chromosome 10q. Molecular Carciogenesis, 36, 614 [18] Jong,K., Marchiori,E., van der Vaart,A., Ylstra,B., Meijer,G., Weiss,M. (2003) Chromosomal breakpoint detection in human cancer. Lecture notes in Computer Science, Springer-Verlag, Berlin, vol. 2611, 54-65 [19] Kallioniemi,A., Kallioniemi,O.P., Sudar,D., Rutovitz,D., Gray,J.W., Waldman,F., Pinkel,D. (1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science, 258, 818-821 [20] Lodish,H., Berk,A., Zipursky,L.S., Matsudaira,P., Baltimore,D., Darnell,J. (October 1999) Molecular Cell Biology. W. H. FREEMAN [21] Lai,W.R., Johnson,M.D., Kucherlapati,R., Park,P.J. (2005) Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics, 21, 3763-3770 [22] Lensch,R., Götz,C., Andres,C., Bex,A., Lehmann,J., Zwergel,T., Unteregger,G., Kamradt,J., Stoeckle,M., Wullich,B. (2002) Comprehensive genotipic analysis of human prostate cancer cell lines and sublines derived from methastases after orthotopic implantation in nude mice. International Journal of Oncology 21 695-706 [23] Ludwig,J.A., Weinstein,J.N. (2005) Biomarkers in cancer staging, prognosis and treatment selection. Nature, 5, 845 - 856 [24] Molinaro,A.M., van der Laan,M.J., Moore,D.H. (2002) Comparative genomic hybridization array analysis. U.C.Berkeley Division of Biostatistics Working Paper Series [25] Nakao,K., Mehta,K.R., Fridlyand,J., Moore,D.H., Jain,A.N., Lafuente,A., Wiencke,J.W., Terdiman,J.P., Waldman,F.M. (2004) High-resolution analysis of DNA copy number alterations in colorectal cancer by array-based comparative genomic hybridization. Carciogenesis, 25, 13451357 [26] Olshen,A.B., Venkatraman,E.S., Lucito,R., Wigler,M. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 5, 557-572 [27] Pinkel,D., Segraves,R., Sudar,D., Clark,S., Poole,I., Kowbel,D., Collins,C., Kuo,W.-L., Chen,C., Zhai,Y., Dairkee,S.H., Ljung,B.-M., Gray,J.W., Albertson,D.G. (1998) High resolution analysis of DNA copy number variation using comparative genomic hybridization microarays. Nat. Genet., 20, 207-211 [28] Polzehl,J. and Spokoiny,V. (2002) Local likelihood smoothing by adaptive weights smoothing. WIAS-Preprint 787 [29] Pollack,J.R., Perou,C.M., Alizadeh,A.A., Eisen,M.B., Pergamenschikov,A., Williams,C.F., Jeffrey,S.S., Botstein,D., Brown,P.O. (1999) Genome-wide analysis of DNA copy-number changes using cDNA microarrays.Nat Genet. 23(1), 41-46. [30] Rahnenfürer,J., Beerenwinkel,N., Schultz,W.A., Hartmann,C., von Deimling,A., Wulich,B., Lengauer,T. (2005) Estimating cancer survival and clinical outcome based on genwtic tumor progression scores. Bioinformatics, 21, 2438-2446 105 [31] Rieger,J., Wick,W., Weller,M. (2003) Human malignant glioma cells express semaphorins and their receptors, neuropilins and plexins. Glia, 42(4), 379-389 [32] Solinas-Toldo,S., Lampel,S., Stilgenbauer,S., Nickolenko,J., Benner,A., Döhner,H., Cremer,T., Lichter,P. (1997) Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer, 20, 399-407 [33] Tinker,N.A., Robert,L.S., Butler,G., Harris,L.J. (2003) Data Pre-Processing Issues in Microarray Analysis. A practical approach to microarray data analysis, 47-65 [34] Tommerup,N., Vissing,H. (1995) Isolation and Fine Mapping of 16 Novel Human Zinc FingerEncoding cDNAs Identify Putative Candidate Genes for Developmental and Malignant Disorders. Genomics, 27(2), 259-264 106