Analysis of Array CGH Data for the Estimation of Genetic Tumor

Transcription

Analysis of Array CGH Data for the Estimation of Genetic Tumor
Analysis of Array CGH Data for the Estimation of
Genetic Tumor Progression
by
Laura Toloşi
Supervisors:
Dr. Jörg Rahnenführer
Prof. Dr. Thomas Lengauer
A thesis submitted in conformity with the requirements
for the degree of Master of Science
Computer Science Department
Saarland University
June 2006
Abstract
Analysis of ArrayCGH Data for the Estimation of
Genetic Tumor Progression
Laura Toloşi
Master of Science
Department of Computer Science
Saarland University
2006
In cancer research, prediction of time to death or relapse is important for a meaningful
tumor classification and selecting appropriate therapies. The accumulation of genetic alterations during tumor progression can be used for the assessment of the genetic status of
the tumor. ArrayCGH technology is used to measure genomic amplifications and deletions,
with a high resolution that allows the detection of down to single genes copy number changes.
We propose an automated method for analysis of cancer mutations accumulation based on
statistical analysis of arrayCGH data. The method consists of the four steps: arrayCGH
smoothing, aberrations detection, consensus analysis and oncogenetic tree models estimation. For the second and third steps, we propose new algorithmic solutions. First, we use
the adaptive weights smoothing-based algorithm GLAD for identifying regions of constant
copy number. Then, in order to select regions of gain and loss, we fit robust normals to the
smoothed Log2 Ratios of each CGH array and choose appropriate significance cutoffs. The
consensus analysis step consists of an automated selection of recurrent aberrant regions when
multiple CGH experiments on the same tumor type are available. We propose to associate
p-values to each measured genomic position and to select the regions where the p-value is
sufficiently small.
The aberrant regions computed by our method can be further used to estimate evolutionary
trees, which model the dependencies between genetic mutations and can help to predict tumor progression stages and survival times.
We applied our method to two arrayCGH data sets obtained from prostate cancer and
glioblastoma patients, respectively. The results confirm previous knowledge on the genetic
mutations specific to these types of cancer, but also bring out new regions, often reducing
to single genes, due to the high resolution of arrayCGH measurements. An oncogenetic tree
mixture model fitted to the Prostate Cancer data set shows two distinct evolutionary patterns
discriminating between two different cell lines. Moreover, when used as clustering features,
the genetic mutations our algorithm outputs separate well arrays representing 4 different cell
lines, proving that we extract meaningful information.
i
ii
I hereby declare that this thesis is entirely my own work except where otherwise indicated. I have used only the resources given in the list of references.
Laura Toloşi
June 22, 2006
iii
iv
Acknowledgments
While carrying out this work, I learned that the most enjoyable and fruitful moments
were the discussions with the people around me. Their professional enthusiasm taught
me to pursue my work with passion and optimism. I am grateful to my supervisor Dr.
Jörg Rahnenführer for his ideas and continuous guiding, Prof. Dr. Thomas Lengauer for
his support and energy and the whole group for creating a great working environment.
I thank Konstantin Halachev for lifting my spirit up and always encouraging me.
v
vi
Contents
1 Introduction
1.1 Problem Statement
1.2 Motivation . . . . .
1.3 Related Work . . .
1.4 Contribution . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
2 ArrayCGH Data
2.1 Genetic Mutations in Cancer Genesis
2.2 ArrayCGH Technology . . . . . . . .
2.3 Statistical analysis of arrayCGH data
2.3.1 ArrayCGH smoothing . . . .
2.3.2 Aberrations detection . . . . .
2.3.3 Region selection . . . . . . . .
2.3.4 ArrayCGH analysis algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
7
11
11
17
20
22
3 Genetic Tumor Progression
3.1 Oncogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Formal model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Genetic Progression Score . . . . . . . . . . . . . . . . . . . . . . . . .
24
24
24
29
4 Application to Cancer Data
4.1 Array CGH data sets . . . . . . . . . . . . . .
4.2 Implementation . . . . . . . . . . . . . . . . .
4.2.1 Implementation steps . . . . . . . . . .
4.3 Results for analysis of prostate cancer data . .
4.3.1 Genetic mutations in individual arrays
4.3.2 Consensus analysis . . . . . . . . . . .
4.3.3 Oncogenetic trees . . . . . . . . . . . .
4.4 Results for analysis of glioblastomas data . . .
4.4.1 Genetic mutations in individual arrays
4.4.2 Consensus analysis . . . . . . . . . . .
4.4.3 Oncogenetic trees . . . . . . . . . . . .
4.5 Validation . . . . . . . . . . . . . . . . . . . .
34
34
35
35
39
39
40
44
48
48
48
53
55
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Conclusions and future work
58
A Prostate Cancer
59
B Glioblastomas
84
viii
Chapter 1
Introduction
1.1
Problem Statement
During the last years, much research in the fields of Molecular Biology and Bioinformatics has focused on finding good indicators of cancer staging. These so called
biomarkers [23], used together with traditional medical measures, could significantly
improve patient care, if their meaning is correctly stated and assessed.
ArrayCGH data is one such biomarker, that provides information about the genetic
amplifications and deletions occurring during cancer progression. These aberrations
are specific to each type of cancer.
Previous results show that the order in which the amplifications and deletions occur
follows preferred patterns that characterize cancer evolution. Mixtures of oncogenetic
trees have been proposed as a statistical model of mutation accumulation patterns and
can be used as predictors of tumor progression status [1]. Based on the tree mixture
model, each tumor sample is assigned a genetic pattern and a genetic progression score,
that gives a prediction of survival time and can help in selecting suitable therapies.
Our first goal was to develop an automated method to determine the chromosomal
gains and losses specific to different types of tumor, based on statistical inference on
arrayCGH data. The second objective was to apply the method to Prostate Cancer
and Glioblastomas data sets, then estimate and analyze the corresponding oncogenetic
trees.
1.2
Motivation
Important problems frequently met in cancer treatment, like prediction of survival time,
choice of treatment, subtype prediction, are currently addressed based almost exclusively
on the anatomic characteristics of the disease. Much effort is spent on including a large
1
series of biomarkers in standard clinical practice, which will potentially improve the
quality of treatment.
The technical means that support cancer treatment are also subject to continuous
development. Comparative Genomic Hybridization (CGH) is an experimental technique used to measure certain types of genetic mutations associated with cancer onset
and progression. During the past decade, CGH technology has evolved to overcome
resolution obstacles, currently allowing the detection of refined mutations, single gene
copy number changes or even smaller target areas.
However, CGH data contains too large an amount of information and additional noise
to be processed directly and fully by medical experts, and thus automated methods
are needed to extract the relevant information.
1.3
Related Work
Many recent publications address the problem of detecting chromosomal aberrations
in different types of cancer. Most of them describe the analysis of CGH experiments,
which identify low resolution mutations ([8], [15], [22]).
The number of publicly available experimental arrayCGH data sets has been rapidly
increasing during the past year, and also the amount of publications which present
methods and results of arrayCGH analysis. Typically, smoothing methods are described ([25], [26], [28]), with applications to single arrays or small sets of experiments.
ArrayCGH smoothing algorithms deal with the problem of removing the noise from
arrayCGH data and detecting regions of gains and losses (see [21] for a comparison
study of the methods).
Among the publications that propose tumor stage and evolution models, we cite Beerenwinkel et al. ([30, 1]), who describe oncogenetic tree mixtures as statistical models of
genetic mutations accumulations during cancer progression. The tree models are specific for each type of tumor and can can be used to predict time to death, subtype or
stage of the disease, or a suitable treatment. We use oncogenetic tree models in our
work and we give a detailed description in Chapter 3.
1.4
Contribution
We propose an automated method for analysis of cancer mutations accumulation based
on statistical analysis of arrayCGH data. The algorithm consists of the following steps:
arrayCGH smoothing, aberrations detection, consensus analysis and oncogenetic tree
models estimation. ArrayCGH smoothing solves the problem of determining regions of
constant copy number. The aberrations detection is a procedure of identifying chromo2
somal gains and losses in individual experiments. The consensus analysis step addresses
the selection of recurrent mutated regions in a set of arrayCGH experiments. The oncogenetic tree models estimation is a statistical learning method that builds a model of
tumor evolution patterns in terms of mutations accumulation. Our contribution to the
method are the second and the third steps. Aberration detection methods have been
proposed before in the literature ([25], [3]), we describe here an alternative approach
that is robust and data adaptive. To our knowledge, an automated consensus analysis
algorithm has not been published yet.
We applied our method to Prostate Cancer and Glioblastomas arrayCGH data sets. We
discovered amplifications and deletions, often reducing to single genes, that narrowed
down the regions identified by CGH or that are not yet reported in the literature.
3
Chapter 2
ArrayCGH Data
An important goal in biomedical research is to identify biological indicators that characterize well the stage of tumors. The process of accumulation of genetic mutations
gives information of the evolution of most cancer types. Array CGH technology is used
to measure these mutations, typically amplifications and deletions. This chapter is an
overview of the array CGH technology and data analysis.
The first section gives an introduction to the biological mechanisms involved in cancer
onset and progression, with focus on the genetic mutations measured by arrayCGH
experiments. Technical details on how arrayCGH data is obtained from tumor tissue
of diagnosed patients are given in Section 2.2. An automated method of statistical
analysis of array CGH data is proposed in the last section of this chapter. The analysis
consists of several consecutive steps: smoothing of the CGH data, aberrations detection
and region selection. Each of these steps will be presented in detail, with emphasis on
the last two, which are our contribution to the overall method.
2.1
Genetic Mutations in Cancer Genesis
This section explains several basic biology notions about cancer, that are necessary for
understanding the data and the analysis methodology proposed. All aspects of cancer
discussed in this section refer to human tumors, and are mainly based on [20].
Basic notions
The complete genetic information of an organism is contained in its genome. The
genetic material is organized as DNA molecules, contained in the nucleus of each cell.
The DNA is a double stranded helix formed of two paired sequences of nucleotides
of the following four types: adenine(A), thymine(T), cytosine(C)and guanine(G) (see
Figure 2.1). The main force promoting the formation of this helix is complementary
basepairing: adenines form hydrogen bonds with thymines and cytosines form hydrogen bonds with guanines. The human DNA is 3 billion basepairs long.
4
Figure 2.1: DNA fragment.
Figure 2.2: Condensed structure of a chromosome.
The DNA strand has several structural layers.
The DNA is packed in macromolecules called chromosomes. In humans, each cell
contains 23 pairs of chromosomes (see Figure 2.3). The DNA strand of a chromosome
has a highly compressed structure, as shown in Figure 2.2.
The DNA sequence contains genetic specifications of all biological processes of a cell. Of
particular interest are the genes, specific subsequences of 1000-10000 base pairs length
of DNA. Genes are translated into single-stranded nucleic acid called RNA, which can
be further translated into macromolecules that perform most of the biological functions in the cell, called proteins. The transcription of genes into RNA and further into
proteins is regulated by the needs of the organism.
The genetic information is identical in all cells of the human organism. At each cell
division, the genome is duplicated into identical halves, via a process called mitosis.
Often, the DNA replication fails to produce identical copies, which results in mutated
5
Figure 2.3: The 23 pairs of chromosomes in human. The picture shows a normal male karyotype. A normal female karyotype has a second X chromosome instead of the Y chromosome.
daughter cells. The regulatory mechanisms of the cell, however, detect these mutations
and under normal conditions, the cell goes to apoptosis, or programmed cell death. A
healthy functioning assumes a permanent balance between cell proliferation and cell
death, to comply with the needs of the organism.
Cancer
Occasionally, the controls that regulate cell multiplication fail to function normally.
A cell in which this occurs begins to divide in an unregulated fashion, without regard
to the body’s need for further cells of its type. The descendants of such a cell inherit
the incapacity to respond to regulation, which may lead to a mass of cells able to
expand indefinitely. This mass is called tumor, and its existence may not have serious
health consequences, if it remains local, or might lead to the death of the patient if
it spreads to other tissue types. The disease associated to this malfunction is called
cancer.
Cancer is caused by DNA mutations which involve genes that normally regulate cell
multiplication. Two classes of genes play a key role in cancer induction: protooncogenes and tumor-suppressor genes. The proto-oncogenes encode proteins that
promote cell proliferation. In cancer, mutations of these genes lead to their abnormal,
increased activity. One of the mechanisms that produce these mutations is the localized
reduplication, or amplification of chromosome segments that include proto-oncogenes.
As a consequence, all genes located within amplified segments have increased copy
number in the tumor tissue, leading to overexpression of their encoded proteins. Figure 2.4 a) shows a schema of a chromosomal gain mutation.
6
Figure 2.4: a) Chromosomal gain mutation. A segment of chromosome 10 is inserted twice
during mitosis; b) Chromosomal loss mutation. A segment of chromosome 5 is deleted during
mitosis.
However, mutations in proto-oncogenes cannot induce an accelerated division of the
cells by themselves, unless they are complemented by mutations in the genes that promote apoptosis, called tumor-suppressor genes. Deletions of segments (Figure 2.4 b))
of chromosomes which contain such genes lead to a decrease in their copy number and
thus to underexpression of the encoded proteins.
Since the DNA contains many genes of both types described above, a cell must undergo
multiple mutations until it becomes cancerous. Direct observations of DNA from tumor
tissue in different stages also show that mutations accumulate over time. Moreover, it
seems that in most of the cancer types, mutations arise in specific, preferred orders.
Therefore, if chromosomal mutations are determined accurately over a large enough
number of tumor samples in different stages, a statistical analysis can provide with
models of evolution which characterize particular types of cancer. These models can
improve prediction of tumor stage, survival time and can help in therapy selection.
2.2
ArrayCGH Technology
This section is an overview of the technology that measures DNA copy number gains
and losses and maps these aberrations to the genomic sequence. For reasons that will
become clear below, it is called Comparative Genomic Hybridization, in short CGH.
7
Figure 2.5: Chromosomal CGH technology. The main steps.
CGH
The CGH procedure was first developed and described by Kallioniemi et al. in 1992
(see [19]) and it has been continuously improved since. The basic strategy of the technique is to use genomic DNA from cancer cells labelled with one fluorochrome and
genomic DNA from healthy reference cells labelled with a second fluorochrome and
then allow them to competitively hybridize to immobilized target DNA. The two fluorochromes are chosen such that to emit easily distinguishable wavelengths, typically
red and green. In regions where there are no amplifications or deletions in the cancer
genome, binding of both samples will be equal, which will result in a perceived yellow
florescence. However, where there are losses in the copy number in the cancer DNA, the
color with which the normal DNA was labelled will predominate. Similarly, in regions
with DNA copy number gain, the color with which the tumor DNA was labelled will
be apparent [24].
Chromosomal-CGH
Initially, the CGH procedure used intact chromosomes as target for hybridization,
which were conveniently immobilized during the metaphase of cell mitosis. The chromosomes were scanned and ratios of the two fluorescent intensities were measured by
8
Figure 2.6: Array CGH technology. The main steps.
quantitative image analysis. Figure 2.5 shows a schema of chromosomal-CGH technique.
However, chromosomal CGH cannot detect aberrations smaller than 3-10 mega basepairs, due to the highly condensed structure of chromosomes. For the same reason, it
also fails to determine precisely the endpoints of the altered regions. Higher resolution
analysis is needed to be able to detect single gene copy number changes and to identify much smaller target areas for these changes so that single causative genes to be
ultimately identified.
Microarray-based CGH
A significant improvement in CGH technology was announced in 1997 by SolinasToldo et al. with a technique they described as matrix-CGH [32]. In the subsequent
years, several refining developments of matrix-CGH overcame the resolution limit problem ([33], [27]). Probes that map to evenly spaced loci along the entire length of the
genome were printed onto glass slides, called microarrays. The microarrays were then
used as targets for hybridization instead of chromosomes, allowing a high resolution of
9
the measurements. This improved technology is called array CGH (Fig. 2.6). Depending on the spacing and length of the clones, array CGH can measure single genes copy
number changes or even smaller regions.
Several types of probes can be used to produce microarrays. Highly used are bacterial
artificial chromosomes (BACs). Each BAC clone consists of a small 100-200 Kilobase
(Kb) segment of DNA, grown in bacteria and immobilized onto slides according to its
genomic location.
Tumor and control tissues labelled with different fluorochromes are hybridized on the
microarrays as in chromosomal CGH. The microarrays are scanned to produce two
separate images that show intensities for the two wavelengths. At each spot, the ratio
of the intensities of the two fluorochromes gives a measure of the abundance of the
corresponding gene in the tumor tissue.
However, the quality of microarray images varies considerably, as the measured intensity of a spot includes a contribution of non-specific hybridization and other chemicals
on the glass. Suboptimal experimental conditions may strongly affect the spot quantification, and clean images as in Figure 2.6 are rather exceptional. Therefore, specific
image analysis algorithms are used to extract corrected fluorescent intensities.
Several further transformations of the intensity values are needed before the copy number changes can be analyzed and interpreted. This adjustment is called normalization
(see [4]), and it consists of three steps:
- Transformation to normality refers to the adjustment of the intensity ratios such
that they approximate a Gaussian distribution. A normal variation of the data is
a prerequisite of many types of statistical analysis. There is a general agreement
that a log transformation of arrayCGH data provides a good approximation of a
Gaussian distribution. If x and y denote the corrected fluorescent intensities of
a spot, the transformed values log2 (x/y), or, equivalently, log2 (x) − log2 (y), are
approximately normally distributed.
- Centralization removes biases from the data. Several sources, including variation
within and among arrays, unequal dye incorporation or poor scanning quality
introduce uneven bias along the microarray. Among the proposed methods for
centralization, LOESS normalization has been widely used [8].
- Re-scaling is a final step that may be applied to ensure that data from different hybridizations have equal variances. This step is usually omitted since the
variances may differ not only due to error in the experiments, but also due to
treatment effects, which should not be altered.
Figure 2.7 shows an example of a normalized CGH array. In what follows, we will refer
to the normalized logarithm of the ratios of dye intensities as Log2 Ratios , or, in a
10
Figure 2.7: Chromosome 1 form a CGH experiment. Each point on the plot corresponds
to a gene. The x-axis represents the position of the gene on the chromosome, in Kilo basepairs. The y-axis shows the normalized Log2 Ratio of the fluorescent intensities of the spot
corresponding to the gene. A Log2 Ratio close to 0 indicate a normal abundance of the gene
in the tumor tissue, whereas a level significantly greater (or smaller) than 0 indicates a gain
(or loss) in copy number.
more relaxed notation, as copy numbers.
2.3
2.3.1
Statistical analysis of arrayCGH data
ArrayCGH smoothing
Array CGH data can reveal chromosomal deletions and amplifications in tumor tissues. We expect changes in copy number to cover multiple consecutive genes, since
usually segments of chromosomes are affected. However, array CGH data is noisy,
which means that a smoothing method should be applied to increase the reliability of
detecting changes.
Classical regression methods are not very suitable for this purpose, as they tend to
blur the sudden changes in copy number and round the segments between jumps, instead of flattening them. The reason is that most regression models are continuous
functions, while our purpose is to detect discontinuities.
The problem resumes to fitting a piecewise constant function to the data (see Fig.
2.8), and it consists of two subproblems: finding the breakpoints that separate segments with homogeneous Log2 Ratios , and estimating the copy number for each such
segment. The first subproblem is more difficult, and there have been many efficient
approaches to solve it. The next section is a summary of the most frequently used
11
2
1
0
−1
Log ratio
0
50000
100000
150000
200000
250000
Chromosome position, in Kb
Figure 2.8: Piecewise constant regression. Chromosome 1 from an arrayCGH experiment
exhibits deletions and amplifications. Each point on the plot corresponds to a gene. The
x-axis represents the position of the gene on the chromosome, in Kilo basepairs. The y-axis
shows the normalized Log2 Ratio of the fluorescent intensities of the spot corresponding to the
gene. The fitted red line indicates regions of constant copy number.
algorithms for this purpose. A comparative analysis can be found in [21].
ArrayCGH smoothing algorithms overview
The most common approach is to model the data as a partition in segments, with
unknown boundaries and unknown height, which will have to be estimated from the
observations. An optimization criterion must involve the quality of the fit, but should
also penalize the number of discontinuities, to avoid overfitting. Jong et al. [18] use a
stochastic genetic algorithm to maximize a likelihood with a penalty term containing
the number of breakpoints.
Fridlyand et al. [11] use a Hidden Markov model (HMM) in which the underlying
copy numbers are hidden states with associated transition probabilities. The HMM is
fitted to the observed data using the Forward-Backward and Baum-Welch methods.
A penalized sum of squares method has been modified by using the L1 norm instead
of L2 norm, which gives sharper boundaries between segments (Eilers et al.)[10]. The
optimization problem is then solved using a quantile regression idea.
Hsu et al. [14] propose to fit wavelets to the data and shrink the coefficients. In
the flat parts most of the higher frequency coefficients will become zero, but near the
12
jumps they will be retained. The positions of the jumps will be indicated this way.
One can also systematically traverse the data series and introduce or remove breakpoints iteratively, maximizing a likelihood statistic. This method is called Circular
Binary Segmentation, and it is proposed by Olshen et al. in [26].
Hupé et al.[16] use adaptive weights smoothing to detect the breakpoints. The method
is called GLAD, and we used this algorithm in our analysis. One of the reasons that
favored this particular choice is that, unlike most of the other approaches, it does not
consider equal distances between consecutive probes, but their real position on the
chromosome. This additional information is very important for an accurate alignment
of multiple CGH arrays smoothing functions, given that Log2 Ratio intensities are frequently not available for arbitrary positions (due to the quality of the image spots, for
example). The next subsection is a detailed presentation of the GLAD algorithm.
The GLAD algorithm
The GLAD (Gain and Loss Analysis of DNA) algorithm was proposed by Hupé et al.
(2004) [16]. It solves a piecewise constant regression problem by using a local maximum likelihood approach. The method is divided into two main steps: detecting the
breakpoints (positions where the underlying copy number changes) and estimating the
status of each segment. We discuss the algorithm in detail, as presented in the cited
publication.
Formal model
Each arrayCGH experiment can be formally represented as a series of N independent observations (X1 , Y1 ), ..., (XN , YN ), where each Xi represents the position on the
chromosome and each Yi is the corresponding measured Log2 Ratio. The locations are
sorted increasingly: X1 < ... < Xi < ... < XN . The underlying statistical model
assumed to have produced this data is a piecewise constant function with additive
gaussian noise, meaning that the random variable Yi depends on the location Xi via a
parameter θ and an additive gaussian noise i as follows:
Yi = θ(Xi ) + i
(2.1)
In this context, θ is the copy number that needs to be estimated and the i are i.i.d.
N (0, σ 2 ). The function θ is piecewise constant, which is be formally expressed as:
θ(x) =
M
X
am 1Im
(2.2)
m=1
The number M of disjoint segments I1 , ..., IM , the limits of the segments and the values
am are unknown. They will be estimated from the data, such that a local likelihood
13
statistic is maximized.
The detection of the breakpoints uses the Adaptive Weights Smoothing (AWS) procedure proposed by Polzehl and Spokoiny (2002) [28]. This is an iterative algorithm that
finds around each location Xi a maximum neighborhood in which the copy number
can be assumed constant and then fits a local maximum likelihood value at Xi . The
locality is forced by assigning weights wij (0 ≤ wij ≤ 1) to all other observations Xj
and readapting these weights to the data such that eventually the observations that lie
within the same constant segment receive much larger weights than the others. The
algorithm is a two-stage iterative process: readapting the weights and reestimating the
parameters.
The readapting of the weights is done via a location penalty kernel Kl that punishes distant observations (w.r.t. position on the chromosome) and a statistical penalty kernel
Ks that punishes two different local models. Kernels are widely used in local regression models, with the purpose of assigning higher weights to close observations than
to distant ones and therefore forcing the prediction at a certain data point to depend
only on neighboring observations. By adding a second kernel for statistical penalty,
the AWS procedure decreases the large weights of neighboring observations to Xi that
do not also have close response values to Yi .
Formally, the kernels are non-increasing symmetrical functions that fulfill Kl (0) =
Ks (0) = 1, controlled by two parameters: a geometric growth rate a that enlarges the
neighborhood around the observation at Xi with every iteration, and λ, that adjusts
the magnitude of the statistical penalty. A memory parameter η stabilizes the procedure by involving the old weights in the computation of the new ones.
The reestimation of the parameters resumes to maximizing a weighted likelihood statistic of the form:
N
X
p(Yj , θ)
0
(2.3)
L(Wi , θ, θ ) =
wij log
p(Yj , θ0 )
j=1
where θ0 is an arbitrary point in the parameter space, Wi = diag{wi1 , ..., wiN } and
p(·, θ) is the probability distribution of the response variables for a given parameter θ.
Thus, the MLE for θ is given by:
θ̂i := θ̂(Xi ) = argsupL(Wi , θ, θ0 )
(2.4)
θ∈Θ
Intuitively, the estimate at observation Xi is given by such θi for which the likelihood
that the Y values in the neighborhood of Xi are approximated by the constant θ is
maximized.
Given the Gaussian model of the data (Formula 2.2), one can prove that the MLE
14
estimate coincides with the weighted least squares estimator by replacing p(y, θ) =
√ 1 e−
2πσ
(y−θ)2
2σ 2
in Formula 2.3:
θ̂i = argsup
θ∈Θ
N
X
wij (log p(Yj , θ) − log p(Yj , θ0 ))
j=1
N
1 X
wij ((Yj − θ0 )2 − (Yj − θ)2 )
2
θ∈Θ 2σ
j=1
!
!
N
N
X
X
−
= argsup
wij θ2 + 2
wij Yj θ +
= argsup
θ∈Θ
PN
=
j=1
j=1
wij Yj
PN
j=1 wij
,
j=1
∀θ0 ∈ Θ
N
X
!!
wij (θ02 − 2Yj θ0 )
j=1
(2.5)
The AWS procedure requires an estimate for the standard deviance σ of the noise. A
robust estimator is given in [16] by the formula:
σ̂ =
IQR(Z1 , ..., ZN −1 )
√
IQR(N (0, 1)) × 2
where Zi = Yi+1 − Yi and IQR is the interquartile range (i.e. the difference between
the first and the third quartile of the given sample).
Initially, all weights wij are set to 1 and the parameter estimates for all observations
are given by the maximum likelihood constant function fitted to the weighted data,
which is in fact the mean of all observations.
The AWS algorithm
Below we give the AWS algorithm, as presented in [28].
Input: a set of N observations (Xi , Yi )i=1..N .
Output: smoothing values (θ̂i )i = 1..N .
Parameters: penalty kernels Kl and Ks , memory parameter η, initial bandwidth h(1) ,
growth rate a, maximal bandwidth h∗ and statistical penalty control λ. We discuss the
possible values and the meaning of the parameters after the algorithm is presented.
15
AWS procedure
(1) Initialization: Calculate the global MLE θ̂(0) of θ:
N
1 X
Yi
N i=1
θ̂(0) =
(0)
(0)
= θ̂(0) and define Wi
For every i = 1, ..., N , set θ̂i
k = 1.
as the unit matrix. Set
(2) Iteration: for every i = 1, ..., N :
(a) Calculate the adaptive weights: For every point Xj , calculate the penalties:
(k)
lij = | ρ(Xi , Xj )/h(k) |2 ,
(k)
(k−1)
sij = λ−1 [L(Wik−1 , θ̂i
= λ−1 ·
1
2σ̂ 2
(k−1)
· (θ̂i
(k−1)
, θ̂j
(k−1)
− θ̂j
(k−1)
) + L(Wjk−1 , θ̂j
)
(k−1)
PN
k=1
(wik
(k−1)
, θ̂i
(k−1)
− wjk
)]/2
(k−1)
)(Yk −
θ̂i
(k−1)
+θ̂j
2
where ρ(x, x0 ) is a metric in the input space and h(k) controls the size of
the neighborhood of each Xi . Calculate:
(k)
(k)
(k)
w̃ij = Kl (lij )Ks (sij )
(k)
and define the weight wij as:
(k)
(k−1)
wij = ηwij
(k)
Denote by Wi
(k)
+ (1 − η)w̃ij
(k)
the diagonal matrix Wi
(k)
(k)
(b) Estimation: Calculate the new local MLE θ̂i
(k)
θ̂i
(k)
= diag{wi1 , ..., wiN }
of θi :
(k)
j=1 wij Yj
PN
(k)
j=1 wij
PN
=
(3) Stopping: Stop if ah(k) > k ∗ , otherwise increase k by 1, set h(k) = ah(k−1) and
continue with step 2.
16
)
The AWS procedure provides with estimates θ̂i for each observation at Xi . The breakpoints are positions Xi such that θ̂i ∈
/ [θ̂i+1 − , θ̂i+1 + ]. In the default case, = 10−2 .
The final step to finish the regression is to choose a maximum likelihood constant fit
within each segment, which is the mean of the Yi values, given the assumed normality
of the data.
Parameters discussion
The AWS procedure involves several parameters that can be tuned in order to make
the method more or less sensitive to discontinuities. We present them below, with the
same notations that they appear in [16].
(a) Kernels Ks and Kl . By default, the kernels are exponential functions: Kl (u) =
Ks (u) = e−|u| . In order to decrease the computational complexity of the method,
the statistical penalty kernel can be chosen to be the triangle function: Ks (u) =
(1 + u)+ 1(−∞,0] + (1 − u)+ 1[0,∞) .
(b) Memory parameter η. The value η ∈ (0, 1) can be viewed as the memory
parameter of the algorithm. The larger the value of η, the more stable the
method w.r.t. iteration. However, it decreases the sensitivity to local changes.
(c) Bandwidth parameters h(1) , a and h∗ . The initial bandwidth h(1) should be
taken as small as possible, as to comprise only one observation. The parameter
a controls the growth rate of the local neighborhoods for each observation Xi . It
should be selected such that at each iteration, the number of data points within
a distance of h(k) from Xi grows geometrically with a factor agrow . The maximal
bandwidth h∗ can be taken very large, but since it is involved in the termination condition of the algorithm, a smaller value would reduce the computational
complexity. Data-driven optimal stopping can be decided via cross validation,
for example.
(d) Parameter λ. This is the most important parameter of the method, because
it scales the statistical penalty sij and thus directly controls the sensitivity of
the method to local changes. Small values of λ lead to overpenalization, which
may result in instability in the prediction in homogeneous regions (potentially
too many breakpoints). On the other hand, large values of λ will oversmooth
the data (less sensitivity to changes). Theoretical arguments are given in [28] to
support the choice λ = t.985 (χ21 ), the 0.985 quantile of a chi-squared distribution
with 1 degree of freedom.
2.3.2
Aberrations detection
Array CGH smoothing methods such as AWS do not completely solve the problem
of detecting the gained and lost DNA segments. There are regions with estimated
17
0.0 0.5 1.0 1.5 2.0 2.5
Frequency
−1.0
−0.5
0.0
0.5
1.0
1.5
LogRatio
Figure 2.9: Histogram of smoothed Log2 Ratios from an arrayCGH experiment along the whole
genome. The distribution is approximately Gaussian with mean zero.
Log2 Ratio very close to 0 which we might not want to choose as aberrations, as it
might still be due to noise. The problem has been addressed in several publications,
typically by choosing appropriate cutoffs c with |Log2 Ratio|> c. A very popular cutoff
is c = 0.2, or, as proposed by Nakao et al. in [25], c = 0.225. Other more adaptive
cutoffs take into account the genome-wide standard deviation σ of the Log2 Ratios .
The cutoff is then chosen as, for example, c = 1.3σ. Hupé et al. ([16]) cluster the
smoothed Log2 Ratios, take as normal level the maximum cardinality cluster, and assign all others to gains or losses.
We developed our own method to determine the thresholds. Typically, we select the
segments with values significantly different from zero. Therefore, the problem reduces
to assessing significance for each homogeneous region. We also decided to compute
separate cutoffs for each array. The reason behind this decision is that each array is
a stand-alone experiment, and thus the errors introduced might have different magnitudes.
In order to get an intuition on the distribution of the GLAD estimates we analyzed corresponding histograms, which in the majority of the cases had the shape of Gaussians,
centered around zero (see Figure 2.9). This observation agrees with the expectation
that the genes with a normal copy number are more frequent than the ones in gained
(or lost) regions, when the entire genome is considered.
We gain more information about the underlying distribution by analyzing quantile18
Figure 2.10: QQ-plot of the smoothed
Log2 Ratios from an arrayCGH experiment
along the whole genome. The x-axis shows
theoretical quantiles of a standard normal distribution and the y-axis shows sample quantiles. The red line is determined by the 25%
and the 75% percentiles and corresponds to a
robust normal.
Figure 2.11: Robust normal fitted to the
smoothed Log2 Ratios (the red curve). It passes
through the first and the third quartiles of the
data. The blue vertical lines are the loss (left)
and gain (right) cutoffs.
quantile plots. For the purpose of comparing a sample distribution against a Gaussian,
theoretical quantiles of a standard normal are plotted against sample quantiles (see Figure 2.10). If the sample is Gaussian, then the plotted pairs should follow a linear trend.
As observed in almost all arrays, the middle section of the data (in an order statistic)
can be clearly considered normal, while the tails deviate significantly. The red line in
Figure 2.10 is defined by two points in the theoretical - sample quantiles plane, which
correspond to the first (0.25) and the third (0.75) quartiles. The quartiles are values
that divide a distribution or a sample from a distribution in four equal segments. The
Gaussian distribution with the same first and third quartiles as the given sample is
unique and it is called a robust normal, because it does not depend on the first and the
last 25% of the data (in the order statistic). Figure 2.11 shows this normal distribution
fitted to the data.
The cutoffs that determine gains and losses in copy number are selected two standard deviations to the left and right of the mean of the fitted robust normal. Figure
2.11 shows these cutoffs (the blue lines).
The algorithm presented above for computing thresholds that separate amplified or
deleted regions from normal regions is based on observations and heuristics. No theoretical arguments to asses performance on other data sets than the ones we studied is
19
available yet. However, the method is highly intuitive and robust and it worked well on
the Prostate Cancer and Brain Tumor data sets ( see Chapter 4 for validation results).
Each region in the genome with constant copy number is assigned a status, either
lost, gained or normal. A natural assignment of -1, 1 or 0 to the genes in the corresponding regions will reflect their status accordingly. We thus obtain arrays of values
{-1,0,1} and size equal to the initial arrays. Moreover, the probes that were not available on the initial arrays (for example spots that could not be resolved by the image
analysis) can now be assigned a status. This can be done by first locating them within
a segment of constant copy number (such segments exist and are unique, since they
define a partition of the genome) and then assigning the respective status to the gene.
In what follows, we will refer to these {-1,0,1} arrays as status arrays, or gain/loss
arrays.
2.3.3
Region selection
Each cancer type has specific amplifications and deletions that can be determined via
a consensus analysis of multiple CGH arrays. We have created an automated method
that selects highly recurrent alterations across many samples of the same tumor type.
The problem that needs to be solved is that of deciding how many arrays should have
a region amplified or, respectively, deleted, such that to select it among the representative mutations for the disease. A second issue is that of determining the boundaries of
such regions, when the samples do not agree on a clear start or end position. This can
be also due to biological causes, not only to our method of selecting altered segments.
To our knowledge, there is no proposed automated method for selecting highly recurrent regions across arrayCGH samples. However, from a biological point of view,
all mutations are interesting for further analysis, even those that occur in a small fraction of the arrays. These regions may show interesting particularities of the disease,
that may be correlated to individual genotypic or phenotypic characteristics.
Our solution to the region detection problem is to consider the number of arrays that
have a certain gene simultaneously gained or lost, respectively, as a random variable
Z, and then associate a p-value to each observation of this variable. The p-values can
be estimated by analyzing the distribution of Z along the genome. In what follows, we
describe only to the algorithm that computes p-values for the gains, the procedure is
similar for losses.
Formally, let n be the number of arrays in the data set. We associate the random
variable Zi to each array i, representing the status of a random gene in the array:
Zi = 1 if the gene is gained and Zi = 0 if the status is normal. We can estimate the
distribution of each Zi . Define pi = Pr(Zi = 1), which implies Pr(Zi = 0) = 1 − pi . We
20
Figure 2.12: P-values measuring significance of each gene in terms of the number of arrays
in which it is amplified (example). The x-axis shows the number of arrays. The y-axis shows
the p-value associated. The blue horizontal line is the significance cutoff.
estimate pi by the frequency of gains along the genome in array i:
p̂i =
number of gained genes
total number of genes
Observe that Z = Z1 + Z2 + ... + Zn . Since all Zi are Bernoulli distributed, Zi ∼
Bernoulli(p̂i ), for any given k ∈ {0, ..., n}, the probability Pr(Z = k) can be computed
as:
n
X Y
Pr(Z = k) = Pr(Z1 + ... + Zn = k) =
p̂ai i (1 − p̂i )1−ai
P
ai ∈{0,1}, i=1
ai =k
For efficient computation reasons, a recursion can be used. Denote by:
Pk (l) = Pr(Z1 + ... + Zl = k),
l ∈ {1, .., n}, k ∈ {0, .., n}
Then, the following recurrence holds:
Pk (l) = Pk−1 (l − 1) · p̂l + Pk (l − 1) · (1 − p̂l ),
∀l ∈ {1, .., n}, ∀k ∈ {0, .., n}
with starting values for l = 1 and k = 0:
l
Y
P0 (l) =
(1 − p̂i ) ∀l ∈ {1, ..., n}
i=1
21
(2.6)


1 − p̂1
Pk (1) = p̂1


0
if k = 0,
if k = 1,
if k > 1
∀k ∈ {0, ..., n}
(2.7)
This gives Pr(Z = k) = Pr(Z1 + ... + Zn = k) = Pk (n) for all k ∈ {0, .., n}.
The p-value associated to observation k of variable Z is:
p-value(k) = Pr(Z ≥ k) =
n
X
Pr(Z = i)
i=k
An occurrence of k simultaneous gains at a certain position is unlikely to happen by
chance if the corresponding p-value is small. The p-value cutoff is a parameter of the
method. A very small cutoff would result in a small number of highly recurrent selected regions, potentially leaving out relevant mutations. However, if the cutoff is
large, many regions will be selected which could contain and therefore mask highly
relevant ones.
Given the property of the mutations to be locally constant, the selection method will
output continuous regions, significantly amplified across all arrays.
In what follows, we will refer to the selected mutated regions (both gains and losses)
as genetic events. We say that an event has occurred in a sample if there is at least
one gene gained (lost, respectively) in the corresponding gained (lost) region.
Formally, let V be {1, ..., l} the set of the l events output by the method. We can
associate each gain/loss array s a binary array ps ∈ {0, 1}l of length l, such that:
(
1 if event i has occurred in s
(2.8)
ps [i] =
0 if event i has not occurred in s
We will refer to these binary arrays as genetic patterns.
2.3.4
ArrayCGH analysis algorithm
We summarize our arrayCGH analysis method below.
Input: a set of n CGH arrays (X, Yk )1≤k≤n , where X = (X1 , ..., XN ) gives the genome
positions of the microarray spots and Yk = (Yk1 , ..., YkN ) are the Log2 Ratios of the k th
experiment.
Output: a list of genetic events E = (e1 , ..., el ) and a corresponding list of genetic
patterns P = (p1 , ..., pn ), pi ∈ {0, 1}l . Each genetic event is specified by its chromosome number, start and end position and type of mutation.
22
Parameters: smoothing parameters α(discussed din the previous section), p-value
cutoff β.
ArrayCGH analysis algorithm
1. ArrayCGH Smoothing
For each array k ∈ {1, .., n}
Apply glad(α) to compute smoothed Log2 Ratios Θk = (θk1 , ..., θkN )
2. Aberrations detection
For each array k ∈ {1, .., n}
a) Compute the first and the third quartiles of the smoothed Log2 Ratios Θk
b) Compute the mean µk and the standard deviance σk of a robust Gaussian
distribution with the same first and third quartiles as obtained in a).
c) Compute the gain cutoff gk and loss cutoff lk :
gk = µk + 2σk
lk = µk − 2σk
d) Compute status arrays sk = (s1k , ..., sN k ):

 −1, θik < lk ;
1,
θik > gk ; ,
sik =

0,
otherwise.
i ∈ {1, ..., N }
3. Region selection
Compute p-values associated to gains and losses: pgain and ploss , as suggested
in (2.6).
For all positions j ∈ {1, .., N }
If pgain (| {sjk = 1, 1 ≤ k ≤ n} |) < β then select position j as gain. If
ploss (| {sjk = −1, 1 ≤ k ≤ n} |) < β then select position j as loss.
Output the list of regions of continuous loss or gain E = (e1 , ..., el ).
Compute genetic patterns P = (p1 , ..., pn ) as in (2.8).
23
Chapter 3
Genetic Tumor Progression
The problem of choosing appropriate treatments for cancer patients is currently addressed based almost exclusively on clinical measurements, such as the age and sex
of the patient, the tumor volume, the lymph node spread, the presence or absence of
metastasis). The stage of the tumor, the time until the death of a patient or until a
relapse can be better determined if additional biological markers are analyzed.
Genetic mutations such as chromosomal amplifications and deletions accumulate during cancer progression in preferred orders, allowing a more precise stage prediction.
These mutations are now measurable with higher precision via arrayCGH technology.
The accumulation process can be modeled using oncogenetic tree mixtures, as proposed
by Beerenwinkel et al. [30].
In what follows, we will introduce the formal statistical model underlying the oncogenetic trees. A measure of the progression status of a tumor computed based on the
tree mixture model will be presented in Section 3.2.
3.1
Oncogenetic Trees
The oncogenetic trees model the order in which permanent genetic changes occur during cancer evolution. Typically, the genetic events targeted are amplifications and
deletions. They are represented as the nodes of the trees, each edge in the tree being labeled with the conditional probability that the child event occurs given that the
parent event occurred. The topology of the tree mixture and the model parameters
are learned from a set of genetic patterns, that can be the result of arrayCGH data
analysis, as in our case.
3.1.1
Formal model
Let V = {1, ..., l} be a set of genetic events, to which we artificially add a null event
that occurs with probability 1. Consider also a set of n genetic patterns which describe
24
subsets of observed events. Each genetic pattern is modified by adding a 1 before the
first position, to indicate the occurrence of the null event.
The set of all n observed patterns can then be represented by a binary matrix of
dimension n × (l + 1):
X = (xij )1≤i≤n
1≤j≤l
with
xij =
1, if the j th genetic event occurs in the ith pattern;
0, otherwise.
Each event j has an associated binary random variable Zj that indicates the occurrence
of the event in a genetic pattern. Therefore, column j of the pattern matrix is a set of
observations of size n on the variable Zj .
Formally, an oncogenetic tree T = (V, E, r, p) has the genetic events as vertices, with
the null event r as root. The set E of edges are labeled with conditional probabilities
p : E → [0, 1], such that for an edge e = (u, v) ∈ E, p(e) = Pr(Zv = 1 | Zu = 1) is the
conditional probability of event v given event u.
Oncogenetic trees estimate the joint distribution of the events based on the observed
genetic patterns. An oncogenetic tree induces a probability distribution over the set
of all possible patterns Ω = {0, 1}l . Let x be a pattern and S ⊆ V the set of events
present in x. If there exists a subset E 0 ⊆ E such that S is the set of all vertices
reachable from r in the induced subtree Tx = (V [E 0 ], E 0 ) then x can be generated by
T , with the positive probability:
Y
Y
p(e) ·
(1 − p(e)).
Pr(x | T ) =
e∈E 0
e∈S×V \S
Otherwise, if there is no such edge subset E 0 , pattern x cannot be generated by T and
Pr(x | T ) = 0.
Learning oncogenetic trees
An algorithm for fitting a tree that approximates the multivariate distribution of the
patterns is presented by Desper et al. in [7]. The tree is constructed as a maximum
weight branching in a complete graph on l + 1 vertices, using Edmonds’ algorithm in
O(| V || E |) time. The weight of an edge e = (u, v) is given by:
Pr(u)
Pr(u, v)
·
= log Pr(u, v)−log(Pr(u)+Pr(v))−log Pr(v)
w(e) = log
Pr(u)Pr(v) Pr(u) + Pr(v)
where Pr(u) is the marginal probability of event u and Pr(u, v) is the joint probability
of events u and v. Intuitively, the weights reflect the desirability of having event v as
25
a direct descendant of u in the tree.
In practice, the joint probabilities are not known, therefore they have to be estimated
from the set of observations. For a large data set, the algorithm reconstructs the correct oncogenetic tree with high probability. The estimated model is not necessarily the
maximum likelihood one (quantitative statements about the quality of the approximation are given in Desper et al. [7]).
Oncogenetic tree mixtures
In a real-life scenario, the assumption that all observable patterns are generated by
a single tree topology is too restrictive in a probabilistic sense. In order to avoid
the situation when an oncogenetic tree estimated based on a set of patterns does not
generate all of them with positive probability, tree mixtures have been proposed by
Beerenwinkel et al. [1]. Typically, the first tree in the mixture is called the noise component and it has a star topology, allowing all patterns to have a positive likelihood.
Figure 3.1 shows an example of oncogenetic tree mixture.
Formally, we define a K-oncogenetic tree mixture M as a collection of K oncogenetic
trees Tk = (V, Ek , r, pk ) that induces a mixed distribution on the pattern space:
M=
K
X
αk Tk
k=1
P
with αk ∈ [0, 1] and K
k=1 = 1. Consequently, the likelihood of a pattern x in the
mixture model is given by:
Pr(x | M) =
K
X
αk Pr(x | Tk )
k=1
Learning oncogenetic tree mixtures
Given the number K of trees, the tree mixture has to be reconstructed based on the
observed set of patterns X. This translates into estimating the parameters αk and the
tree topologies Tk . Assume that, for each pattern, we know the tree component that
generated it. In this case, we can use the algorithm described above K times to reconstruct all trees from the corresponding subset of patterns. But the responsibilities are
not known, therefore they have to be estimated from the data. This procedure results
in an EM-like algorithm (Dempster et al., [6]).
26
27
0.92, a genetic pattern is generated by the nontrivial tree component. The edges of the trees are labeled with the conditional
probabilities between the events, the confidence intervals and the bootstrap sample counts.
Figure 3.1: Oncogenetic tree mixture example. 8% of the samples are best explained by independence of the events. With probability
We present below the main steps of the algorithm, as given in [30]:
1. Guess initial responsibilities: Run (K-1)-means clustering algorithm on all patterns and set responsibilities according to clustering results.
2. Maximization-like step: Estimate the star tree T1 and the other components
T2 ...TK with Edmonds’ algorithm from all events weighted with their responsibilities. Compute the mixture parameters as the sum of responsibilities.
3. Expectation step: Compute new responsibilities of all patterns from likelihood
with respect to tree components.
The only parameter that still needs to be estimated is the optimum number of trees
K. This can be done via cross-validation, by trying out several values for K and then
choosing the simplest model within one standard deviation of the one that achieves the
maximum mean log-likelihood.
The presented algorithm differs from the traditional EM algorithm in the fact that
the maximization step does not provide an ML estimate. Moreover, convergence to a
local maximum of the log-likelihood is not guaranteed.
The detailed algorithm is given below:
EM-like algorithm for learning K-oncogenetic tree mixtures
1. INPUT
• Patterns of events X = (xij )1≤i≤n
1≤j≤l
• Number of oncogenetic trees K ≥ 2
2. OUTPUT
• K-oncogenetic trees mixture model
PK
k=1
αk Tk
3. PROCEDURE
1. Guess initial responsibilities:
(a) Run (K-1)-means clustering algorithm
(b) Set responsibilities
1
,
if xi is in cluster k − 1;
2
γik =
1
, else.
2(K−1)
28
2. M-like step.
P Update model parameters:
Set Nk = N
i=1 γij for all k = 1, ..., K
Let T1 be a star with edge weights
β=
l
N
1 XX
γi1 xij
lN1 j=1 i=1
For k = 2, ..., K:
(a) For all pairs of events (u, v), 1 ≤ u, v ≤ l, estimate their joint probabilities
N
1 X
pk (u, v) =
γik xiu xiv .
Nk i=1
(b) Compute the maximum weight branching Tk from the complete digraph
with weights w derived from pk .
(c) Compute the mixture parameter αk = NNk .
3. E-step. Compute responsibilities:
αk Pr(xi | Tk )
γik = PK
m=1 αm Pr(xi | Tm )
4. Iterate steps 2 and 3 until convergence.
Model stability
Beerenvinkel et al. [30] propose to use bootstrap analysis (Hastie, Tibshirani and
Friedman, 2001) for measuring the stability of the trees, i.e. for measuring the dependence of the topology on sampling effects. The task is reduced to single oncogenetic
trees. Given the responsibilities γ computed with the EM algoritm, resampling with
replacement is carried out for each pattern xi with probability γik . From the bootstrap
sample of size N , an oncogenetic tree is reconstructed. The procedure is repeated sufficiently many times. As a test statistic, the relative count of each edge e ∈ E in the
bootstrap trees is considered.
The second, non trivial component in figure 3.1 shows strong evidence of the events
8q13,24+ and 14q12,24 as initial events, and, for example, of the succession 1q21-23+
→ 18q21,23-.
3.2
Genetic Progression Score
Various clinical and histological markers have been proposed and used for determining
the progression status of human tumors. The main applications are prediction of survival time and selection of a suitable therapy for every patient. Scores that measure
29
the progression status of tumor samples can be computed from their associated genetic
patterns. Many naive scores assume independence and cumulative effects of the genetic events, assumptions that do not hold in general. The genetic progression score
(GPS) proposed in [1] integrates dependencies between events by using oncogenetic
tree mixture models. In what follows, we will formally present the GPS and briefly
argue about its predictive power.
Time stamps are added to the oncogenetic trees in order to express the accumulation of genetic events not only in terms of preferred order, but also considering the
time intervals at which subsequent events occur. Given this supplementary information, we can estimate the ”age” of a tumor and also give a prediction of the survival
time.
A timed oncogenetic tree can be obtained by assuming independent Poisson processes
for the occurrence of events on the tree edges and for the sampling time of the tumor
(i.e. the time from onset until the tumor is analyzed). Denote by Ti the waiting time of
event i, representing the difference of occurrence times between the parent event pa(i)
and the event i itself. Assume Ti is exponentially distributed with parameter λi and
let the sampling time of the tumor Ts be exponentially distributed with parameter λs .
We want to label all edges (pa(i), i) in the tree with the estimated value of the waiting
time Ti .
Observe that, if X and Y are exponentially distributed with parameters λ and µ
respectively, the following relations hold:
E[X] =
Z
1
λ
(3.1)
∞
Pr(X ≥ t) =
λe−λx dx = e−λt ,
∀t > 0
(3.2)
t
Pr(X ≥ α + β | X ≥ β)
=
=
(3.2)
=
=
30
Pr(X ≥ α + β ∧ X ≥ β)
Pr(X ≥ β)
Pr(X ≥ α + β)
Pr(X ≥ β)
−λ(α+β)
e
= e−λα
e−λβ
Pr(X ≥ α),
∀α, β > 0
(3.3)
and
∞
Z
Z
Pr(X ≥ Y ) =
∞
λe
0
−λx
dx µe−µy dy
y
Z
∞
e−(λ+µ)y dy
= µ
0
µ
=
λ+µ
(3.4)
The relations above are used for computing the expected waiting time E[Ti ]. Denote
by T[i] the cumulative time until the occurrence of event i. Then
pi
=
Pr(Zi = 1 | Zpa(i) = 1)
=
Pr(Ts ≥ Ti + T[pa(i)] | Ts ≥ T[pa(i)] )
(3.3)
=
(3.4)
=
Pr(Ts ≥ Ti )
λi
λi + λs
(3.5)
Therefore,
(3.1)
E[Ti ] =
1 − pi
1 (3.5) 1 − pi 1
=
=
E(Ts )
λi
p i λs
pi
(3.6)
The tumor age at the time of sampling is not known, in general. Thus, the parameter
λs cannot be estimated from the data and the expected waiting time E[Ti ] cannot be
scaled to the true time scale of the oncogenetic process. We define unitless waiting
times E[Ti ] by normalizing the mean sampling time to E[Ts ] = λ1s = 1.
The expected times E[Ti ] can therefore be explicitly calculated from the oncogenetic
trees using formula (3.6). In what follows, we show how to extend the computations
to estimating waiting time for a genetic pattern x = (x1 , ..., xl ).
Intuitively, waiting times accumulate when traversing the tree from the root towards
the leaves. However, in general, one cannot compute formally the expectation of the
resulting random variable. Therefore, the proposed solution is to simulate the waiting
process along the tree nsim times (typically nsim ≥ 106 ) by drawing observations from
variables Ti ∼ exp(λs pi /(1 − pi )) on all tree edges. For each simulation, a natural
consistency rule is applied to filter out the cases when there exist events i and j such
that xi = 1 and xj = 0, but the observed waiting time for event i is larger than the
waiting time for event j. The observed waiting time for an event i is the sum of all
simulated waiting times of events lying on the path from i to the root.
31
Figure 3.2: Estimating waiting times for patterns in timed oncogenetic trees. For pattern
x in which all events occurred, i.e. x = (1, 1, 1, 1), first observations ti are drawn from
exponential distributions with parameters λs pi /(1 − pi ) and a cumulative waiting time is
computed as t = max(t1 + t3 , t2 ). In general, waiting times of subsequent events are added
and the maximum of the cumulative times of events in different subtrees is chosen.
In cases of inconsistency, a NULL waiting time is returned. For all consistent simulations, the cumulative rule summarized in Figure 3.2 is applied to obtain an observation
on the waiting time for x. The expected waiting time of the pattern is finally estimated
as the average of all observed waiting times.
We refer to this unitless waiting time as the GPS of the pattern. Thus, GPS reflects
the progression of tumor development along the oncogenetic tree model of genetic
aberrations. For the sample x, we define:
GPS(x) = EM (Tx ),
where Tx denotes the waiting time until pattern x and the expectation is taken with
respect to the distribution induced by the underlying oncogenetic tree mixture model
M. Figure 3.3 shows the nontrivial oncogenetic tree from figure 3.1, annotated with
waiting times.
In [30], GPS was computed for tumor samples from Prostate Cancer and Glioblastomas. For both cases, the results showed that GPS has prognostic value, i.e. it can
be used to differentiate patient subgroups with respect to expected clinical outcome.
For example, in the prostate cancer case, the patients with GPS < 1 have a significantly longer time to PSA (prostate specific antigen) recurrence than the ones with
GPS > 1. The prostate specific antigen is a substance produced by the prostate that
may be found in an increased amount in the blood of patients who have prostate cancer, widely used as an indicator of the presence of the disease. Moreover, the GPS can
improve performance over established histopathological parameters, such as Gleason
32
Figure 3.3: Example of timed oncogenetic tree. Each edge is annotated with the expected
waiting time until the occurrence of the child, once the parent has occurred. For this example,
the mean sample time was assumed to be E[Ts ] = 1/λs = 100, i.e. λs = 0.01.
score in the prostate cancer case. The Gleason score is a measure of tumor aggressiveness that ranges between 1 (the lightest) and 10 (the most severe). tumor For
the largest group of tumors with an average Gleason score of 7, experiments showed
that GPS further identifies subgroups with different prognosis with respect to time to
relapse after surgery.
33
Chapter 4
Application to Cancer Data
We tested our algorithm for analysis of arrayCGH data on two array CGH data sets,
representing experiments on prostate and glioblastoma tumors. This chapter presents
the results of the two applications, starting with the description of the input data and
technical details of the implementation, followed by the presentation, interpretation
and validation of the results.
4.1
Array CGH data sets
Prostate Cancer
The prostate cancer data set was made available by the Department of Urology and
Pediatric Urology of University of Saarland, via a collaboration with Prof. Dr. Bernd
Wullich and Dr. Jörn Kamradt. It consists of 17 array CGH experiments on tumors
belonging to 4 prostate cancer cell lines: PC3, DU145, LNCaP and CWR22.
Cell lines are populations of cells derived from a single ancestor cell and grown in
the laboratory. In cancer research, a parent cell from a tumor tissue is cultivated in
vitro or in vivo (typically implanted and grown in mice). The resulting cells are genetically identical due to the common ancestor, therefore arrayCGH experiments on cell
lines have high quality.
PC3 cell lines were generated from a brain metastasis and DU145 from a skeletal metastasis. They are both very advanced forms of cancer and highly resistent to therapies.
The LNCaP cells originate from a lymph node metastatic lesion of human prostatic
carcinoma, and have been widely used in the study of prostate cancer. CWR22 derive
from a primary prostate tumor with bone metastasis with a Gleason score 9, a measure
of the tumor aggressiveness that ranges from 1 (the lightest) to 10 (the most severe).
It is shown in the literature that prostate tumors are characterized by an increased
level of male-specific hormones called androgens and that they evolve from an androgen34
dependent early stage which responds well to antiandrogenic treatments to advanced
stages which are androgen-independent. Cell lines PC3 and DU145 are androgenindependent, while LNCaP and CWR22 are androgen-dependent.
Due to the different metastasis location and stages of evolution of the 4 cell lines,
it is expected that they exhibit different patterns of genetic mutations accumulations.
The array CGH analysis in Section 4.3 will show these differences.
Glioblastomas
The brain tumor data set was published by Markus Bredel et al. in 2005 and it is
available at the Stanford Microarray Database 1 . The description of the experiment
parameters can be found in [4]. Out of the 54 arrays, we selected 21 that refer to
glioblastomas, a type of malignant brain tumor that grows rapidly and is fatal in most
of the cases. It is known that this type of tumor has cells that are genetically very
different from healthy cells.
The resolution of the experiments is of approximately 40.000 genes, out of which 36.000
were mapped to genome position. Normalized Log2 Ratios , locuslink IDs and gene
names are available for each spot on the array. Section 4.4 presents the results of our
statistical analysis of the array CGH experiments on the glioblastomas data set.
4.2
Implementation
We implemented our method in R (version 2.1.1), a software environment for statistical
computing and graphics 1 often used in applications in Bioinformatics. R is a high-level
programming language, suitable for our purposes since it allows easy and optimized
computations with large matrices. Moreover, a comprehensive collection of packages
implementing most of the frequently used statistical learning algorithms is available.
Below are described the main steps of the application together with the packages and
functions used (see Chapter 2 Section 2.3.4 for the algorithm).
4.2.1
Implementation steps
1. Input
Each input CGH array is a data frame object that keeps for each spot on the
experimental chip the following information:
• Chromosome: the chromosome to which it is mapped
1
1
http://smd.stanford.edu/cgi-bin/publication/viewPublication.pl?pub no=182
http://www.r-project.org
35
• PosOrder: an index which gives the relative order of the spots on the corresponding chromosome
• PosBase: the base position on the specified chromosome at which the corresponding sequence begins
• LogRatio: the normalized log2 ratio of the two fluorescence intensities as
output by the array CGH experiment
These fields are obligatory. Additionally, if available, information on the gene
name or locuslink ID or other gene identifiers are useful for a quick interpretation
of the results.
2. Array CGH smoothing
The array CGH smoothing step is necessary for noise reduction and it amounts
to fitting a piecewise constant function to the Log2 Ratios . We used the package
GLAD1 which, apart from the main function glad that implements the piecewise
constant regression, contains several plotting facilities for a convenient visualization of the results, e.g. cytogenetic bands annotation. Cytogenetic bands are
segments of the chromosomes of different fluorescent intensities as a result of
staining.
The function glad adapts the more general function aws from the R package
AWS2 , which implements a local polynomial adaptive weights smoothing method,
to the particular setting of array CGH analysis.
We used glad with the default parameters setting:
glad(profileCGH, smoothfunc="lawsglad",
model="Gaussian", lkern="Exponential",
qlambda=0.985, ...)
• profileCGH: the input array CGH data in the format required above;
• smoothfunc: chooses the piecewise regression method used, in the default
case adaptive weights smoothing (aws)
• model: determines the underlying assumption about the distribution of the
Log2 Ratios , in the default case Gaussian
• lkern: chooses the location kernel for the aws function (see Kl parameter
in the the parameter discussion section on page 17)
1
2
http://www.bioconductor.org/packages/bioc/stable/src/contrib/html/GLAD.html
http://cran.r-project.org/src/contrib/Descriptions/aws.html
36
• qlambda: stochastic penalty that tunes the sensitivity of the method to local
changes (see parameter λ on page 17)
The function glad calls aws, with the following parameters:
aws(y, x = NULL, p = 0, sigma2 = NULL, qlambda = NULL,
eta = 0.5, lkern = "Triangle", hinit = NULL,
hincr = NULL, hmax = 10, ...)
• y: the observed Log2 Ratios ;
• x: the chromosomal locations of the observations;
• p: degree of the local polynomial used in the regression model; p = 0 for a
local constant fit;
• sigma2: estimate of the variance of the model; if NULL, it is estimated from
the data;
• qlambda: stochastic penalty, same as in glad;
• eta: memory parameter used to stabilize the procedure, with 0.5 default
value (see parameter η on page 17);
• lkern: location kernel, same as in glad, but with a different default value
(see parameter Kl on page 17);
• hinit: initial bandwidth for the location penalty (see parameter h(0) on
page 17); glad sets this parameter to a default value of 1;
• hincr: factor to increase the bandwidth between iterations (see parameter
a on page 17); its default value when called by glad is 1.2
• hmax: maximal bandwidth to be used, set to 10 by default; it determines
the number of iterations and is used as the stopping rule (see parameter h∗
on page 17).
The output of the glad function is a data frame object that contains the complete input information and a new column Smoothing, specifying the fitted values.
Missing values handling
Array CGH data contains many not available (NA) Log2 Ratio values due to unresolved spot images on the experimental chips. The missing values are nevertheless
sparse, and a prediction of their values is useful for a coherent alignment of the
arrays in the consensus analysis. We use the smoothing function as predictor of
missing values in a natural way: we assign to a not available Log2 Ratio the value
of the neighboring observations, following the local constant assumption.
37
3. Aberrations detection
We used hist from the graphics package to visualize histograms of the smoothed
Log2 Ratios for each array separately, which suggested normal distributions. We
also used the package stats, containing the function qqnorm which produces
quantile-quantile plots and the function qqline which fits a robust normal to the
Log2 Ratios and returns its mean and standard deviance.
The gain and loss cutoffs at 2 standard deviations from the mean determine
the status of each observation spot to be either 1 (gain), 0 (normal) or −1 (loss).
4. Region selection
The region selection algorithm amounts to aligning all the status arrays and
computing associated p-values to each observation. We implemented the dynamic programming scheme suggested in section 2.3.3 which requires quadratic
running time in the number of arrays.
A cutoff of 0.01 was used to filter out all chromosomal regions with larger pvalue. The remaining regions are called genetic events. They are stored in a data
frame object containing the following information:
• Chromosome: the chromosome on which the region is located
• Start: the starting position of the region, in basepairs
• End: the ending position of the region, in basepairs
• Status: the type of mutation, either 1 (gain) or −1 (loss)
5. Oncogenetic trees estimation
To estimate oncogenetic trees from the genetic events selected, we used the
mtreemix software package1 developed by Niko Beerenwinkel et al. [2].
Input files format
Two input files with the same name and different extensions are required as input: a profile (.prf) and a pattern (.pat) file.
The profile file contains a listing of the genetic events labels, one label per line.
For an easy interpretation, these labels should refer to established chromosome
regions nomenclature, such as cytogenetic bands for large regions or isolated genes
names in the case of narrow regions.
1
http://mtreemix.bioinf.mpi-sb.mpg.de
38
The pattern file contains a binary data matrix preceded by its dimensions (number of rows, number of columns) in the first line. Each column corresponds to a
genetic event, starting with the null event added artificially as explained in section 3.1.1. Each row represents the genetic pattern of an array, starting with a 1
in the null event column and followed by a space-separated list of zeroes and ones.
Applications
• mtreemix_bootstrap: fits a mixture model and analyzes its stability using
bootstrap sampling. Among the parameters required are the profile and
pattern files, the number of trees in the mixture model, the parameter of
the exponential sampling times and the minimum conditional probability to
include an edge in the model.
• mtreemix_time: adds waiting time estimates to the mixture model. The
parameters required include the ones listed above, in addition the number
of simulations used for time estimation must be specified.
• mtreemix_select: carries out model selection, optimizing the number of
trees in the mixture model. The model selection method can be either
cross-validation, modified Bayes Information Criterion (BIC) or standard
BIC [13].
The trees, the bootstrap stability values and the estimated waiting times can be
visualized using the treeify function.
4.3
Results for analysis of prostate cancer data
This section presents the results of our arrayCGH analysis applied to the Prostate
Cancer data set, compares them to previous experiments and gives an interpretation
of the estimated oncogenetic trees.
4.3.1
Genetic mutations in individual arrays
The analysis of individual arrays supports the assumption that the smoothed Log2 Ratios
follow normal distributions around 0, but the quantile-quantile plots show that the
tails of these distributions deviate. Figures A.1 to A.17 from Appendix A show the
histograms of the smoothed Log2 Ratios together with the fitted robust normals, the
corresponding qq-plots, the gains/losses cutoffs and the resulting mutations on the first
chromosome of each of the 17 arrays.
Our algorithm successfully chooses gain and loss cutoffs that separate the tails of the
distributions in a fully data adaptive way. However, since our method is searching for
regions with Log2 Ratios significantly different from 0, it cannot identify low-magnitude
39
Figure 4.1: p-values associated to gains and losses in the Prostate data set. Regions with
p-values smaller that 0.01 are called significant. The figures show that at least 5 gains or 4
losses should be observed at a certain position in order to select it.
amplifications or deletions which might appear repeatedly in a large fraction of the arrays. For instance, experiments HV3 20 68CGH (Fig. A.4), HV3 20 60CGH (Fig.
A.7), HV3 20 58CGH (Fig. A.15), HV3 20 70CGH (Fig. A.16) and HV3 20 61CGH
(fig. A.17) show a low-magnitude loss on the short arm of chromosome 1, between
1p12 and 1p34 which is not output by our method.
From a biological point of view, these regions may be interesting to analyze, and
our method can be adjusted to be more sensitive to low amplifications or deletions by
shrinking the cutoffs (e.g. to 1.5 times the standard deviation of the robust normals).
However, this might lead to an undesired masking of narrow regions of high gain or
loss, which lie within low regions.
In our experiments, we focus on identifying rather narrow, highly specific gained or
lost regions, as they make the search for single genes involved in tumor genesis easier.
The same reason also justifies the use of high resolution CGH array technology.
4.3.2
Consensus analysis
The p-value cutoff of 0.01 selects the regions where at least 5 out of 17 arrays show a
gain in copy number and those where at least 4 out of 17 show a loss. Figure 4.1 shows
the p-value curves for gains and losses in the Prostate Cancer data set.
Our method outputs 19 genetic events, listed in table 4.1. We compared our results
with previously published studies on Prostate Cancer genetic mutations [22], which
were typically the outcome of CGH experiments.
40
Chromosome 1. Amplifications of the entire long arm (1q) were indicated previously. Our method detects narrower gains at 1q21-q23 and 1q24, due to the higher
resolution of array CGH data. It also shows a clear loss at 1q25-q32, a mutation that
was not detected by chromosomal CGH experiments.
Chromosome 4. Previous results show a loss in the long arm, from 4q31 to the end,
a mutation also detected by our analysis. An additional loss was identified at 4q22.
Chromosome 5. Amplifications of the entire short arm 5p were identified by chromosomal CGH. Array CGH data show a much narrower gain at 5p12.
Chromosome 7. Gains located on the 7th chromosome are frequently reported in the
literature, either of the entire chromosome, or only of the short arm (7p) or regions of
the long arm. The array CGH analysis shows that 5 out of the 17 arrays we used have
an amplification of the SEMA3C gene, positioned within 7q21 band. Overexpression
of this gene has already been correlated with ovarian and breast cancer [12] and malignant gliomas [31]. It is believed to play a role in cancer metastasis.
Chromosome 8. Gains of the long arm 8q of chromosome 8 and losses of the short
arm 8p are two of the most referenced mutations in prostate cancer. Our method
confirms the amplifications of the entire 8q arm, but narrows down to 8p22 the loss in
the short arm.
Chromosome 10. Amplifications located at 10q are known from CGH experiments.
Our analysis of CGH arrays outputs a much narrower region, located at 10q22.
Chromosome 12. Our analysis locates a gain in copy number of KRAS gene located
in the band 12p12.1, a well known oncogene involved in many types of cancer.
Chromosome 14. Chromosomal CGH reports amplifications along the 14q arm,
which are confirmed by our analysis. Three narrow regions are identified by our analysis, one of which resumes to a few genes.
Chromosome 18. Among the frequently referenced mutations in prostate cancer is
the loss of the long arm of chromosome 18. Our method identifies the same region.
Chromosome 19. 5 out of the 17 CGH arrays show a loss of a narrow region located at 19q13.4. This region contains several genes that encode zinc finger proteins
(ZNF701, ZNF137). These proteins are involved in DNA transcription processes. Many
zinc finger proteins encoding genes appear as tumor suppressor genes candidates in the
literature [34].
Figure 4.2 shows on two separate consensus plots the amplifications and deletions detected by our method. For a more detailed view, see Figures A.18 to A.30 in Appendix
A.
41
Chromosome
Start Position
End Position
Mutation
Label
1
1
1
2
2
4
4
5
7
8
8
10
12
12
14
14
14
18
19
140608453
168325828
171062336
43742879
162306337
89900841
147757519
29115522
79984333
68008046
185394
73973114
25249834
27004735
27243148
45903485
88854803
37913083
57707639
158830654
171032956
196730612
53975994
162426906
99866357
191599721
43704902
79984333
142189219
11602966
81580145
25259666
29616341
34770818
71980491
89570828
76104136
57792000
gain
gain
loss
loss
loss
loss
loss
gain
gain
gain
loss
gain
gain
gain
gain
gain
gain
loss
loss
1q21-q23+
1q24 +
1q25-q322p16-p21PSMD144q224q31-q355p12+
SEMA3C+
8q+
8p2210q22+
12p11.1+
KRAS+
14q12-q21+
14q22-q31+
CALM1,RPS6KA5+
18qZNF701,ZNF137-
Table 4.1: Summary of amplifications and deletions identified in the Prostate Cancer data
set. For each mutation, the chromosome location, the starting end the ending position in
basepairs, the type of mutation and the label (referring to cytogenetic bands) are given.
42
43
along the genome. The 24 chromosomes are annotated on the top axis of each plot. The black stripes indicate the locations of either
gain or loss, in each of the 26 arrays.
Figure 4.2: Mutations in Prostate Cancer arrayCGH data set. The gains (top figure) and the losses (bottom figure) are aligned
4.3.3
Oncogenetic trees
We associated genetic patterns to the 17 prostate cancer CGH arrays, corresponding to
the 19 genetic events presented in the previous section (Table 4.1). We then estimated
an oncogenetic tree mixture model with two components based on this set of genetic
patterns (see Figure 4.3).
The weight of the star component in the mixture is 0.76, therefore as many as 76% (13
out of 17) of the genetic patterns are most likely generated by the model that assumes
independence of the events (noise component). We explain this poor mixture model
by the heterogeneity of the genetic patterns, and the relative small amount of arrays
compared to the number of events. The LNCaP and CWR cell lines are not well represented (only 2 samples of each), therefore their specific dependencies do not have a
strong enough support to be captured by the tree models and they are considered noise.
We decided to leave out the LNCaP and CWR cell lines and estimate a tree mixture
model based only on the better represented PC3 and DU145 cell lines. The mixture
model has two components (see Figure 4.4), but a higher proportion of the arrays (8
out of 13) are now explained by the nontrivial tree component. The stronger dependencies between the events within this restricted data set is explained by the increased
similarity between the arrays.
The analysis of the nontrivial component shows that events 8q+ and 14q22-q31+ appear early in the course of the disease. The literature reports gain of 8q as one of the
mutations associated with the onset of malignancy in prostate cancer, therefore present
in a large fraction of the advanced cases. The relative count of bootstrap samples (687
out of 1000 and 685 out of 1000, respectively) strongly support 8q+ and 14q22-q31+,
respectively, as initial events.
The topology of the tree indicates two distinct evolutionary pathways. The left subtree characterizes PC3 cell line arrays, while the right subtree describes better DU125
arrays. The events located in the subtree rooted at 1q25-32+ appear predominantly in
PC3 cell lines, while the events in the subtree rooted at 1q21-23+ are found rather in
DU145 cell lines (Table 4.2). Therefore, the tree model learns from the data originating
from two different cell lines two different progression pathways of prostate cancer.
Bootstrap counts show strong dependencies between events 1q24+ and 8p22-, 1q2123+ and 4q31-35 and 1q24+ and KRAS+. However, in general, the confidence intervals
are rather large and the relative bootstrap counts small. This poor quality of the tree
mixture model is a consequence of having too many features (genetic events) and not
enough examples to learn from (arrays). It is not a particular problem of oncogenetic
trees, but of most statistical learning models.
44
45
the bottom box contains the nontrivial tree component. Each edge in the model is annotated with the conditional probability, the
confidence interval associated and the bootstrap samples count as a measure of stability. The weight of each component in the mixture
is given in the top left corner of the box.
Figure 4.3: Oncogenetic tree mixture model for Prostate Cancer. The top box contains the star-shaped noise component and
46
two cell lines into two evolutionary pathways. The subtree rooted at 1q25-q32- characterizes the PC3 cell lines, while the subtree
rooted at 1q21-q23+ contains events predominant in DU145 cell line.
Figure 4.4: Oncogenetic tree mixture model based on arrays from the PC3 and DU145 cell lines. The tree topology separates the
47
0
0
0
0
0
0
0
0
1
1
0
1
1
1
1
1
1
PC3
0
0
0
0
0
1
0
0
1
1
0
1
1
0
1
1
1
PC3
0
0
0
0
0
0
0
0
1
0
1
1
0
0
0
1
0
PC3
0
0
0
0
0
0
0
0
1
1
0
1
0
1
0
1
0
PC3
1
0
0
0
0
0
0
0
1
1
0
1
1
1
1
1
1
PC3
0
0
0
0
0
0
0
1
1
1
0
1
0
1
1
1
1
PC3
1
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
1
DU145
1
1
1
1
1
0
0
1
0
0
0
0
0
0
0
0
1
DU145
1
1
1
1
1
1
0
1
0
0
0
0
0
0
0
1
1
DU145
1
0
1
1
1
0
1
0
0
0
0
0
0
0
1
1
1
DU145
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
0
1
DU145
1
1
0
1
1
0
1
0
0
0
0
0
0
0
0
0
1
DU145
to an array. The genetic events are grouped relative to their position in the oncogenetic tree mixture model: the first two events are
initial events, the second group forms the left subtree and it is highly characteristic for PC3 arrays, while the third group belongs to
the right subtree and it is representative for the DU145 arrays.
Table 4.2: Genetic patterns of arrays from PC3 and DU145 cell lines. Each row corresponds to a genetic event and each column
0
1
0
0
1
0
0
0
1
0
1
1
1
0
0
1q25-321q24+
4q2210q22+
8p22KRAS+
CALM1,RPS6KA5+
1q21-23+
4q31-355p12+
14q12-21+
18qZNF701,ZNF137SEMA3C+
12p11.1+
0
0
8q+
14q22-31+
PC3
4.4
Results for analysis of glioblastomas data
This section presents the results of our arrayCGH analysis of the Glioblastomas data
set, compares them to previous experiments and gives an interpretation of the estimated
oncogenetic trees.
4.4.1
Genetic mutations in individual arrays
The analysis of the individual arrays shows substantial differences between the Prostate
Cancer data set and the Glioblastomas data set. The histograms of the smoothed
Log2 Ratios indicate that a large proportion of them accumulate very close to zero,
having a high peak around the mean and steep slopes towards the tails. As a consequence, the amplifications and deletions cutoffs are rather close to the mean. However,
the qq-plots show the same normal distribution trend with deviating tails (see Figures
B.1 to B.26 in Appendix B).
Several reasons explain the differences between the distributions of the smoothed
Log2 Ratios from the two data sets. In the case of Prostate Cancer, the cell lines
used for experiments contain only tumor cells, in (approximately) the same stage of
progression. The Glioblastomas experiments use tumor tissue prevailed directly from
diagnosed patients, which contain cancerous cells in different stages of progression,
potentially mixed with healthy cells. With a certain probability, healthy cells labeled
as cancerous hybridize against the reference tissue, influencing the ratio of fluorescent
intensities. Therefore, the amplifications and deletions measured by arrayCGH might
have a lower magnitude compared to the true levels, as measured in pure tumor cells.
Another reason that explains the steep slopes of the histograms is the low resolution
of mutations. In general, the Glioblastomas show narrower and less coherent amplified
or deleted regions, compared to the Prostate Cancer arrays. This may be due to the
impure tissue as shown before, or due to biologically different evolution patterns of the
two diseases.
4.4.2
Consensus analysis
A p-value cutoff of 0.01 determines the selection of regions where at least 7 out of the
26 arrays have a mutation of a fixed type (either gain or loss). Figure 4.5 shows the
p-value curves for gains and losses in the Glioblastomas data set.
Our region selection method outputs 34 genetic events, out of which 13 are mapped to
chromosome 22. Figure 4.6 highlights these regions on a consensus plot. The oversegmentation in this case has both advantages and drawbacks. The positive aspect is that
narrow regions serve well the goal of identifying single causative genes. It is not clear
though whether learning a prediction model for tumor progression from a set of many
small regions would result in a more powerful model than one learned from few larger
48
Figure 4.5: p-values associated to gains and losses in the Glioblastomas data set. Regions
with p-values smaller that 0.01 are called significant. The figures show that at least 7 gains
(or losses) should be observed at a certain position in order to select it.
compact segments. In the case of oncogenetic trees, too many events may lead to poor
tree mixtures, as shown in the previous section. Biological considerations should also
be involved in the selection decision. This is a typical feature selection problem and, in
a general setting, one seeks to optimize an objective function. In our case however, the
goal is to build tumor progression models that can predict tumor stage and survival
time. We do not propose a solution to the region selection problem here, it is subject
to future work.
In order to continue our analysis with the estimation of oncogenetic tree mixtures,
we decided to select two broad regions from the losses of chromosome 22. They are
listed in Table 4.3, together with all other mutations found by our analysis. In what
follows, we comment on these results and we compare them with previously published
studies on Glioblastomas, which are based on CGH experiments (see [17], [15] for summaries of aberrations in Glioblastomas).
Chromosome 1. Frequent gains of 1q are reported in the literature. Our method outputs a recurrent gain of a much restricted region, at 1q43-q44. Losses at the terminal
position of the short arm 1p were identified by chromosomal CGH and confirmed by
our method, but two particular genes, MAD2L2 and GGPS1 were found to be lost in a
high number of patients. MAD2L2 is involved in cell division and has been associated
with cancer genesis.
Chromosome 5. Our analysis detected a recurrent gain (7 out of 26)of PRLR (prolactin receptor) gene located at 5p12. This gene is involved in anti-apoptosis processes,
therefore it may play a role in tumor development.
Chromosome 6. Losses of the long arm were detected by chromosomal CGH. Our
49
Figure 4.6: Losses on chromosome 22. Some of the regions are separated by very small
intervals.
method identifies a single gene (KIAA0408) within the 6q23 cytogenetic band , lost in
7 out of 26 patients. The function of this gene is still unknown.
Chromosome 7. Our analysis confirms the amplifications of all chromosome 7 reported by CGH experiments.
Chromosome 8. 10 out of 26 patients have amplifications located at the terminal
positions (8q24.3) of the long arm. Much broader regions were reported before.
Chromosome 9. CGH experiments find gains of the long arm 9q and losses of the
short arm 9p. We detect recurrent deletions of a segment of the short arm, 9p21-p24.
Chromosome 10. Partial or full deletions of chromosome 10 in almost all patients
confirm the CGH results and identify this mutation as a good genetic marker of glioma
oncogenesis.
Chromosome 11. We detect an amplification of gene UCP3 in 8 patients. The overexpression of this gene is related to muscle wasting during cancer [5].
Chromosome 13. Deletions of the long arm in a large fraction of patients indicate
the presence of tumor suppressor genes and identify this mutation as one of the most
common in gliomas progression.
Chromosome 16. We locate a recurrent gain of the gene FLJ37464 at 6q22 in 9
patients. No previous references relate this gene to cancer.
Chromosome 18. CGH experiments show sparse gains and losses along the chromosome 18. Our method identifies a deletion of the TYMS (thymidylate synthetase) gene
in 10 patients. It is known to be involved in DNA repair processes and therefore to
play a role in tumor genesis.
Chromosome 19. We find amplifications along the entire chromosome in almost all
50
patients. A particular recurrent loss of the CACNA1A (calcium channel) at 19p13.1 is
also identified. This gene is not directly connected to cancer in the literature.
Chromosome 20. Gains of the entire chromosome are typical mutations in CGH
experiments. Our method identifies amplified regions that span almost the whole chromosome.
Chromosome 22. Recurrent losses of the long arm were frequently detected by
CGH experiments. We identify multiple lost regions, one of which consists of only
the neurofibromin 2 gene (NF2), which is a known tumor suppressor gene involved in
meningiomas.
Chromosome
Start Position
End Position
Mutation
Label
1
1
1
5
6
7
8
9
9
10
11
13
16
18
19
19
20
20
20
21
22
22
23
236978915
11668803
231817793
35099984
127803428
288195
143950776
111040
27099285
170642
73388984
19105879
65587028
586997
232045
13179114
71251
5934878
9024931
44256633
15446202
28324118
108585154
245410192
11668803
231817793
35099984
127803428
158320342
146248629
21957751
27099285
135256286
73388984
114098116
65587028
647650
63778638
13179114
1491567
7811630
62357564
44256633
18085619
41872032
108585154
gain
loss
loss
gain
loss
gain
gain
loss
loss
loss
gain
loss
gain
loss
gain
loss
gain
gain
gain
loss
loss
loss
loss
1q43-q44+
MAD2L2GGPS1PRLR+
KIAA04087+
8q24.3+
9p21-p24TEK10UCP3+
13qFLJ37464TYMS19+
CACNA1A20p13+
20p12.3+
20p12.2-qter+
TMEM122q11.2122q12.2-q13.2NXT2-
Table 4.3: Summary of amplifications and deletions identified in the Glioblastomas data set.
For each mutation the chromosome location, the starting end the ending position in basepairs,
the type of mutation and the label (referring to cytogenetic bands) are given.
Figure 4.7 shows on two separate consensus plots the amplifications and deletions detected by our method. For a more detailed view, see Figures B.27 to B.44 in Appendix
B.
51
52
the genome. The 24 chromosomes are annotated on the top axis of each plot. The black stripes indicate the locations of either gain
or loss, in each of the 26 arrays.
Figure 4.7: Mutations in Glioblastomas arrayCGH data set. The gains (top figure) and the losses (bottom figure) are aligned along
4.4.3
Oncogenetic trees
We estimated an oncogenetic tree mixture with two components based on the 26 genetic patterns and 24 events presented in the previous section. The trees are shown in
Figure 4.8.
The nontrivial component of the mixture has weight 0.12, therefore only 3 patterns are
likely to be generated by the corresponding tree. A mixture of three tree components
does not capture better the dependencies between the events, as only 7 patterns are
explained by the nontrivial topologies. In fact, cross validation selects the tree mixture
with only one component (the star shaped tree) as the optimum model, thus independence of the events is the most likely assumption given the observed patterns.
We believe that we do not obtain meaningful oncogenetic tree models because of the
large number of events compared to the number of observed patterns. Since it is
apparent that aggressive tumor types such as Glioblastomas show a large number of
mutations and the high resolution arrayCGH data can capture up to single genes deletions and amplifications, it is necessary to develop a strategy of selection of key events
which mark the stages of tumor progression for different types of cancer. However, it
might be true that there isn’t a specific order of mutations accumulation for all types
of cancer, which contradicts the current beliefs.
Regardless of the quality of the tree mixture obtained, we expect mutations 10-, 7+,
19+, 13q- to appear in the early stages of evolution of Glioblastomas. Much research
focuses on identifying candidate genes for tumor onset within these regions.
53
54
box contains the nontrivial tree component. Each edge in the model is annotated with the transition probability, the confidence interval
associated and the bootstrap samples count as a measure of stability. The weight of each component in the mixture is given in the
top left corner of the box.
Figure 4.8: Oncogenetic tree mixture model for Glioblastomas. The top box contains the star-shaped noise component and the bottom
4.5
Validation
The goal of our arrayCGH data analysis method is the identification of genetic markers
(chromosomal amplifications and deletions) that characterize specific types of tumors.
Our algorithm consists of several steps of data filtering based on statistical decisions.
We believe that the information we extract characterizes well the stage of tumors, while
the dimension reduction is substantial, from approximately 30000 Log2 Ratios (CGH
arrays resolution) to 20 genetic events.
To support this affirmation, we validated our method based on the following scenario:
we clustered the 17 Prostate Cancer arrays (see Section 4.1) represented in four different feature spaces, expecting different cell lines to group together. The biological
reasons that support our expectation have been presented in Section 4.1. The feature
spaces are described below.
1. Initial Log2 Ratios. The arrays are represented in the 30000 dimensional space of
the initial Log2 Ratios of intensities.
2. 50 highest variance genes. We select 50 genes that have the largest variance
across all arrays and use them as features.
3. Status arrays. The -1 (loss), 0 (normal level), 1(gain) information is used to
represent the arrays in an Euclidean space with 30000 dimensions.
4. Genetic patterns. The genetic patterns are represented in the space of the 19
genetic events.
In each of the settings above, we used an agglomerative hierarchical clustering method
with Euclidean distance between the data points and average-linkage distance between
clusters [13]. The method starts with each data point assigned to a separate cluster
and then iteratively joins pairs of closest clusters, until one cluster is obtained. The
result of hierarchical clustering can be visualized by means of dendrograms, see Figure
4.9 for the clustering results in each of the four settings.
It becomes clear from the clustering dendograms that the genetic events selected by
our method best separate the four cell lines. The initial Log2 Ratios do not discriminate
between the PC3 and DU145 cell lines and only a slight improvement is achieved when
selecting the 50 highest variance genes. However, when clustering the status arrays, the
CWR and the DU145 cell lines form separate groups. When using the genetic events
as features, the cell lines separate very well, in the sense that a partition in 4 clusters
groups together CWR and LNCaP cell lines in one cluster, all DU145 cell lines in the
second cluster, while the PC3 cell lines are split in 2 groups. A partition in 5 clusters
further separates the CWR and LNCaP cell lines.
55
We also computed the correlations between the mutual distances of cell lines (represented in each of the the four feature spaces described) and ideal distances, defined
as zero between the same cell lines and one between different cell lines. The correlation
increases from 0.27 for the Log2 Ratios representation, up to 0.75 correlation coefficient
in the case of genetic patterns representation. The correlation coefficients are shown
in Figure 4.10.
The results prove that the information we extract from the arrayCGH data captures
the genetic differences between the cell lines.
Figure 4.9: Dendrograms of the hierarchical clustering of the Prostate Cancer arrays in
different feature spaces: a) the initial Log2 Ratios ; b) the 50 highest variance genes; c) the
status arrays; c) the genetic patterns. The labels indicate cell lines.
56
Figure 4.10: Correlation coefficients between the mutual distances of cell lines (represented
in 4 feature spaces) and true distances.
57
Chapter 5
Conclusions and future work
We presented an automated method for statistical analysis of arrayCGH data which
detects genetic amplifications and deletions in different types of cancer. Our contribution to the method consists of an algorithm for determining appropriate gains and
losses cutoffs in individual CGH arrays and a selection method of mutated regions
when a large collection of experiments is available. The genetic events computed by
our method can be further used for learning oncogenetic evolutionary models, which
can help to predict tumor stages and survival time.
We applied our algorithm to Prostate Cancer and Glioblastoma data sets. As the
results show, the high resolution of arrayCGH data allows a more precise characterization of cancer mutations. The regions obtained often reduce to single genes, which
can be further investigated as candidates for cancer onset and progression. The evolutionary tree models estimated for the Prostate Cancer show two different progression
patterns, characterizing two different cell lines. Therefore, the events we select capture
the characteristics of different disease subtypes. This conclusion is also confirmed by
the clustering results of the four cell lines.
We can further improve our algorithm by adding a step of automated parameters
optimization, to adapt to the particularities of each data set. This is not a difficult
task and established methods such as cross-validation are expected to work well. The
region selection however raises the more complicated problem of reducing the genetic
mutations to a small set of highly representative key-events. The optimization of mutation sets needs to consider statistical as well as biological criteria.
For further assessment, we need to test our method on larger arrayCGH data sets,
which are already publicly available. We can also test the prediction accuracy of the
oncogenetic tree mixture models if arrayCGH experiments contain additional clinical
information, such as tumor stage as predicted by traditional methods or survival time
of the patients.
58
Appendix A
Prostate Cancer
Individual analysis
Figures A.1 to A.17 show histograms, qq-plots of the smoothed Log2 Ratios and the
first chromosome of each array from the Prostate Cancer arrayCGH data set, as output
by our method. Each figure is labeled with the ID of the arrayCGH experiment.
59
Figure A.1: HV3 7 18CGH
60
Figure A.2: HV3 7 13CGH
61
Figure A.3: HV3 7 06CGH
62
Figure A.4: HV3 20 68CGH
63
Figure A.5: HV3 20 66CGH
64
Figure A.6: HV3 20 63CGH
65
Figure A.7: HV3 20 60CGH
66
Figure A.8: HV3 20 67CGH
67
Figure A.9: HV3 20 53CGH
68
Figure A.10: HV3 7 25CGH
69
Figure A.11: HV3 7 19CGH
70
Figure A.12: HV3 7 09CGH
71
Figure A.13: HV3 20 64CGH
72
Figure A.14: HV3 20 59CGH
73
Figure A.15: HV3 20 58CGH
74
Figure A.16: HV3 20 70CGH
75
Figure A.17: HV3 20 61CGH
76
Consensus Plots
Figures A.18 to A.30 are consensus plots of the chromosomes which contain the aberrations detected, showing individual mutations of all arrays as well as highlighting
the regions selected. The x-axis of each plot shows the position on the corresponding
chromosome and is annotated with the cytogenetic bands represented by the light blue
vertical lines. The y-axis aligns the 17 arrays. The black stripes indicate the regions
of either gain or loss, for each array individually. The regions with a low p-value are
highlighted by green, vertical stripes.
Figure A.18: Gains on chromosome 1. 8 arrays indicate gains of 1q21-q23 and 7 show gains
of the 1q24 region.
77
Figure A.19: Losses on chromosome 1. There is a strong evidence of a loss of 1q25-q32.
Figure A.20: Losses on Chromosome 2. Narrow regions at 2p16-p21 and 2q24 are lost.
78
Figure A.21: Losses on chromosome 4. There are recurrent losses on the long arm 4q, but
regions 4q22 and 5q31-q35 appear to have a higher frequency.
Figure A.22: Gains on chromosome 5. There are repeated gains at 5p12.
79
Figure A.23: Gains on chromosome 7. Two arrays show an amplification of the whole
chromosome and one only of the long arm. A very narrow region within the 7q21 band which
contains gene SEMA3C is lost in 5 arrays.
Figure A.24: Gains on chromosome 8. Gains along the long arm are identified in 6 arrays.
80
Figure A.25: Losses on chromosome 8. There are deletions of the 8p22 band in 4 arrays.
Figure A.26: Gains on chromosome 10. Amplifications at 10q22 are frequent.
81
Figure A.27: Gains on chromosome 12. 2 arrays have amplifications of the entire chromosome, while a narrow region 12p11.1 appears to be gained in 6 of them.
Figure A.28: Gains on chromosome 14. Multiple gains are located along the 4q arm of the
chromosome. 3 regions are indicated as significant: 14q12 − q21, 14q22 − q31 and a short
region at 14q32.
82
Figure A.29: Losses on chromosome 18. A large part of the long arm is deleted.
Figure A.30: Losses on chromosome 19. A region which contains several zinc-finger protein
encoding genes is lost in 5 arrays.
83
Appendix B
Glioblastomas
Individual analysis
Figures B.1 to B.9 show histograms and qq-plots of the smoothed Log2 Ratios of each
array from the Glioblastomas arrayCGH data set, as output by our method. Each
figure is labeled with the ID of the arrayCGH experiment.
Figure B.1: bredel51551
84
Figure B.2: bredel51550
Figure B.3: bredel51552
Figure B.4: bredel51557
85
Figure B.5: bredel51559
Figure B.6: bredel51516
Figure B.7: bredel51544
86
Figure B.8: bredel51554
Figure B.9: bredel51565
Figure B.10: bredel51563
87
Figure B.11: bredel51558
Figure B.12: bredel51564
Figure B.13: bredel51518
88
Figure B.14: bredel51531
Figure B.15: bredel51566
Figure B.16: bredel51546
89
Figure B.17: bredel51530
Figure B.18: bredel51556
Figure B.19: bredel51511
90
Figure B.20: bredel51529
Figure B.21: bredel51528
Figure B.22: bredel51515
91
Figure B.23: bredel51549
Figure B.24: bredel51548
Figure B.25: bredel51536
92
Figure B.26: bredel51540
93
Consensus plots
Figures B.27 to B.40 are consensus plots of the chromosomes which contain the aberrations detected, showing individual mutations of all arrays as well as highlighting the
regions selected. The x-axis of each plot shows the position on the corresponding chromosome, and is annotated with the cytogenetic bands represented by the light blue
vertical lines. The y-axis aligns the 26 arrays. The light grey points indicate gene
locations, and also correspond to spots on the array CGH chips. The black points
emphasize the locations of either gain or loss, for each array individually. The regions
with a low p-value are highlighted by green, vertical stripes.
Figure B.27: Chromosome 1 gains
94
Figure B.28: Chromosome 1 losses
Figure B.29: Chromosome 5 gains
95
Figure B.30: Chromosome 6 losses
Figure B.31: Chromosome 7 gains
96
Figure B.32: Chromosome 8 gains
Figure B.33: Chromosome 9 losses
97
Figure B.34: Chromosome 10 losses
Figure B.35: Chromosome 11 gains
98
Figure B.36: Chromosome 13 losses
Figure B.37: Chromosome 16 gains
99
Figure B.38: Chromosome 18 losses
Figure B.39: Chromosome 19 gains
100
Figure B.40: Chromosome 19 losses
Figure B.41: Chromosome 20 gains
101
Figure B.42: Chromosome 21 losses
Figure B.43: Chromosome 22 losses
102
Figure B.44: Chromosome 23 losses
103
Bibliography
[1] Beerenwinkel,N., Rahnenfürer,J., Däumer,M., Hoffmann,D., Kaiser,R., Selbig,J., Lengauer,T.
(2005) Learning multiple evolutionary pathways from cross-sectional data. Journal of Computational Biology, 12, 584-698
[2] Beerenwinkel,N., Rahnenfürer,J., Kaiser,R., Hoffmann,D., Selbig,J., Lengauer,T. (2005)
Mtreemix: a software package for learning and using mixture models of mutagenetic trees Bioinformatics, 21(9), 2106-2107
[3] Bilke,S., Chen,Q.-R., Whiteford,C.C., Khan,J. (2005) Detection of low level genomic alterations
by comparative genomic hybridization based on cDNA micro-arrays Bioinformatics, 21-7, 11381145
[4] Bredel,M., Bredel,C., Juric,D., Harsh,G.R., Vogel,H., Recht,L.D., Sikic,B.I. (2005) HighResolution Genome-Wide Mapping of Genetic Alterations in Human Glial Brain Tumors. Cancer
Research 65(10), 4088-4096
[5] Busquets,S., Garcia-Martinez,C., Olivan,M., Barreiro,E., Lopez-Soriano,F.J., Argiles,J.M. (2005)
Overexpression of UCP3 in both murine and human myotubes is linked with the activation of
proteolytic systems: a role in muscle wasting? Biochimica et Biophysica, 1760, 253-258
[6] Dempster,A., Laird,N.M., Rubin,D.B. (1977) Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society, Series , B 39, 1-38
[7] Desper,A., Jiang,F., Kallioniemi,O.P. (1999) Inferring tree models for oncogenesis from comparative genome hybridization data. Journal of Computational Biology, 6(1), 37-51
[8] Dudoit,S., Yang,Y.H., Callow,M.J., Speed,T.P. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments Statistica Sinica, 12, 111-139
[9] Eilers,P.H.C., de Menezes,R.X. (2005) Quantile smoothing of array CGH data. Bioinformatics,
21, 1146-1153
[10] Eilers,P.H.C., de Menezes,R.X. (2005) Quantile smoothing of array CGH data. Bioinformatics,
21, 1146-1153
[11] Fridlyand,J., Snijders,A.M., Pinkel,D., Albertson,D.J., Jain,A. N. (2004) Hidden Markov models
approach to the analysis of array CGH data. Journal of multivariate analysis, 90 (1), 132-153
[12] Galani,E., Sgouros,J., Petropoulou,C., Janinis,J., Aravantinos,G., Dionysiou-Asteriou,D., Skarlos,D., Gonos,E. (2002) Correlation of MDR-1, nm23-H1 and H Sema E gene expression with
histopathological findings and clinical outcome in ovarian and breast cancer patients. Anticancer
Research, 22, 2275-2280
[13] Hastie,T., Tibshirani,R., and Friedman,J. (2001) The Elements of Statistical Learning. SpringerVerlag 193-222
[14] Hsu,L., Self,S.G., Grove,D., Randolph,T., Wang,K., Delrow,J.J., Loo,L., Porter,P. (2005) Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics 6, 211-226
104
[15] Huhn,S.L., Mohapatra,G., Bollen,A., Lamborn,K., Prados,M.D., Feuerstein,B.G., (1999) Chromosomal Abnormalities in Glioblastoma Multiforme by Comparative Genomic Hybridization:
Correlation with Radiation Treatment Outcome1 Clinical Cancer Research, 5, 14351443
[16] Hupé,P., Stransky,N., Thiery,J.P., Radvanyi,F., Barillot,E. (2004) Analysis of array CGH data:
from signal ratio to gain and loss of DNA regions. Bioinformatics, 20, 3413-3422
[17] Inda,M.M., Fan,X., Munoz,J., Perot,C., Fauvet,D., Danglot,G., Palacio,A., Madero,P., Zazpe,I,
Portillo,E., Tunon,T., Martinez-Penuela,H.M., Alfaro,J., Eiras,J., Bernheim,A., Castresana, J.S.
(2003) Chromosomal Abnormalities in Human Glioblastomas: Gain in Chromosome 7p Correlating With Loss in Chromosome 10q. Molecular Carciogenesis, 36, 614
[18] Jong,K., Marchiori,E., van der Vaart,A., Ylstra,B., Meijer,G., Weiss,M. (2003) Chromosomal
breakpoint detection in human cancer. Lecture notes in Computer Science, Springer-Verlag,
Berlin, vol. 2611, 54-65
[19] Kallioniemi,A., Kallioniemi,O.P., Sudar,D., Rutovitz,D., Gray,J.W., Waldman,F., Pinkel,D.
(1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors.
Science, 258, 818-821
[20] Lodish,H., Berk,A., Zipursky,L.S., Matsudaira,P., Baltimore,D., Darnell,J. (October 1999) Molecular Cell Biology. W. H. FREEMAN
[21] Lai,W.R., Johnson,M.D., Kucherlapati,R., Park,P.J. (2005) Comparative analysis of algorithms
for identifying amplifications and deletions in array CGH data. Bioinformatics, 21, 3763-3770
[22] Lensch,R., Götz,C., Andres,C., Bex,A., Lehmann,J., Zwergel,T., Unteregger,G., Kamradt,J.,
Stoeckle,M., Wullich,B. (2002) Comprehensive genotipic analysis of human prostate cancer cell
lines and sublines derived from methastases after orthotopic implantation in nude mice. International Journal of Oncology 21 695-706
[23] Ludwig,J.A., Weinstein,J.N. (2005) Biomarkers in cancer staging, prognosis and treatment selection. Nature, 5, 845 - 856
[24] Molinaro,A.M., van der Laan,M.J., Moore,D.H. (2002) Comparative genomic hybridization array
analysis. U.C.Berkeley Division of Biostatistics Working Paper Series
[25] Nakao,K., Mehta,K.R., Fridlyand,J., Moore,D.H., Jain,A.N., Lafuente,A., Wiencke,J.W., Terdiman,J.P., Waldman,F.M. (2004) High-resolution analysis of DNA copy number alterations in
colorectal cancer by array-based comparative genomic hybridization. Carciogenesis, 25, 13451357
[26] Olshen,A.B., Venkatraman,E.S., Lucito,R., Wigler,M. (2004) Circular binary segmentation for
the analysis of array-based DNA copy number data. Biostatistics, 5, 557-572
[27] Pinkel,D., Segraves,R., Sudar,D., Clark,S., Poole,I., Kowbel,D., Collins,C., Kuo,W.-L., Chen,C.,
Zhai,Y., Dairkee,S.H., Ljung,B.-M., Gray,J.W., Albertson,D.G. (1998) High resolution analysis of
DNA copy number variation using comparative genomic hybridization microarays. Nat. Genet.,
20, 207-211
[28] Polzehl,J. and Spokoiny,V. (2002) Local likelihood smoothing by adaptive weights smoothing.
WIAS-Preprint 787
[29] Pollack,J.R., Perou,C.M., Alizadeh,A.A., Eisen,M.B., Pergamenschikov,A., Williams,C.F., Jeffrey,S.S., Botstein,D., Brown,P.O. (1999) Genome-wide analysis of DNA copy-number changes
using cDNA microarrays.Nat Genet. 23(1), 41-46.
[30] Rahnenfürer,J., Beerenwinkel,N., Schultz,W.A., Hartmann,C., von Deimling,A., Wulich,B.,
Lengauer,T. (2005) Estimating cancer survival and clinical outcome based on genwtic tumor
progression scores. Bioinformatics, 21, 2438-2446
105
[31] Rieger,J., Wick,W., Weller,M. (2003) Human malignant glioma cells express semaphorins and
their receptors, neuropilins and plexins. Glia, 42(4), 379-389
[32] Solinas-Toldo,S., Lampel,S., Stilgenbauer,S., Nickolenko,J., Benner,A., Döhner,H., Cremer,T.,
Lichter,P. (1997) Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer, 20, 399-407
[33] Tinker,N.A., Robert,L.S., Butler,G., Harris,L.J. (2003) Data Pre-Processing Issues in Microarray
Analysis. A practical approach to microarray data analysis, 47-65
[34] Tommerup,N., Vissing,H. (1995) Isolation and Fine Mapping of 16 Novel Human Zinc FingerEncoding cDNAs Identify Putative Candidate Genes for Developmental and Malignant Disorders.
Genomics, 27(2), 259-264
106