Bag of Words approaches for Bioinformatics

Transcription

Bag of Words approaches for Bioinformatics
Pietro Lovato
Bag of Words approaches for
Bioinformatics
Ph.D. Thesis
XXVII cycle (January 2012 - December 2014)
Università degli Studi di Verona
Dipartimento di Informatica
Advisor:
dr. Manuele Bicego
Series N◦ : TD-03-15
Università di Verona
Dipartimento di Informatica
Strada le Grazie 15, 37134 Verona
Italy
Abstract
In recent years, several Pattern Recognition problems have been successfully faced
by approaches based on the “bag of words” representation. This representation
is particularly appropriate when the pattern is characterized (or assumed to be
characterized) by the repetition of basic, “constituting” elements called words. By
assuming that all possible words are stored in a dictionary, the bag of words vector
for one particular object is obtained by counting the number of times each element
of the dictionary occurs in the object.
Even if largely applied to several scientific fields (with increasingly sophisticated approaches), techniques based on this representation have not been completely exploited in Bioinformatics, due to the methodological and applicative
challenges derived from the peculiar scenario. However, in this context the bag
of words paradigm seems to be particularly suited: on one hand, many biological
mechanisms inherently subsume a counting process; on the other hand, in many
Bioinformatics scenarios the objects of the problem are either unstructured or with
unknown structure, so that one of the main drawbacks of the bag of words representation (it destroys the object’s structure) does not hold anymore. This permits
to exploit and to derive highly effective and interpretable solutions, a stringent
need in nowadays Bioinformatics research.
This thesis is inserted in the above described scenario, and promotes the use of
the bag of words paradigm to face problems in Bioinformatics. We investigated the
different problematics and aspects related to the creation of bag of words models
and representations for some specific Bioinformatics problems, as well as proposing original solutions and approaches based on this representation. In particular,
in this thesis three scenarios have been analyzed: the gene expression analysis, the
modeling of HIV infection, and the protein remote homology detection. For each
scenario, motivations, advantages, and challenges of the bag of words representations are addressed, proposing possible solutions. The merits of bag of words
representations and models have been demonstrated in extensive experimental
evaluations, exploiting widely used benchmarks as well as datasets derived from
direct interactions with biological and clinical laboratories and research groups.
With this thesis, we provided evidence that the bag of words representation can
have a significant impact on the Bioinformatics and Computational Biology communities.
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
5
7
7
2
The Bag of Words paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 What to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Easy-to-define words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Difficult-to-define words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 How to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 How to model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 The bag of words as a multinomial . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Inference and learning in bayesian networks . . . . . . . . . . . . . . .
9
10
11
12
13
15
16
18
19
21
Part I Gene expression analysis
3
The gene expression analysis problem . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Background: gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 DNA Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Computational analysis of a gene expression matrix . . . . . . . . . . . . . .
3.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
27
28
30
33
4
Gene expression classification using topic models . . . . . . . . . . . . . .
4.1 Topic models and gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Probabilistic Latent Semantic Analysis (PLSA) . . . . . . . . . . .
4.2 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 The interpretability of the feature vector . . . . . . . . . . . . . . . . . . . . . . .
35
35
36
38
41
43
45
VI
5
Contents
The Counting Grid model for gene expression data analysis . . .
5.1 The Counting Grid model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Class embedding and biomarker identification . . . . . . . . . . . . . . . . . . .
5.3 Example: mining yeast expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Embedding and clustering performances . . . . . . . . . . . . . . . . . .
5.4.2 Qualitative evaluation of gene selection . . . . . . . . . . . . . . . . . . .
5.4.3 Quantitative evaluation of gene selection . . . . . . . . . . . . . . . . .
5.4.4 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
51
52
55
56
58
60
62
63
Part II HIV modeling
6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7
Regression of HIV viral load using bag of words . . . . . . . . . . . . . . .
7.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 What to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 How to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.3 How to model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.4 Information extracted: regression of viral load value . . . . . . . .
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Experiment 1: modeling antigen presentation with the
counting grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Experiment 2: comparison with the state of the art . . . . . . . .
77
77
79
79
79
80
81
82
Bag of words analysis for T-Cell Receptors . . . . . . . . . . . . . . . . . . . .
8.1 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Diversity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 Reliability of the bag of words . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.2 Shannon index analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.3 Nuanced patterns in the bag of words . . . . . . . . . . . . . . . . . . . .
8.2.4 Rarefaction curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.5 Total number of species estimation . . . . . . . . . . . . . . . . . . . . . .
8.2.6 Reliability of the bag of words . . . . . . . . . . . . . . . . . . . . . . . . . .
87
88
89
90
91
91
92
95
95
96
97
8
83
84
Part III Protein remote homology detection
9
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.1 Background: protein functions and homology . . . . . . . . . . . . . . . . . . . . 108
9.2 Computational protein remote homology detection . . . . . . . . . . . . . . . 110
9.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Contents
1
10 Soft Ngram representation and modeling for protein remote
homology detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.1 Profile-based Ngram representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.2 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.2.1 Modeling: soft bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.2.2 Soft PLSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.2.3 SVM classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
10.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.3.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.3.2 Detection results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 122
11 A multimodal approach for protein remote homology detection 127
11.1 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
11.1.1 Componential Counting Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
11.2 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.2.1 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.2.2 Multimodal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.2.3 Classification scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
11.3.1 First analysis: families 3.42.1.1 and 3.42.1.5 . . . . . . . . . . . . . . . 133
11.3.2 Second analysis: all families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.4 Multimodal analysis of bitter taste receptor TAS2R38 . . . . . . . . . . . . 136
12 Conclusions and future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Sommario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
1
Introduction
Humans have developed highly sophisticated skills to sense the environment and to
take actions according to what they observe. Daily routine deals with recognizing
faces, understanding spoken words, reading handwritten characters, and so on.
However, complex processes underlie these acts of pattern recognition. Following a
classical definition by Duda [66]: “pattern recognition is the act of taking in raw
data and taking an action based on the “category” of the pattern”. The notion
of pattern is extremely general, and is defined by Watanabe [237] as “an entity,
vaguely defined, that could be given a name”. A fingerprint image, a human face,
a DNA sequence, or a text document are just examples of patterns; they are the
objects, or instances, of the problem under consideration.
Over the years, researchers questioned if it could be possible to give similar
capabilities to machines: from automated speech recognition, fingerprint identification, optical character recognition, DNA sequence manipulation and much
more, it is clear that reliable, accurate pattern recognition by machines would
be immensely useful. Automatic pattern recognition entails that these real-world
objects are acquired and abstracted, through sensors and measurements, in a digital representation that a machine is able to understand. From an initial set of
raw data one can derive features, that are characteristics intrinsic to the object
itself, intended to be informative, non redundant, facilitating the decision making
model, and possibly leading to better human interpretations. In the end, the individual objects are described by the set of values assigned to the different features,
encoded in a mathematical entity such as a vector, a string, a tree, a graph, or
others. Given the representation of objects, the pattern recognition strategy typically exploits the so-called “learning from examples” paradigm, in which a large
set of N objects or instances of the problem – called a training set – is acquired
and represented, and used to learn the parameters of a model or a classifier. Once
the model is trained, it can determine the category of new objects, which are said
to comprise a testing set. The ability to correctly categorize new examples that
differ from those used for training is known as generalization capability.
In a simple example, depicted in figure 1.1, the objects to be represented and
modeled are text documents, where the goal can be for example to categorize them
into literary genres. Suppose that, after the acquisition phase, the raw data for
a document consists of a sequence of ASCII characters. This already represents
2
1 Introduction
Acquisition
Humans have
developed highly
sophisticated skills
to sense their
environment and to
take actions according
to what they observe.
Daily routine deals
with recognizing
faces, understanding
spoken words, reading
handwritten
characters, and so on.
Representation
Representation
Features:
- ASCII characters
Pattern:
sequence of char
c1,c2,c3,...,cN
Features:
- Document length
- words average length
Pattern:
2-dimensional vector
[dl,wl]
Fig. 1.1. In the example, a document is acquired and digitally encoded as a list of
ASCII characters (top part of the figure). Another representation for the document, in
the bottom, consists of a numerical vector that is obtained by measuring the document
length and the average length of its words.
a possible representation for documents, where features are individual characters
listed in a structure whose length varies according to the number of characters
in the documents. More than this, numerical features can be extracted from the
raw data: for example, one can compute the document length and the average
length of the words comprising it, representing the document with two numerical
values stored in a fixed-length vector. In other words, a document is represented
as a point in a vector space (2-dimensional in this case), also called feature space.
Even if there may be an information loss during this process of projecting objects
in a feature space, such an approach is perhaps the most employed in pattern
recognition. Its strength is that it can leverage a wide spectrum of mathematical
tools ranging from statistics, to geometry, to optimization. Of course the choice
and the combination of features is crucial, and many different ways to characterize
an object exist and have been proposed in the past. In particular, an effective one
called the “bag of words” [203] asserted itself and has assumed a great importance
in recent years. The bag of words is a representation particularly appropriate when
the pattern is characterized (or assumed to be characterized) by the repetition of
basic constituting elements called words1 . If we assume that all possible words
are stored in a dictionary, the bag of words vector for one particular object is
obtained by counting the number of times each element of the dictionary occurs in
the object. Looking back at the document categorization example, if we are given
a dictionary that lists all possible words, the document can be characterized with
a vector where each element counts the number of occurrences – in the document –
of each given word in the dictionary. A scheme of the bag of words representation
for documents is shown in figure 1.2.
The bag of words has been very successful in the literature: a non exhaustive
list of fields where it has been effectively applied includes Natural Language Processing, where it has been originally introduced [70, 200, 203]; Signal Processing,
where it has been employed to model audio signals [165, 215], biochemical signals
(such as NMR spectra) [37], or medical signals (such as EEG) [150]; robotics,
where it is mainly employed for fast robot localization [75]. Also, in the fields of
Image Processing and Computer Vision, it has been proposed to characterize textures [55,230], contours [155], images [54,57,141], 3D shapes [127], and videos [212].
1
The terminology stemmed from the Natural Language Processing community [203],
where it is assumed that the constituting elements of a document are words.
1 Introduction
3
Dictionary
Acquisition
Humans have
developed highly
sophisticated skills
to sense their
environment and to
take actions according
to what they observe.
Daily routine deals
with recognizing
faces, understanding
spoken words, reading
handwritten
characters, and so on.
aardvark
about
all
apple
...
gas
...
zebra
Bag of words
vector
aardvark
about
all
apple
...
gas
...
zebra
0
3
2
0
...
1
...
0
Fig. 1.2. The bag of words representation. Given a document to represent, and a dictionary containing all possible words, the bag of words vector is obtained by counting how
many times each word of the dictionary appears in the document.
One of its main advantages is that it can represent in a vector space many types
of objects, even ones that are non-vectorial in nature (like documents in the example of Fig.1.2), for which less computational tools are available. The obtained
results motivated researchers to delve deeper into this paradigm, by proposing
methods to refine the information contained in a bag of words, to better interpret
it, and to boost performances in classification tasks. For example, by taking into
explicit consideration that the values observed in a bag of words arise from a counting process, probabilistic models have been successfully developed end exploited.
The relation is that counts are consistently explained through the multinomial
distribution. Many probabilistic models have been proposed in the literature for
bag of words: among others, there is a class of models whose importance has drastically grown in recent years, called topic models [29–31, 94, 104, 148, 176]. Topic
models were originally introduced in the Natural Language Processing community [31, 94], with the main goal of describing documents (represented as bags of
words), abstracting the topics the various documents are speaking about. Their
wide usage is motivated by their expressiveness and efficiency, by the interpretability of the solution provided [44], and by state of the art results achieved in several
applications [12, 33, 53, 129, 185, 188, 215, 232, 243].
Facing a task with a bag of words approach needs some issues to be addressed,
which can be summarized as follows:
•
What to count? The definition and the extraction of the “building blocks”,
i.e. the entities to be counted, is crucial for any bag of words method. Back
to the document example, depending on the task it could be more appropriate
to define as building blocks syllables, rather than words. In Computer Vision,
the dictionary elements of a bag of words are usually saliency points in images:
the definition of saliency is not trivial, and its extraction usually go through a
complex processing scheme, whose pros and cons must be addressed carefully.
In particular, since the dictionary must be “discrete”, it is sometimes required
to quantize the continuous signals. In practice, the definition of the words
requires the experimentation of multiple possibilities, and the combination of
automated techniques with the intuition and knowledge of the domain expert.
4
•
•
1 Introduction
How to count? Counting implies measuring the level of presence of an element
(the “word”) in an object. The counting process can be easy and straightforward, or can be more difficult. Counting words in document seems like a natural
and easy task to perform; however, quantifying the number of molecules in a
cell may be challenging due to technological problems. Moreover, counts derive
from a measurement process. Thus, they may be affected by noise or prone to
systematic errors, derived from a technological problem (as in the above example), or from an inappropriate processing of the raw data. This in turn means
that counts are affected by uncertainty, which has been hardly taken into account in bag of words approaches. Finally, it should be noted that a proper
count should in principle be a discrete number, but – generally speaking – it
can also be any real number that reflects a level of presence / importance /
power.
How to model? The bag of words representation can be modeled further,
and depending on the task at hand, different models can be employed. As
it is often the case, a more complex model may offer a richer description or
higher performances, at the cost of an increased complexity. The literature on
the subject is huge, including fast and straightforward solutions, as well as
complicated models which take into considerations many facets and aspects.
It is very important to notice that one of the main drawbacks of the bag of words
representation is that – in many domains and applications – it destroys the possible
structure of objects. The term “bag” is used because (in the text domain where
it has been originally introduced) the ordering of the words in the document is
lost: the sentences “the man killed the wolf” and “the wolf killed the man” will
result in the same bag of words vector despite the huge difference in semantics.
Nevertheless, it may still be very convenient to employ the bag of words, as it
readily extracts a numerical vector which may facilitate the subsequent steps and
achieve high accuracy. This representation would be completely suited in those
scenarios where the object is either unstructured or with unknown structure: in
this case there is no information loss, and the bag of words can bring out all its
potentialities.
In this general picture, the rapidly developing field of bioinformatics is increasingly providing scenarios where the bag of words representation seems to be a very
promising possibility, for a twofold reason: on one hand, the bag of words seems
a natural choice since many problems are intrinsically formulated as counting; on
the other hand, in different contexts the structure is truly absent or unknown, excising one of the main drawbacks of the bag of words representation. For example,
the cell regulates its function by adjusting the amount (i.e. the level of presence)
of particular molecules – proteins – it manufactures. This regulation phenomenon,
called gene expression, is the process by which the information encoded in a gene is
used to direct this assembly of proteins. The most important aspect is the amount
(count) of these molecules, which primarily determines the function of the cell.
Moreover, there is no obvious ordering of the genes: the biological machine inherently works ignoring the spatial position of the molecules, and gene expression is
carried out by simply looking at types and concentration of substances. For these
reasons, the bag of words seems to be a very suitable choice of representation for
the gene expression domain. However, there are methodological and applicative
1.1 Contributions
5
challenges – derived from the peculiar scenario – that have to be addressed to
completely exploit the bag of words approach. For example, it is not straightforward to map gene expression into a count value: the mostly used measuring
technology, called DNA Microarray [47], essentially measures fluorescence emitted
by the gene products2 . In another example, the immune system gathers evidence
of the execution of various molecular processes, both foreign and the cells’ own,
because particular receptors (called TCRs) observe sets of epitopes, small segments
of the proteins involved in these processes. Epitopes do not have any obvious ordering in this scheme: the immune system, through TCRs, sees these epitope sets
as disordered “bags”, based on whose counts the action needs to be taken. In this
context, the bag of words would provide a set of tools for capturing correlations
in the immune target abundances during cellular immune surveillance, and could
be immensely useful for detecting patients or populations that are likely to react
similarly to an infection, or for rational vaccine design.
From the modeling point of view, it seems that in the examples portrayed
probabilistic models and topic models (based on the bag of words) represent particularly suited choices. Aside from the state of the art performances obtained in
several other scenarios, probability theory provides a consistent framework for the
quantification and manipulation of noise and uncertainty, being able to explain the
process of data generation, to increase the predictive accuracy, and to provide a
more interpretable description. Particularly in biology and medicine, interpretability is a key requirement and a stringent need: very often, the final goal is helping
the biological expert to gain a deeper understanding of the phenomenon under
investigation.
1.1 Contributions
This thesis is inserted in the above described framework, and is aimed at investigating and promoting the applicability of bag of words representations and models
in the wide field of bioinformatics. The first contribution of this thesis is therefore
the identification of some bioinformatics scenarios which can be faced from a bag
of words perspective. For each scenario, motivations, advantages, and challenges
of the bag of words representations are addressed, proposing possible solutions.
From a methodological point of view, this thesis contributes in two ways: i) bag
of words approaches have been exported from other contexts and tailored to the
specific bioinformatics scenario; ii) novel bag of words representations and models
have been derived. From a more applicative perspective, the derived representation
and models have been extensively tested, contributing to push forward the state of
the art. Primary importance has been given to the interpretability of approaches
results, in an effort to provide a biologist or a clinician with tools that permit to
gain relevant insights into the phenomenon under consideration.
More in detail, three applicative contexts have been analyzed:
2
Emerging technologies such as Rna-seq [235] are increasingly providing a way to directly observe and count such molecules, although they are not as widespread as microarrays.
6
•
•
•
1 Introduction
Gene expression analysis:
In this context, the first original contribution was in recognizing that a vector
of gene expressions can be considered as a bag of words. Given that, we investigated the capabilities of topic models for the classification task, reaching
state of the art results on many datasets; considerations on the interpretability
of the obtained representations are provided, with the use of a real dataset involving different species of grapevine (resulting from a collaboration with the
Functional Genomics Lab at the University of Verona). Finally, we show the
suitability of more recent models, to mine knowledge from gene expression data
(more than classification): our approach permits to visualize a gene expression
dataset by embedding biological samples in a 2D map, and to derive a principled and founded method to highlight the most discriminative genes involved
in a pathology.
HIV modeling:
This thesis contributed in the HIV and immune system modeling, by promoting
the usage of the bag of words representation and models for epitope sets, as well
as by studying the TCRs counts variation upon infection. In fact, upon HIV
infection, two phenomena co-occur: i) the patient’s bag of epitopes changes,
since new fragments of the virus are presented for immune surveillance; ii) the
patient’s bag of TCRs changes, since HIV/AIDS implies a progressive failure
of the immune system, resulting in a drastic decrease of TCR levels. In the
first case, a bag of words representation has been derived and modeled with
the final goal of regressing the viral load value (an estimate of the patient
HIV status). In the second case, the quality of TCR counts, extracted via
454 pyrosequencing from different HIV patients, has been assessed. Using the
proposed approach, realized in collaboration with the David Geffen School of
Medicine (UCLA), we were able to propose a reliable estimate of the bag of
words (which is heavily prone to noise and sequences errors) and to statistically
validate clinical hypothesis.
Protein remote homology detection:
Finally, this thesis addressed the protein remote homology detection problem,
a crucial task in bioinformatics where the goal is to determine if two proteins
have a similar biological function even when their sequence similarity is low.
In this context, the bag of words approach has been already investigated in
the literature, and have proved to be successful: by positing an analogy with
the document scenario, biological “words” have been extracted from a protein
sequence (for example using Ngrams, namely short contiguous subsequences of
N symbols). This thesis contributed in two different directions. The first one is
aimed at integrating evolutionary information into the bag of words representation, equipping each word/Ngram with a weight that encode its conservation
across evolution. A novel bag of words approach, called soft bag of words, has
been devised, together with a novel probabilistic model able to handle the presence of a weight associated with each word. The second research direction is
aimed at properly integrating into existing models partial information derived
from other sources. In particular, there is a source of information which is typically disregarded by classical approaches: the available experimentally-solved,
possibly few, 3D structures of proteins. In this thesis a multimodal approach
1.3 Publications
7
for protein remote homology detection has been proposed, validating it using
standard benchmarks, as well as employing a real dataset involving the superfamily of GPCR proteins (in collaboration with the Applied Bioinformatics
Group at the University of Verona).
1.2 Organization of the thesis
This thesis is divided in an introductory chapter and three main parts. The first
chapter formally presents the bag of words paradigm, and introduces the notation
and formalism employed in the subsequent chapters. The three main parts describe
the proposed approaches in the three bioinformatics scenarios, namely the gene
expression analysis, the HIV modeling, and the protein remote homology detection.
In the gene expression part, Chap. 3 introduces and describes the problem,
and summarizes the recent literature. Then, Chap. 4 describes how to employ bag
of words and topic models for gene expression classification, also discussing the
interpretability of the method. Finally, Chap. 5 deals with the usage of a more recent and sophisticated model for gene expression, presenting methodological and
applicative contributions achieved. In the HIV modeling part, Chap. 6 introduces
and describes the problem, as well as the state of the art. Then, Chap. 7 discusses
the proposed approach for HIV viral load regression, whereas Chap. 8 describes the
detailed analysis performed on TCR bags of different HIV patients. In the protein
remote homology detection part, Chap. 9 introduces the problem and surveys the
state of the art, while the subsequent chapters detail the lines of researches investigated: Chap. 10 presents the novel soft bag of words approach, whereas Chap.
11 is concerned with the study of a multimodal approach to integrate structural
information to ease the detection task. Finally, in Chap. 12 conclusions are drawn
and future perspectives are envisaged.
1.3 Publications
Some parts of this thesis have been published in conference proceedings or in
international journals. In the context of gene expression analysis, Chap. 4 has been
published in [24]; a preliminary study of the ideas presented in Chap. 5 has been
published in [136], whereas the comprehensive approach has been submitted to a
journal [137]. In the context of HIV modeling, Chap. 7 has been published in [179],
whereas Chap. 8 is still under consideration for publication. In the context of
protein remote homology detection, Chap. 10 has been submitted to a journal [138];
the multimodal approach of Chap. 11 has been preliminary presented as a poster
in [139], and the complete study is in press for publication [140].
2
The Bag of Words paradigm
There are several application scenarios where the bag of words scheme have been
applied with success; some examples have been presented in Chap. 1. All these
approaches follow a common pipeline, which consists of several steps that lead to
the construction of the bag of words representation and the solution of the task.
While the general idea is clear (i.e. representing an object with a vector of counts),
a general formalization seems to be missing in the literature. This chapter fills this
gap, by defining a possible pipeline that can be employed to face a problem using
a bag of words approach. This pipeline is schematically depicted in Fig. 2.1, and
explained briefly in the following.
The starting point is the problem to solve or the task at hand; in the pattern
recognition approach, we are given several training examples, i.e. instances or
objects of the problem. The crucial aspect is to recognize that a given object of
the problem can be seen as composed by simpler, “constituting” elements – that
we will call words in the following (referring to the text domain where they have
been originally introduced [203]). In other fields of Computer Science, the concept
of “word” is sometimes called atom, token, chunk, or building block. Depending on
the problem, the identification of words can be straightforward or not: for example,
it is intuitive that textual words are constituting elements of a text document,
whereas it is not so easy to define words for an image. The universe of all words –
that can constitute every possible object of the problem – is called dictionary. In
Fig. 2.1 this stage in the pipeline is represented by the diamond “what to count”.
The second step is to perform the counting process, i.e. to determine the number of times each word of the dictionary appears in the object to represent. This
leads to a numerical vector, where each element is associated to a word of the
dictionary; its value is the number of times this word appears in the object. In
Fig. 2.1, this stage in the pipeline is represented by the diamond “how to count”.
Through the bag of words vector, objects are embedded into a feature space: vectors in this feature space can already be used to solve the task, for example as
input for a classifier.
Otherwise, the bag of words representation can be modeled, for example by
taking into explicit consideration the fact that these features are counts. One
possible choice is to employ probabilistic modeling, which provides a consistent
framework to explain the process of data generation, to manage the presence of
10
2 The Bag of Words paradigm
Problem - Task
What to
count
Dictionary
How to
count
Bag of words representation
How to
model
Solution / Knowledge
Fig. 2.1. A possible pipeline of execution of a bag of words approach.
uncertainty/noise, to provide a more interpretable description, and to possibly
increase the predictive accuracy. This stage in the pipeline is represented by the
diamond “how to model”.
The next sections are devoted at detailing the aforementioned steps: for every
step, the mathematical notations and the formalization of concepts is introduced,
along with a brief survey of the state of the art.
2.1 What to count
Suppose that we are given a set of training objects X = {x1 , . . . , xT }. In this stage,
two problems have to be addressed. The first one is to define simpler, “constituting”
elements whose repetition characterizes the object. For example, in Fig. 2.2(a) the
object is a truck built with the famous Lego bricks, and it is reasonable to define
individual elements as the different types of bricks. The truck is a complex object,
but composed by the repetition of some simpler bricks, opportunely assembled. In
the following, we will refer to these elements (bricks) as words, and we will denote
them with the symbol w. In the example, individual words are the different types
of bricks that can be used.
The second problem is to collect all words – that can constitute every possible
object of the problem – in a dictionary. The mathematical definition of a dictionary
is a set D comprising all possible words: D = {w1 , . . . , wN }. Fig. 2.2 (b) shows
the dictionary of our example, which should contain all possible brick types, not
limited to the ones needed for building the truck (see for instance the green brick
in the dictionary, which is not a piece of the truck). The dictionary D can be prespecified and known a priori, or can be created by aggregating all words observed
2.1 What to count
11
Fig. 2.2. (a) Words can be defined as the different bricks composing the lego truck; (b)
the dictionary is the set of all possible words.
at least once in at least one training instance. In any case, it is worth stressing
that the dictionary represents a universe: constituting elements of any object must
be elements in the dictionary. If a novel word – not contained in the dictionary –
is observed, for example during the testing phase, it should be either discarded, or
the dictionary has to be re-tuned.
In the literature, there are many contexts where the dictionary and words
therein are clearly identifiable. In some other contexts it is more difficult, and the
main effort of defining the bag of words is the identification of such words. In the
following we detail these two possible cases.
2.1.1 Easy-to-define words
The first example in this scenario is the Natural Language Processing field, where
the bag of words has been originally introduced [68, 203]. In the original formulation, words are seen as the constituting elements of a text, and the dictionary
has an intuitive and literal meaning. However, depending on the task, it may be
more convenient to decompose a text in Ngrams (sets of N consecutive characters
extracted from a word) [71], or syllables/phonemes [164].
This way of reasoning has also been exported in biology: many molecules are
essentially strings or sequences (called polymers) composed by many repeated subunits. The most striking example is perhaps DNA, a long polymer “written” using
four letters (A,T,C,G) called nucleotides. Similarly, proteins are made up by the
linear combination of 20 different “building blocks” called amino acids [126]. In
this context, words can be defined by taking individual symbols or Ngrams (like
before, N consecutive nucleotides/amino acids extracted from a sequence – sometimes called Kmers in this biological context) [231]; or a word can be defined
through complex heuristics that take into explicit consideration the biological significance [204]. Another bioinformatics example is shown in Fig. 2.3. The figure
schematically represents a portion of the cell, centered around the nucleus. Colored dots correspond to mRNA molecules, whose amount and types ensure proper
growth, development, and health maintenance of the cell [126]. mRNA molecules
are copies transcribed from genes, and the production of these copies process is
regulated by an important process called gene expression. This mechanism acts as
both an “on/off” switch to control which genes are expressed in a cell as well as a
12
2 The Bag of Words paradigm
Fig. 2.3. The amount and types of mRNA molecules in a cell – represented by colored
dots in the schematic portion of the cell portrayed – reflect the function of the cell. On
the right, a possible dictionary containing the list of all known genes.
“volume control” that increases or decreases the level of expression of particular
genes as necessary [126]. Thus, the more expressed a gene is, the more copies will
be transcribed. Given this, it is reasonable to assert that genes are “constituting
elements” of the cell, and can be employed as words to be used in a bag of words
representation [23]. The dictionary, i.e. the ensemble of all the genes in an organism, is usually known a priori, and obtained through complex sequencing studies
like the human genome project [50].
In a very similar fashion, the immune system gathers evidence of a viral infection by surveying the amount and types of epitopes (small segments of the viral
proteins) which are cleaved in the cell and presented to the cell surface as a mean of
warning. The immune system sees these epitope sets as disordered “bags”, based
on whose counts the action needs to be taken [49]. In this context, it may be
reasonable to employ as words the epitopes.
2.1.2 Difficult-to-define words
In the previous section we gave a brief and non-exhaustive list of examples where
the word definition is somehow natural or easily derived from the problem. However, for some other applications the concept of “word” is not so explicitly defined. A striking example can be found in Computer Vision, where bag of words
approaches brought a substantial boost to the state of the art, and have been a
turning point in the field. The main merit of these approaches is that they were
able to define words that can be counted in natural images. For example, a successful line of research is aimed at extracting repeating keypoints that may represent
local salient areas of the images – much like words are local features of a document. Saliency can be defined for example as stability under minor affine and
photometric transformations (such as in SIFT, SURF, HOG [57, 141]), or based
on computational models of the human visual attention system – these last approaches are concerned with finding locations in images that are visually salient
(e.g. high spatial frequency edges) [82].
In any case, however, these local image descriptors are high-dimensional, realvalued feature vectors. Thus, a vector quantization step is required to obtain a
2.2 How to count
13
discrete vocabulary: this is traditionally performed with a clustering algorithm
such as k-means [100], or learned with more sophisticated approaches [3] which
take into account the final task (e.g. classification). In the end, words correspond
clusters: different SIFTs in the same cluster are represented by the same word,
with a consequent information loss. A graphical representation of the approach is
illustrated in Fig. 2.4.
Finally, another research line suggests to generate keypoints by sampling the
images using a grid or pyramid structure, or even by random sampling: these
have been historically preferred for fast extraction of words in videos [212]. It is
worth mentioning that in computer vision, bag of words representation have been
proposed also to characterize textures [55, 230], where words are the repeating
texton element, and 2D and 3D shapes [127].
A final consideration: as in the text scenario, when applied to images the bag
of words destroys the structure, i.e. the spatial layout of keypoints in images,
contrarily to what happens in some of the examples we presented in the field of
bioinformatics.
Another interesting application domain where bag of words approaches have
been successfully used, but the definition of words required some efforts is audio processing. In particular, [215] noted that sounds that human listeners find
meaningful are best represented in the time-frequency domain, and these kinds
of representations are essentially counting the number of time-frequency acoustic
quanta that collectively make up complex sound scenes, similar to how we count
words that make up documents. It is important to notice that a time-frequency
transform is often complex-valued, and is often computed with tools such as the
short-time Fourier transform, constant-Q transforms, wavelets, etc. However, because the hearing system is more sensitive to the relative energy between different
frequencies, for most practical applications only the modulus of these transforms
is used, whereas the phase is discarded. Thus, a discrete non-negative count value
can be derived, from which the analogy sound frequency / word can be established [211, 213–215]. Other approaches for audio processing are instead based on
Mel Frequency Cepstral Coefficients (MFCCs [242]), and employ a vector quantization step (similarly to the Computer Vision scenario) of these coefficients to
derive the acoustic words [115, 123].
Once the dictionary is built, the next step is to perform the counting process
and obtain the bag of words vector representation.
2.2 How to count
Counting is perhaps one of the oldest mathematical activities, and one of the
first we learn as children. Intuitively, to count means to determine the number of
elements in a set. Standard English dictionaries define it as “to say numbers one
after the other in order, to calculate the number of people or things in a group”.
The mathematical definition of “counting” [69,85] resembles this last definition:
given a finite set of elements Y , to count is to establish a function between the
elements of Y and the natural numbers N0 in progressive order (zero is excluded).
Therefore, one element of the set is associated with “1”, another one with “2”,
14
2 The Bag of Words paradigm
w1
w1
w4
w4
…
…
w2
w2
w3
(a)
(b)
w3
(c)
Fig. 2.4. Pipeline for defining words in images. (a) Keypoints are extracted from training
images and embedded in a vector space; (b) the dictionary is derived with a vector
quantization step that cluster together similar keypoints in a single “word”; (c) whenever
a new keypoint is extracted, it is assigned to the nearest word.
and so on until all elements of Y are assigned a natural number. We would like
this function to be bijective, to determine the count value. Thus, we first define
Nn = {x ∈ N|1 ≤ x ≤ n}
For each integer n ∈ N0 , Nn is the set of natural numbers until n. Then, to count a
finite set Y is to establish a bijection f : Y → Nn for some n ∈ N0 . In other words,
if there exists a bijection f : Y → Nn , then we say that the number of elements in
Y is n and write |Y | = n. A graphical representation using sets is depicted in Fig.
2.5.
This definition can be useful in the bag of words representation to count the
number of instances of one word; since in general the object is composed by the
repetition of different words, we want to extend this notion defined on a set to a
more general scenario. For this reason, we denote an object X as a multiset [32],
a generalization of the notion of a set where the members are allowed to appear
more than once. For example, X = {’a’,’a’,’a’,’b’,’d’}. In addition, as described in
the previous section, we are given a dictionary D with |D| = W elements: in this
particular example, we define D = {’a’,’b’,’c’,’d’}, and W = 4. To build a bag of
words vector x for the object X, we count how many times each word wi ∈ D
occur in X. Specifically, we build a vector where each element represents a word
in the dictionary and the value of that element is the number of times the word
appears in the object. Mathematically, we can think of this as a function
count : (X, D) → N
(X, wi ) 7→ |{wi |wi ∈ X}|
(2.1)
The bag of words vector x of the object X is then a vector of size W defined as
x
=
[count(X, w1 ), count(X, w2 ), . . . , count(X, wW )]
(2.2)
In the example, x = [3, 1, 0, 1]. In this particular formulation, counts are required
to be discrete values. However, the definition can be extended also to continuous
values, motivated by the two following considerations:
2.3 How to model
15
N5
Y
1
2
3
4
5
Fig. 2.5. To count means to establish a bijective function between the set Y and the
natural numbers considered until n.
•
•
There are cases where the count is a discrete value, but due to technological
problems we can only observe a continuous value “proportional” in some way
to the real count. Consider for example the gene expression scenario presented
in the previous section: genes, and molecules in general, are very difficult to be
directly observed. The current technology to measure gene expression, called
DNA microarray, detects and quantifies mRNA by detection of fluorescencelabeled targets. A researcher must then use a special scanner to measure such
fluorescent intensity, and the raw image extracted is elaborated with image
processing techniques to obtain a final expression value.
More generally, the constraint can be relaxed by drawing a parallelism between
a count value and a measure that reflects a value of presence, importance,
power, or frequency. Actually, it is reasonable to intend these values as counts:
the more present an element in an object, the higher its count. For example,
as described above, acoustic words are counted by computing the magnitude
of a signal in the Fourier domain, and this can result in a real value [215].
Finally, it is worth noticing that a concept often related in the literature with
counting is the histogram [167]. In its precise definition, a histogram is a graphical
representation of the distribution of continuous data. For simplicity, consider the
1-D case, where we have several real numbers laid on the x axis. A histogram is
obtained by first “binning” the range of possible values that are observed – that
is, divide the entire range of values into a series of small, discrete intervals. Then,
one counts how many values fall into each interval, and draws a rectangle having
width equal to the interval range, and height equal to the count. An example of
histogram is pictured in Fig. 2.6. From the bag of words perspective, each bin in a
histogram represents a word, that has been obtained through a discretization (i.e.
a vector quantization) of the observed continuous values; the height of each bin is
the count value that is present in an entry of the bag of words vector.
2.3 How to model
Through the bag of words representation, an object is projected in a vector space.
In this space, the problem under consideration may be solved. However – depending
on the task at hand – it can be convenient to employ a model that further exploits
the information contained in the bag of words vectors. This is done in order to
16
2 The Bag of Words paradigm
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Fig. 2.6. A histogram is a rough estimate of the distribution of continuous values,
obtained by depicting the count of values occurring in certain ranges.
increase the performances, to explain the process of data generation (if the goal
is classification or clustering), to highlight particular facets of the data providing
a more interpretable description (if the goal is visualization/interpretation), or to
validate the observed bag of words.
Many models exist and have been proposed, depending on the application and
the problem to solve. This thesis adopts the perspective of probabilistic modeling,
which will be explained in the following sections: statistics and probability theory
provide a consistent framework for the quantification and manipulation of uncertainty, and can take into account all of the aforementioned considerations. We will
introduce the notions and a general framework for probabilistic modeling, along
with some examples: specific models will be presented when needed throughout
the thesis. Before that, we will introduce how the bag of words can be seen from
a probabilistic perspective.
2.3.1 The bag of words as a multinomial
In this section we will describe how the bag of words can be regarded as a random
variable. Consider a simple example where a die is thrown. The result of one
throw is a discrete random variable that can take one of 6 possible mutually
exclusive values. In the spirit of the bag of words approach, we will refer to each
of these values as a word, and the 6 possible words constitute the dictionary.
Therefore D = {1, 2, 3, 4, 5, 6}. There are different ways of expressing the variable
characterizing a word: a particularly convenient representation is the “1-of-W”
scheme, where the variable is a W-dimensional (W=6 in our dice example) vector
w in which one of the elements wv equals 1, and all remaining elements equal 0.
Suppose for example that a particular observation of the variable corresponds to
the result “4” of the die. Then w will be represented as
w = [0, 0, 0, 1, 0, 0]
2.3 How to model
17
We can think of this as an indicator function that “selects” the observed word
in the dictionary. Thus, we can refer to a word either with its index v in the
dictionary, or with a “1-of-W” vector w where wv = 1.
In addition to that, we are aware of the probabilities of the different words: in
our example, they are all equal to 1/6. If these probabilities are encoded in the
vector
1 1 1 1 1 1
, , , , ,
π=
6 6 6 6 6 6
then the distribution of w is
p(w|π) =
W
Y
πvwv
(2.3)
v=1
Since all entries of w are zeros except one, in our example this formula simply
reduces to p(w|π) = π4 = 1/6.
Then we can make a step forward: we would like to obtain a vector x that counts
how many times the different words w occurred throughout N independent throws
of the die, i.e. a proper bag of words vector. First, let us denote each of the N
results as w1 , . . . , wN . Through the 1-of-W scheme, x is easily computed with the
element-wise sum of each wn :
x = w1 + w2 + . . . + wN
(2.4)
Suppose for example that in 5 different throws we obtain the results {3, 3, 5, 3, 2}.
Through the 1-of-W scheme, the bag of words is computed as follows:
w1 = [0 0 1 0 0 0]
w2 = [0 0 1 0 0 0]
w3 = [0 0 0 0 1 0]
w4 = [0 0 1 0 0 0]
w5 = [0 1 0 0 0 0]
X
wi = x = [0 1 3 0 1 0]
i
A multinomial distribution is the probabilistic distribution that describes x,
namely the number of times each of W possible words occurs out of N trials,
where each word has a probability πk . π represents the parameter of the multinomial distribution.
Usually, we are interested in computing the probability that a particular observation x is generated by a multinomial distribution with known parameter π.
Since each word wi is independent, the probability mass function can be derived
from Eq. 2.3
Y
K
N
p(x|π, N ) =
πkxk
(2.5)
x1 x2 . . . x K
k=1
where the normalization coefficient is the number of ways of partitioning N words
into W groups of size x1 , . . . , xk and is given by
18
2 The Bag of Words paradigm
N
x1 x2 . . . x K
=
N!
x1 !x2 ! . . . xK !
(2.6)
Note that the variables xk are subject to the constraint
W
X
xk = N
k=1
2.3.2 Probabilistic models
Once the key distributions are defined, probabilistic manipulations can be expressed in terms of two simple equations, known as the sum rule and the product
rule [28]. In general, given two random variables a and b we can write the following
equations:
X
(Sum rule)
p(a) =
p(a, b)
(2.7)
b
(Product rule)
p(a, b) = p(b|a)p(a)
(2.8)
The sum rule is sometimes called marginalization, and the sum is over all possible
values b can take. Note also that the summation must be replaced by an integral
if b is continue rather than discrete. All of the probabilistic inference and learning
manipulations discussed in this thesis (but the statement is much more general),
no matter how complex, amount to repeated application of these two equations.
For example, by applying two times the product rule, one can easily derive the
Bayes’ rule, which states the following:
(Bayes’ rule)
p(a|b) =
p(b|a)p(a)
p(b)
(2.9)
This basic equations serve as the main ingredients of probabilistic generative
models [81]. The goal of generative modeling is to formally develop statistical
models that can explain the input data, or visible variables x, as tangible effects
that are generated from a combination of hidden variables h, representing the
causes, also coupled with conditional interdependencies.
Let us look back at the dice example, where a die is thrown N times. We already
introduced the variable x, representing the sum of the vectors w1 + . . . + wN
(each wi corresponding to the result of one throw represented through the “1-ofW” scheme). We also noted that x is a multinomial variable. We can complicate
the example by supposing that there are two possible dice: one is a common die
(denoted h1 ), the other has only the even numbers, duplicated (denoted h2 ). In
this example, before throwing the die N times, the identity of the die h is chosen.
Moreover, we are only able to see the result of the throws, but not the identity of
the die. Our goal is to understand if a particular observation x resulted by throwing
N times either h1 or h2 . In order to do so, the idea is to compute p(h = h1 |x) and
p(h = h2 |x), called the posterior probability; after that, we can decide that x has
been generated by the die ĥ, where
ĥ = arg max p(h|x)
h
2.3 How to model
a
19
b
c
Fig. 2.7. A simple Bayesian Network.
The problem of course is to compute the posterior p(h|x), which can be solved by
reversing the conditional probability by using Bayes’ law, thus leading to
p(h = h1 |x) =
p(x|h = h1 )p(h = 1)
p(x)
and, in a similar way, we can compute p(h = h2 |x). At this point, one should recall
that p(x|h) is a multinomial distribution whose parameter varies depending on the
die h chosen:
Y
K
N
(h) xk
p(x|h, N ) =
πk
(2.10)
x1 x2 . . . x K
k=1
To put these formulae into concrete perspective, suppose we instantiate our example with the following:
•
•
•
x = [4 0 3 0 3 0], N = 10;
p(h1 ) = p(h2 ) = 0.5, i.e. the prior probability that one die is preferred to the
other is flat;
π (h1 ) = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6], whereas π (h2 ) = [1/3, 0, 1/3, 0, 1/3, 0]
Then,
p(h = h1 |x) =
=
p(x|h = h1 )p(h = h1 )
p(x|h = h1 )p(h = h1 )
= P2
=
p(x)
i=1 p(x|h = hi )p(hi )
7 · 10−5 · 0.5
= 0.001
0.0356
and
p(x|h = h2 )p(h = h2 )
p(x|h = h2 )p(h = h2 )
=
= P2
p(x)
i=1 p(x|h = hi )p(hi )
0.0711 · 0.5
=
= 0.999
0.0356
p(h = h2 |x) =
We can conclude that it is much more likely that the observed x has been generated
by throwing 10 times the die h2 .
2.3.3 Bayesian networks
One could proceed to formulate and solve complicated probabilistic models purely
by the algebraic manipulation introduced in the previous section. However, this can
20
2 The Bag of Words paradigm
w1
w2
...
h
wN
π
h
w
N
π
Fig. 2.8. A simple Bayesian Network for the dice example. In other literature, this model
is known as a mixture of unigrams.
result in an unnecessarily complex framework, leading to a proliferation of formulae
to keep track of. For this reason it is highly advantageous to augment the analysis using graphical representations of probability distributions, called bayesian
networks, which start from the concept of a graph. In a bayesian network, each
node represents a random variable, and the links express probabilistic relationships between these variables. The graph then captures the way in which the joint
distribution over all of the random variables can be decomposed into a product of
factors, each one depending only on a subset of the variables. More specifically, a
bayesian network for random variables a1 , . . ., aN is a directed acyclic graph on
the set of variables, along with one conditional distribution for each variable given
its parents, p(ai |apai ).
Then, the graph captures the way in which the joint probability of all the
variables is decomposed, namely by saying that
p(a1 , . . . , aN ) =
N
Y
p(ai |apai )
i=1
that is, the observation of a particular value for a variable is influenced only by
the value taken by the direct parents of such variable. A graph example is shown
in figure 2.7. The joint distribution of the three variables a, b, and c is given by
p(a, b, c) = p(c|a, b)p(a)p(b).
However, in a typical bayesian network some of the nodes v are clamped to
observed values (i.e. measurements), and some other nodes h are hidden, and aim
at representing the causes that generated that particular set of observations, which
we called the training data. Consider the bayesian network for the dice example,
which is shown in Fig. 2.8 (left). First, note that the variable h is represented with
a shaded node, to denote that it is hidden. As already mentioned, h is a discrete
variable that represents the index 1, 2 of the die from which the data point w, i.e.
the result of a throw (visible variable), is generated. The joint distribution of this
model is
N
Y
p(w1 , . . . , wN , h) = p(h) ·
p(wi |h)
(2.11)
i=1
This network describes the generation of a bag of words x by simulating the
process of throwing a die N times. In fact, first we select a die h according to p(h)
2.3 How to model
21
(h has no parent in the graph), and then assign a value to N different variables wi
drawing N times from p(w|h). The final bag of words x is constructed by simply
summing: x = w1 + . . . + wN . The graphical representation can be made more
compact, by surrounding some of the nodes with a box, called a plate (as in the
right part of Fig. 2.8). When dealing with complex models, it is inconvenient to
explicitly draw N nodes w1 , . . . , wN ; plate notation allows such multiple nodes to
be expressed more compactly, where a single representative node w is surrounded
with a plate labeled with N , indicating that there are N nodes – which are independent and identically drawn (i.i.d) – of this kind. Finally, it is possible to include
in the representation the parameters of the distributions. In Fig. 2.8 we explicitly
depicted π, not circled to denote that it is not treated as a random variable1 . This
simple example of a bag of words model has been originally named in the literature
mixture of unigram model [158].
So far, we described the meaning of a bayesian network and how it is possible –
given some specified parameters – to evaluate the joint probability of the variables.
In the following, we will introduce the main tasks, that represent the true essence
of bayesian networks: learning and inference. In the learning phase, training data
are used to infer plausible configurations of the model parameters π. With the
model trained, inference consists in “querying” the model, in order to compute
estimates or making decisions about basically every probabilistic relation that can
be expressed between any two or more variables in the model.
2.3.4 Inference and learning in bayesian networks
When the model is trained and its parameters fully specified, the main inferential
problem consists in computing the posterior distributions of one or more subsets
of hidden nodes h, which will be denoted with q(h):
q(h) = p(h|v)
(2.12)
In general, probabilistic inference reduces to use Bayes’ rule and the other rules
of probability to express the posterior distributions as function of the conditional
ones specified by the joint probability of the model.
As an example, consider again the mixture of unigram model for the dice
example in Fig. 2.8: the goal is to infer the posterior distribution over h, p(h|w),
where for simplicity we will refer to w instead of w1 , . . . , wN . Using Bayes’ rule
we can write
p(w|h)p(h)
q(h) = p(h|w) =
p(w)
where p(w) can be computed using the sum and product rules
X
X
p(w) =
p(w, h) =
p(w|h)p(h)
h
1
h
Of course, it is possible to treat parameters as random variables. In that case, we
should have also specified a distribution over parameters p(π) including it in the joint
probability decomposition.
22
2 The Bag of Words paradigm
In general, when dealing with complex models or models with more than one
hidden variable, it may happen that the distribution p(h|v) cannot be computed,
or it requires an exponential number of values to be stored. In such cases, the
posterior p(h|v) is said to be intractable, and a variety of approximations (called
variational approximations) can be made in order to make computations possible
and efficient [81]. In the exact formulation, if there are H hidden variables, one
must use an unconstrained form for p(h|v), that is no factorization over h = hi
has to be assumed:
q(h) = p(h|v) = p(h1 , . . . , hN |v)
The main idea of variational approximation is to keep only few dependencies between the hidden variables in the posterior. Assume for example only one dependence: hi depends on hj . This yields the following factorization:
#
" N
Y p(hn |v) p(hi |hj , v)
(2.13)
q(h) = p(h|v) =
n=1,n6=i
This approximation is called a structured variational approximation [81]. Another
common choice is to assume the complete factorization over the hidden variables
hi . In this case the equation for the posterior becomes:
q(h) = p(h|v) =
N Y
p(hn |v)
(2.14)
n=1
This kind of approximation is called mean field approximation and is the most
frequently used method, due to its simplicity.
In the learning phase, we want to estimate plausible configurations of the model
parameters. Learning is possible provided that we are given several examples, or
training data: the main idea is that there is a setting of the parameters that
produced the observed training data. Since the model parameters π are unknown
at the moment, we consider them as hidden variables.
At this point, hidden variables can be divided into the parameters, denoted
by π, and one set of hidden variables h(t) for each training case t, t = 1, . . . , T .
So, h = [π, h(1) , . . . , h(T ) ]. In a similar way, there is one set of visible variables
for each training case, so v = [v (1) , . . . , v (T ) ]. Assuming that the training cases
are independent and identically distributed (i.i.d.), the distribution over all visible
and hidden variables (including parameters) is
p(h, v) =
T
Y
p(h(t) , v (t) , π) = p(π)
t=1
T
Y
p(h(t) , v (t) |π)
(2.15)
t=1
For example, in the mixture of unigram model, with T i.i.d training data xt ,
t = 1, . . . , T , the joint probability is given by
p(h, w) =
T
Y
(t)
(t)
p(h , w , π) = p(π)
t=1
= p(π)
T
Y
p(h(t) , w(t) |π) =
t=1
T
Y
t=1
p(h(t) )
N Y
W Y
n=1 v=1
(t)
πk
wv
(2.16)
2.3 How to model
23
Given this quantity, the learning problem can be seen as the problem of maximizing
the data log-likelihood
X
p(v) =
p(h, v)
(2.17)
h
i.e. finding the model which best fits the data. In formulae, the best parameter
configuration π̂ is
X
π̂ = arg max
p(h, v)
(2.18)
π
h
For the same reasons discussed in the inference phase, it may be that the
likelihood is intractable, and approximate techniques must be employed. One of
the most famous tool in statistical estimation for approximate inference is the
Expectation-Maximization (EM [81]), which will be presented in the next section.
The Expectation-Maximization algorithm
In the context described so far, for a set of parameters π and remaining hidden
variables h(1) , . . . , h(T ) , EM is an algorithm that obtains a point estimates for π,
which will be called π̂, and computes the exact posterior over the other RVs h(t) ,
given π.
The starting point is to derive a bound on the log-likelihood, i.e. the logprobability of the visible RVs ln p(v). This derivation can be carried out using
the Jensen’s inequality: given a real convex function f , numbers x1 , . . . , xn in its
domain, and probabilities µ1 , . . . , µn :
!
n
n
X
X
µk f (xk )
(2.19)
f
µk xk ≤
k=1
k=1
If the function f is concave instead of convex, the direction of the inequality is
simply reversed. To obtain a convex combination inside the concave ln function
of the log-likelihood, we employ the q posterior distribution we discussed during
inference:
!
!
X
X q(h)p(h, v)
ln p(v) = ln
p(h, v) = ln
(2.20)
q(h)
h
h
!
X
p(h, v)
≥
= −F(q, p)
(2.21)
q(h) ln
q(h)
h
The function F is called the free energy, and is an upper bound on the negative
log-likelihood. Moreover, since we have to account for training data, we can rewrite
the free energy formula as:
F(q, p) = − ln p(π) +
T X
X
t=1
h(t)
ln
q(h(t) )
p(h(t) , v (t) |π)
(2.22)
which is the main equation we will employ when making reference to the free
energy. Note that, since p(π) is constant by definition, it can be omitted in the
24
2 The Bag of Words paradigm
subsequent derivations. In this equation, two are the unknown quantities: the
distribution q(h(t) ) and the parameters π.
EM estimate these two unknown quantities by alternating between minimizing
F(q, p) with regard to the set of distributions q(h(1) , . . . , h(T ) ) (Expectation step
or E-step), and minimizing F(q, p) with regard to π (Maximization step or Mstep). These two solutions give the EM algorithm, summarized in the following
steps:
•
•
•
Initialization: Choose values for π̂ (randomly, or using some clever strategy).
E-Step: Compute p(h(t) |v (t) , π̂), then assign q(h(t) ) ← p(h(t) |v (t) , π̂).
M-Step: Minimize F(q, p) w.r.t. π̂ by solving
!
T
X
X
(t) (t)
(t) ∂
ln p(h , v |π̂) = 0
q(h
∂ π̂
(t)
t=1
h
•
Repeat for a fixed number of iterations or until convergence.
Summarizing, the bag of words paradigm provides a general framework – articulated in three main steps – that can be employed to solve a pattern recognition
problem. Many contributions in the literature provided enhancement to each stage,
although the formalization of a general pipeline seems to be missing: this chapter
provided a possible one, that will be exploited for the solution of the problems
described in the next parts of this thesis.
Part I
Gene expression analysis
3
The gene expression analysis problem
In recent years, the research areas of molecular biology and genomics experienced
a rapid and profitable growth thanks to advances in knowledge and in technology.
On one hand, individual studies led to new discoveries about the roles played by
specific genes in the development of diseases. On the other hand, population studies are possible with technologies such as DNA microarrays [92] and RNA-seq [235],
which provided scientists with a way to measure the expression levels of thousands
of genes simultaneously. This created a stringent need for algorithmic approaches
able to extract information from data and create compact, interpretable representations of the problem.
This part of the thesis describes how gene expression data can be approached
with bag of words approaches. In particular, this chapter explains the problem and
the computational challenges related to the analysis of gene expression data, along
with the state of the art present in the recent literature. Subsequently, motivations
and original contributions in this context are summarized and detailed in the next
2 chapters.
3.1 Background: gene expression
The balance in cell processes like growth, response to stimuli, and maintenance,
is complexly regulated by the mechanism of gene expression. In classical genetics,
a gene is an abstract concept – a unit of inheritance that ferried a characteristic
from parent to child [166]. Examples of these characteristics that are inherited are
the color of the eye of a person, the blood type, or diseases such as haemophilia
or color blindness to name a few.
Further studies, and the development of biochemistry, proved that the hereditary nature of every living organism is defined by its genome, which contains the
genes [41]. The genome consists of a long sequence of molecules called nucleic
acids – in particular, DNA. Most DNA molecules consist of two strands coiled
around each other to form a double helix. The two DNA strands are known as
polynucleotides since they are composed of simpler units called nucleotides: there
are four possible nucleotides in DNA, either guanine (G), adenine (A), thymine
(T), or cytosine (C). Structurally, A pairs with T and C with G, mainly for di-
28
3 The gene expression analysis problem
mensional reasons – only this combination fits the constant width geometry of the
DNA spiral.
DNA provides the information needed to construct the organism. The term
information is used because the genome does not itself perform any active role in
the development of the organism. By a complex series of interactions, the gene
sequence is used to produce another type of molecules, the proteins, in the appropriate time, place, and quantity [118]. Proteins either form part of the structure
of the organism, or have the capacity to perform the chemical reactions necessary
for life. The process by which information from a gene is used in the synthesis
of a functional product (i.e. a protein) is called gene expression. This process is
essentially articulated in two stages:
Transcription DNA expresses its genetic instructions by first transferring its
information to a messenger RNA (mRNA) molecule, in a process called transcription. The term transcription is appropriate because, although the information is
transferred from DNA to RNA, the information remains in the language of nucleic acids. A gene encoding for a protein contains not only the sequence that will
eventually be directly translated into the protein (the coding sequence) but also
regulatory sequences that direct and regulate the synthesis of that protein.
Translation The mRNA molecule then transfers the genetic information to a
protein by specifying its amino acid sequence. This process is termed translation
because the information must be translated from the language of nucleotides into
the language of aminoacids. Since the cardinality of the proteins alphabet is grater
than the nucleotides one, an important question is how many nucleotides are necessary to specify a single amino acid. With a sequence of 3 nucleotides, there are
43 = 64 possible combinations of the 4 RNA alphabet symbols, more than enough
to specify 20 different aminoacids (in fact, this code is redundant – a mechanism
aimed at preventing translation errors). During translation, the RNA nucleotides
are “read” by translational machinery in a sequence of nucleotide triplets, each
one coding for a specific amino acid. Two special triplets specify the start and the
end of the protein sequence.
Gene expression is a highly complex process, that allows a cell to respond
dynamically both to environmental stimuli and to its own changing needs. This
mechanism acts as both an “on/off” switch to control which genes are expressed in
a cell as well as a “volume control” that increases or decreases the level of expression of particular genes as necessary. Disruptions or changes in gene expression
are responsible for many diseases. Gene expression may be controlled at any of
a number of points along the molecular pathway from DNA to protein: the most
important one is during transcription, where gene expression selects genes to be
transcribed into mRNA and provides the efficiency of the process, namely how
many proteins have to be produced.
3.1.1 DNA Microarray
The technologies that permit to detect the expression level of the genes in an
organism has been made possible only in recent years, with the advent of a tech-
3.1 Background: gene expression
29
Fig. 3.1. (left) A microarray slide contain thousands of probes, each one corresponding to
a known gene. (right) Labeled cDNA hybridize to the slide and the fluorescence emitted
is a measure of gene expression.
nology called DNA microarray. A microarray is a laboratory tool used to detect the
expression of thousands of genes at the same time [38, 186]. They are microscope
slides that are printed with thousands of tiny spots arranged in a grid, with each
spot containing a known, single-strand, DNA sequence composing gene (Fig. 3.1).
The DNA molecules attached to each slide act as probes to detect gene expression,
which is the set of messenger RNA (mRNA) transcripts expressed by a group of
genes.
To perform a microarray analysis, mRNA molecules are typically collected
from an experimental sample: the mRNAs are converted into complementary DNA
(cDNA), and labeled with a fluorescent dye. The sample is then allowed to bind
to the microarray slide, in a process called hybridization. If a particular gene is
very active, it produces many molecules of messenger RNA, thus, more labeled
cDNA which hybridize to the probes on the microarray slide and generate a very
bright fluorescent area. Genes that are somewhat less expressed produce fewer
mRNA, thus, less labeled cDNA hybridized, which results in dimmer fluorescent
spots. If there is no fluorescence, none of the mRNA have hybridized to the DNA,
indicating that the gene is inactive [38, 186].
Following hybridization, the microarray is scanned to measure the expression
of each gene printed on the slide, resulting in a fluorescence image such as the
one shown in Fig. 3.2. The output of the digital system that scans the fluorescence
image is a matrix of numbers which gives a quantitative value for the expression of
each gene in the various spots of the microarray. This is done with image-processing
techniques, which help to segment the spots, remove noise, measuring the quality
of the spots, and quantify the signal. Clearly, this is a non trivial task due to a
wide variety of factors such as the strong noise present in the image, the difficulty
to estimate background/foreground, the non-perfect alignment of the spots in an
ideal grid, and others [221].
30
3 The gene expression analysis problem
Fig. 3.2. Example of fluorescence image derived after microarray hybridization.
Usually, several hybridizations are carried out: the idea is to measure the expression of different samples, which can belong to one or several classes. For example, some samples could be collected from healthy individuals, and other samples
could be collected from individuals with a disease like cancer. The final result,
which is typically the data investigated with pattern recognition techniques, is a
gene expression matrix, in which a row corresponds to a gene, a column to a sample, and a given entry represents the expression level of that particular gene in
a given experiment (sample). Summarizing, a gene expression matrix is a combination of many different microarray experiments, each one arranged in a column,
measuring the expression levels of all genes (each one arranged in a row). An
example is shown in figure 3.3.
As a final comment, it is important to notice that, even if the microarray
technology is currently the most widespread, emerging and more advanced technologies will eventually make microarrays obsolete. One worth mentioning is RNASeq [235], which enables to investigate at high resolution all the RNAs present in
a sample, characterizing their sequences and quantifying their abundances at the
same time. In practice, millions of short strings, called “reads”, are sequenced from
random positions of the input RNAs. These reads can then be computationally
mapped on a reference genome to reveal a “transcriptional map”, where the number of reads aligned to each gene gives a direct count of its expression level [76].
3.2 Computational analysis of a gene expression matrix
The advent of the microarray technology has created a need for algorithmic approaches that extract information from gene expression data to create compact,
interpretable representations. Many computational tasks can be carried out to an-
3.2 Computational analysis of a gene expression matrix
31
Fig. 3.3. A gene expression matrix (taken from [102]), where rows indicate genes,
columns different experiments, and the color indicates the expression level.
32
3 The gene expression analysis problem
alyze this matrix: i) selection of differentially expressed or discriminative genes;
ii) classification of samples; iii) clustering of genes or samples (i.e. for individuate pathological subtypes); iv) biclustering, i.e. simultaneous cluster of genes and
samples. In the following, we will briefly review each of these.
Gene selection
A typical gene expression matrix contains hundreds of experiments and thousands
of genes. Depending on the task, gene selection techniques may represent an important class of preprocessing tools: such methods, by eliminating uninformative
genes, can reduce the dimension of the problem space, and alleviate the curse of
dimensionality issue [66, 89]. Moreover, in this context such operation may have
a large impact from the biological / medical point of view, because it can help
researchers to identify a stable and informative set of biomarkers for cancer diagnosis, prognosis, and therapeutic targeting [89,199]. Several approaches have been
proposed in the literature, ranging from simple filters based on variance or entropy up to complex methods which consider labels and concepts like redundancy
and relevance. A comprehensive and recent review on gene selection can be found
in [121].
Sample classification
The classification of samples is an important emerging clinical applications of gene
expression analysis, for example to distinguish different diseases according to different expression levels in normal and tumor cells. As introduced in the previous
paragraph, the major challenge is perhaps the curse of dimensionality; there are
very few samples in comparison to the number of genes analyzed, and many models
may not generalize well to new data despite excellent performance on the training set [219]. Furthermore, many of the features are irrelevant or redundant to
the problem being researched, making gene selection (described in the previous
paragraph) necessary even if an algorithm could handle the large quantity of data.
Other approaches perform feature extraction, which aims at “summarizing” the
numerous genes in form of a small number of new components (often linear combinations of the original expression levels). Some examples are Partial Least Squares
(PLS) [34, 35], generalized Partial Least Squares [79], or Independent Component
Analysis (ICA) [217].
After dimension reduction, one can apply any classification method to the
constructed components. Several reviews on the subjects are present [36, 67, 111,
122, 144, 219]. Perhaps the most popular classification algorithm is the Support
Vector Machine (SVM), which is suitable for classifying high dimensional data
without suffering too much from the curse of dimensionality problem, and proper
performances in the gene expression scenario have been demonstrated in many
studies [56, 83, 112, 219, 233].
Clustering
In this task, the goal is to subdivide a set of genes or samples in such a way
that genes (samples) with similar expressions along samples (genes) fall into the
3.3 Contributions
33
same cluster, whereas dissimilar items fall in different clusters. Beyond simple
visualization, when clustering genes usually the goal is to identify functionally
similar genes, or to infer a functional role for unknown genes in the same cluster
[84]. When clustering samples, it may be employed to identify biologically relevant
structure in large data sets: in this context, the most employed clustering analysis
is perhaps hierarchical clustering, which allows the inference of possible group
structure within the data [194] and, thus, have been used to indicate tentative
new subtypes for some cancers [5].
Two reviews on the subject can be found in [110, 218].
Biclustering
In the clustering context, a recent trend is represented by the application of biclustering methodologies, namely clustering techniques able to simultaneously group
genes and samples [143, 183]; a bicluster may be defined as a subset of genes that
show similar activity patterns in a specific subset of samples.
This kind of analysis may have a clear biological impact in the gene expression
scenario, where a bicluster may be associated to a biological process that is active
only in some samples and may involve only a subset of genes. Different approaches
for biclustering gene expression data have been presented in the literature in the
past, each one characterized by different features, like computational complexity,
effectiveness, interpretability, optimization criterion and others (some reviews can
be found in [143, 183, 224]). Generally speaking, most biclustering methodologies
have been obtained by adapting, tailoring and opportunely combining existing
clustering techniques [22, 74, 86, 248].
To computationally analyze a gene expression matrix, it is important to notice
that most approaches do not consider that expression levels are essentially counts
of mRNA molecules (gene products), although very difficult to be directly measured. This consideration can motivate the use of a bag of words approach, where a
column in the gene expression matrix is interpreted as a numerical vector counting
how many times each word/gene occurs in the sample. Another motivation is that
genes and mRNA molecules in the cell have a “bag” structure: there is no obvious
ordering of the genes in this context, therefore a bag of words representation does
not destruct the structure of the objects (samples). Finally, there are probabilistic
models for bag of words that provided state of the art classification and clustering
accuracy, as well as highly interpretable solutions. In particular, we introduced
in Chap. 1 topic models, probabilistic models aimed at representing the various
topics that a corpus of documents is speaking about.
For these reasons, the bag of words and its related models (in particular, topic
models) appear to be a convenient tool for the gene expression data analysis problem.
3.3 Contributions
This part of the thesis we report the contributions achieved in the field of gene
expression data analysis. In particular, after postulating the bag of words representation for gene expression samples, we investigated the capabilities of topic
34
3 The gene expression analysis problem
models (which have been never investigated in the gene expression domain) in
mining information from such data, employing them for a variety of tasks. In this
context, the topics introduced by topic models capture a fundamental information
in this context, namely a co-occurrent pattern of genes: they are latent modules,
that assign high probability to genes that tend to be highly co-expressed.
We faced the sample classification task, by extracting topic models-derived feature vectors to be used in a discriminative setting with support vector machines.
This results in a hybrid generative discriminative scheme [99], where surrogates
or by-products of the generative model learning are injected as features in a discriminative classifier. The proposed approach has been extensively tested on 10
different benchmark data sets (in other works, it is common to only evaluate 2-3),
by employing several different topic models, different ways of extracting feature
vectors from the trained topic models, different classification schemes, and different kernels. Obtained results, when compared to the state of the art, confirm
the suitability of this bag of words approach for the classification of gene expression data. Moreover, considerations on the interpretability of the obtained feature
descriptors have been provided, with the use of a real dataset involving different
species of grape plants. This work is described in chapter 4 and has been published
in [24].
Then, we made one step forward along the direction of modeling gene expression with topic models. In particular, we employed a more recent and sophisticated
topic model called the Counting Grid [104, 175], to mine and extract an informative representation for a set of expression samples. The main motivation is that
most topic models, even if they represent a proper choice, have a clear drawback:
they assume that topics act independently of each other. While this assumption is
often needed to simplify computations and inference, it may be too impoverishing
in the gene expression scenario, where it is known that biological processes are
tightly co-regulated and interdependent in a complex way [126]. The Counting
Grid model copes with the aforementioned limitation: the idea behind the model
is that topics are arranged in a discrete grid, learned in a way that “similar” topics are closely arranged. Similar biological samples, i.e. sharing some topics and
active genes, will be mapped close on the grid, allowing for an intuitive visualization of the dataset. We made a comprehensive evaluation of the model in the
gene expression scenario, by i) visualizing on four different datasets how samples
are embedded and clustered together on the grid, naturally separating between
classes; ii) validating numerically this claim, also performing a thorough evaluation on parameters’ sensitivity; iii) proposing a novel methodology to highlight
and automatically select genes particularly involved in the pathology or in the
phenomenon of interest; iv) demonstrating that the model achieve state-of-the-art
results for classification tasks.
4
Gene expression classification using topic models
This chapter describes how to employ the bag of words representation and topic
models for classification of gene expression data. First, motivations and considerations on the suitability of the proposed approach are presented, together with
a brief review of the models employed. Then, a classification scheme is proposed,
based on highly interpretable features extracted from topic models. An extensive
experimental evaluation, involving ten different literature benchmarks, demonstrates the suitability of topic models for this classification task. Finally, we performed a qualitative analysis on a dataset involving grapevine plants expression,
that confirms the great interpretability of the proposed approach.
4.1 Topic models and gene expression
As already introduced, the basic idea underlying topic models is that each document may be characterized by the presence of one or more topics (e.g. sport,
finance, politics), which induce the presence of some particular words. From a
probabilistic point of view, the document may be seen as a mixture of topics. The
representation of documents and words with topic models has one clear advantage:
each topic is individually interpretable, providing a probability distribution over
words that picks out a coherent cluster of correlated terms. This may be really
advantageous in the gene expression context, since the final goal is to provide
knowledge about biological systems, and highlight possible hidden correlations.
As largely detailed in the previous chapters, the novel application of topic
models in the gene expression scenario starts from the analogy that can be set
between the pair word-document and the pair gene-sample: actually it is reasonable
to intend the samples as documents and the genes as words. In fact each sample
is characterized by a vector of gene expressions: the expression level of a gene
in a sample may be easily interpreted as the count of a word in a document
(the higher the level the more present the gene/word is in the sample/document).
This permits to consider the expression matrix as a bag of words matrix, thus
opening the possibility of exploiting all the tools developed for the bag of words
representation. At this point it is important to notice that, contrarily to many
other bag of words applications where the word order is lost, expression levels
36
4 Gene expression classification using topic models
Topic = 'economics'
80%
Dictionary words
20%
Document st mostly talking about economics
Fig. 4.1. Intuitive representation of the PLSA model for document analysis.
have a natural “bag” structure: there is no obvious ordering of the genes in this
picture, therefore the bag of words representation does not alter the underlying
structure of samples.
Usually, topic models take as input a set of documents, each one containing a
set of words. The documents are summarized by an occurrence matrix, where each
entry indicates the number of occurrences of a given word in a given document.
In the same way, in the gene expression scenario the input is a set of T samples,
summarized by an expression matrix n(gn , st ) which measures the expression level
of the gene gn in the sample st . The dictionary is of size N , namely we have N
different genes appearing in the sample set, and the dictionary indexes these genes.
The simplest model employed in this chapter is called Probabilistic Latent
Semantic Analysis (PLSA [94]). Even if such model has been introduced in the
text analysis community, in the next section we re-formulated its theory in order to deal with the gene expression scenario, assuming the analogy gene/words,
sample/documents, and expression-level/word-counts.
4.1.1 Probabilistic Latent Semantic Analysis (PLSA)
In PLSA, the presence of a gene gn in the sample st is mediated by a latent
topic variable, z ∈ Z = {z1 ,..., zZ }, also called aspect class. Intuitively, a topic
may represent a biological process, which is active only in a subset of samples,
and characterized by the high expression levels of a subset of the genes. The joint
probability of the observed variables is
X
X
p(gn , st ) =
p(gn , zk , st ) = p(st ) ·
p(gn |zk ) · p(zk |st )
(4.1)
k
k
In other words , the topic zk is a probabilistic co-occurrence of genes encoded by
the distribution βzk (g) = p(gn |zk ), g = {g1 ,..., gN }. p(zk |st ) (with z = {z1 ,..., zK })
represents the proportion of the topics in the sample st ; finally p(st ) accounts
for the global expression pattern of the sample st (in the document scenario,
this accounts for documents of different lengths). An intuitive visualization of the
PLSA model is depicted in Fig. 4.1, whereas the bayesian network representation
is shown in Fig. 4.2. From this, the generative process for a sample st can be
derived as follows. First, a topic zk is drawn from the distribution p(z|st ): a topic
4.1 Topic models and gene expression
37
β
g
z
s
N
T
Fig. 4.2. Bayesian network representation of the PLSA model.
particularly present in sample st will more likely be selected. Then, a gene gn is
drawn from the distribution p(gn |zk ), which is conditioned to the value assumed
by zk . Finally, the process is repeated, by selecting another topic and another gene,
until the whole sample is generated.
The hidden distributions of the model, p(g|z) and p(z|s), are learned using an
exact Expectation-Maximization algorithm (EM). The EM iteratively learns the
model by minimizing a bound F on the loglikelihood L (i.e. the probability of
the visible variables p(g)) by alternating the E and M-step (F is the free energy
defined in Chap. 2, Eq. 2.22). In this context, the data-loglikelihood L is:
L=
T
N X
X
n(gn , st ) · log p(gn , st )
(4.2)
n=1 t=1
For the E-step, p(zk |st , gn ) can be obtained by looking at the bayesian network
structure:
p(zk |st )p(gn |zk )
(n,t)
q(zk ) = p(zk |gn , st ) = P
, qk
(4.3)
t
k p(zk |s )p(gn |zk )
With this notation, the Free Energy for the PLSA model can be written as
X
X
X (n,t)
(n,t)
(n,t)
F=
n(gn , st )
qk ln qk
−
qk ln p(st )p(zk |st )p(gn |zk )
(4.4)
n,t
k
k
In the M-step, the minimum of the free energy is found by setting the various
derivatives to zero. Three normalization constraints have to be accounted, one for
each hidden distribution. Thus, the free energy formula has to be augmented by
appropriate Lagrange multipliers τk , ρi and ϕ, giving the following constraints:
X
τk · 1 −
p(gn |zk ) = 0
(4.5)
j
ρt · 1 −
X
p(zk |st ) = 0
(4.6)
k
X
ϕ· 1−
p(st ) = 0
(4.7)
t
After eliminating the Lagrange multipliers, the M-step re-estimation equations can
be obtained. For example, deriving w.r.t. p(gn |zk ) and setting derivatives equal to
0 leads to:
38
4 Gene expression classification using topic models

P
 p(gn |zk ) · τk = t n(gn , st )qk(n,t)
P
n p(gn |zk ) = 1

P
t (n,t)

/(τk )
 p(gn |zk ) =
t n(gn , s )qk

 τ = P P n(g , st )q (n,t)
k
n
n
t
k
The final result, which gives the M-step equation for p(gn |zk ), is:
P
t (n,t)
t n(gn , s )qk
p(gn |zk ) = P P
t (n,t)
n
t n(gn , s )qk
(4.8)
In a similar way, one can obtain estimates for p(zk |st ) and p(st ). The other
M-step updates are summarized as follows:
(n,t)
t
n n(gn , s )qk
p(zk |s ) = P P
t (n,t)
k
n n(gn , s )qk
P
t
n n(gn , s )
p(st ) = P P
t
t
n n(gn , s )
t
P
(4.9)
(4.10)
The E-step and the M-step equations are alternated until a termination condition is met, for example when there is a little change of the data log-likelihood
across two consecutive iterations.
It is important to note that t is a dummy index into the list of documents in the
training set. In other words, st is a random variable with as many possible values
as training documents: the model learns the topic mixtures p(z|s) only for those
documents on which it is trained. For this reason, PLSA can not be considered a
well-defined generative model of documents; there is no natural way to use it to
assign probability to a previously unseen document.
However, once the model has been learned one can estimate the topic proportion of an unseen sample. This is achieved by applying the learning algorithm while
keeping fixed the previously learned parameter p(gn |zk ) and estimating p(zk |st )
for the sample at hand by using Eq. 4.9.
As a final consideration, note that this is an admixture model [81]: one sample is made by the contribution of different topics, each one with its proportion,
contrarily to the mixture of unigrams model presented in Chap. 2. This is particularly appropriate in the gene expression scenario, where a topic can be intended
as a biological process. It is reasonable that many processes are active at the same
time (with different levels) in a sample, and these processes influence the measured
expression levels.
4.2 The proposed approach
As explained in the previous section, given the analogy between the pair worddocument and the pair gene-sample, we can in general associate the expression
4.2 The proposed approach
39
matrix n(gn , st ) (e.g coming from a DNA microarray experiment) to the count
matrix of topic models, to be explicitly or implicitly used to trained the topic
model. Note that the framework can be employed with any topic model, not just
PLSA (in fact, in the experimental evaluation we considered also some other topic
models).
In order to classify samples, we propose to exploit a hybrid generativediscriminative scheme [26,99,120,173], where the generative and the discriminative
paradigms are merged together. Recall that generative approaches, such as topic
models, are based on probabilistic class models and a priori class probabilities,
learned from training data and combined via Bayes’ rule to yield posterior probabilities. On the contrary, discriminative learning methods aim at learning class
boundaries or posterior class probabilities directly from data, without relying on
intermediate generative class models. Generative and discriminative classification
schemes represent the two main directions for classifying data: each philosophy
brings pros and cons; the last research frontier aims at fusing them, following heterogeneous recipes [173]. In the hybrid generative-discriminative scheme adopted
here, the typical pipeline is to learn a generative model – suitable to properly
describe the problem –, and use it to project every object in a feature space (the
so-called generative embedding space), where a discriminative classifier may be
trained. In particular, the approach we employed here is realized as follows:
Generative model training
Given the training set, the topic model is trained as explained in the previous
section. Different schemes may be adopted to fit the best model (or set of models)
to the data, namely by learning one model per class, one per the whole dataset or
others – an interesting analysis has been reported, for the Hidden Markov Model
case, in [21]. Here we employ the basic one, namely training one single model for
all classes.
Generative embedding
Within this step, all the objects are projected, through the learned model, to
a vector space. In this way, standard discriminative classifiers are proved to
achieve higher performances than a solely generative or discriminative approach
[99, 120, 173]. There are many embeddings that can be built on the generative
model: a first and simple choice is to use the estimated topic posteriors distribution. The intuition is that since every topic may be approximately associated to a
biological process (or to a set of – [22]), the topic distribution p(z|st ) characterizing a sample may indicate which and to which extent the different processes are
active in such sample, thus representing a significant and possibly discriminant
feature from a threefold perspective:
1. they provide a really interpretable representation of the microarray experiments, in terms of biological processes, as shown in section 4.4;
2. the dimensionality of the feature vector is reduced from the number of genes
N to the number of topics K, with K N – thus providing a more compact
and easy-to-manage representation.
3. finally, such descriptors represent multinomial distributions, which are suitable
to be classified using kernels on probability measures (also called Information
40
4 Gene expression classification using topic models
Theoretic Kernels and detailed in the following section) – which have been
shown to be very effective in classification problems involving text, images,
and other types of data (see [147] and the references therein); moreover, very
recently, they have been shown to be very suitable for the hybrid generativediscriminative approach (see for example [25]).
Moreover, it is important to notice that this representation with the topic posteriors has been already successfully used in computer vision for classification
purposes [33, 53] as well as in the medical informatics domain [42] (this being
confirmed by our experimental evaluation).
More sophisticated choices are of course possible: here we investigated a possible extendibility of the proposed approach by employing a more complex descriptor
called FESS (Free Energy Score Spaces – [174]). In the FESS, the embedding is
achieved via the unique decomposition in addends that composes the free energy
of the model. For PLSA, it has been defined in Eq. 4.4, but we can re-write it in
the following way:
X
X
F(st ) =
n(gn , st ) ·
p(zk |gn , st ) · log p(zk |gn , st ) +
n
−
X
n
k
t
n(gn , s ) ·
X
p(zk |gn , st ) · log p(gn , st , zk ) · p(zk )
(4.11)
k
where the first term represents the entropy of the posterior distribution and the
second term is the cross-entropy. As visible in Eq. 4.11, both terms are composed
of Z × N addends, and their sum is equal to the free energy. The idea of FESS is
to decompose the free energy in its addends, i.e., F(st ). For PLSA this results in
a space of dimension equal to 2 × Z × N ; we will refer to this as FESS L3 . In [174],
the authors point out that, if the dimensionality is too high, some of the sums
can be carried out to reduce the dimensionality of the vector. The choice of the
addends to optimize is intuitive but guided by the particular application. In our
case, as previously done in [174, 178], we perform the sums over the gene indices,
optimizing the topics contribute. The resulting score space has dimension equal to
2 × Z; we will refer to this space as FESS L2 .
In few words, the FESS expresses how well each data point (i.e. sample) fits
different parts of a trained generative model. It has been found that the FESS is
highly informative for discriminative learning, yielding state-of-the-art results in
several contexts [172, 174]. However, its suitability in the gene expression context
has never been investigated.
Discriminative classification
In the resulting generative embedding space any discriminative vector-based classifier may be employed. In this fashion, according to the generative/discriminative
classification paradigm, we used the information coming from the generative process as discriminative features of a discriminative classifier. Almost always in hybrid generative discriminative schemes, Support Vector Machines (SVM [52]) are
employed in the resulting generative embedding space.
4.3 Experimental evaluation
41
4.3 Experimental evaluation
The suitability of the proposed classification scheme has been extensively tested on
ten different well-known datasets, briefly summarized in Table 4.1 (in the literature
usually only 2-3 datasets are employed).
As in many gene expression analyses, a beneficial effect may be obtained by
selecting a sub group of genes, in order to limit the dimensionality of the problem
and to reduce the possible redundancy present in the dataset. Here we employed
the Minimum-Redundancy Maximum-Relevance feature selection approach [63,
170]1 . In order to have a fair comparison with the state of the art, for every
dataset we selected the best result in the literature (at least to the best of our
knowledge) – they are reported in Table 4.1; we used then, in our experiments,
the same number of genes used in that paper (when specified); if not specified, we
retain 500 genes (as in several other papers [23,177,194]). For similar reasons, also
the cross validation protocols – again reported in Table 4.1 – have been chosen by
looking at the corresponding state of the art papers.
In the learning phase, the PLSA model has been built only on the training set.
Since the training procedure can converge to local optima of the likelihood, the
training has been repeated 20 times, starting from different random initializations,
retaining the model with the highest data likelihood.
The number of topics is a free parameter in topic models, and should be set in
advance. Different automatic techniques have been proposed in the literature to
set such a number, ranging from hold-out likelihood [194] to cross validation, from
a priori knowledge to probabilistic model selection methods – e.g. the Bayesian Information Criterion (BIC – [205]). Here we adopted a very simple scheme: starting
from the observation that topic models are adequate in finding clusters (they were
designed as clustering techniques), we thought it reasonable to fix the number of
topics as proportional to the number of classes (after few trials, we found that three
times the number of classes was a reasonable choice). Despite the simplicity of this
rule, obtained results were very satisfactory. An analysis of the performances of
PLSA with respect to this parameter is discussed in the next section.
Table 4.1. Summary of the employed dataset. In particular, N represents the number
of genes, T the number of samples, and C the number of classes.
Dataset
leuk2
leuk1
11tumors
colon
brain1
brain2
lung
nci60
prostate
9tumors
1
N,T,C
Citation
11225,72,3
[10]
5327,72,3
[87]
12533,174,11 [222]
2000,62,2
[7]
5920,90,5
[182]
10367,50,4
[161]
12600,203,5
[19]
7129,60,9
[196]
10509,102,2
[210]
5726,60,9
[220]
Test Protocol
5-fold CV
10-fold CV
5-fold CV
LOO CV
4-fold CV
10-fold CV
5-fold CV
10-fold CV
LOO CV
10-fold CV
http://www.mathworks.com/matlabcentral/fileexchange/14916.
42
4 Gene expression classification using topic models
A final note on the training for the PLSA model: in some cases the raw expression matrix contained negative values and cannot be used as it is as the count
matrix of topic models (which requires positive values); therefore a simple shifting
step has been applied to the matrix in order to have positive values.
Table 4.2. Classification errors of the proposed approaches for different datasets.
Method
Leuk2 Leuk1 11Tumors Colon Brain1 Brain2 Lung
NCI60 Prostate 9Tumors
PLSA Lin
PLSA JS
PLSA JT
FESS L2
FESS L3
0.0267
0.0267
0.0267
0.0267
0.0267
0.0286
0.0143
0.0286
0.0278
0.0417
0.0900
0.0583
0.0534
0.0465
0.0457
0.0968
0.0968
0.1129
0.0833
0.0833
0.4800
0.3067
0.3067
0.0963
0.0963
0.0686
0.0490
0.0490
0.0392
0.0392
0.3200
0.2533
0.2733
0.0516
0.0556
LPD Lin
LPD JS
LPD JT
0.0133 0.0125
0.0133 0.0393
0.0133 0.0143
0.0947
0.0671
0.0840
0.1935 0.1205 0.1800 0.0585 0.2700
0.1935 0.1429 0.1800 0.0673 0.2533
0.1774 0.1101 0.1800 0.0578 0.2867
0.0588
0.0588
0.0490
0.3433
0.3600
0.3767
Our Best
0.0267 0.0143
0.0457
0.0833 0.0761 0.0778 0.0397 0.0963 0.0392
0.0516
Bayesian
0.0297 0.0143
0.0847
0.1452 0.0863 0.2800 0.0542 0.3433
0.0882
0.2933
Supervised TM 0.0810 0.0833
0.5866
0.0806 0.3318 0.3200 0.1541 0.6733
0.0588
0.5900
State-of-art
(Reference)
0.0520
( [60])
0.0650 0.1350 0.1500 0.0620 0.1170 0.0240
( [135]) ( [162]) ( [219]) ( [46]) ( [250]) ( [234])
0.2460
( [91])
0.0150 0.0250
( [246]) ( [219])
0.0982
0.0982
0.0967
0.0773
0.0761
0.1000
0.1200
0.2200
0.0778
0.0778
0.0541
0.0447
0.0397
0.0711
0.0711
As almost always in hybrid generative discriminative schemes, the classification
accuracies have been computed using Support Vector Machines in the resulting
generative embedding space – the parameter C has been selected using Cross
Validation on the training set. As already discussed, more than using the standard
linear kernel, we exploited the probabilistic nature of the feature vector by the use
of different kernels on measures (also called information theoretic kernels [147]),
which provide similarity between probabilistic distributions. It has been shown in
other contexts (see for example [25]) that such combination may be beneficial for
some hybrid generative-discriminative methods. In particular, here we employ the
standard Jensen-Shannon kernel (JS), based on the Jensen-Shannon divergence
between two distributions p1 and p2 :
JS(p1 , p2 ) = H
p + p H(p ) + H(p )
1
2
1
2
−
2
2
(4.12)
where H is the Shannon entropy. We also employed a more recent kernel, introduced by [147], which is based on a non-extensive generalization of the classical
Shannon information theory, and defined on (possibly unnormalized) probability
measures: the Jensen-Tsallis (JT) kernel, defined as:
KqJT (p1 , p2 ) = lnq (2) − Tq (p1 , p2 )
(4.13)
where lnq (x) = (x1−q − 1)/(1 − q) is the q-logarithm,
Tq (p1 , p2 ) = Sq
p + p S (p ) + S (p )
1
2
q 1
q 2
−
2
2q
(4.14)
4.3 Experimental evaluation
43
is the Jensen-Tsallis q-difference and Sq (r) is the Tsallis non-extensive entropy,
defined, for a multinomial distribution r = (r1 , . . . , rW ), as
!
W
X
1
q
1−
Sq (r1 , . . . , rW ) =
ri
(4.15)
q−1
i=1
The parameter q has been adjusted by cross validation on the training set. For what
concerns the FESS, after extracting the descriptors, we used in our experiments
the SVM with linear kernel.
Another interesting point of analysis is related to the different possible ways in
which topic models can be exploited for classification. Alternatives to our staged
scheme exist: in particular, here we compared our approach to a simple Bayesian
scheme – which trains one model per class and performs classification with the
Bayes rule –, and to the supervised topic models approach [148] – which explicitly
takes into account the labels in the training process2 .
Finally, we compared our classification results with the ones obtained using as
a topic model the Latent Process Decomposition (LPD [194]), a generative model
explicitly proposed for the microarray scenario to cluster genes. The LPD is inspired by topic models: however, in the PLSA a word is generated by a multimodal
distribution p(gn |zk ), whereas in the LPD the word-topic probability is modeled
by a single gaussian (µgn ,zk , σgn ,zk ), thus reflecting the continuous nature of the
expression level.
All the obtained results are reported in table 4.2, together with state of the
art results. “Lin”, “JS”, and “JT” stand for linear, Jensen-Shannon and JensenTsallis kernels, respectively. “FESS L2” and “FESS L3” are the two variants of
the FESS approach introduced before.
4.3.1 Discussion
As a general comment, from the table it can be argued that descriptors extracted
from Topic Models are really effective for expression microarray classification.
When compared with literature, we can observe that our results are in line with the
state of the art. Moreover, in three cases (Brain1, Brain2 and 9Tumors), our best
result is substantially better than the state of the art. It is important to notice,
at this point, that we compared our results (obtained within a single framework)
with results obtained with many different techniques on different datasets, each
technique possibly tailored for the specific dataset (which are very different in
terms of composition and difficulty – see table 4.1).
Some more specific observations can be drawn from the table: in particular,
by looking at the behavior of the different kernels, we can notice that a beneficial
effect is obtained when exploiting the probabilistic nature of the feature vector
by using the information theoretic kernels. When comparing PLSA with LPD, it
seems that in average there is not such a big difference in terms of accuracy, with
some datasets slightly preferring PLSA. A possible justification may be searched in
the sensitivity of LPD model to the choice of the number of topics. To investigate
2
The code can be found in http://cran.r-project.org/web/packages/lda/.
44
4 Gene expression classification using topic models
Linear Kernel
0.14
Jensen−Tsallis kernel
0.14
pLSA
LPD
0.12
0.12
0.1
0.1
Averaged Error
Averaged Error
pLSA
LPD
0.08
0.06
0.08
0.06
0.04
0.04
0.02
0.02
0
5
10
15
Number of topics
20
25
30
0
5
10
15
Number of topics
20
25
30
Fig. 4.3. Accuracies on the Leuk dataset by varying the number of topics. Results are
shown for PLSA and LPD with linear (left) and Jensen-Tsallis (right) kernels.
such behavior, we performed an exhaustive analysis on the Leuk2 dataset, by
varying such number from 3 to 30 (step 3). In Figure 4.3 the error curves are
displayed, employing the linear kernel (on the left) and the Jensen-Tsallis kernel
(on the right).
It seems evident from the plots that the accuracies for the PLSA do not vary
too much while changing the number of topics, whereas the LPD is more sensitive
to such choice (when the number of topics is properly chosen, LPD outperforms
PLSA). This is true both for linear and for the JT kernels.
The accuracy and the possible margin in extensibility of the proposed approach
is evident when looking at the results obtained with FESS. Actually it turned
out that when the topic proportion descriptor is not enough to discriminate (see
for example NCI60 and 9tumors), the FESS signature permits to unravel such
complexity, leading to excellent results (on the contrary, when the topic proportion feature vector works well, only a marginal improvement is obtained by using
FESS).
Finally, by comparing the different ways of exploiting topic models for classification (our approach, the Bayesian scheme, and the supervised topic models
method), it seems evident that in problems with few classes a supervised topic
model is a good choice, leading to very good results, whereas when the number of
classes increases the separation of the training set made by the Bayesian scheme is
more appropriate. In general, nevertheless, our hybrid approach is better, confirming the fact, shown in other many different contexts, that this scheme is able to
exploit the complementarity of the generative and the discriminative paradigms:
generative models, which are more suited to describe data, are used to derive features, which are then classified by discriminative techniques, which are more suited
to find decision boundaries.
4.4 The interpretability of the feature vector
45
4.4 The interpretability of the feature vector
In this section we would like to demonstrate that the extracted p(z|s) vectors of
PLSA are highly interpretable. In particular, p(z|s) characterizes “how present”
every topic is in a given sample, and we already posited that a topic may be easily
associated to a biological process. Actually, by definition, a topic characterizes
a subset of samples where the gene expressions are highly correlated. Therefore
p(z|s) may be used to infer the different biological processes which are active over
the different samples. It should be noted that also the probability of the genes
given the topic may be very useful: actually it may be interpreted as the impact
of the different genes in a particular biological process. Moreover, the probabilistic
nature of these models permits to encode also the level of the impact, thus taking
into account the well known fact that not all biological processes are taking place
in every sample.
To show these characteristics we applied the proposed scheme on a real dataset,
in a study conducted in collaboration with the Functional Genomics Lab at the
University of Verona. The dataset included 48 samples (and 24676 genes) of microarray expressions of two grapevine species, V. vinifera and V. riparia, both
subjected to infection with Plasmopara viticola, a pathogen responsible for a destructive disease. It is known that V. riparia is resistant to the pathogen, while V.
vinifera is more susceptible to infection, and the study focused in understanding
molecular switches, signals and effectors involved in resistance [181]. In the paper,
they reported a microarray analysis of early transcriptional changes associated to
P. viticola infection in both susceptible Vitis vinifera and resistant Vitis riparia
plants (12 and 24 h post inoculation). The same experiments have been conducted
with the plant treated with water, a neutral agent used as control. We chose this
dataset since it is very complex and structured; different classes can be highlighted:
in particular, samples can be divided on the basis of the type of plant (V. vinifera
or V. riparia), of the time point (after 12h or 24h), or the pathogen/water treatment.
In the training phase, we employed the Bayesian Information Criterion (BIC
[205]) to have a rough estimate on the best number of topics to set. In very few
words, BIC defines a term that penalizes the likelihood of the model depending on
the number of its free parameters; in such way, larger models – which do not lead
to a substantial increase of the likelihood – are discouraged. a PLSA model was
trained 50 times and the best model (in a likelihood sense) was retained. Using BIC,
we found that the best values ranged in the interval [5 8]. Guided by the expertise
of biologists, the number of topics has been set to 6. Then information has been
extracted from the topic/document and word/topic distributions. In particular,
in figure 4.4 we report on the left an intuitive bar-plot of the probability p(z|d)
(different rows correspond to different topics z), while the figure on the right
represents the functional categories, as analyzed by the biologists, of the most
important genes (found by looking at p(g|z)).
Studying the composition of the dataset, we observed that it is rather accurately reflected by the p(z|d) distribution (on the left of the figure). Actually every
topic can reflect a different aspect of the dataset. For example, some topics show
groups of samples which are more correlated with the effects of treatment at the
46
4 Gene expression classification using topic models
Fig. 4.4. PLSA analysis. (a) Bar representation of the p(z|s) distribution for each of the
6 topics. The main classes are represented on the bottom of the figure. (b) Functional
category distribution of topic specific genes.
different time points rather than with a specific reaction to the pathogen in comparison with the control (water). This is evident in the 3rd and 4th topics, which
represent V.vinif era after 12 hour and 24 hour respectively, the former without
pathogen inoculation and the latter infected. The last topic captures the processes
of V.riparia after 12 hour since the infiltration, in the first case with water, in the
second with the pathogen.
From the specific disease resistance point of view, the analysis confirmed the
tendency of a specific response in V. riparia. In fact, the 1st topic deals with samples related to infected V. riparia’s leaves at both time points (12 and 24 hours
after infection). By looking at the genes which are most active in the 1s t topic, biologists found that their distribution is particularly significant. In fact, important
functional categories among the involved genes (listed on the right side of figure
4.4) are carbohydrate metabolism and transport, in contrast with a strong contribution of photosynthesis-related gene expression in other topics. As previously
reported, the primary metabolic reprogramming underlies defense in biotrophic interactions in order to potentially supply both energy and precursors to implement
a defense mode.
It is also worth noting that, within topic 1, the same trend of the last 12 experiments is visible on the classes of V. vinifera subjected to inoculation (samples
12-24). In fact, this means that an activation of some genes – possibly involved
in the response to the pathogen – is undergoing, but the response is too weak,
explaining the susceptibility of the plant to P. viticola.
4.4 The interpretability of the feature vector
47
Concluding, all these observations qualitatively confirm the capabilities of the
proposed descriptors to encode different aspects of the dataset.
5
The Counting Grid model for gene expression
data analysis
In the previous section, we showed that topic models introduce an interesting intermediate level of representation, based on the concept of topic, which essentially
expresses a co-occurent pattern of genes: they are latent modules that assign high
probability to genes that tend to be highly co-expressed. However, a common assumption of most topic models is that these modules act independently of each
other. While this assumption is often needed to simplify computations and inference, it may be too simplistic in the gene expression scenario, where it is known
that biological processes are tightly co-regulated and interdependent in a complex
way.
In this chapter we make a step forward – pursuing the bag of words and the
topic model philosophy but coping with the afore-described limitation – presenting
a novel strategy to extract an informative representation for a set of experimental
samples through a generative model called Counting Grid (CG [104]). The Counting Grid is a model for objects represented as bag of words recently introduced for
text mining [104] and image processing [175]. The key idea of topic models is still
present: a document is abstracted in an intermediate representation of “topics”,
which are probability distribution over words that picks out a coherent cluster of
correlated terms. However, here topics are arranged on a discrete grid, learned in
a way that “similar” topics are closely arranged. Fig. 5.1 pictures this idea, and
compare it with the PLSA model.
Similar biological samples, i.e. sharing some topics and active genes, will be
mapped close on the grid, allowing for an intuitive visualization of the dataset.
More specifically, the CG seems to be very suitable in the gene expression scenario
for the following reasons:
•
•
The CG provides a powerful representation – successful in other fields [104,
171, 175] – which permits to capture evolution of patterns in the experiments,
that can be clearly visualized.
The CG is well suited for data that exhibits smooth variation between samples.
Expression values are biologically constrained to lie within certain bounds by
purifying selection [106] and variation in only a few expression values can cause
a pathology. This specific property of the data is captured well by the model.
50
5 The Counting Grid model for gene expression data analysis
sample st
80%
20%
PLSA
'economics'
'politics'
CG
more 'politics'
more 'psychology'
sample st
more 'computer science'
Fig. 5.1. In the PLSA model, a document is an admixture of independent topics. In
the example, document st is composed by the topics ’economics’ and ’politics’ with a
0.8:0.2 proportion. In the CG model, neighboring topics are similar, and a document is
generated from one window in the grid. Traveling in any direction on the grid lead to a
smooth topic transition.
•
Last, but not least, it is possible (preliminary investigated in [177]) to achieve
a better classification accuracy with respect to other probabilistic approaches
as well as to the recent state of the art.
In this chapter we make a comprehensive evaluation of the CG model in the
gene expression scenario by providing the following main contributions.
1. By testing and visualizing different data sets, we show that samples belonging
to different biological conditions (such as different types of cancer) cluster
together on the grid.
2. We prove that the model is able to select genes that are involved in the pathology or in the phenomenon which motivated the experiment, deriving a principled and founded way to extract the most important genes.
3. We show that the model achieves state-of-the-art results for classification tasks.
4. We evaluate the sensitivity of the model to parameters such as grid and window
size and the robustness of the model to overfitting.
5.1 The Counting Grid model
51
Ex
Ey
T
Wx
k
Wy
g
N
sample s1
sample s2
(a)
π
(b)
Fig. 5.2. (a) The Counting Grid model. Closely mapped samples s1 and s2 share some
topics, as they have a common subset of genes particularly active. (b) Bayesian network
representation for the Counting Grid model.
Before detailing how these goals are achieved, in the following we will review the
Counting Grid model.
5.1 The Counting Grid model
We have shown in the previous chapter that from a set of samples st , t = 1 . . . , T
the PLSA topic model learns a small number of topics which correlate related genes
particularly active in a subset of samples. However, there are no strong constraints
in how topics are mixed, because they are assumed to be statistically independent.
In the Counting Grid model, these distributions representing topics are arranged
on a discrete grid.
Formally, the Counting Grid πi,z is a D-dimensional discrete grid indexed by
i = (i1 , . . . , iD ) where each id ∈ [1 . . . Ed ] and E = (E1 , . . . , ED ) describes the
extent of the counting
P grid. Each cell represents a tight distribution over genes
(indexed by n), so n πi,n = 1 everywhere on the grid. A given sample st , represented by expression values {gnt } is assumed to follow a distribution found in
a window somewhere in the counting grid. In particular, using windows of dimensions W = [W1 , . . . , WD ], each bag can be generated by first averaging all
expression levels in the hypercube window Wk = [k . . . k + W] starting from the
location k (upper-left corner of the window) and extending
Pin each direction d
by Wd grid positions to form the histogram hk,n = Q 1Wd i∈Wk πi,n , and then
d
generating the bag of genes from such averaged histogram. In other words, the
position (upper-left corner) of the window k in the grid is a latent variable given
52
5 The Counting Grid model for gene expression data analysis
which the probability of the bag of genes {gnt } for sample st is
!
Y X
Y
g t
t
1
t
gn
πi,n n
p({gn }|k) =
(hk,n ) = Q
d Wd
n
n
i∈Wk
Relaxing the terminology, we will refer to E and W respectively as the counting
grid size and the window size, indicating with Wk the particular window placed
at location k. We will refer to the ratio of the window volumes, κ, as a capacity of
the model in terms of an equivalent number of topics, as this represents how many
non overlapping windows can be fit onto the grid. An example of a 2D grid is
depicted in figure 5.2 on the left; on the right, the bayesian network for the model
is depicted.
To learn a Counting Grid, we need to maximize the log likelihood of the data:
!
T
X
X Y gt
n
log P =
log
hk,n
(5.1)
t=1
k
n
The sum over the latent variables k makes it difficult to perform assignment
to the latent variables while also estimating the model parameters. The problem
is solved by employing an EM procedure [81], which – as explained in Chap. 2 –
iteratively learns the model by minimizing the free energy F by alternating the E
and M-step. In particular, for the CG model the free energy F is equal to
XX
L≥F =−
qkt · log qkt +
(5.2)
t
+
k
XX
t
k
qkt
X
n
gnt · log
X
πi,n
i∈Wkt
where qkt = P (k|st ) is the posterior distribution over the latent mapping onto the
counting grid of the t-th sample, and L is the data log likelihood. The E-step
estimates q, aligning all bags to grid windows to match the bags’ histograms. The
M-step re-estimates the counting grid π given the current q. To avoid local minima,
it is important to consider the counting grid as a torus, and perform all windowing
operations accordingly.
5.2 Class embedding and biomarker identification
In this section the novel methodology aimed to extract the most relevant and
discriminant genes is presented. The starting point is the process which we called
label embedding. As often in the case of gene expression, samples have a label
associated that reflects for example a pathological subtype, or a different tissue
of the organism. Suppose that for each sample, we are given a label y t = l, l =
[1, . . . , L] representing its class index. Once a Counting Grid is learned and the
location of each sample is located on the grid (by looking at qkt ), it is possible to
obtain a posterior probability of each class p(l|i) = γl (i) in each position i: this
indicates which regions of the CG better “explain” the class labeled by l. This is
achieved by using the posterior probabilities qkt already inferred:
5.2 Class embedding and biomarker identification
a) γi
Tumor
53
b) γi
Tumor
Tumor
c) πi.z ( πi.z )
Gene
Not Expr.
Gene
Expr.
d) Fz,i
Fig. 5.3. (a) Label embedding γi . (b) Gradient of the embedding. (c) Counting grid for
a particular gene (πn ) and its gradient. (d) Fn,i
P P
γl (i) =
t
k|i∈Wk
P P
t
qkt · [y t = l]
k|i∈Wk
qkt
(5.3)
where [·] is the indicator function, that indicates membership of an element in
the class. The output is 1 if sample st belongs to class l, 0 otherwise. Roughly
speaking, the main idea is to “average” all the mappings of the training samples
belonging to a given class. If the CG is able to capture the underlying behavior
of a specific class, then only a part of this averaged map will be different than
zero, possibly in a spatially coherent small region – the region which more likely
“explains” the training patterns of that class. In order to clarify this concept, in
Fig. 5.3 (a) we show the label embedding for the prostate cancer dataset [210],
which comprises two classes. In the figure the tumoral class is embedded. Please
observe that the active (non zero) locations are all grouped in spatially coherent
zones of the averaged map. Therefore, even if the labels are not used during the
learning of the CG, tumoral and non-tumoral samples are naturally separated
(since we are in a two class problem, the embedding of non tumoral class is simply
obtained by reversing this image); this suggests that indeed the CG is suitable to
describe the latent structure which generates the data.
54
5 The Counting Grid model for gene expression data analysis
As a second step, we compute the gradient of the embedding, ∇γi , which
returns information about where and how the classes separates – see Fig. 5.3(b).
In this case the idea is to find which are the regions in the CG where the first class
“translates” to the second class or vice versa. Please note that in the two class case
we only need to compute the gradient on one map, since the map of the second class
is just the complementary of the first. The generalization to the multiclass case
can be simply addressed by considering 1 vs all embeddings, although alternatives
are possible.
As a final step, to get the gene score Fn , upon which we will base the strategy
to rank genes, we evaluate how much the expression of the different genes varies
along the borders between the classes. The idea is straightforward: to discriminate
between the two classes the most useful genes are the ones which vary most where
we have the class transition. For example in Fig.5.3(c) we show for a particular
gene n̂ the map πn̂,i , which represents where that gene is more expressed in the
grid. We also show its gradient in each position (yellow arrows). After a quick
glance at Fig.5.3(b) one can convince himself that the expression of n̂ is mostly
expressed in tumoral samples and often varies where a transition between tumoral
and non-tumoral samples is present; that suggests that the gene is important for
classification and related to the disease.
To capture this idea mathematically we compute the directional derivatives of
the πn,i in the direction ~v of the gradient of the class embedding ~v = ∇γi and we
sum over all the locations i in the grid. To reward more the variation in expression
where we have a high variation between classes, we also multiply by the modulus
of v.
In formulae we have that the feature score is equal to:
X
X ~v
· ∇πn,i =
(5.4)
Fn =
~v · ∇πn,i |~v | ·
|~v |
i
i
In the formula, we take the absolute value because we regarded as equally relevant
genes which under express in the transition to class l and genes which over express
in the transition to class l. Fig. 5.3(d) shows that Fn̂,i 6= 0 only along the borders
between the 2 classes.
Fn represents the rank score of every gene, which permits to order the genes
from the most prominent (i.e. the one which varies the most in the direction of
“transition” of the classes) to the least.
Summarizing, the proposed gene ranking approach consists of the following steps
(for the two class case, however generalizing to more classes is straightforward):
1. Training of the Counting Grid on the whole dataset (generative step, labels
are not used)
2. Label embedding of the training samples of one class
3. Computation of the gradient of the map, which estimates the regions of the
maps where there is the transition from one class to the other
4. Computation in such zones of the gradient of the genes
5. As a final score, each gene is ranked by its averaged variation in the direction
where the two classes vary most.
5.3 Example: mining yeast expression
55
20
8
7
OD 600 nm
Glucose (g/liter)
15
6
5
10
4
3
5
2
1
0
9
11
13
15
17
0
21
19
Time (hours)
Fig. 5.4. Temporal profile of the cell density, as measured by OD at 600 nm and glucose
concentration in the media.
5.3 Example: mining yeast expression
To illustrate the main features of the proposed framework we present a simple
example, where we studied a dataset by De Risi et al. [61], measuring the gene
expression of 6400 genes in Saccharomyces Cerevisiae during the diauxic shift,
a recurring cycle in the natural history of yeast involves a shift from anaerobic
(fermentation) to aerobic (respiration) metabolism.
In few words, if the yeast find itself in a medium rich in sugar, it follows a
rapid growth fueled by fermentation, with the production of ethanol. When the
fermentable sugar is exhausted, the yeast cells turn to ethanol as a carbon source
for aerobic growth, as depicted in Fig. 5.4. This switch from anaerobic growth
to aerobic respiration upon depletion of glucose, is known to be correlated with
widespread changes in the expression of genes involved in fundamental cellular
processes [61]. In this particular experiment, expression values have been measured
2
2
4
TP4 TP5
6
8
10
2
12
TP6-7
2
4
4
6
6
8
10
12
4
6
8
10
12
−3
x 10
3.04
3.03
3.02
3.01
TP3
8
TP1-2
10
12
3
2.99
2.98
Fig. 5.5. Yeast dataset embedding. Each time point is placed in a location of the grid,
highlighted in red (left part of the figure). There is a clear path connecting the dots: since
the most pronounced transition occurs between the 3rd and the 4th time points, in the
right part we show the class embedding γl (i) of samples l = {4, 5, 6, 7}.
56
5 The Counting Grid model for gene expression data analysis
at 7 different time points, as shown in Fig. 5.4. From our point of view, each time
point is a bag st = {gnt }, n = 1, . . . , 6400. As done in the previous chapter, we
performed a filtering of the genes1 , obtaining a final refined dataset of 310 gene
expression values at 7 time points.
We learned the CG using these 7 samples, by setting the parameter κ to 4:
specifically, we opted for a 12×12 grid with a 6×6 window for a clearer visualization. In the left part of figure 5.5 we provide a visualization of the mapping
position on the learned CG of the 7 experiments – each red dot corresponds to
the maximum of the q t , i.e. to the most probable position of a given time point t.
The highlighted path connects the temporal transition between the 7 time points,
permitting a clear understanding of the dataset. By looking at this embedding, it
seems that the more pronounced transition occurs between the 3rd and the 4th
time points. Thus, we roughly divided the dataset in 2 classes: we can see the distribution of the “respiration” class (samples 4-5-6-7), i.e. the map γl (i), in the right
part of figure 5.5. From this map, we computed the gradient of γl (i) (portrayed
in figure 5.6), and identified the genes which vary the most across the direction of
the gradient, as described in section 5.2. For example, the gene highlighted in the
zoomed portion of the grid (figure 5.6) is gad1, which seems to rapidly activate
during the transition from fermentation to respiration. This is in line with previous
findings reported in the literature [193, 197].
We extracted the top 10 relevant genes by using the framework described in
section 5.2, which are reported in Tab. 5.1 (please note that gad1 is indeed the
most relevant gene). To prove that these genes are indeed relevant from a biological
point of view, we looked for terms in the Gene Ontology (GO) [11], which are highly
over-represented among these 10 genes, with respect to all other terms pertaining
the remaining 300 genes2 . Statistically significant (p < 0.05) terms are reported
in table 5.2, and they are interestingly related to synthesis of sugar and response
to oxidative stress. The p-value is computed employing a chi-squared test with
Benjamini multiple hypothesis correction (more details can be found in [17]).
This simple example permits to show the main features of the proposed framework: i) the 7 experiments are projected in the grid in a meaningful way, with
a clear path which indicates the temporal evolution of the gene expressions; ii)
by looking at the gradient of the class embeddings we can highlight genes which
are responsible of the transition of the gene expressions from “fermentation” to
“respiration”, this being qualitatively confirmed by the GO analysis.
5.4 Experimental evaluation
To quantitatively assess the merits of the Counting Grid model in the gene expression scenario, we performed several experiments on three datasets widely employed
in the literature. The first one is a prostate cancer dataset by [62], containing the
1
2
Following
http://www.mathworks.com/help/bioinfo/examples/gene-expressionprofile-analysis.html
We carried out this analysis by employing the online tool GOstat
http://gostat.wehi.edu.au/.
5.4 Experimental evaluation
2
4
6
8
10
12
57
−3
x 10
3.04
2
3.03
4
3.02
6
3.01
8
3
10
2.99
12
2.98
Fig. 5.6. Derivatives computed on the map γl (i). On the right, a zoom of an area of the
CG where the gradient is high. The highlighted gene is the one which varies the most in
the gradient direction.
Table 5.1. Top genes selected with the proposed approach
Rank
1
2
3
4
5
6
7
8
9
10
Gene name
gad1 (YMR250W)
hsp12 (YFL014W)
gsy1 (YFR015C)
ygp1 (YNL160W)
ctt1 (YGR088W)
sam4 (YPL273W)
gsy2 (YLR258W)
sol4 (YGR248W)
hsp30 (YCR021C)
pgm2 (YMR105C)
Description
Glutamate decarboxylase
Heat shock protein
Glycogen synthase
Yeast glycoprotein
Cytosolic catalase T
S-adenosylmethionine metabolism
Glycogen synthase
6-phosphogluconolactonase
Heat shock protein
Phosphoglucomutase
Table 5.2. Statistically significant GO terms over-represented in the pool of the 10
selected genes.
GO
GO:0005978
GO:0006979
GO:0006950
Description
glycogen biosynth. process
response to oxidative stress
response to stress
Genes
gsy1,pgm2,gsy2
hsp12,ctt1,gad1
hsp30,ygp1,
hsp12,ctt1,gad1
p-value
0.0225
0.0235
0.0247
expression of 9984 genes in 53 different samples: 14 samples labeled for benign prostatic hyperplasia (BPH), three normal adjacent prostate (NAP), one normal adjacent tumour (NAT), 14 localized prostate cancer (PCA), one prostatitis (PRO),
and 20 metastatic tumours (MET). The second is a lung cancer dataset [19], also
employed in the experiments dome in the previous chapter, consisting of 203 gene
expression profiles from normal and tumour samples, with the tumors labelled as
squamous, COID, small cell, and adenocarcinoma (5 classes in total). Finally, the
58
5 The Counting Grid model for gene expression data analysis
5
Prostate dataset
10
5
15
Lung dataset
10
5
15
Brain dataset
10
15
COID
Normal
Normal
PCA
5
5
5
PNET
MET
10
10
10
Malignant glioma
SQ
Adenocarcinoma
NAP
BPH
Medulloblastoma
SMCL
5
10
15
−3
x 10
5
10
15
−4
x 10
5
4.46
4.44
5
MET
4.36
10
1.185
4.34
1.18
4.32
4.9714
4.9712
4.971
10
4.9708
4.9706
Medulloblastoma
4.3
15
1.175
−4
x 10
4.9722
4.9716
4.38
Adenocarcinoma
15
4.9718
5
4.4
1.195
1.19
10
4.42
5
10
4.972
1.205
1.2
Rhab
15
15
15
4.9704
4.9702
15
15
Fig. 5.7. CG embeddings for the three studied datasets.
brain tumor dataset [182] contains the expression levels of 7129 genes measured in
90 different patients classified in 5 classes (normal, primitive neuroectodermal tumor – PNET, atypical teratoid/rhabdoid tumors – Rhab, and malignant gliomas).
We reduced the dimensionality of the original data sets by retaining the top
500 genes ranked by variance. In the following, we first show that the model is
able to properly embed the samples on separated parts of the grid, where different
zones reflect different sample class/conditions – this shows that the framework
well captures the differences in gene expressions related to different classes; then,
we extract the most relevant genes with the approach of section 5.2, validating
them from a medical point of view; finally, we report classification accuracies obtained by using descriptors extracted from the model, reaching the state-of-the-art
performances.
5.4.1 Embedding and clustering performances
Following the original recipe of [104], a single CG is learned using all samples (but
ignoring their labels). Data samples are embedded into the CG space: we show
some embeddings on a 15×15 grid (using a 3×3 window) in figure 5.7, to have an
immediate insight into the datasets. To evaluate how well samples cluster on the
grid, we resort to the external criterion of purity [145]. In few words, we leave out
one sample and estimate γl (k) on the remaining data by employing Eq. 3. Then,
we assign a label to the test sample by computing
X
y test = argmaxl
qktest · γl (k)
(5.5)
k
The accuracy obtained with this nearest neighbor strategy is our purity score.
We considered CG dimensionalities from 1 to 5, testing systematically up to 40
complexities per dimension. Results, shown in figure 5.8, confirms the capabilities
5.4 Experimental evaluation
1
NN in CG space (Prostate dataset)
0.9
Purity [0−1]
0.8
0.7
(a)
0.6
1D
2D
3D
4D
5D
0.5
0.4
0.3
0
10
1
1
10
2
10
Capacity κ
3
4
10
10
NN in CG space (Lung dataset)
0.9
Purity [0−1]
0.8
0.7
(b)
0.6
1D
2D
3D
4D
5D
0.5
0.4
0.3
0
10
1
1
10
2
10
Capacity κ
3
4
10
10
NN in CG space (Brain dataset)
0.9
Purity [0−1]
0.8
0.7
(c)
0.6
1D
2D
3D
4D
5D
0.5
0.4
0.3
0
10
1
10
2
10
Capacity κ
3
10
Fig. 5.8. Purity results.
4
10
59
60
5 The Counting Grid model for gene expression data analysis
of the proposed framework to embed the different classes of each dataset in different
regions of the grid; moreover, except for the Brain tumor dataset, it seems that
the grid size and the choice of the capacity do not affect much clustering abilities
(with only 1-dimensional counting grids being slightly worse).
Interestingly, performances do not drop even for very large complexities, suggesting that the model is robust with respect to overtraining.
5.4.2 Qualitative evaluation of gene selection
Table 5.3. Top genes selected with the proposed approach
Prostate dataset (stability index: 0.891)
Gene name Description
CTGF
Connective tissue growth factor
EGR1
Early growth response 1
AMACR Alpha-methylacyl-CoA racemase
ATF3
Activating transcription factor 3
LUM
Lumican
MMP7
Matrix metalloproteinase 7
SPRY4
Sprouty (Drosophila) homolog 4
FOSB
FBJ murine osteosarcoma viral oncogene homolog B
FGG
Fibrinogen, gamma polypeptide
DCT
Dopachrome tautomerase
Lung dataset (stability index: 0.907)
Rank Gene name Description
1 GAPDH
Glyceraldehyde-3-phosphate dehydrogenase
2 MAPK3
Mitogen-activated protein kinase 3
3 IL13RA2 Interleukin 13 receptor
4 NCAM1
Neural cell adhesion molecule 1
5 TIE1
Tyrosine kinase
6 CYP2C19 Cytochrome P450
7 SLC20A1 Solute carrier family 20
8 YWHAE Tyrosine 3-monooxygenase
9 ERF
Ets2 repressor factor
10 CXCR5
Chemokine (C-X-C motif) receptor 5
Brain dataset (stability index: 0.813)
Rank Gene name Description
1 MAPK3
Mitogen-activated protein kinase 3
2 CXCR5
Chemokine (C-X-C motif) receptor 5
3 TIE1
Tyrosine kinase
4 CYP2C19 Cytochrome P450
5 DUSP1
Dual specificity phosphatase 1
6 HINT1
Histidine triad nucleotide binding protein 1
7 MAPK11 Mitogen-activated protein kinase 11
8 RABGGTA Rab geranyltransferase
9 EIF2AK2 Eukaryotic translat. initiation factor 2-alpha kinase 2
10 IL13RA2
Interleukin 13 receptor, alpha 2
Rank
1
2
3
4
5
6
7
8
9
10
5.4 Experimental evaluation
61
In this section we provide a qualitative evaluation of the gene selection procedure, in order to understand if the most relevant genes extracted are significant
from a medical point of view. In the next section, a quantitative evaluation and
comparison with other state of the art methods is reported.
With the approach proposed in section 5.2, we extracted the 10 most relevant
genes that are involved in a particular tumor class (Metastasis for the prostate
dataset, adenocarcinoma for the lung, and medulloblastoma for the brain). Ideally,
the genes selected by our framework should not vary too much when varying
the model capacity – thus confirming results shown in section 5.4.1 - figure 5.8.
To investigate this aspect we run the gene selection several times using CG of
different complexities, and validate the “stability” of the selected through the index
proposed by [119]: this index takes values in the range [−1, 1], and the higher its
value, the larger the number of commonly selected genes during different trainings
of the algorithm. More in detail, given two sets of genes f1 and f2 , the stability
index is defined as follows:
KI(f1 , f2 ) =
r − (s2 /N )
s − (s2 /N )
(5.6)
where s denotes the signature size, r = |f1 ∩ f2 | and N is the total number of genes
in the dataset.
For every dataset, such stability index was never below 0.8, as reported in Tab.
5.3, confirming a preliminary investigation carried out in [136]. In Tab. 5.3 we
report the most frequently selected genes while varying the model complexity: on
these genes we carried out a detailed investigation in order to assess their potential
significance for cancer biology.
Prostate Cancer dataset
The top gene highlighted by the algorithm for prostate cancer is CTGF. CTFG
belongs to the CNN protein family which is involved in functions such as cell adhesion, proliferation, differentiation and apoptosis [107]. CNN family proteins have
been identified as diagnostic and therapeutic agents for cancer [107]. Expression
of CCN family proteins is altered in various cancers, including breast, colorectal,
gallbladder, gastric, ovarian, pancreatic, and prostate cancers, gliomas, hepatocellular carcinoma, non-small cell lung and squamous cell carcinoma, lymphoblastic
leukemia, melanoma, and cartilaginous tumors [45]. CTGF specifically has been
shown to be involved in the invasiveness of cancer cells [45]. Similarly, it has been
reported [72] that tumor angiogenesis and tumor growth are critically dependent
on the activity of EGR1, the second top gene selected. Gene ATF3 codes for a transcription factor, that affects cell death and cell cycle progression. There is some
evidence [142] that this gene can suppress ras-mediated tumor genesis. Lumican
levels in breast cancer are associated with disease progression and have been used
to predict survival ( [228] reported that low levels of lumican are related to tumor
size), while FOSB has been found to drive ovarian cancer [206], and can be used
as a prognostic indicator for epithelial ovarian cancer. Finally, MMP7 has been
found to be involved in cancer metastasis and has been proposed to be used as a
target for drug intervention in cancer [240].
62
5 The Counting Grid model for gene expression data analysis
We also compared this result with the one obtained with the LPD model [194]
described in the previous chapter. Interestingly, there is some overlap between our
and their result: of the 6 genes we found to be correlated with cancer, their model
was able to highlight CTGF and EGR1 genes, although they were ranked lower.
Lung Cancer dataset
GAPDH expression was found to be strongly elevated in human lung cancer
cells [227]. It is also correlated with breast cancer cell proliferation and aggressiveness [192]. IL13RA was found to be one of the genes that mediate the metastasis of breast cancer to lung [151]. NCAM has been researched as a target for
immunotherapy for cancer as it is expressed in small cell lung cancer, neuroblastoma, rhabdomyosarkoma, brain tumours, multiple myelomas and acute myeloid
leukaemia. TIE1 is involved in angiogenesis, the creation of new blood vessels,
which as in important process also in tumor progression [58]. The experimental deletion of this gene from mice inhibits tumor angiogenesis and growth [58].
YWHAE is correlated with survival in breast cancer: it was found to be enriched in
metastatic tumor cell pseudopods [207], and is involved in the pathology of small
cell lung cancer.
Brain Tumor dataset
MAPK3 belongs to a family of proteins that regulate cell proliferation, differentiation and cell cycle progression. It was shown to be a prognostic biomarker in gastric
cancer and implicated in the progression of hepatocellular carcinoma [114]. CXCR5
is a protein in CXC chemokine receptor family, which plays a role in the spread
of cancer, including metastases [13]. TIE1 was implicated as a prognostic marker
for gastric cancer [131] and showed over-expression in breast cancer. DUSP1 is a
promoter for tumor angiogenesis, invasion and metastasis in non-small-cell lung
cancer [153] and plays a prognostic role in breast cancer [27]. HINT1 is a tumor
suppressor gene [249].
5.4.3 Quantitative evaluation of gene selection
We assessed numerically the performances of the gene selection approach presented
in section 5.2 by performing a classification experiment on two benchmark datasets
(namely the Colon and Prostate – summarized in Tab. 5.4) by employing only the
genes selected with the proposed approach. We compare our results with state-ofthe-art methodologies for gene selection. To have a fair comparison with the stateof-the-art, we adopted the testing protocol of [245]: the data set was randomly
split 2:3/1:3 (training/testing).
Table 5.4. Summary of the datasets used
Name
Colon
Prostate
N. Genes
2000
6033
N. Samples
62
102
Reference
[7]
[210]
5.4 Experimental evaluation
63
Table 5.5. Classification results (AUC) for the dataset used.
Colon dataset
Gene Signature Size
Gene Sel. Method
10
50
100 150 200
SVM-RFE [245]
76.4 77.5 79.2 79.4 80.1
Ens.SVM-RFE [245]
80.3 79.4 78.6 78.6 79.4
SW SVM-RFE [245]
79.5 81.2 78.4 76.2 76.2
ReliefF [245]
78.8 80.1 78.5 77.5 76.1
Ens. ReliefF [245]
78.9 80.2 79.1 77.3 76.1
SW ReliefF [245]
78.3 79.6 78.1 76.4 75.4
[2]
85.0 86.0 87.0 87.5 86.5
Our method
81.38 89.53 89.64 89.25 88.97
Prostate dataset
Gene Signature Size
Gene Sel. Method
10
50 100 150
200
SVM-RFE [245]
89.8 91.3 92.1 92.1
92.2
Ens.SVM-RFE [245]
92.9 92.0 92.0 92.6
92.7
SW SVM-RFE [245]
93.4 91.3 90.0 90.7
91.2
ReliefF [245]
93.3 93.0 91.4 91.4
91.7
Ens. ReliefF [245]
93.4 92.4 91.4 91.0
91.9
SW ReliefF [245]
93.3 92.7 91.4 91.3
91.4
[2]
95.5 96.0 95.0 94.0
94.0
Our method
78.21 88.30 92.45 94.99 95.73
More in detail, we employed the whole dataset to train a CG (of course labels
are ignored in this phase), from which we computed the Fn score for each gene:
after that, only the top-ranked genes have been extracted: in particular, we retain
the top [10 50 100 200] genes. Then classification is performed using a linear
SVM with the parameter C = 1, using the area under the ROC curve (AUC) as an
estimate for the classification performance. The test has been repeated 100 times,
and the mean of the computed AUCs is shown in table 5.5, along with comparative
state-of-the-art results (see the references between brackets). As for the Counting
Grid size, we varied its dimensions by selecting κ between 5 and 40, reporting in
the table the mean of the obtained AUCs.
From table 5.5 it is evident that the proposed approach produces results comparable, and in many cases superior, with state-of-the-art techniques. Furthermore,
when looking at the stability, we can observe that our approach is very competitive: the obtained indices are shown in Table 5.6. Since the proposed approach is
aimed at explaining the data through a generative model, and labels are used later
on, the stability index is very high: for both datasets and all different signature
sizes, it is always above 0.9, while the best result found in the references we used
for comparison is 0.78.
5.4.4 Classification results
As a last experiment, we employed the Counting Grid in a classification setting.
We followed the standard hybrid generative-discriminative recipe explained in the
64
5 The Counting Grid model for gene expression data analysis
Table 5.6. Stability of the proposed approach
Colon dataset
Gene Signature Size
Gene Sel. Method
10
50
100 150 200
Best [245]
78.00 75.00 70.00 69.00 67.00
[2]
65.00 59.00 58.00 61.00 62.00
Our method
94.32 92.40 91.73 90.79 90.53
Prostate dataset
Gene Signature Size
Gene Sel. Method
10
50
100 150 200
Best [245]
68.00 65.00 68.00 68.50 69.00
[2]
72.00 72.00 73.00 72.00 71.00
Our method
90.04 94.36 95.60 95.73 96.37
previous chapter [173]: the idea is to characterize every sample with a feature
vector obtained from the learned CG, so that samples are projected in a highly
informative space where standard discriminative classifiers such as Support Vector
Machines (SVM) can be used.
In our experiments, we employed two strategies, both based on the definition
of a kernel to be used with a SVM classifier. In few words, in the former case [171]
the kernel is defined on the basis of a geometric reasoning on the grid of the learned
CG, which is called Spreading similarity measure:
SSMS (s1 , s2 ) = SM (qk1 ∗ SW , qk2 ∗ SW )
(5.7)
In particular, we used the variant with the Histogram Intersection kernel:
SMint (a, b) =
K
X
min ai , bi
(5.8)
i=1
The second kernel employed is the Fisher Kernel [99], whose derivation in the CG
case has been proposed in [177]. In the original formulation, the authors first define
t
the Fisher score for a gene F Sk,n
t
F Sk,n
= gnt ·
X
i∈Wkt
qit
,
hi,n
(5.9)
and the concatenation of the F S, computed for all genes n, comprises the Fisher
score for a sample. Then, the standard linear kernel is computed from these Fisher
score vectors.
These two classification strategies has been applied on the three datasets3 .
3
Since the prostate dataset has never been studied for classification (some classes have
too few samples), to have a comparison with the literature we used another widely employed prostate cancer dataset [210], which contains expression profiles of 102 samples
(2 class problem).
5.4 Experimental evaluation
65
Accuracies have been computed using the dataset author’s protocol: LeaveOne-Out for the Prostate dataset, 5-fold cross-validation for the Lung dataset,
4-fold cross-validation for the Brain dataset.
The best result obtained by varying the complexity of the grid is reported in
table 5.7. In order to have a clear insight of the gain obtained by explicitly consider
the relation between topics, as done in the CG case, we applied the same hybrid
classification strategies to the PLSA, in the same way described in the previous
chapter; finally we compare our results to those obtained with the LPD – we took
the results from Tab. 4.2 of Chap. 4. In two datasets out of three, the CG model
(equipped with the Fisher kernel) was able to outperform other topic model-based
approaches, as well as the current state-of-the art (taken from [177]).
Table 5.7. Classification results
HI PLSA
HI CG
Fisher PLSA
Fisher CG
LPD
Best SoA
Reference
Prostate
0.826
0.773
0.921
0.940
0.951
0.982
[234]
Lung
0.911
0.918
0.938
0.959
0.942
0.938
[46]
Brain
0.858
0.869
0.862
0.900
0.890
0.865
[162]
Final remarks on gene expression analysis
In this part of the thesis, we explored the potentialities of topic models for classification and interpretation of gene expression data. Looking at the pipeline of the
bag of words paradigm presented in Chap. 2 (Fig. 2.1), this part contributed in all
stages of the pipeline from a methodological point of view, by casting the gene expression scenario in the bag of words framework; from an applicative point of view,
it particularly contributed in the “How to model” stage, by tailoring and applying topic models for solve the classification task, and by motivating the use of the
very recent CG model to mine knowledge from gene expression data. In particular,
Chap. 4 proposed a classification scheme, based on highly interpretable features
extracted from topic models. This resulted in a hybrid generative-discriminative
approach for classification. An extensive experimental evaluation, involving 10 different literature benchmarks, confirmed the suitability of topic models for classifying this kind of data. Finally, a qualitative analysis on grapevine plants expression
suggested the great expressiveness of the proposed approach. Chap. 5, building
upon the motivations and the results obtained so far, investigate the Counting
Grid model as a tool applicable to different analysis of a gene expression matrix,
particularly suited because it models the smooth changes in gene expression over
time or over different samples. The model provides an intuitive embedding, where
samples effectively cluster together according to the patterns that categorize them
in one or several tumoral classes. Also, as a novel methodological contribution we
employ the model to perform gene selection. Finally, we assessed its capability
in a classification setting. All these merits have been extensively tested: results
demonstrate the suitability of the model from a twofold perspective: numerically,
by reaching state-of-the-art accuracies in classification and gene selection experiments; clinically, by realizing that many of the selected genes are potentially
significant for cancer biology.
Part II
HIV modeling
6
Introduction
Understanding the human immune system, namely the ways in which the body
protects itself from diseases and infections, is one of the most challenging topics
in biology. There is a very broad branch of the biological sciences – immunology –
devoted to the study of an organism’s defense (immune) system. As in many other
contexts in the life sciences, recent technological advances have drastically transformed this research field: the sequencing of the human genome have produced
increasingly large volumes of data relevant to immunology research; at the same
time, huge amounts of functional and clinical data are being reported in the scientific literature and stored in clinical records. Thus, computational methodologies
can be immensely useful for extracting meaningful representations from such data,
and capturing correlations between the pathogen actions and the defense system
reactions, which are still largely unknown. One of the most challenging scenario
is perhaps understanding these correlations in the case of the human immunodeficiency virus (HIV), which ranks among the most deadly infectious epidemics
ever recorded [239]. HIV is a severe disease that targets the immune system and
destroys the ability of a person to react to other opportunistic infections.
In this part of the thesis we focus on two aspects of HIV that can be computationally analyzed from a bag of words perspective. The first one concerns the
problem of determining a correlation between a patient HIV status and a phenomenon called antigen presentation. In few words, cells present to their surface
disordered fragments of their proteins. If the cell is infected, some of these fragments will belong to the foreign pathogen, and will be detected by specialized
receptors called TCRs. As these fragments do not appear in a particular spatial
organization on the surface, the immune system effectively sees the infection as a
bag of molecules, based on whose counts the action needs to be taken. In this context the bag of words approach seems particularly suited, with all of the aspects
of the bag of words pipeline described in Chap. 2 involved. Such a perspective
has never been adopted in this context. We will show that through the bag of
words representation and models proposed in Chap. 7 we can predict the severity
of the HIV infection in a person. The second aspect regards the analysis of counts
of TCRs derived from a set of sequencing experiments. Even if it is known that
the main consequence of HIV is depleting the types and counts of TCRs, some
methodological and applicative research lines are still open. In particular, we fo-
72
6 Introduction
cused on the reliability of observed counts, and on the robust estimations of a
TCR population diversity. These analysis, missing from the literature, can have a
profound clinical impact, as shown in Chap. 8.
6.1 Background
The collection of cells, tissues, and molecules that mediate resistance to infections
is called the immune system, a highly sophisticated machinery that recognize and
respond to possibly harmful substances called antigens. The term antigen is very
generic, and refers to any substance that causes an immune response; in particular,
disease-causing antigens are called pathogens [1]. In order to understand the computational problems faced in this part of the thesis, we will briefly introduce some
biological backgrounds. In particular, we will describe in a very simplified fashion
how a specific part of the immune system is able to recognize pathogen-infected
cells.
A central role is played by T cells, a type of white blood cells that circulate
around the body, scanning for cellular abnormalities and infections. T cells can not
“see” inside cells to detect an infection, but rely on a phenomenon called antigen
presentation. In very few words, most of the cells present to their surface short
fragments derived from cellular proteins, as a mean of advertising their state to
the immune system. These fragments, called epitopes, are the part of an antigen
– both foreign and the cell’s own – that are effectively recognized by the immune
system [1]. Moreover, the epitope transport from inside the cell towards the surface
is mediated by special molecules called major histocompatibility complex, or MHC
(which in humans is called human leukocyte antigen – HLA). MHC/HLA molecules
are proteins that provide a recognizable scaffolding to present an antigen to the
T cell. Thus, T cells only recognize an antigen if it is carried on the surface of a
cell [4]. Finally, the specialized receptors on T cells that perform such recognition
are called T cell receptors (TCRs).
The input to the cellular immune surveillance, summarizing the concepts described, is illustrated in Fig. 6.1. We show a simplified illustration of an infected
cell which expresses both self (black) and viral (red) proteins (Fig.6.1a). MHC
molecules bind to a small fraction of peptides from these proteins, the epitopes.
Inside these MHC complexes, epitopes are transported to the surface of the cell,
where they may be “spotted” by the T cells through the specialized TCRs. As the
sampled epitopes do not appear in a particular spatial organization on the surface
(Fig. 6.2a), the immune system effectively sees the infection as a bag of MHC
molecules loaded with different viral epitopes. Depending on the application, this
representation may be further simplified into a bag of epitopes (Fig.6.2b), under
the assumption that the main effect of the MHC molecules is the epitope selection
(e.g. choosing conserved vs non-conserved targets [93]).
The challenge of this recognition is that the organism cannot predict the precise
pathogen-derived antigens that will be encountered. For this reason, the immune
system relies on the generation and maintenance of a diverse TCR repertoire.
In other words, TCR diversification must have evolved to keep up with emerging pathogens, to cover most of the antigenic universe with corresponding receptors [159]. This diversification occurs through a complex genetic mechanism called
6.1 Background
RGY HQYA
YD
GK
DY
IA
L
73
T cell
K ED
L
E L R S LY NTVATLY C
TGS E
LQ
TCR
antigen
QVP L R P MTY K A AVDL
P VT
epitope
MHC
(a)
(b)
Fig. 6.1. (a) MHC type I binds to a fraction of proteins and exports them to the cell
surface, where these sampled peptides appear without particular order. In this cartoon
image only 3 different MHC I molecules are present. (b) Specialized T cells recognize
epitopes through receptors called TCRs.
z
cz
6
0
2
3
2
1
(a)
(b)
Fig. 6.2. (a) Epitopes appear on the cellular surface without particular order. (b) A bag
of epitopes and its relative counts cz .
VDJ recombination, which consists of nearly random mutations in the TCR gene
sequence. In particular, the TCR genomic region includes a large number of variable (V), diversity (D) and joining (J) gene segments that are used to produce
functional TCR. A simplified view is shown in Fig. 6.3, where it is shown that the
VDJ recombination leads to the construction of a functional TCR gene. Moreover,
at recombination junctions some bases can be randomly added/subtracted [14,59].
For this reason, immune receptors are extraordinarily difficult sequencing targets
because any given receptor variant may be present in very low abundance and
may differ by only a single nucleotide. Through VDJ recombination the enormous
repertoire of antigen receptors is generated, providing the versatility that is essential to normal immune functioning: in fact, it has been observed that around 106
distinct TCR molecules can be generated by VDJ recombination, although estimates of the potential repertoires are around 109 [236]. T-cell diversity contributes
to immune defense in two ways: on one hand, it provides an initial pool from which
the best and most efficient T cells will be selected to attack the pathogen; on the
74
6 Introduction
V
segments
D
segments
J
segments
Fig. 6.3. As a TCR develops, it rearranges its DNA by randomly choosing different
segments of the V, D, and J regions, cutting them out and pasting them back together
in random combinations.
other hand, it provides the flexible TCR reserve should the pathogen attempt to
escape by mutation.
Measuring the diversity and quantity of TCR is a crucial task in immunology, as
the concentration of T-cells is a general predictor of the likelihood of opportunistic
infections. For example, a sharp rise in the incidence of otherwise rare infections has
been observed when counts of T cells fall below 200 cells/µL of blood [152]. This is
exactly the target of the Human immunodeficiency virus (HIV), which binds and
infects T-cells. As the disease progresses, the number of T cells declines below its
normal level, destroying the ability of the patient to mount an immune response. As
a consequence, the patient becomes hypersusceptible to opportunistic infections –
often fatal – by pathogens. The severity of HIV viral infection is generally measured
by clinicians with the so-called viral load, a quantity that reflects the number of
virus particles in a milliliter of blood.
6.2 Contributions
In the above described context, this part of the thesis addresses two problems: the
first one is to investigate the correlation between the bag of epitopes presented
to the cell surface and the patient HIV status; the second one is related to a
robust statistical analysis of the counts of TCRs in HIV infected patients, aimed
at discovering how the progress of the disease affects different kinds of patients.
More in detail, the first contribution of this part is arguing for new applications of the bag of words paradigm as a set of tools for capturing correlations in
the immune target abundances in cellular immune surveillance. Consequently, we
propose a novel way of modeling bags of words in this context which i) treats
observed epitope abundances as counts and ii) moves away from the traditional
componential structure towards a spatial embedding that captures smooth changes
in antigen presentation. We promote the use of topic models to capture cellular
presentation, and more generally the view that the immune system has of the invading pathogens. Furthermore, we demonstrate that the newest of these models,
the counting grid employed in Chap. 5, seems to be especially well suited to this
task, providing stronger predictions than what can be found in biomedical literature. In the experimental section, we restrict to the analysis of the links between
6.2 Contributions
75
the HIV viral load and the patients HLA types, leading to significant improvement
with respect to the state of the art.
As a second contribution, we analyzed the counts of TCRs derived from a
set of sequencing experiments in collaboration with the David Geffen School of
medicine (UCLA). We developed a framework that allowed clinicians to assess
the diversity of TCR populations in healthy, diseased, and perinatally infected
patients (i.e. youths that contacted the disease in the womb because the mother
was infected). Moreover, we looked from a methodological point of view at the
reliability of the observed counts: as mentioned before, many TCRs differ by only
a single nucleotide, and are observed very few times (in most cases, only once).
A final comment: most of the work presented in this part of the thesis has been
set up during an abroad internship at Microsoft Research, Redmond (US), in the
eScience group under the supervision of N. Jojic.
7
Regression of HIV viral load using bag of words
This chapter discusses the problem of modeling the immune response to HIV from
a bag of words perspective. In particular, we fully exploit the pipeline of Chap.
2 with the final goal of regressing the HIV viral load (i.e. a number that reflects
the severity of the infection) starting from a bag of word representation of epitope
sets.
We introduced in the previous chapter that the mammalian immune system
consists of a number of interacting subsystems employing various infection clearing
paths, with antigen presentation playing a central role in many of them. Moreover,
we discussed that the immune system needs to recognize a virus not as a whole but
as a set of disordered viral epitopes (a “bag” of epitopes). Like the gene expression
context, this is another example in computational biology where the bag of words
representation seems particularly suited because the structure of the objects in the
problem is truly unknown, rather than just sacrificed for computational efficiency.
This chapter has a dual purpose: i) it argues for the application of the bag of
words paradigm as a set of tools for capturing correlations between the epitopes
abundance in HIV patients and their viral load; and ii) it demonstrates that it is
possible to effectively model the bag of words representation, by deriving a novel
regression method – based on the counting grid [104] – that provides stronger
predictions than what can be found in biomedical literature.
In the remainder of this chapter (after a brief review of the state of the art), we
first explain how to extract and model the bag of words representation from epitope
sets: in the explanation, we address every step in the pipeline proposed in Chap.
2. We then report an experimental evaluation, leading to significant improvement
with respect to the state of the art.
7.1 State of the art
Explaining the differences in viral loads in different HIV patients is a crucial problem investigated by researchers in the HIV community (e.g., [6,149]). The hypothesis is that the variation in epitope presentation across patients is expected to reflect
on the variation in viral load, at least to some extent [93, 154]. In particular, early
studies showed that changes in viral load occur in synchrony with the emergence
78
7 Regression of HIV viral load using bag of words
Table 7.1. Comparisons with the biomedical and computational biology literature. The
percentage of viral load (VL) explained, is the square of the Pearson’s linear correlation
coefficient ×100 (See Tab.7.2)
Reference
[154]
[113]
[93]
[96]
Bag of words
Major Result
VL considered too noisy. Associations with mutations found
1-2% of VL variance explained through individual allele association
4% of VL variance explained through by targeting efficiency
4,3%-9% of VL variance explained by combinations of epitopes
Up to 13,50% of VL variance explained by embedding into the CG
of new epitopes in immune assays (e.g., [6, 149]). However, in case of the highly
polymorphic HIV, a handful of epitopes usually fails to control the infection, and
so researchers turned to population studies in search for optimal immune targets.
These early studies [6, 149] failed to detect significant links between patients epitopes and viral load, mainly because the straightforward statistical approaches
could not handle small dataset sizes (typically around 200 patients or less). However, in [154] the evidence of an association between viral mutations and patients’
epitope types was recognized. Generally speaking, the viral load is highly variable
and it may depend on numerous factors, such as gender, age, prior infections and
general health of the individual. Yet, any statistically significant result has been
seen as having important consequences to HIV research.
Eventually, larger cohorts allowed researchers to clearly assess the link between
epitope types and viral load. For example, in [113], certain epitope types were found
to strongly associate with low viral load in a cohort of over 700 HIV patients in
southern Africa. In these studies, despite the statistically strong associations, the
viral load in positive and negative patients still had such large variance that each
of these epitope types alone could only explain less than 2% of the total log viral
load variance in the population. For these reasons, computational methods that
capture these correlations and that are able to regress the viral load value for
different patients have assumed a great importance in recent years.
Tab. 7.1 provides an immediate insight into the state of the art on the computational methods that faced this task. To put these numbers into perspective, it
is important to make two observations. First, even weak signals, e.g. in [113, 154]
had the tendency to move the entire field, as valuable characteristics of the interaction between HIV and the host immune system were revealed, informing both
the research on HIV drugs and the research on HIV vaccine. Second, in addition
to high variation of the viral load due to factors that relate to age and general
health, it is known that the set point viral load depends strongly on the infecting
strain (see [6] for a recent study), and as HIV was found to mutate in its reactions
to HLA presentation, this variation in fitness in the infecting strains may itself
be due to the HLA pressure from previous hosts. Thus the increase in explanatory power of HLA types from around 4% of the log viral load to around 13.5%
is potentially of great importance. Further analysis in selected combinations of
features in the counting grid may lead to further advances in understanding the
evolutionary arms race between HLA and the human immune system.
7.2 The proposed approach
79
7.2 The proposed approach
Our problem is to model epitope sets with the bag of words paradigm, in particular
to show that it is possible to perform regression and find correlations between this
bag and the HIV viral load. In order to do so, it is necessary to i) extract a
dictionary of possible epitopes that can be generated from the viral proteins; ii)
count the abundance of these epitopes (this task is particularly difficult because
there is no technique able to directly measure such counts); iii) choose a model that
is able to capture epitope co-presentation and perform regression. In the following
sections each step is addressed.
7.2.1 What to count
The first observation that has to be made is that the concentration of any viral
epitope on the cellular surface depends on the source protein’s expression level. In
the following, we will denote a viral protein sequence as S = s1 s2 . . . sL . Moreover,
it has been shown that most epitopes transported to the surface by HLA molecules
are of length 9 [198].
For these reasons, we identified as words 9mers of protein sequences. In principle, the dictionary should be composed by all possible 9mers, leading to a dictionary of size 209 = 512 · 109 . Instead, we opted for a more data-driven dictionary:
given a sequence S = s1 s2 . . . sL corresponding to a viral protein, we extracted all
possible 9mers with overlap observed in the sequence:
W = {w1 , w2 , . . . , wl , . . . , wL−9+1 }
where wl = [sl . . . sl+9−1 ] is the l-th 9mer extracted from the sequence. In particular, we considered three essential HIV proteins in isolation, whose sequences
are known in the literature: GAG (core structural protein), POL (reverse transcriptase, used by the virus to integrate itself in the human genome), and VPR
(essential for the replication of the virus). In this way, the dictionary is composed
by listing all unique 9mers observed, and each word/9mer in the dictionary is
indexed by n.
7.2.2 How to count
Unfortunately, there is no technology able to directly measure the epitopes concentration (count) on the cell surface. Thus, to obtain the count value we reasoned
about the mechanism employed by HLA molecules for generating 9mers starting
from the viral proteins. In particular, each HLA molecule indexed by m (in a human host there are up to 6 different HLA) binds to the viral protein and cut it to
obtain a 9mer, that is later transported to the cell surface.
To generate the count value, we exploited an HLA-epitope complex prediction
algorithm that can estimate the binding energy Eb (n, m) for each of the epitopes n
and the different patient’s HLA molecules m [105]: the higher the energy, the less
likely it is that the HLA m will create epitope n. We also used a cleavage energy
estimate Ec (n) [157], which estimates – for each protein – the energy required to
create the epitope n.
80
7 Regression of HIV viral load using bag of words
We combined these information, and turned the total energy into a count (concentration) as follows
cn = e−Ec (n)−minm
Eb (n,m)
(7.1)
Even if other techniques for estimation of surface epitope (relative) counts
exist (see [247] for a recent review), we choose to employ this particular one as
it provides prediction for arbitrary HLA types, simply defined by their protein
sequence.
In the end, the vector xt = [ct1 , . . . , ctN ] is the bag of words representation for
the epitopes characterizing patient t.
7.2.3 How to model
It is important to notice that the counts cz are not independent. The MHC system,
as well as viral mutations, create links among the abundances of different viral
peptides in the observed bag. First, two patients infected by the same virus, e.g.
HIV, are highly unlikely to have the exact same HLA molecules: therefore, each
of their HLAs will select specific epitopes from HIV proteins, and the patients’
sets of immune targets will likely overlap only partially. In other words, each HLA
molecule has its binding preferences that lead to selection of only one of a hundred
to a thousand of epitopes. Second, the variation of the HIV epitope sets found in
different patients exhibits strong co-occurrence patterns where a high count of one
peptide often implies inclusion of several others, as they are all good binders to
a particular HLA. This is precisely what topic models were meant to do for text
documents, as summarized by Fig. 7.1 (a,b).
In particular, the topic proportions p(z|t) (as given for example by a PLSA
model) for individual patients t can be used as a compact representation, that
discards the superfluous aspects of the bag of words. In this context, the HIV viral
load can be regressed directly to these hidden variables. Modeling cellular peptide
presentation as a mixture of topics can capture some of the presentation patterns
discussed above. Upon model fitting, the topics may correspond to individual MHC
molecules that are more frequent in the patient cohort, or entire families of MHC
types that have similar presentation (sometimes referred to MHC supertypes).
In this case, the topic probability distribution would reflect the probabilities of
binding of a particular MHC (super)type to these different peptides. Some topics
may also capture the HIV clade structure as mutations in each clade alter the
MHC binding patterns.
Among the different topic models, there are some reasons why the counting
grid may be a more appropriate model of variation in epitope bags. These reasons
relate with the manner in which biological entities interact and adapt to each
other leading to patterns of slow evolution characterized by genetic drift, local coadaptation, as well as punctuated equilibrium. In case of antigen presentation, for
example, millions of years of evolution created certain typical variants of MHC as
well as minor variations on each of these major types. These variations are at least
in part due to the interaction with viruses [93], and similarly the genetic variation
in viruses reflect some of this evolutionary arms race, too. Thus, the HIV clade
constraints, as well as MHC binding characteristics may be so interwoven that a
7.2 The proposed approach
a)
Topic ‘1’
Topic ‘2’
b)
Topic ‘N’
81
Counting Grid
Document
c)
q1k
d)
E1
k2
+
Wk
W2
i2
W1
8.3
q3
+
E2
+
k1
q1
10.0
10.6
i1
+
+
q2
y1 = 8.3
y2 =10.0
y3 =11.2
π i,z
11.2
γ(i)
Fig. 7.1. Capturing dependencies in bags of words.
rigid view of cellular presentation as a mix of a small number of topics may be
inappropriate. In the counting grid, the major variants of cellular presentation can
be modeled as far away windows, while minor variations would be captured by
slight window shifts in certain regions of the grid.
7.2.4 Information extracted: regression of viral load value
The final task in our problem is the following: given the bag of words representation
of every given patient t and a model for the set of patients, we want to derive a
regression method able to predict the viral load. In the following, we present a
novel methodological contribution, aimed at embedding continuous values y t (e.g.,
HIV viral load) on the grid and perform regression.
First, let us look back at Eq. 5.3, where we embedded discrete labels into
the learned grid. Here we generalize this notion: by using the inferred posterior
probabilities qkt , we compute an M-step using the target value in place of counts
(ctn ). In formulae this is equivalent to
P P
t
t
t
k|i∈W qk · y
(7.2)
γ(i) = P P k
t
t
k|i∈Wk qk
82
7 Regression of HIV viral load using bag of words
E = [63,63] ; W = [11,11]
E = [63,63] ; W = [22,22]
ρ = 0,339
ρ = 0,314
Fig. 7.2. HIV viral load embedding in the 2D. The window is shown with a dotted line
in the figure.
In Fig.7.1 (c,d) we visually show the effect of this equation: the viral load y t of
each sample is “copied” in the window positioned by qkt (i.e., Wk ) and then the
result is averaged over all the samples. Also, in Fig.7.2 we show a couple of γs,
estimated from the dataset we used in the experiments. The window W is shown
with a dotted line in the figure.
The function γ can then be used for regression in what is essentially a nearestneighbor strategy: when a new test patient is embedded based on its bag of words,
the target y test is simply read out from γ, which is dominated by the training
points which were mapped in the same region. In other words, given the mapping
location qktest of the test sample, its prediction y test will be
X
y test =
qktest · γ(k)
(7.3)
k
Beside this simple scheme, we also propose a more complex one inspired by [96].
The idea is to regress the reconstruction error Ent = c̃tn − Rnt on residual viral load
t
t
t
the viral load prediction using the counting grid,
yRED
= y t − yCG
, being yCG
t
and c̃n the normalized epitope counts. Following [96], we used a regularized linear
regression with L1 norm (also known as LASSO [226]).
7.3 Experiments
The experimental evaluation is aimed firstly at proving the suitability of the CG
model, providing also a comparison with other bag of words models (which have
never been applied to this task). Then, we demonstrate that – when used for
regression – the CG significantly outperforms the state of the art in biomedical
and computational biology literature [93,96,113,154] (see Tab.7.1 for a summary).
As input data, we analyzed the cellular presentation of HIV patients from the
Western Australia cohort [154]. We represented each patient’s cellular presentation by a set of 492 counts over that many 9-long peptides from the GAG protein,
previously found to be targeted by the immune system as explained. The counts
were calculated based on the patients HLA types and the energy estimation procedure discussed by [93] and in Section 7.2.2. This provides us with bags of epitopes
(counts over the 492 words) that represent GAG in 135 different patients. We used
7.3 Experiments
POL
Pearson’s linear
correlation coeff.
0.4
VPR
0.4
0.2
0
0
0
−0.2
−0.2
−0.4
−0.4
−0.4
−0.6
−0.6
−0.6
−0.8
−0.8
10
10
2
10
3
10
E
−0.8
0
1
2
10
10
Capacity
10
3
10
Capacity
10
0
10
1
10
2
10
3
10
Capacity
GAG
0.4
Pearson’s linear
correlation coeff.
50
0.2
−0.2
1
GAG
0.4
0.2
0
83
0.2
0
−0.2
−0.4
−0.6
−0.8
0
10
1
10
2
10
3
10
Capacity
Fig. 7.3. HIV viral load regression. On the top row, we depict the variation of the
correlation factor ρ for the CG model with different complexities. We used colors to
represent the CG size; the same capacity in fact can be obtained with different E/W
combinations. On the bottom row, the same analysis on the GAG protein with LPD.
the same process for two more proteins: POL and VPR. This resulted in bag of
words matrices of respectively 88×135 and 939×135 words×samples. We analyzed
only the clade B infected patients.
7.3.1 Experiment 1: modeling antigen presentation with the counting
grid
To employ the counting grid model, we first trained it on the bag of words ctn ,
without using the regression targets y t (log viral load). Then, in a leave-one-out
fashion, we held out a sample t̂ and we estimated the regression function γ (see
Eq. 7.2, with t 6= t̂) using all the others; finally, by reading out γ in the appropriate
t̂
location we obtained the viral load prediction for sample t̂ using Eq. 7.2: yCG
=
P t̂
k qk · γ(k).
Once we computed the estimated regression target for all the samples, we
computed ρ, the pairwise correlation coefficient between the true and the estimated
viral load. The proposed approach based on CG has been compared with the LPD
model introduced in Chap. 4 [194]. To evaluate LPD we worked as for CGs: we
learned a single model (without using the targets) and we predicted the viral load
for the left out sample using linear regression based on the topic proportions p(z|t).
We considered Counting Grids of various complexities picking between E =
[12, 15, 18, 21, 25, 30, 35, 40, 50] and W = [2, 3, 4], only trying the combinations
with capacity κ between 1.5 and T /2, where T is the number of samples available.
84
7 Regression of HIV viral load using bag of words
Results for all the proteins are shown in Fig. 7.3, where we reported the results
for a range of capacities κ which are roughly equivalent to the number of LPD
topics K. LPD and CG reach similar results of POL and VPR, while CG have a
clear advantage on GAG (we show graphically this last statement in Fig. 7.3). It
is important to note how for the Counting Grid, the correlation factor varies much
more regularly with the capacity κ.
7.3.2 Experiment 2: comparison with the state of the art
In order to have a comprehensive comparison with the state of the art, we evaluated
the regression results obtained with standard phylogenetic analysis [223]. We built
a phylogenetic tree from GAG, POL, and VPR sequences using the maximum
likelihood approach of [223]: the aim is to model the low-level generative process
of random mutations by learning the probability distributions which govern it.
The purpose of this analysis is to show if evolution of the sequences alone can
give some insights into the viral load. Few parameters have to be tuned when
computing such trees: in our experiments, we pick as a rate substitution matrix
the WAG model [238], and we allowed for rate variations across sites, setting 4
discrete gamma categories [244]1 .
Then, we want to predict the viral load ŷ for a testing sequence xtest by looking
at sequences which are close to that one in the tree topology. For this reason,
regression is carried out with the following formula:
X
ŷ test =
e−C·dist(xtest ,xt ) · y t
(7.4)
t
where t indexes the training sequences xt and their associated viral load value y t .
The parameter C has been found with crossvalidation on the training set.
To have a fair comparison and evaluate the robustness of our framework, for
each protein we performed leave-one-out crossvalidation on the training set to
pick the best model complexity (E/W for Counting Grids, Number of topics K
for LPD), and we compared the results with trees. More in detail, we proceeded as
explained in the previous section but – only using the training data – we regressed
the viral load and computed the linear correlation factor for each complexity. Then
we picked the complexity that gave better result and we predicted the viral load on
the test sample. It is important to note that now i) the viral load of each sample
can in principle be predicted with a different complexity, and ii) the test sample
is not used to train the model.
As final experiment, we also employed the advanced scheme proposed in Sec.
7.2.4, which we dub CGs→LASSO: as before, we used leave-one-out cross validation to choose the best model complexity.
Results are shown it Tab.7.2. For LPD, this process failed and we could not
obtain statistical significant results because of severe overtraining issues. As visible from Tab.7.2, column CGs→LASSO, the advanced CG scheme improved the
performances in all the cases. The model complexities chosen by each round of
leave-one-out did not differ much; regardless of the protein considered, for more
1
These are the default values set by the phylogenetic tool employed, MEGA [223]
7.3 Experiments
1
20
Iteration No.
5
10
25
30
85
15
40
Fig. 7.4. Evolution of the viral load across the iterations.
than 89% of the data points the same complexity was typically chosen, as reported
in the last column of Tab 7.2.
The medical literature also reports other results obtained on the GAG protein,
comparisons are shown in Tab.7.1. As visible our approach strongly outperforms
all the methods [93, 96, 113, 154].
It remains to be understood exactly why CGs exhibit such a strong advantage
over topic models (LPD). One intuitive explanation is that the slow smooth variations in count data that can be captured in counting grids better represent the
dependencies that were produced by millions of years of coevolution between the
HLA system and various invading pathogens [113]. This process involved numerous
mixing of both the immune types and the viral strains, and may have produced the
sort of thematic shifts in antigen presentation that CGs are designed to represent.
A more speculative possibility is that the immune system, through some unknown
mechanism, collates the reports from circulating T cells into an immune memory
with similar structure.
A final note on the embedding function γ: the bags of peptides are mapped to
the counting grid iteratively as the grid is estimated as to best model the bags,
but the regression target, the viral load, was not used during the learning of CGs
or LPD models. However, the inferred mapping after each iteration can be used
Table 7.2. Pearson’s linear correlation (after crossvalidation where applicable). Crossvalidation for LPD was found not statistically significant (NS) for GAG and POL. The last
column report the most common CG’s complexity chosen in the rounds of leave-one-out
crossvalidation.
Protein
GAG
VPR
POL
CGs
ρ
0.3301
0.2011
0.2338
CGs→LASSO Trees
ρ
ρ
0.3674
0.3519
0.2546
0.1061
0.2443
0.1812
LPD Ridge Regr. Complexity Chosen
ρ
ρ
NS
0.1835
[30,5] - 89%
0.1202
NS
[50,8] - 94%
NS
NS
[40,11] - 97%
86
7 Regression of HIV viral load using bag of words
to visualize how the embedded viral load γ evolves. This is illustrated in Fig.7.4
for a model of complexity E = [30 × 30], W = [8 × 8]. The emergence of areas of
high (red) and low (blue) viral load indicates that as the structure in the antigen
presentation is discovered, it does indeed reflect the variation in viral load.
8
Bag of words analysis for T-Cell Receptors
In the previous chapter we studied one particular aspect of the HIV virus, namely
how it is presented to the cell surface for T-cell recognition. Nevertheless, the infection carried on by HIV affects different parts of the immune system. In particular,
from a clinical point of view, T-cell depletion is the central abnormality associated
with HIV infection. This is the cause that leads to AIDS, a severe condition where
– because of the low count of T-cells in the body – the ability of a person to react to other opportunistic infections is compromised. While antiretroviral therapy
(ART) can restore T cell counts, the adequacy of the diversity of the reconstituting
cells, particularly in long term survivors of perinatal infection (i.e. infection passed
from a mother to her baby during pregnancy), is not well understood. While the
quantity of T-cells is broadly indicative of the immunological status of an individual, the diversity, i.e. the number of different types of T-cells, is also a crucial
factor [15,208]. In fact, it has been observed [48] that even if such diversity is generated in an antigen-independent fashion, during an immune response particular
T-cells are selected and overproduced, in a process called clonal selection.
In this chapter we build a bag of words representation able to characterize this
aspect of the HIV virus: in particular, individual TCR sequences – which are called
species in this context – compose the dictionary, and each patient is represented
with a bag of word vector counting the number of different TCRs.
This work stemmed from a collaboration with the David Geffen School of
Medicine (UCLA), with the main goal of studying the diversity of the bag of
words in different classes of HIV patients. In very few words, quantifying the
diversity of a sample means to assess the total number of species present. However,
here this analysis is particularly challenging, because i) the obtained sequences are
extremely noisy, and ii) the observed sequences may be too few to correctly capture
the underlying true species distribution. Moreover, typical dataset sizes are very
reduced, because only a handful of patients are available. Therefore, making any
claim about which HIV class of patients is more diverse is extremely difficult to
prove.
These concerns, typically disregarded in the literature, have been taken into
account in this chapter by performing a critical and robust statistical analysis of
the bag of words representation. As a second contribution, we questioned from a
more methodological point of view the reliability of the observed bags of words,
88
8 Bag of words analysis for T-Cell Receptors
proposing a possible criterion to assess their reliability. The investigation stems
from the consideration that any given TCR variant may be present in very low
abundance and may differ by only a single nucleotide. In fact, most species (the
different TCR sequences in the dictionary) are very rare, with a corresponding
count of 1, and any estimate of diversity is inevitably biased.
The clinical partners provided a dataset obtained after a TCR sequencing using
the technology of 454 Pyrosequencing [195] in people surviving beyond 15 years
after perinatal infection. In particular, the pool of TCR sequences has been derived
from 9 different patients: 3 perinatally infected subjects that received ART therapy (positive samples), 3 healthy children (negative samples), and 3 cord samples
used as controls. For each patient, TCR sequences can be classified according to
their VDJ recombination. In particular, sequenced TCR are divided in 3 known V
segments and 15 known J segments. The D region is hyper variable, and the main
source of diversity.
We performed a thorough evaluation of several statistics that estimate diversity of the bag of words, providing useful insights on the data. The interesting
conclusion reached is that during ART of HIV-infected children, an early and
sustained increase in TCR is seen, and this is somewhat in line with previous findings [101, 190, 191]. Surprisingly, we also observed that TCR diversity in positive
infected children is even higher than in negative ones, suggesting a greater baseline
thymic function (the thymus is the gland responsible for T-cell production). This
claim has been numerically evaluated using several methods that provide robust
estimates of diversity.
However, even if some claims about differences in positive and negative patients
can be made, it is not possible to reliably estimate the total number of species
present in a sample. In fact, the results of our reliability analysis shows that the
bag of words contains too many rare species, and these estimates of total diversity
can not be considered reliable.
In the following section, we describe the proposed approach, detailing i) the
diversity measures that can be employed to robustly estimate differences between
patients and ii) explaining the methodology to assess the reliability of a bag of
words vector. Then, results are reported, and conclusions drawn.
8.1 The proposed approach
In order to compare TCR sets from different patients, the starting point is to
derive the bag of words representation. The pool of TCR sequences have to be
categorized in species, i.e. the words indexed in the dictionary. We propose three
different schemes to extract words and build the dictionary:
1. Raw scheme: Using this simple scheme, each unique sequence is assigned to a
species: sequences belonging to a species are identical in every position of their
nucleotidic sequence. In other words, if two sequence differ only by a single
nucleotide, they are assigned to two different species.
2. Low error scheme: In order to deal with sequencing errors (even if in principle they should increase diversity in all samples equally) and to reduce noise,
8.1 The proposed approach
89
we propose the following filtering steps. First, all sequences where a local alignment with one of the known V or J strands scores higher than 5% errors (an
error can be a mismatch, an insertion or a deletion) are removed. Then, individual sequences are counted using the raw scheme.
3. Cluster scheme: With this scheme, sequences which are very similar can be
clustered and aggregated in a single species. Clustering is done by measuring
biological similarity between sequences, and sophisticated approaches that take
into account the TCR nature of sequences have been proposed in the past: here
we employed the well-known cd-hit-454 tool1 [160].
In the following section, we describe how to quantify the diversity of this bag of
words representation.
8.1.1 Diversity measures
The simplest way of assessing the diversity of a bag of words vector is by measuring
its entropy, which in this context is sometimes called the Shannon index. First,
the counts ctn (number of sequences in the n-th species of the t-th sample) are
normalized to obtain the frequency
ctn
t
m cm
fnt = P
Then, the Shannon index for the t-th bag is equal to
X
Dt = −
fnt log2 fnt
(8.1)
(8.2)
n
However, the sample size (i.e. the number of observed species) has a strong
effect on the Shannon index. Since the observed TCR counts are only a fraction –
or a sample – of a larger distribution of rare variants that could not be observed
with the sequencing technology at our disposal, it may be that the Shannon index
is too biased for estimating diversity. For this reason we propose to employ a
robust histogram shape estimation technique called “Unseen estimator” [229]. This
technique uses the observed distribution of species in a sample to estimate how
many undetected dictionary elements, i.e. species, occur in various probability
ranges. Given such a reconstruction, one can then use it to estimate any property
of the distribution which only depends on the shape; such properties are termed
symmetric and include entropy (i.e. the Shannon index) and support size [229].
Another robust strategy that can be employed in order to have consistent
estimates across different patients is to subsample each patient so that he has the
same number of sequences of the others. Of course, the subsampling procedure has
to be repeated many times in order to have a significant comparison.
The last robust measure of diversity we employed is the so-called rarefaction
curve [189], originally introduced in ecology. The idea is the following: if we are
given W individual sequences with N < W different species, one way to visualize
1
publicly available at http://weizhong-lab.ucsd.edu/cdhit 454/cgi-bin/index.cgi
90
8 Bag of words analysis for T-Cell Receptors
the diversity patterns is to run randomization tests in each of which a subset of
v < W sequences is picked at random, and the number of unique sequence types
n is counted. When these tests are done many times for all values of v, we can
plot the graph of species accumulation, by averaging over samples. The curve –
which is often called a rarefaction curve [189] – is nearly linear at the beginning,
as picking a handful of sequences from a large diverse set is not likely to result
in any sequence repetitions. As v increases, the graph starts to curve and would
asymptotically reach the total number of species in the population (if W is large
enough so that the sample covered all of the diversity). If we have multiple groups
of sequences with different numbers of sampled sequences, then the comparison
of these graphs provides some idea as to which sample came from a more diverse
population.
8.1.2 Reliability of the bag of words
In this section we addressed the problem of assessing the reliability of the bag of
words representation. This analysis stem from the consideration that typical TCR
sequencing produces many rare variants, i.e. species observed only once, and every
technique that attempts at estimating statistics may produce unreliable results.
A crucial question is “What can one infer about an unknown distribution based
on a random sample?” If the underlying distribution is relatively “simple” in comparison to the sample size – for example if our sample consists of 1000 independent
draws from a distribution supported on 100 domain elements – then the empirical
distribution given by the sample will likely be an accurate representation of the
true distribution. If, on the other hand, we are given a relatively small sample in
relation to the size and complexity of the distribution – for example a sample of
size 100 drawn from a distribution supported on 1000 domain elements – then the
empirical distribution may be a poor approximation of the true distribution. To
assess whether our bag of words fall on this last category, we propose an approach
based on the Unseen estimator [229]. The idea is that below a certain abundance
threshold (i.e. for very rare species), the unseen estimator is not able to reconstruct the shape of the “missing” part of the distribution. We propose a reliability
criterion, which is based on the consistency of the prediction across subsampling.
More in detail, suppose that we are given a bag x = [x1 , . . . , xW ] with total
PW
counts N (i.e. N sequences have been observed, N = n=1 xn ). The dictionary,
of size W , list all species D = {w1 , . . . , wW }. Consider also the normalized version
with frequencies instead on counts: x = [f1 , . . . , fW ], where fn = xn /N .
First, we define Wα as the number of species with frequency above a given
value α:
Wα = |{wi |fi > α}|
(8.3)
Then we compute for all α the values Ŵα obtained after the unseen estimator, and
repeat the same procedure after subsampling x so that it has N/2 total counts. In
(0.5)
the end, after many subsampling, we obtain an average Ŵα .
The reliability threshold α̂ is obtained by aggregating two criteria:
(0.5)
1. The average number of predicted species after subsampling Ŵα̂
20% of the number Ŵα̂ predicted with the full data;
is within
8.2 Experimental results
91
2. The standard deviation over the repetitions is within 20% of the average pre(0.5)
diction Ŵα̂ .
Patient AP04
4
2
x 10
Observed data; # reliable species: 1657
Subsampled data; # reliable species: 1930 ± 353
1.8
Number of species
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
5
6
7
Sequence abundance threshold
8
9
10
−4
x 10
Fig. 8.1. Reliability evaluation. The shaded area indicates consistent estimates, namely
estimates of the number of species that are robust with respect to subsampling.
Fig. 8.1 depicts an example: the solid curve represents the (estimated) number
of species above the threshold specified on the x axis, whereas the dashed curve
represents the (estimated) number of species after 50% of sequences have been
subsampled at random many times. As can be observed from the graph, in the
left part (which is heavily influenced by rare, low frequency species), the two
curves are rather different; however, they tend to converge moving towards the
right. In addition, note that the standard deviation over the different subsamplings
decreases when raising the threshold. The shaded portion of the curve indicates
where estimates are reliable: when the shaded area starts, the average number of
predicted species after subsampling (1930) is within 20% of the number predicted
with the full data (1657). Moreover, the standard deviation over the repetitions
(353 species) is within 20% of the average prediction of 1930 species.
8.2 Experimental results
8.2.1 Dataset statistics
We report in Tab. 8.1 some basics statistics and give a general overview of the
dataset used in our experimental evaluation. More in detail, in the top part of
the table, the total number of TCR sequences for each patient is displayed; we
reported also how these counts are distributed in the three V families considered,
92
8 Bag of words analysis for T-Cell Receptors
Table 8.1. Sequences and species (unique sequences) counts for the TCR dataset.
V4
V9
V 17
Count of TCR sequences
Positive patients
Negative patients
AP04 AP22 CP04 CN13
CN02 BN02
38031 3377 24832 12789
33727
2883
27537 32125 37942 112979 54912 25396
25337 13015 23497 72688
30868
4734
V 4/9/17
90905
48517
86271
198456
119507
33013
Cord samples
Cord12 Cord11 Cord13
29607
19827
9627
47801
82367 10987
10868
73392 11325
88276
175586
31939
V4
V9
V 17
Count of
Positive patients
AP04 AP22 CP04
19666 2862 17277
9646 13608 15016
11113 8865 10428
species - raw scheme
Negative patients
Cord samples
CN13 CN02 BN02 Cord12 Cord11 Cord13
9532 15772 1962
25966
14231
8114
35440 18910 8395
29072
54042
8911
32168 11508 2529
8396
43461 11075
V 4/9/17
40425
77140
25335
42721
46190
12886
63434
111734
28100
V4, V9 and V172 . In the bottom part, we show the counts of species (i.e. unique
sequences) obtained with the Raw scheme. From the table it can be noted that
there is a very large variability in the number of sequenced TCRs: unfortunately,
this is a limitation due to technology, not biology.
In order to make these numbers more robust, we employed the Low error and
Cluster scheme for building the bag of words: for the Cluster scheme, we employed
the default parameters of the algorithm.
Counts of these low error and clustered sequences are reported in Tab. 8.2, top
and bottom part respectively.
8.2.2 Shannon index analysis
In our first evaluation, we estimated diversity using the Shannon index as described
in Sec. 8.1.1. The main hypothesis that we want to confirm is that the diversity of
cord samples Dcord is greater than the diversity of positive samples Dpos , which
in turn is greater than the diversity of negative samples Dneg , in formula: Dcord >
Dpos > Dneg .
In Tab. 8.3 we reported the Shannon index computed on the bag of words
for each patient – also keeping sequences divided per V family. Using this simple
scheme, our hypothesis Dcord > Dpos > Dneg seems to be supported by the values
in Tab. 8.3, although not statistically supported. However, when computing the
Shannon index on the reconstructed histogram using the Unseen estimator, the
hypothesis is strongly supported, with a p-value p<0.01 (excluding the cord sample and negative control with a much lower yield of high quality sequences; the
excluded two samples, however, are also consistent with this inequality, as can be
observed from the table). Since many assumption on Gaussianity of the data are
missing, the p-value has been computed through randomization tests [78].
2
According to the nomenclature in http://www.imgt.org
8.2 Experimental results
93
Table 8.2. Species counts after low error and clustering processing steps.
V4
V9
V 17
Count of species - low error scheme
Positive patients
Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
14038 2399 12649 7700 10585 1416 23315 11580 6909
6662
9208
9871 22972 12065 5406 21501 41019 7354
7347
4683
6427 17032 7373 1412
6108
24979 5749
V 4/9/17
28047
V4
V9
V 17
Count of species - cluster scheme
Positive patients
Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
3358 1571 4990 3862
2384
737
14079
5667
5202
2210 3110 3142 5701
3437 1489
12414 23041 5842
2000 1879 1825 4459
1461
348
3708
9959
4911
V 4/9/17
7568
16290
6560
28947
9957
47704
14022
30023
7282
8234
2574
50924
30201
77578
38667
20012
15955
Encouraged by these results, we made a step forward by observing that the
obtained counts have some noticeable differences in sample size: they span from
around 24000 sequences (patient cord13, low error scheme) to 175000 (patient
CN13, low error scheme). As discussed in Sec. 8.1.1, this may affect the estimation
of the Shannon index. Therefore, we subsample each of them to get a random pool
of sequences, so that each patient has the same number of sequences M :
X
M = min
ctn
(8.4)
t
n
From this subsampled data, we computed statistics such as the number of species,
Shannon index, or fraction of singletons (number of sequences occurring only once
divided by the total number of sequences). Results are reported in Tab. 8.4, where
the numbers displayed are an average over 100 different random subsampling.
The last column in the table report the exact number of species resulted after
subsampling in each case.
From the tables it seems evident that Dcord > Dpos > Dneg (also confirmed
by p<0.01). Moreover, positive patients have overall more species and a higher
fraction of singletons, suggesting that the tail of the distribution is longer. Another
representation that can be useful to assess samples diversity is a pie chart, such as
the ones reported in Fig. 8.2. In the figure, each slice aggregates 512 species, sorted
by descending frequencies: by visual inspection of the graph, it can be noted that
positive patients distributions resemble more the ones of the cords. For example,
in the first 3/4 of the charts (where the more rare species are distributed) there
are more slices in positive patients than in negatives, this again suggesting the
higher diversity of positive w.r.t. negatives.
Finally, we wanted to assess the abundance levels of the different species, in
order to have a clearer picture of how they accumulate (from rare species to high
frequent ones) in HIV patients. Since every patient has the same sample size, the
94
8 Bag of words analysis for T-Cell Receptors
probability mass occupied by one sequence is 1/(sample size=24k) everywhere: the
same abundance threshold can be used for comparing different samples. In Fig. 8.3
(a) we reported, for different abundance levels shown on the x axis, the number of
species where sequences are at least that abundant (in other words, given a point
on the x axis, the corresponding y indicates how many species are more frequent
than the x point chosen). Confirming previous hypothesis, negative patients have
more high frequent species (as the blue line is above the red one towards the right
part of the plot), in contrast with positives having more rare species, and an overall
higher diversity.
Table 8.3. Shannon index D for the TCR dataset.
V4
V9
V 17
Shannon index - low error scheme
Positive patients Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
12.74 11.12 13.13 12.60 11.95 10.19 14.39 13.22 11.82
10.04 11.67 11.03 11.14 10.59 9.02
13.65 14.80 11.48
11.19 11.45 10.68 11.90 10.22 8.93
12.35 13.81 12.48
V 4/9/17 13.05 12.68 13.11 12.70 12.41 10.00
15.12
15.64
13.39
Shannon index after unseen estimator - low error scheme
Positive patients Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
V4
14.07 14.23 14.45 14.08 12.92 11.77 16.81 14.66 15.59
V9
10.79 12.62 11.84 11.71 11.23 9.63
15.06 15.95 14.23
V 17
12.27 12.81 11.69 12.99 11.02 10.46 14.01 15.00 16.86
V 4/9/17 14.11 13.78 14.14 13.50 13.23 10.79
V4
V9
V 17
17.15
16.82
16.83
Shannon index - cluster scheme
Positive patients Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
10.87 10.37 11.74 11.49 10.10 9.13
13.32 12.02 11.10
8.69 10.31 9.66 9.82 9.20 7.66
12.70 13.78 10.89
9.53 10.02 9.04 10.21 8.43 7.08
11.46 12.40 12.14
V 4/9/17 11.40 11.35 11.67 11.28 10.79
8.59
14.13
14.45
12.82
Shannon index after unseen estimator - cluster scheme
Positive patients Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
V4
11.05 11.10 12.04 11.88 10.25 9.60
14.17 12.35 12.78
V9
8.80 10.51 9.84 9.89 9.27 7.77
13.16 14.21 12.39
V 17
9.70 10.40 9.20 10.33 8.46 7.24
12.10 12.66 14.37
V 4/9/17 11.58 11.64 11.91 11.41 10.94
8.74
14.76
14.81
14.56
8.2 Experimental results
95
Table 8.4. Statistics of the TCR data after subsampling.
V4
V9
V 17
Average count of species - subsampled data
Positive patients Negative patients
Cord samples
Sample size
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
1684 1812 1867 1853 1441 1416
2047
1934
1767
2087
3047 4075 3638 3257 3176 2684
6967
8460
7354
9958
1853 2296 1637 2026 1405 1412
3060
3153
3663
3718
V 4/9/17 10849 10775 11610 9576 9156
V4
V9
V 17
18858
19384
20012
24369
Average shannon index - subsampled data
Positive patients Negative patients
Cord samples
Sample size
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
10.59 10.74 10.80 10.79 10.25 10.19 10.99 10.87 10.16
2087
9.69 11.17 10.50 10.29 9.99 8.73
12.51 12.92 11.48
9958
10.27 10.78 9.83 10.47 9.36 8.93
11.47 11.53 11.83
3718
V 4/9/17 12.43 12.39 12.47 11.83 11.78
V4
V9
V 17
7129
9.93
13.98
14.08
13.39
24369
Average fraction of singletons - subsampled data
Positive patients Negative patients
Cord samples
Sample size
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
0.67 0.76 0.81 0.79 0.52 0.52
0.96
0.86
0.83
2087
0.25 0.30 0.28 0.26 0.26 0.22
0.55
0.74
0.69
9958
0.37 0.47 0.34 0.41 0.30 0.32
0.69
0.73
0.97
3718
V 4/9/17 0.34
0.34
0.38
0.32
0.30
0.23
0.66
0.66
0.78
24369
8.2.3 Nuanced patterns in the bag of words
By looking at statistics of the whole sequences histogram, we concluded that
Dcord > Dpos > Dneg . However, we wanted to investigate the possibility of a
more nuanced picture in some specific parts of these distributions. After sorting
the TCR counts in descending order (from high frequent variants to low frequent
ones), we divided the histogram in 100 overlapping bins, each occupying 2% of
the sequenced TCR total. In Fig. 8.4 we show the normalized partial Shannon
index summations computed in these local portions of the histogram; the area
under the curve equals the total Shannon index reported in Tab. 8.4. In every part
of the spectrum, i.e. both for highly abundant (left part of the figure) and rare
sequences (right part), the Shannon index for HIV+ dominates the one of HIV-,
and in general supports the conclusion that Dcord > Dpos > Dneg .
8.2.4 Rarefaction curves
The rarefaction curves for the dataset are shown in Fig. 8.5. We stopped the curves
at the level of 1/3 of the smallest sample (9809 sequences) to make them comparable across samples. These rarefaction curves indicate with statistical significance
(considering the number of species at x=9809, p=0.02) that TCR diversity in positive patients is higher than in negative ones. We carried out the same analysis by
96
8 Bag of words analysis for T-Cell Receptors
((+) AP04: 10849 species
(-) CN13: 9576 species
Cord12: 18858 species
(+) AP22: 10775 species
(-) CN02: 9156 species
Cord11: 19384 species
(+) CP04: 11610 species
(-) BN02: 7129 species
Cord13: 20012 species
Fig. 8.2. Pie charts for the 9 samples after subsampling. In particular, the first column
comprises positive samples, the second one negative samples, and the third one cord
samples.
keeping the three V families separate, and the result is shown in Fig. 8.6. Because
we only have three patients of each class, p-value is weakened (p=0.04 at x=500),
even if this result still support statistically the difference between positives and
negatives.
8.2.5 Total number of species estimation
We assessed the performances of the Unseen estimator to reconstruct the total
number of species in our TCR populations. Results are reported in Tab. 8.5.
8.2 Experimental results
Number of species above threshold
14000
Number of species
10000
AP04, species: 12612
AP22, species: 12385
CP04, species: 13586
CN13, species: 11177
CN02, species: 10577
4
10
Number of species
12000
Number of species above threshold − log domain
5
10
AP04, species: 12612
AP22, species: 12385
CP04, species: 13586
CN13, species: 11177
CN02, species: 10577
8000
6000
4000
97
3
10
2
10
1
10
2000
0
0
0.2
4
18
x 10
0.4
1.4
1.6
12
0
0.5
1
x 10
8
6
4
3
3.5
4
−3
x 10
AP04, species: 119223
AP22, species: 120426
CP04, species: 162350
CN13, species: 114709
CN02, species: 103192
5
10
10
1.5
2
2.5
Sequence abundance threshold
Number of species above threshold after unseen estimator − log domain
6
10
AP04, species: 119223
AP22, species: 120426
CP04, species: 162350
CN13, species: 114709
CN02, species: 103192
14
4
10
3
10
2
10
1
10
2
0
10
1.8
−3
Number of species above threshold after unseen estimator
16
Number of species
0.6
0.8
1
1.2
Sequence abundance threshold
Number of species
0
0
0
0.2
0.4
0.6
0.8
1
1.2
Sequence abundance threshold
1.4
1.6
1.8
−3
x 10
10
0
0.5
1
1.5
2
2.5
Sequence abundance threshold
3
3.5
4
−3
x 10
Fig. 8.3. (a) The y axis indicates the number of species above the abundance threshold
on the x axis. (b) The same representation, showing the y axis in log domain for a better
insight into high frequent species. (c) and (d) depict the same graphs, after the histogram
of TCR species have been reconstructed with the unseen estimator.
However, as we will demonstrate in the next section, these estimate cannot be
considered reliable, as we do not have enough samples to make claims about the
total number of species.
8.2.6 Reliability of the bag of words
The following analysis, described in Sec. 8.1.2, is aimed at detecting the abundance
threshold at which we can reliably estimate – with the Unseen estimator – the
number of different species having abundance above this threshold.
Fig. 8.7 reports the whole reliability study, evaluated for every patient. As
a general comment, it can be noted that the reliability threshold is higher in
positive patients, again suggesting that their species histogram comprises more
rare variants (therefore an increased difficulty for the estimator to reconstruct
the number of unseen species). Finally, we considered a global threshold, where
predictions for every positive and negative patient are reliable: this threshold is
set roughly to 2−4 (as this is the highest threshold, found in BN02). In Tab. 8.6,
we reported the number of species predicted with the Unseen estimator above this
abundance threshold.
15
Shannon index
14
13
12
11
10
Positives
Negatives
Cords
9
8
0
0.1
0.2
0.3
0.4
0.5
α
0.6
0.7
0.8
0.9
1
Fig. 8.4. From the subsampled species histogram (sorted by descendent frequencies) of
each patient, we computed the Shannon index in contiguous bins centered in α, each
region occupying 2% of total TCRs.
Cord11
Cord12
Positive
Negative
Cord
8000
Cord13
Species
6000
CP04 (+)
AP04 (+)
AP22 (+)
CN13 (-)
CN02 (-)
4000
BN02 (-)
2000
0
0
2000
4000
6000
Individual sequences
8000
Fig. 8.5. Rarefaction curves computed on the nine samples.
10000
Family V 4
700
Cord12
Cord11
CP04 (+)
CN13 (-)
AP04 (+)
AP22 (+)
CN02 (-)
Cord13
BN02 (-)
Positive
Negative
Cord
600
Species
500
400
300
200
100
0
0
100
200
300
400
Individual sequences
500
600
Family V 9
3500
Cord11
Cord12
Cord13
AP22 (+)
CP04 (+)
CN13 (-)
CN02 (-)
AP04 (+)
BN02 (-)
Positive
Negative
Cord
3000
2500
Species
700
2000
1500
1000
500
0
0
500
1000
1500
2000
Individual sequences
2500
3500
Family V 17
1400
Cord11
Cord13
Cord12
AP22 (+)
CN13 (-)
AP04 (+)
CP04 (+)
CN02 (-)
BN02 (-)
Positive
Negative
Cord
1200
1000
Species
3000
800
600
400
200
0
0
500
Individual sequences
1000
1500
Fig. 8.6. Rarefaction curves computed on the nine samples, divided per V family.
Table 8.5. Number of species estimated with Unseen for the TCR data.
V4
V9
V 17
Estimated number of species - raw data
Positive patients Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
449131 21417 126559 49168 108600 23092 251454 75056 368308
64853 89915 137566 336466 113224 57045 187519 285919 219232
77883 73365 96022 310024 149969 32634 44976 447732 285609
V 4/9/17
410614 186552 341697 652815 352962 112926 453996 712770 1037168
V4
V9
V 17
Estimated number of species - low error data
Positive patients
Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
206095 48092 68781 35568 71112 5992 165935 48358 174783
49537 84808 72943 197745 83000 34235 217107 162220 87685
81018 28639 57563 265697 58566 34382 34947 170585 119167
V 4/9/17
303702 112541 204755 380845 232023 64665
V4
V9
V 17
Estimated number of species - clustered data
Positive patients
Negative patients
Cord samples
AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13
11288 2692 12176 6528
7185 1273
35082
7568 33341
5274 8912 9737 17758 11628 4681
30393 43808 30374
4162 7814 4844 11439 3253
869
10288 22402 26568
V 4/9/17
20403 20328 26811 40787 22956
6264
728317 362088 334221
82412
72102
Table 8.6. Unseen prediction above the reliability threshold of 2−4 .
Positive patients
AP04
AP22
CP04
769
1013
666
Negative patients
CN13
CN02
BN02
770
763
281
87530
Patient AP04
4
4
x 10
7.633e−05
1
0
Patient CN13
4
4
x 10
2
1.705e−04
1
0
2
4
6 10−4
Sequence abundance threshold
0
x 10
2
2.645e-05
1
0
0
x 10
Patient Cord12
0
x 10
2
2.075e−04
1
Patient Cord11
0
6.386e-04
1
−4
2
4
6 10
Sequence abundance threshold
x 10
2
2.321e−04
1
0
Patient Cord13
# reliably observed species: 45
3
0
−4
2
4
6 10
Sequence abundance threshold
4
4
# reliably observed species: 15
2
Patient BN02
3
0
2
4
6 10−4
Sequence abundance threshold
x 10
−4
2
4
6
10
Sequence abundance threshold
# reliably observed species: 266
1
4
Number of species
Number of species
0
4
4.283e−05
4
3
0
1.369e−04
1
4
2
# reliably observed species: 3
0
Patient CN02
3
0
2
4
6 10−4
Sequence abundance threshold
4
4
2
# reliably observed species: 1931
3
Number of species
Number of species
# reliably observed species: 2557
3
0
2
4
6 10−4
Sequence abundance threshold
4
4
Patient CP04
# reliably observed species: 925
3
Number of species
2
0
4
x 10
4
# reliably observed species: 1119
3
Number of species
Number of species
# reliably observed species: 2059
Patient AP22
Number of species
x 10
Number of species
4
4
−4
2
4
6 10
Sequence abundance threshold
3
2
2.876e−04
1
0
0
−4
2
4
6 10
Sequence abundance threshold
Fig. 8.7. Reliability analysis on the 9 patients. In the legend, the number of observed
species which can be reliably estimated is reported.
Final remarks on HIV modeling
In this part of the thesis we promoted the use of bag of words models to capture
two aspects of the HIV infection, namely antigen presentation and TCR variation.
In Chap. 7, we derive a bag of words representation to characterize the view
that the immune system has of the invading pathogens, covering all aspects of the
pipeline proposed in Chap. 2. We also demonstrated that the counting grid model
seems to be especially well suited in the modeling stage, providing stronger predictions than what can be found in biomedical literature. Our experiment showed
that cellular presentation of the GAG protein explains more than 13.5% of the log
viral load. Although viral load varies dramatically across patients for a variety of
reasons, e.g. gender, previous exposures to related viruses, etc., detection of statistically significant links between cellular presentation and viral load is expected
to have important consequences to vaccine research [93].
In Chap. 8, we investigate a critical aspect usually overlooked, namely the quality of the bag of words representation of TCR populations. We derived several measures of diversity that can be employed to sudy differences between HIV patients,
each one designed to be as robust as possible with respect to i) small sample sizes
and ii) abundance of rare species. From an applicative point of view, we reached
the conclusion that TCR diversity in positive infected children is even higher than
in negative ones, providing potential new insights into the thymus function (i.e.
the gland responsible for T-cell production), which may respond differently if the
infection occurs in early development of the fetus. From a methodological point
of view, we provided some evidence that the current sequencing technology does
not allow for accurate estimates of the total population size: it seems that the
sample size obtained (which resulted in a huge amount of rare species observed
only once) is too poor to draw statistically significant conclusions. Concluding, we
gave a general scheme that could be employed in other validation contexts (for
example, in ecology where these types of analysis are crucial for characterizing an
ecosystem).
Part III
Protein remote homology detection
9
Introduction
One of the cornerstones of bioinformatics is the process of identifying homologous
proteins, i.e. to detect if two proteins have a similar function or have an evolutionary relationship. The establishment of this relationship is usually done by
measuring the similarity between the protein sequences. Through this comparative
analysis, one can draw inferences regarding whether two proteins are homologous.
It is important to distinguish similarity, which is a quantitative measure of how
related two sequences are, from homology, which is the putative conclusion reached
based on the assessment of their similarity [16]. Usually, homology is inferred when
two sequences share more similarity than would be expected by chance; when significant similarity is observed, the simplest explanation is that the two sequences
did not arise independently, but they derived from a common ancestor [169]. However, homologous sequences do not always share significant sequence similarity;
there are thousands of homologous proteins whose pairwise alignment is not significant, but other evidences (such as the molecular three dimensional structure)
clearly prove their homology. In these cases, detecting the homology on the basis
of sequence alone is very challenging: in the literature, this problem of detecting homology in the presence of low sequence similarity is referred to as remote
homology detection.
A large number of computational approaches have been proposed for solving
this task (an analysis of the state of the art is presented in Sec. 9.2). Among others,
the bag of words approach has already been investigated in the literature: a natural
analogy can be made between biological protein sequences (that are essentially
strings composed by symbols from a 20-letters alphabet) and text documents.
Under this parallelism, biological “words” that are usually extracted are called
Ngrams or Kmers [125], such as the ones depicted in Fig. 9.1: they are short
contiguous subsequences of N symbols. Consider for example the sequence MDCCDC,
and suppose that we define as words 2grams, i.e. short subsequences of 2 amino
acids. Thus, the dictionary contains 202 = 400 2grams. The 2grams extracted from
the example sequence are the ones in the multiset {MD,DC,CC,CD,DC} (usually
overlapped Ngrams are considered). Then, the bag of words representation is a
vector where each element corresponds to a 2gram in the dictionary and the value
of that element is the number of times the 2gram appears in any position of the
sequence.
108
9 Introduction
MDCCDC
MDCCDC
MDCCDC
MDCCDC
MDCCDC
N=2
MDCCDC
MDCCDC
MDCCDC
MDCCDC
MDCCDC
MDCCDC
MDCCDC
N=3
N=4
Fig. 9.1. Ngram definition. Given a sequence and a fixed value of N , Ngrams are short
consecutive (and overlapped) subsequences of N symbols.
This part of the thesis presents some contributions in this context, proposing
novel bag of words approaches that can overcome the limits of existing approaches.
In particular, in Chap. 10 we propose a novel bag of words representation for
protein sequences, which is enriched with evolutionary information, a kind of information not fully exploited in this context. Then, in Chap. 11 we propose a
multimodal strategy to integrate structural information (a more rich, yet difficult
to obtain modality which is never used) into existing bag of words approaches for
sequences.
Before going into the details of the contributions, we will clearly formalize the
problem and the current state of the art.
9.1 Background: protein functions and homology
Proteins are highly complex and functionally sophisticated molecules, which perform a vast array of functions within living organisms. Despite their diversity, from
a structural point of view they all consist of one or more chains, where each building block of this chain is taken from a set of 20 amino acids. Thus, the simplest
representation of a protein (called primary structure), is simply the sequence of
its amino acids.
The primary structure drives the folding and intramolecular bonding of the
linear amino acid chain, which ultimately determines the protein’s unique threedimensional shape. Usually, stable patterns of folding occur: these regular substructures, known as alpha helices and beta sheets (see Fig. 9.2 for an example)
constitute the secondary structure of a protein. Most proteins contain several helices and sheets, in addition to other less common patterns. The ensemble of formations and folds in a single linear chain of amino acids – sometimes called a
polypeptide – constitutes the tertiary structure of a protein. Finally, some proteins are made by the aggregation of multiple polypeptide chains or subunits, and
in this case the protein is said to have a quaternary structure. Fig. 9.2 depicts
graphically the four different types of protein structures. A more formal definition
of protein (taken from [41]) is “a biologically functional molecule that consists
9.1 Background: protein functions and homology
109
hemoglobin
alpha helix
P13 protein
beta sheet
Fig. 9.2. Levels of protein structure, from primary (amino acids sequence) to quaternary
(complex of functionally-folded three dimensional structures).
of one or more polypeptides (linear chain of many amino acids), each folded and
coiled into a specific three-dimensional structure”.
Proteins are devoted to carry out an incredibly vast set of functions, which
spans from enzymatic catalysts, transport and storage of other substances, structural scaffolding for cells and tissues, control of biochemical reactions and immune
response, to regulation of growth and cell differentiation [41].
Such functions are univocally determined by the protein’s specific three dimensional structure [41]. This is because the protein structure allows molecular
recognition: in almost every case, the function of a protein depends on its physical
interaction with other molecules. For example, enzymes are proteins that catalyze
biochemical reactions. The function of an enzyme relies on the structure of its
active site, a cavity in the protein with a shape and size that enable it to fit the
intended substrate, and with the correct chemical properties to bind the substrate
efficiently. In other words, the function of an enzyme is possible because both the
enzyme and the substrate possess specific complementary geometric shapes that fit
into one another [77]. Not all proteins are enzymes, but all bind to other molecules
in order to complete their tasks, and the precise function of a protein depends on
the way its exposed surfaces interact with those molecules. In conclusion, we can
generally assert that proteins with similar shape share similar function. The collection of proteins that are similar in shape and perform similar functions are said
to comprise a protein family. Proteins from the same family also often have long
stretches of similar amino acid sequences within their primary structure. These
stretches have been conserved through evolution and are vital to the catalytic
function of the protein.
Summarizing, the key point is the following: to gain insights into the function
of a newly discovered protein, it is of primary importance to identify its family,
or some homologues members which function has been already discovered in the
biological literature. Given what we presented so far, the most reliable method to
determine a protein family is to analyze its three-dimensional (3D) structure, i.e.
110
9 Introduction
the cartesian coordinates of every atom of the protein. Unfortunately, acquiring
such coordinates requires some sophisticated experimental techniques. The most
common method is x-ray crystallography [109], which is based on the scattering of
X-rays by the electrons in the crystal’s atoms. Despite advances in techniques for
determining protein structure, the structures of many proteins are still unknown.
On the contrary, determining the amino acid sequence of a protein is a more
straightforward task, easily doable with current technology [51]. One can convince
himself of this fact by looking at the number of discovered sequences and structures stored on internet databases. A comprehensive freely accessible resource of
protein sequence and functional information, called Uniprot1 [51], contains (as of
January 2015) around 547000 manually annotated (i.e. of high quality and confidence) sequences. The corresponding database of experimentally-determined 3D
structures, the Protein Data Bank (PDB2 ), contains (as of January 2015) around
35000 solved structures, that can be retrieved in the form of cartesian coordinates
for each atom in the protein [18].
Therefore, to determine homology usually a researcher has to resort to the
analysis of solely the protein sequences. However, in the protein remote homology detection scenario, homologous proteins share low sequence similarity: in such
cases, detecting the homology becomes a very challenging problem for which sophisticated techniques should be derived. The next section clearly formalizes the
problem and the current state of the art.
9.2 Computational protein remote homology detection
From the computational point of view, protein remote homology detection is a
crucial and widely studied problem, which has assumed great importance in recent
years. Most of the approaches can be split into three basic groups, according to
the taxonomy proposed in [128]: i) pairwise sequence comparison algorithms; ii)
generative models for protein families; iii) discriminative classifiers.
In the first, simplest case, similarities between proteins are evaluated via pairwise sequence alignment, a technique aimed at finding the best superimposition
between two sequences. In practice, a sequence alignment is obtained by inserting spaces inside the sequences (the so called gaps) in order to maximize the
point to point similarity between them [8]. A simple example is shown in Fig.
9.3. A huge number of algorithms for sequence alignment exist in the literature [8, 9, 156, 168, 216, 225], which can be classified in several different categories.
The main taxonomy divides the approaches in three categories: global alignment
methods, which are aimed at finding the best overall alignment between two sequences; local alignments, which detect related segments in a pair of sequences,
and multiple alignments, which are aimed at simultaneously align more than two
sequences. Among the algorithms proposed in the past for pairwise alignment, the
Needleman-Wunsch [156] and Smith-Waterman [216] algorithms are the most accurate methods, whereas heuristic algorithms such as BLAST [8] and FASTA [168]
trade reduced accuracy for improved efficiency. All of these techniques heavily rely
1
2
http://www.uniprot.org/
http://www.rcsb.org/
9.2 Computational protein remote homology detection
111
on a fundamental parameter, called the substitution matrix, which encodes the
biological knowledge and assigns a score for matches/mismatches based on the
rate at which one character in a sequence is likely to mutate into another one
(the higher, the more likely it is). Another important parameter, the gap penalty,
is specified by a pair of values representing the cost for inserting a gap and extending an existing one. Then, advanced approaches have obtained higher rate of
accuracy by collecting statistical information from a set of similar sequences. One
of the most famous methods, PSI-BLAST [9], uses BLAST to iteratively build
a probabilistic profile of a query sequence and obtains a more sensitive sequence
comparison score. Briefly, a profile derive from the results of a standard sequence
alignment (through BLAST). These results are combined into a general sequence
which summarizes significant features present in these sequences. A query against
the protein database is then run using this profile instead of a single sequence,
and a larger group of proteins is found as a result of this new query. This larger
group is used to construct another profile, and the process is iterated. By including
related proteins in the search, PSI-BLAST is much more sensitive in picking up
distant evolutionary relationships than a standard protein-protein BLAST.
The second category in the taxonomy relies on generative models. The most
famous model employed in this context is the profile hidden Markov model (HMM)
[108], which uses examples of proteins in a family to train a generative model
which characterizes the family [184]. Generative models improve upon profile-based
methods by iteratively collecting homologous sequences from a large database and
incorporating the resulting statistics into a central model. All of the resulting
statistics, however, are generated from positive examples, i.e., from sequences that
are known or posited to be evolutionarily related to one another.
Because the homology detection task can be seen as the problem of discriminating between related and unrelated sequences, we can employ discriminative
approaches, which can explicitly model the difference between these two sets of sequences. In this context, the most employed classifier is the support vector machine
(SVM) [52], which uses both positive and negative examples, and has provided
state-of-the-art performances in this context [128]. Many SVM-based approaches
have been proposed: SVM-Fisher [97,98] couples an iterative HMM training scheme
Gap
Match
Mismatch
M V - - - F F C L
| |
| | . :
M V S S S F F S I
Scores:
5
Similarity:
4 -10 -.5 -.5
6
6
-1
2
11
Fig. 9.3. Alignment of two sequences.
112
9 Introduction
with a discriminative classifier; SVM-LA [202] derives a string kernel obtained
after pairwise sequence alignment; SVM-k-spectrum extracted Ngrams from the
sequences and fed the bag of words representation to the SVM. Other examples include Mismatch-SVM [124], SVM-pairwise [128], SVM-I-sites [95], SVM-SW [202],
and others. A more detailed comparison of SVM-based methods has been presented
in [201].
Apart from the specific classifier employed, it is important to notice that particular success has been obtained by employing representations that are based on
sequence profiles. As explained before, the profile of a sequence S = s1 . . . sL , is
the result of a multiple sequence alignment between S and its closest neighbors
found by a database search (such as the afore described PSI-BLAST [9]). The information contained in the profile may be very useful, and has been exploited for
protein remote homology detection in [133], where a novel representation called
top-Ngram is extracted by looking at the N most frequent amino acids in each
position of the profile. Another profile-based approach is proposed by Liu et al. in
a recent paper [134], where a profile-based sequence is derived by rewriting each
amino acid in the original sequence with the most probable one according to the
profile, and standard Ngrams (i.e. groups of N consecutive amino acids in this
new sequence) are extracted and used to classify sequences. In both cases, a bag of
words approach is employed: the feature vector is obtained by counting the number
of times each Ngram (or top-Ngram) occurs in the “profile-enriched” sequence.
9.3 Contributions
The profile-based approaches achieved state of the art prediction performances,
therefore seem to hold high potential. However, it is important to observe that
the profile information may be further exploited: in particular, in such approaches
only few amino acids of the profile are considered – one, in the approach of [134],
N in the top N-grams technique of [133]. Moreover, such approaches do not use
the frequencies associated to the profile amino acids: for example, in the approach
of [134], every sequence amino acid is replaced by the most frequent profile amino
acid, no matter how frequent it is (simply the most frequent); by doing so, there is
no difference between a situation where a strong conservation throughout evolution is present (e.g. the frequency of the top amino acid is near 1, all the others are
near 0) and a situation where this conservation is not present (e.g. the frequencies
are more or less identical among different amino acids). The same reasoning holds
also for the Top Ngrams approach proposed in this thesis. We made a contribution
by proposing of a novel representation called soft Ngram, which is able to take into
considerations all these aspects. Soft Ngrams are extracted from the profile of a
sequence, explicitly considering and capturing the frequencies in the profile, thus
reflecting the evolutionary history of the protein. Then, two modeling approaches
to derive feature vectors from the soft Ngram representation, employable as input for the SVM discriminative classifier. Starting from the bag-of-words model,
we promote the use of topic models in the context of protein remote homology
detection: we derived a soft PLSA model, that deals with the proposed characterization for sequences. In a thorough experimental evaluation, we demonstrated
9.3 Contributions
113
on three benchmarks that the soft Ngram representation is more descriptive and
accurate than other profile-based approaches, being also superior to almost all the
approaches proposed in the literature.
Then, an alternative route is explored: actually, the main idea is that the 3D
structure of a protein (when available) represent a source of information which
is typically disregarded by classical approaches. In particular, we provided some
evidence that it is possible to improve sequence based models by exploiting the
available (even partial) 3D structures. The approach, based on topic models, allowed the derivation of a common and intermediate feature space – the topic space
– which embeds sequences being at the same time “structure aware”. We experimentally demonstrate that, in cases where the sequence modality alone fails, introducing only 10% of the training structures resulted in significant improvements on
detection scores. Moreover, we applied the proposed approach to model a GPCR
protein, finding evidence of structural correlations between sequence Ngrams: such
correlations can not be recovered employing a sequence-only technique. An interesting conclusion is that this multimodal scheme seems to be particularly suitable
for those situations where the sequence modality alone fails.
10
Soft Ngram representation and modeling for
protein remote homology detection
This chapter presents the novel profile-based representation for sequences, called
soft Ngram. This representation, which extends the traditional Ngram scheme,
permits to extract information from all the symbols in the profile, also considering
the associated evolutionary frequencies: this is in practice achieved by extracting
Ngrams from the whole profile, which are subsequently equipped with a weight
directly computed from the corresponding evolutionary frequencies.
This chapter also illustrates two different approaches to model the proposed
representation and derive a feature vector, which can be effectively used for discriminative classification using a support vector machine (SVM). A thorough evaluation on three benchmarks demonstrates that the new approach outperforms
other Ngram-based methods, and achieves state-of-the-art results when compared
to a broader spectrum of techniques.
10.1 Profile-based Ngram representation
This section reviews the approaches of [133] and [134], which derive a Ngram
representation on the basis of the profile. In both cases, the starting point is the
profile of a sequence S = s1 . . . sL , which is represented by a matrix M


m1,1 m1,2 . . . m1,L
 m2,1 m2,2 . . . m2,L 


M= .
(10.1)
.. . .
.. 
 ..
.
.
. 
m20,1 m20,2 . . . m20,L
where 20 is the total number of standard amino acids, L is the length of the
sequence, and mi,l reflects the probability of amino acid i (i = 1, . . . , 20) occurring
at sequence position l (l = 1, . . . , L) across evolution. Thus, the elements in each
column of M add up to 1.
Once the profile of a sequence is computed, the frequencies in each column of
M are sorted in descending order, with the resulting sorted matrix denoted M̃
(right part of figure 10.1b). An entry m̃i,l contains the frequency of the i-th most
probable amino acid in position l, which is then denoted s̃i,l . This matrix is then
116
10 Soft Ngram representation and modeling for protein remote homology detection
employed to extract the Ngram representation. The two methods [133,134] employ
different strategies to extract Ngrams from the profile matrix M̃:
Column-Ngram [133]
In this approach, called in the original paper top-Ngram, each column of M̃ is considered independently. Given a column l, a column-Ngram is the concatenation of
the most probable N amino acids in position l, and is denoted by vl = s̃1,l . . . s̃N,l .
Row-Ngram [134]
In this approach, only the first row of M̃ is considered (i.e. only the most probable/frequent amino acid in each position of the profile): the original sequence is
rewritten by substituting each amino acid with the corresponding most frequent
amino acid of the profile. Then Ngrams are extracted as in other approaches [125],
i.e. by considering N consecutive amino acids. Summarizing, a row-Ngram vl is
composed by amino acids s̃1,l . . . s̃1,l+N −1 – please note that neighboring Ngrams
in the sequence overlap by N − 1 amino acids.
From the description above it seems evident that none of these approaches
fully exploit the complete profile information contained in M̃: in both cases only
few amino acids of M̃ are considered – 1 for Row-Ngrams, N for Column-Ngrams;
moreover, the elements of M are used only to determine the ranking, completely
disregarding the evolutionary information contained in the values of M: the approaches do not make any difference between a situation where a strong conservation throughout evolution is present (the top value of M̃ is near 1, all the others
are near 0) and a situation where this conservation is not present (values of M̃ are
more uniformly distributed). We will see how both these aspects are considered
with the proposed representation.
In any case, once extracted, the set of Ngrams of a given sequence are summarized with a bag of words vector, obtained by counting the number of times
each possible Ngram appears in the sequence. More in detail, given all distinct
Ngrams {v} – the dictionary – the bag of words c is defined as the vector of length
V = |{v}| = 20N , where an entry c(v) indicates the number of time the dictionary
Ngram v is present in the set of Ngrams extracted from the sequence. This vector,
computed for every sequence, is used for classification.
10.2 The proposed approach
In this section the proposed approach is described: we first present the soft Ngram
representation and its major differences with the methods presented in the previous
section; then, the two modeling strategies to derive a fixed-length feature vector
are detailed. A summarizing scheme which graphically depicts the pipeline of the
proposed approach is shown in figure 10.1.
The basic idea behind the soft Ngram representation is that in a given position
l of a sequence there are many plausible Ngrams, each one with a different probability driven by evolution. To give an illustrative example, consider the situation
10.2 The proposed approach
Sequence S:
E
C
S
S
...
117
R
PSI-BLAST
A
R
...
...
...
...
V
...
...
Profile M:
(a)
0.05 0.07 0.08 0.12 ... 0.09
0.00 0.06 0.17 0.13 ... 0.28
0.02 0.02 0.13 0.01 ... 0.02
Sorted profile
Column-2gram
C
A
G
E
S ... R
C ... K
E
C
W ... W
0.31 0.43 0.26 0.23 ... 0.28
0.21 0.07 0.18 0.17 ... 0.10
(b)
...
C
̃
M:
...
̃
S:
E
D
Row-2gram
0.01 0.01 0.00 0.01 ... 0.00
Soft weight assignment
w1
0.02
0.09
AA
AC
AD
AE
w1(v)
0.12
...
...
...
...
...
EC
wL-1
0
0.14
0.03
0
w3
w2
0
0
0
0
0
0.06
0.25
0
...
...
...
...
Row-2gram dictionary
...
...
YY
((c)
0
0.74 0.01 0.35
0
0
0
0.10
Soft 2gram representation
Probabilistic
modeling
∑ wl
l
0
0
0.55
(d)
0.01
...
Soft PLSA
...
...
0.20
0
0
0
0.25
0
Soft bag-of-words
Fig. 10.1. The proposed soft Ngram representation. (a) The profile M of a sequence is
computed with PSI-BLAST. (b) Each column in the profile is sorted, and soft Ngrams are
extracted (row- or column-wise) from S̃. (c) After having built the dictionary, each soft
Ngram representation vector wl is computed for the sequence by combining frequency
values in the sorted profile matrix M̃. (d) The final feature vector is derived with either
of the two proposed modeling schemes – soft bag-of-words or soft PLSA.
118
10 Soft Ngram representation and modeling for protein remote homology detection
where the Ngram size is 1. Assume that in a given position l of the sequence, the
top two amino acids are A with frequency 0.8, and R with frequency 0.2. Previous Ngram approaches will consider only one amino acid, A, which is the most
probable. In our perspective, we consider two amino acids: A, whose “weight” is
0.8, and R, with weight equal to 0.2. This permits to encode the whole evolutionary information contained in the profile (taking into account also the frequency)
and to discriminate between situations where the two top frequencies are different (e.g. if A and R would have frequencies 0.6 and 0.4, respectively, standard
profile-approaches would extract again A, whereas our representation treats these
situations as different).
More in detail, the representation is obtained in two steps: Ngram extraction,
and weight assignment.
Ngram extraction.
Ngrams are extracted by tailoring the previous definition of column- and rowNgrams in the following way:
•
•
Soft column-Ngram. Ngrams are extracted from the whole column (not only
from the top N positions): in particular soft column-Ngrams are of the form
vi,l = s̃i,l . . . s̃i+N −1,l ∀i ∈ [1, . . . , 20 − N + 1]. For each column, Ngrams are
extracted with N − 1 overlap degree N − 1.
Soft row-Ngram. Ngrams are extracted for all possible rows of M̃: Soft rowNgrams are of the form vi,l = s̃i,l . . . s̃i,l+N −1 ∀i ∈ [1, . . . , 20]. For each row,
Ngrams are extracted with N − 1 overlap degree.
Weight Assignment.
The goal is to assign a weight to each soft column- or row-Ngram extracted in the
previous step. Such weight should reflect the evolutionary frequencies of the amino
acids which compose it. Inspired by the score fusion technique of [117], we propose
two simple strategies to extract this quantity – which we denoted as wl (v):
•
•
Sum strategy, where the profile frequencies of the amino acids constituting the
Ngrams are summed
Prod strategy, where the profile frequencies of the amino acids constituting the
Ngrams are multiplied
10.2.1 Modeling: soft bag of words
The goal of the modeling stage is to derive a single feature vector that characterizes
any given sequence S t . In this paper we propose two methods, one called soft bagof-words, the other soft PLSA. The former is presented in this section, whereas
the latter is described in the next section.
In the classical bag of words representation (as said many times during this
thesis), the feature vector is obtained by counting the number of times each Ngram
of the dictionary occurs in the sequence. In our proposed soft bag-of-words, the
feature vector is obtained again by a counting process, which however considers
the weights: each Ngram extracted from the sequence does not count as “1”, but
as much as its weight. In other words, for each soft Ngram of the dictionary, we
10.2 The proposed approach
119
summed the weights of all its occurrences in the set of Ngrams extracted from the
profile. In formulae, an entry ct (v) of the feature vector characterizing sequence t
is
X
ct (v) =
wlt (v)
(10.2)
l
Once this quantity is computed for each element v in the dictionary, the final vector
denoted ct represents the feature vector that can be used in a classification setting.
10.2.2 Soft PLSA
A more sophisticated way of modeling the proposed representation stems from the
already stressed consideration that objects represented as counts may be successfully modeled in a probabilistic way, using e.g. the already seen topic models. For
example, the PLSA model presented in Chap. 4 seems a suitable model also in the
context of protein remote homology detection: in this peculiar scenario, documents
correspond to sequences and Ngrams correspond to words. From a probabilistic
point of view, the sequence may be seen as a mixture of topics, each one providing
a probability distribution over Ngrams. Standard Ngram representations [125], including the profile-based approaches of [133] and [134], can be directly modeled
using PLSA. This model, however, can not be applied as is to our proposed soft
representation, due to the presence of weights. Here we propose an adaptation
of the PLSA, which we call soft PLSA, able to directly exploit and consider the
information contained in the weights. In essence, the soft PLSA borrows the same
metaphor of classic PLSA: given the set of soft Ngrams l extracted from the profile
of a sequence S t , the presence of a particular soft Ngram v in such set is mediated
by a latent topic variable z ∈ Z = {z1 , . . . , zK }. However, in our formulation the
probability of observing a particular pair (v, S t ), i.e. the log likelihood denoted by
log p(wlt = v, S t ), is weighted by the soft value wlt (v):
"
log p(wlt
t
= v, S ) =
wlt (v)
t
· log p(S )
K
X
!#
βvk θkt
(10.3)
k=1
where βvk = p(v|zk ) and θkt = p(zk |S t ). In practice, the topic zk is a probabilistic
co-occurrence of soft Ngrams encoded by the distribution βvk . Intuitively, θkt measures the level of presence of each topic zk in the sequence S t . On the other hand,
βvk expresses how much the soft Ngram indexed by v in the dictionary is related
to topic zk . Under this model, the full data log-likelihood for a training set of T
sequences is
L=
=
T
X
t
log p(d ) +
" Lt
V
X
X
t=1
v=1
T
X
V X
t=1
t
log p(d ) +
v=1
wlt (v)
l=1
t
c (v) · log
log
K
X
βvk θkt
#
k=1
K
X
k=1
βvk θkt
(10.4)
120
10 Soft Ngram representation and modeling for protein remote homology detection
where we highlighted the fact that the value ct (v) is the sum over all weights
assigned to the different occurrences of soft Ngram v in the sequence. Finally,
p(dt ) accounts for sequences of different lengths.
Given the training set, we have to devise an algorithm which permits to learn
the parameters of the model β and θ such that the loglikelihood of the observations is maximized. Our learning strategy is based on the exact ExpectationMaximization (EM), which after initializing the parameters β and θ iterates the
following two steps:
•
•
the E-step, which computes the posterior over the topics qkvt = p(zk |v, dt )
given the current estimate of the model
the M-step, where β, θ and the prior over sequences p(S t ) are re-estimated
given the q obtained with the previous E-step.
For a more detailed review on the EM algorithm, interested readers may refer to
Chap. 2 or to [81]. In our context, the E-step formula is computed with the Bayes
rule starting from the values of β and θ:
βvk · θkt
p(zk |v, dt ) = qkvt = PK
t
k=1 βvk · θk
(10.5)
The M-step rules for updating θ and β are as follows:
T
X
βvk ∝
θkt ∝
t
qkvm
L
X
wlt (v)
(10.6)
wlt (v)
(10.7)
t=1
l=1
V
X
L
X
v=1
t
qkvm
l=1
t
t
p(S ) ∝
V X
L
X
wlt (v)
(10.8)
v=1 l=1
where the symbol ∝ indicates that the result of each formula should be normalized
so that the probability constraint (the sum should be 1) is satisfied. With the
trained model, inference can be performed on an unknown sequence S test , in order
to estimate its topic proportion vector θtest . Such quantity may be computed with
a single M-step iteration.
Following again the hybrid generative-discriminative scheme [99, 173], we decided to employ as feature vector for a given sequence S t the corresponding topic
t
].
proportions vector θt = [θ1t , . . . , θK
10.2.3 SVM classification
Once computed, the feature vectors ct (for the soft bag-of-words) or θt =
t
] (for the soft PLSA), can be used to face the protein remote homology
[θ1t , . . . , θK
detection problem; as done in many other remote homology detection systems,
the training feature vectors are input to learn a Support Vector Machine, which
is then used to classify the test protein sequences.
10.3 Experimental evaluation
121
SCOP 1.53
F2
F1
f1
f2
f3
...
f4 ...
f9
F23
f50
f51 ... f54
Negative train/test, randomly split
Positive test
Positive train
Fig. 10.2. The tree shows the subdivision in families and superfamilies used to simulate remote homology in the SCOP dataset. In the example, the first protein family is
highlighted as positive test set.
10.3 Experimental evaluation
10.3.1 Experimental details
The experimental evaluation is based on three benchmarks: the first one is a wellknown dataset created from SCOP version 1.53 [80]. This dataset1 [128] contains
4352 sequences from 54 different protein families and 23 superfamilies, in a hierarchical structure as the one shown in Fig. 10.2. The idea is that – given a sequence
– its remote homologues are the sequences in the same superfamily, but not in
the same family. Thus, to simulate remote homology, 54 different subsets are created: in each of these, an entire target family is left out as positive testing set.
Positive training sequences are selected from other families belonging to the same
superfamily (i.e. sharing remote homology), whereas negative examples are taken
from different superfamilies and split between training and testing with the same
proportions of the positive class. Class labels are very unbalanced, with a vast
majority of objects belonging to the negative class. In fact, on average the positive
class (train + test) is composed by 49 sequences, whereas the negative one is made
by 4267.
The second dataset has been created for our evaluation, starting from the
observation that the version 1.53 of SCOP is fairly outdated (September 2000);
therefore, we downloaded sequences from the more recent SCOP 2.04, ensuring
that all pairwise similarities have E-value greater than 10−5 . A total of 8700 sequences were extracted at the end. The subdivision is carried out with the same
protocol of the SCOP 1.53 benchmark, resulting in 89 different subsets, each one
corresponding to one particular protein family2 .
Finally, we used a third dataset to assess the performances of our framework
in a more challenging task: in particular we employed a fold benchmark extracted
from SCOP 1.67 [90], where homologous sequences are taken at a superfamily level
1
2
Available at http://noble.gs.washington.edu/proj/svm-pairwise/
The dataset is available at
http://www.pietrolovato.info/proj/softngrams.html
122
10 Soft Ngram representation and modeling for protein remote homology detection
rather than at a family level – making this dataset considerably harder than the
SCOP 1.53 and SCOP 2.04 ones. The dataset contains 3840 sequences and is split
in 86 different subsets3 .
The proposed soft Ngram approaches have been evaluated and compared
against the corresponding non-soft versions in different experimental conditions.
In particular, we performed different trials by varying the dictionary size – we considered 1grams, 2grams, and the concatenation of the two dictionaries: in this last
case the dictionary contains 420 distinct Ngrams. For what concerns the second
model, the soft PLSA has been compared with the standard PLSA model, learned
on profile-based Ngrams. To the best of our knowledge, standard PLSA has never
been investigated for remote homology detection with profile-based representations. As detailed in the previous section, the models (both PLSA and soft PLSA)
are trained on the training set alone, and feature vector θs for testing sequences
are obtained via an inference step. Both models require the number of topics K
to be known beforehand. To set this parameter, we performed a coarse search,
finding that the most reasonable choice is to set it to ∼100. In all the experiments
we noticed that the learning is sensitive to the initial choice of the parameters
β and θ. In fact, the convergence of the EM algorithm to a good local optimum
depends on the choice of the starting point for the EM iterations [241]. A good
initialization is therefore crucial: following ideas contained in [73], we chose to ini1
∀k. To initialize β, we clustered sequences into K
tialize θ uniformly, i.e. θkt = K
groups using the complete link algorithm for hierarchical clustering. This way, the
k-th cluster groups together similar sequences: the average of their feature vectors,
after normalization (s.t. sum is equal to 1), is the initialization for βv,k .
As in many previous works [64, 65, 132–134, 187], classification is performed
using SVM via the public GIST implementation4 , setting the kernel type to radial
basis, and keeping the remaining parameters to their default values. Detection
accuracies are measured using the receiver operating characteristic (ROC) score
[88], which represents the area under the ROC curve (the larger this value the
better the detection).
10.3.2 Detection results and discussion
In the first set of experiments we compared the soft bag-of-words and the soft
PLSA with the corresponding standard bag-of words and PLSA models, on the
SCOP 1.53 and SCOP 2.04 superfamily benchmarks. Averaged ROC scores, for
all families, are presented in table 10.1 and 10.2, respectively. From the tables it
can be observed that ROC scores are always higher when the soft representation
is employed, reflecting that the considered information enriches the description
of proteins and leverages performances. To assess statistical significance of our
results and to demonstrate that increments in ROC score gained with the proposed
approach are not due to mere chance, we performed a Wilcoxon signed-rank test
with Bonferroni correction [133]: we found that in 52 cases out of 56, the increased
performances with soft Ngrams and soft PLSA are significant with p-value p <
0.05. Additionally, we noticed that in many cases the product strategy works best
3
4
http://www.biomedcentral.com/1471-2105/8/23/additional
downloadable from http://www.chibi.ubc.ca/gist/ [128]
10.3 Experimental evaluation
123
Table 10.1. ROC scores computed on the SCOP 1.53 and SCOP 2.04 superfamily
benchmarks. In the table we compared between bag-of-words (BoW) and soft bag-ofwords model.
Dictionary
1-gram
row 2-gram [134]
col 2-gram [133]
row (1,2)-gram
col (1,2)-gram
SCOP 1.53
BoW
softBoW,sum
softBoW,prod
0.906
0.930
0.929
0.947
0.947
0.923
0.944
0.950
0.940
0.957
0.941
0.933
0.944
0.934
Dictionary
1-gram
row 2-gram [134]
col 2-gram [133]
row (1,2)-gram
col (1,2)-gram
SCOP 2.04
BoW
softBoW,sum
softBoW,prod
0.923
0.953
0.949
0.952
0.958
0.937
0.958
0.961
0.952
0.959
0.960
0.947
0.958
0.956
Table 10.2. ROC scores computed on the SCOP 1.53 and SCOP 2.04 superfamily
benchmarks. In the table we compared between PLSA and soft PLSA model.
Dictionary
1-gram
row 2-gram [134]
col 2-gram [133]
row (1,2)-gram
col (1,2)-gram
SCOP 1.53
PLSA
softPLSA,sum
softPLSA,prod
0.925
0.946
0.947
0.962
0.950
0.941
0.950
0.964
0.954
0.962
0.964
0.948
0.949
0.959
Dictionary
1-gram
row 2-gram [134]
col 2-gram [133]
row (1,2)-gram
col (1,2)-gram
SCOP 2.04
PLSA
softPLSA,sum
softPLSA,prod
0.939
0.963
0.959
0.964
0.967
0.951
0.965
0.970
0.955
0.963
0.970
0.960
0.966
0.971
in combination with row-Ngrams, whereas the sum strategy with column-Ngrams:
since multiplication implies statistical independence between amino acids, it may
be a more reasonable assumption between different amino acids in the same row.
Finally, we report in Table 10.3 comparative results with other approaches of
the literature applied to the SCOP 1.53 benchmark. When compared to other
techniques that are based on Ngram counting, the proposed approach (by using
both soft BoW and soft PLSA) sets the best performance so far; looking at the
global picture, the table shows that, except in one case, the proposed approach
outperforms every state-of-the-art method.
In order to better investigate the behavior of the proposed framework, we
reported in Fig. 10.3 the ROC curves obtained on the SCOP 1.53 benchmark. To
124
10 Soft Ngram representation and modeling for protein remote homology detection
Table 10.3. Average ROC scores for the 54 families in the SCOP 1.53 superfamily
benchmark for different methods
Method
Soft Ngram (our best)
Soft PLSA (our best)
ROC
0.957
0.964
Reference
This chapter
This chapter
Ngram based methods
SVM-Ngram
SVM-Ngram-LSA
SVM-Top-Ngram (n=1)
SVM-Top-Ngram (n=2)
SVM-Top-Ngram-combine
SVM-Ngram-p1
SVM-Ngram-KTA
0.826
0.878
0.907
0.923
0.933
0.887
0.892
[65]
[65]
[133]
[133]
[133]
[134]
[134]
Other methods
SVM-pairwise
SVM-LA
Profile (5,7.5)
SVM-Pattern-LSA
SVM-Motif-LSA
PSI-BLAST
SVM-Bprofile
SVM-PDT-profile (β=8,n=2)
HHSearch
SVM-LA-p1
0.896
0.925
0.980
0.879
0.860
0.676
0.921
0.950
0.915
0.958
[202]
[202]
[187]
[65]
[65]
[65]
[64]
[132]
[132]
[134]
draw the curves, we considered all 54 families at once: this means that the false
positive rate and the true positive rate are not relative to one particular family, but
rather they are an average over the different subsets. In each figure, we compared
the soft approach with its standard counterpart, reporting the area under the curve
in the legend. For every comparison, we confirm that the proposed soft methods
outperform their non soft counterparts. Interestingly, there is a major boost when
1grams are employed. 1grams correspond to the amino acids readily available
from the profile, and are the core piece of information that we are considering;
this may suggest that exploiting all amino acids in the profile – along with their
corresponding frequency – is a key step in developing novel representations to ease
the remote detection problem.
Finally, in table 10.4 we reported results obtained on the SCOP 1.67 fold
benchmark, where the task is more challenging (detecting homologies at the superfamily level rather than at the family level). In the table, the best configuration
– achieved using row (1,2)-gram for soft BoW and soft PLSA – is reported. Even
in this difficult case, the proposed framework proved to be very effective, with our
soft PLSA approach setting a new state of the art.
10.3 Experimental evaluation
125
1gram
1
0.8
TPR
0.6
0.4
0.2
0
Soft BoW, AUC=0.930
BoW, AUC=0.808
0
0.2
0.4
0.6
0.8
FPR
Soft PLSA, AUC=0.932
PLSA, AUC=0.879
1
0
0.2
0.4
0.6
0.8
1
FPR
Row−2gram
1
0.8
TPR
0.6
0.4
0.2
0
Soft BoW, AUC=0.900
BoW, AUC=0.850
0
0.2
0.4
0.6
0.8
FPR
Soft PLSA, AUC=0.944
PLSA, AUC=0.932
1
0
0.2
0.4
0.6
0.8
1
FPR
Column−2gram
1
0.8
TPR
0.6
0.4
0.2
0
Soft BoW, AUC=0.939
BoW, AUC=0.923
0
0.2
0.4
0.6
0.8
FPR
Soft PLSA, AUC=0.951
PLSA, AUC=0.941
1
0
0.2
0.4
0.6
0.8
1
FPR
Row−(1,2)gram
1
0.8
TPR
0.6
0.4
0.2
0
Soft BoW, AUC=0.937
BoW, AUC=0.844
0
0.2
0.4
0.6
FPR
0.8
Soft PLSA, AUC=0.946
PLSA, AUC=0.939
1
0
0.2
0.4
0.6
0.8
1
FPR
Column−(1,2)gram
1
0.8
TPR
0.6
0.4
0.2
0
Soft BoW, AUC=0.939
BoW, AUC=0.917
0
0.2
0.4
0.6
FPR
0.8
Soft PLSA, AUC=0.952
PLSA, AUC=0.944
1
0
0.2
0.4
0.6
0.8
1
FPR
Fig. 10.3. ROC curves computed on the SCOP 1.53 dataset. In each subfigure, the
proposed soft representation is compared with its standard counterpart.
126
10 Soft Ngram representation and modeling for protein remote homology detection
Table 10.4. Average ROC scores for the 86 families in the SCOP 1.67 fold benchmark
for different methods
Method
Soft Ngram (our best)
Soft PLSA (our best)
ROC
0.828
0.861
Reference
This chapter
This chapter
Ngram based methods
SVM-Top-Ngram (n=2)
SVM-Top-Ngram-combine-LSA
0.813
0.854
[133]
[133]
Other methods
PSI-BLAST
SVM-pairwise
SVM-LA
Gpkernel
Mismatch
eMOTIF
SVM-Bprofile (Ph=0.11)
SVM-Bprofile-LSA (Ph=0.11)
SVM-Nprofile-LSA (N =9)
0.501
0.724
0.834
0.844
0.814
0.698
0.804
0.823
0.823
[90]
[90]
[90]
[90]
[90]
[90]
[133]
[133]
[130]
11
A multimodal approach for protein remote
homology detection
Even if reaching satisfactory accuracies on several benchmark datasets (e.g. the
SCOP 1.53 dataset detailed in the previous chapter), there are still complex cases
where even state-of-the-art approaches may perform poorly for the protein remote
homology detection task. In such cases, it may be possible that information derived from other sources helps, provided that it is possible to properly integrate
such (even partial) information into existing models. In the context of protein
remote homology detection, there is a source of information which is typically disregarded by classical approaches: the available experimentally-solved, possibly few,
3D structures1 . Now the question is: Is it possible to improve sequence-based methods by integrating information derived from such 3D structures? In this chapter we
provide some evidence that this is possible, by deriving a multimodal approach2
for remote homology detection. We took inspiration from the multimodal image
and text retrieval context [103], where images are equipped with loosely related
narrative text descriptions, and retrieved by using textual queries. This scenario is
particularly interesting with respect to our scopes, because it shares many similarities with our context: i) the link between the modalities is weak, partially hidden,
and, in general, difficult to infer; ii) most importantly, the context is asymmetric: one of the two modalities is richer than the other, yet being more difficult
or expensive to obtain – therefore fewer examples are typically available (it is
known that the number of experimentally-determined structures is one order of
magnitude lower than the number of known sequences). The goal is to develop an
approach which works directly on the weaker source of information (the sequence),
being however built taking into account the (possibly smaller) richer source (the
structure).
In this chapter we show that such multimodal point of view can be effectively
explored for protein remote homology detection: as said above, the richer modality
is represented by a (possibly small) subset of structures – retrieved from PDB –
which are used to derive a “structure-aware” model for sequences. Our multimodal
1
2
Some papers already show the potentialities which can be gained with structural information (see for example [95]); however, they are all based on 3D predictions made
from sequences, therefore not using the true 3D structures found in PDB.
From a general point of view, a multimodal approach represents a technique aimed at
solving a given task by integrating different sources of information.
128
11 A multimodal approach for protein remote homology detection
approach, based on the recent [176], starts by encoding sequences and structures
with a bag of words representation. In particular, sequences are described using
counts of Ngrams, (presented in the previous chapter); structures are described using counts of 3D fragments, as in [39]. Both representations are then modeled using
topic models: we investigate here two models, the already presented PLSA [94] and
the Componential Counting Grids (CCG) model [176]. The latter represents a recent admixture model extension of the Counting Grid (its use in the protein remote
homology detection context has never been investigated).
For both models, we created an augmented model accounting for structural
information in two steps: i) a model (PLSA or CCG) for the available structures
is learned, creating a latent space which acts as a common, intermediate representation; ii) all the sequences are embedded into this space derived from structures.
Such embedding is determined by exploiting the (partial) available correspondences between sequences and structures.
The suitability of the proposed multimodal framework for protein remote homology detection has been evaluated in two ways: on one hand, we performed
various tests on the standard SCOP 1.53 benchmark [128], demonstrating that i)
the proposed framework permits drastic improvements in those scenarios where
sequence modality fails – even when only 10% of training sequences have their
corresponding structure; ii) on the whole benchmark (54 families), it favorably
compares with other recent approaches. On the other hand, we performed a thorough analysis on a member of the GPCR superfamily, suggesting that the proposed
multimodal approach can extract information that cannot be derived by employing
only sequence-based approaches.
11.1 Materials and methods
This section briefly summarizes the probabilistic models (in particular, CCG) employed in our approach.
To employ these models, a document should be represented with a bag of
words vector, where each entry nt (wi ) counts the number of times a given word wi
occurs in a given document (indexed by t). In our biological scenario documents
correspond to proteins, while basic building blocks (such as sequence Ngrams)
are the observed words. Once learned, the topic models permit to represent all
proteins in the topic space: even if in the protein case this space does not have
a straightforward biological meaning3 , it turned out to be really informative for
proteins comparison, as largely shown in [209]. In the following, we will detail the
Componential Counting Grid model (CCG, [176]).
11.1.1 Componential Counting Grid
The Componential Counting Grid (CCG – [176]), introduced in the context of Natural Language Processing, is a recent extension that combines the basic ideas of the
Counting Grid [104] with the “admixture” nature of PLSA (i.e. different words of
3
In some other cases a biological interpretation can be easily assigned, as in the gene
expression case (see Chap. 4 and 5).
11.1 Materials and methods
CG
CCG
0.2
sample st
129
0.8
sample st
Fig. 11.1. Difference between CG and CCG models. In the generative process of the
CG, all words from a sample are generated from the same window in the grid. In the
CCG, words composing a sample are allowed to be generated from multiple windows.
a document may be drawn from different topics). As the Countng Grid, the model
stems from the fact that for many text corpora, documents evolve into one another
in a smooth way, with some words dropping and new ones being introduced. For
example, news stories smoothly change across the days, as certain evolving stories
progressively fall out of novelty and new events create new stories. CCG introduces
these topological constraints by arranging topics in a 2-dimensional grid; similar
topics are placed nearby, in a way that they can be contained in a fixed-size windows inside the grid. Contrarily to the CG, where one document is assumed to
be generated by only one window in the appropriate position, in the CCG different words in the same documents may be generated from multiple windows. This
difference is highlighted in Fig. 11.1.
More formally, the componential counting grid is a grid of discrete locations
πx,y , with fixed dimensions E = E1 × E2 . Each location is endowed with a distribution over all V words, which acts exactly like the distribution p(w|z) for PLSA:
given a location zk , k = (x, y) (i.e. a topic), πk represents a multinomial distribution describing the probability of each word given that location (i.e. a topic). To
model smooth transitions between topics, CCG assume that a word is not generated from a single distribution πk related to a single position of the grid k (as in
PLSA), but also considering distributions in a neighborhood of k. In particular, a
word in a document t is generated by i) choosing a location zk from a multinomial
distribution p(z | t) = θt (like topics proportion of PLSA); ii) sampling from the
average of all the πk relative to a window of fixed dimensions W = W1 × W2
centered at zk .
As detailed in [176], model parameters and hidden distributions are learned
using a variational EM algorithm. Similarly to PLSA, the model is completely
specified given the parameters α (Dirichlet prior over locations) and π. Again,
130
11 A multimodal approach for protein remote homology detection
given these quantities, inference on an unknown object permits to recover the
value of θtnew .
11.2 The proposed approach
In this section the multimodal approach used to integrate structural and sequential information is explained. From a very general perspective, the main idea is the
following (see Fig. 11.2): suppose we have a set of sequences {seqi }; for some of
them we also know the corresponding structures {structi }. Then, we determine,
from the set of structures {structi }, a function f (struct), which is able to project
all structures in a feature space (Fig. 11.2(a)). The goal is to determine a function
g(seq) so that f (structi ) ≡ g(seqi ) for all available structures (i.e. corresponding
sequences and structures should share the same representation). The found function f can now be used to project whatever sequence in the common space, which
is now built using structural information (Fig. 11.2(b)).
In order to realize this, we exploit an approach derived from the multimodal
image-text retrieval literature [176], which is based on topic models described in
the previous section. Even if different alternatives exist [31, 103], in such retrieval
context the approach proposed in [176] appeared to be simpler and more effective.
11.2.1 Data representation
To employ topic models, we have to define a bag of words representation for
proteins, and in particular for both sequences and structures. For the sequence
modality, we use as words the Ngrams: more in detail, in all our experiments we
used bigrams, i.e. subsequences composed by two consecutive amino acids. In the
structural domain, we employed as words structural fragments, as proposed in [39]:
each fragment is a list of 3D coordinates for consecutive Cα atoms in the backbone
of the protein – in their original work, the authors provide different dictionaries of
fragments. In our study, following other papers [39, 209], we employed the 400 11
dictionary (composed by 400 structural fragments each of length 11).
In the end, we have two different dictionaries, one for each modality: a dictionary DST = {w1ST , . . . , wVSTST } for structures, and a dictionary DSE = {w1SE , . . . , wVSESE }
for sequences.
Inputs and data involved in our method are composed by:
•
A set containing S pairs of corresponding sequence/structure counts (bags),
for a subset of training proteins
{(STTrt , SETrt )}, t = 1, . . . , S
where
STTrt = nt (wiST ), i = 1 . . . , VST
SETrt = nt (wiSE ), i = 1, . . . , VSE
11.2 The proposed approach
131
(a)
(b)
Fig. 11.2. The idea of the multimodal scheme.
•
A set of T −S sequence bags, representing sequences in the training set without
the corresponding 3D structure
{SES+1
, . . . , SETTr }
Tr
•
A set of N testing sequences’ bags
{SE1Te , . . . SEN
Te }
where SETet = nt (wiSE )
11.2.2 Multimodal learning
The key idea of the proposed multimodal approach is that the latent topic space
learned by PLSA (or CCG) establishes a common representation where both sequences and structures can be embedded. Since the two modalities are asymmetric (with the structural being the richer one), we impose this latent space to be
powered by (possibly few) structures. The proposed approach articulates in three
major steps:
132
11 A multimodal approach for protein remote homology detection
Topic model learning on structures. First of all, we learn a topic model
(PLSA or CCG) using the available structure counts {ST1Tr . . . STSTr }: acknowledged the superiority of the structural modality, we emphasize the topic space to
be “structure-driven”.
For what concerns the learning, we already emphasized that choosing a good
initialization for parameters p(w|z) (π for CCG) is crucial for proper learning – the
typical random initialization may lead to poor local minima. In order to overcome
this issue, we used the same initialization described in the previous chapter: we
cluster words into Z groups (where Z represents the number of topics) using the
complete link algorithm, which performs an agglomerative clustering. Then, we
initialize β (π) so that each topic has high probability of generating the words
inside its cluster, and low probability of generating words outside the cluster.
At the end of this learning stage, each structure is characterized in a space by
t
its corresponding vector θST
, t = 1, . . . , S.
Multimodal projection. In this step, we exploit correspondences between structures and sequences, projecting the sequences in the latent space learned with
t
for the S
structures in the previous step. We impose that the topic proportions θSE
t
training sequences are equal to the θST obtained from the corresponding structures.
In this way we are establishing a 1:1 mapping between the structural topics and the
sequential topics. In practice, this is achieved by learning the PLSA/CCG model
t
t
on sequence counts keeping θSE
fixed and set to θST
. As a result, the parameters
βSE and αSE (πSE and αSE for CCG) of the learned model are completely specified in
the sequence domain. However, they have been learned taking into consideration
the topic proportions derived from the model learned on structures.
Inference on the remaining training and testing sequences. For train, . . . , SETTr }, where 3D structures are unknown, an
ing proteins in the set {SES+1
Tr
inference step with the learned enriched model can be performed to recover the
t
, t = S + 1, . . . , T . The same inference is performed on testing
topic proportions θSE
t
for SEtTe , t = 1, . . . , N . As explained in the background
sequences to derive θSE
section, inference is performed by keeping fixed α, β (α and π for CCG), and
t
for the new samples.
estimating θSE
Summarizing, we propose to learn a topic model on the richer structural modality, and consequently embed the corresponding sequences in the same latent space,
discovering the parameters governing Ngrams (sequence words) distributions in a
“structure-aware” sense.
11.2.3 Classification scheme
In order to perform classification, we employed the generative embedding scheme
[120], using as feature vector the topic posterior θt , to be used for training a
discriminative classifier such as an SVM. SVMs are therefore trained using all
t
t
θSE
(t = 1, . . . , T ) in the training set, whereas classification is carried out on θSE
(t = 1, . . . , N ).
11.3 Experimental evaluation
133
11.3 Experimental evaluation
In this section the proposed approach is evaluated with the standard and widely
used SCOP 1.53 benchmark [128] described in the previous chapter. In particular,
we first perform a thorough analysis on two cases where the sole sequence modality fails, showing that drastic improvements can be obtained by the multimodal
approach, even if using few structures; then we evaluate the proposed approach
on the whole benchmark, in order to have a clear comparison with alternative
approaches in the state of the art.
As done in the previous chapter, detection accuracies are measured using the
receiver operating characteristic (ROC) score [88] (the larger this value the better
the detection).
11.3.1 First analysis: families 3.42.1.1 and 3.42.1.5
In this first part we performed a thorough analysis on two cases where the sequence
modality fails (i.e. cases where a proper characterization of the family cannot be
determined). In particular, we concentrate on families 3.42.1.1 and 3.42.1.5, on
which almost random accuracies are obtained by using models based solely on
sequences. We applied the proposed multimodal scheme on these two families,
starting from the corresponding 3D structures downloaded from PDB. In particular, once the sequences and the structures are encoded as explained in previous
sections, the models (PLSA or CCG) are learned from the training set, in order
to get the θs usable to train the SVM. θs for the testing set are then extracted
via model inference. When using PLSA, taking inspiration from [65, 209], we set
the number of topics to 100. For CCG, we exploited the concept of capacity [176],
already defined for the Counting Grid: it measures how many non-overlapping windows can fit onto the grid. This can be equated to the number of topics in a topic
model: therefore we set the CCG dimension as E = [20, 20] and W = [2, 2], so that
the capacity equals to 100. After computing the θs, the classification has been carried out using the public libsvm implementation4 [43], employing the RBF kernel.
Parameter C of the SVM has been set as 10−3 for every experiment, whereas the
RBF parameter σ has been found by grid search (testing power of 2: [2−4 , . . . , 24 ]),
retaining for each family the one performing better on average (reasonable values
lie around 2−2 ).
In order to get a complete understanding of the proposed approach, we also
assessed the performances when only a limited number of structures are available
for learning. In particular, we used an increasing fraction of randomly chosen
structures to build the structure model. Since there is a very limited number of
positive examples (29 for the first family, 26 for the second), we decided to always
consider all of them, sampling at random negative training examples. The structure
model is then transferred to sequence model; inference on the enriched sequence
model finally permitted to get descriptors for all training and testing sequences,
to be used by the SVM classifier. Detection results, for fractions ranging from
0.1 to 1 (i.e. all training structures), are averaged over 50 runs, and reported in
Figure 11.3, for both the PLSA and CCG models. We also determined whether
4
http://www.csie.ntu.edu.tw/˜cjlin/libsvm/
134
11 A multimodal approach for protein remote homology detection
mmCCG
mmCCG
mmPLSA
mmPLSA
CCG baseline
PLSA baseline
CCG baseline
PLSA baseline
Fig. 11.3. Detection scores displayed as a function of the number of structures used in
the multimodal approach. “mmPLSA” (“mmCCG”) stands for the proposed multimodal
approach by using the PLSA (CCG) model. Filled markers indicate statistically significant improvements over the baseline. Results are reported for (left) family 3.42.1.1 and
(right) family 3.42.1.5.
the improvement gained with the proposed multimodal approach is statistically
significant, using a standard t-test with alternative hypothesis “multimodal results
are greater than the baseline”. In Figure 11.3, filled markers indicate statistical
significance at significance level α = 0.05.
From these plots it seems evident that the use of structural information permits to derive a better sequence model: in both families, CCG achieves significant
improvements when employing only 10% of all training structures. For the second
family, even if multimodal PLSA accuracies are higher than the baseline, statistical significance is obtained only when 80% or more of the structures are employed.
When all training structures are considered, the improvement is rather high for
both models.
When comparing the two probabilistic models, it appears evident that the
Componential Counting Grid outperforms the PLSA model, both when used on
the sequence modality alone and when empolyed in a multimodal framework. Such
a model, never used in the context of protein remote homology detection, permits
to derive a better and more discriminant description of count data, as outlined
in [176] for other application fields.
11.3.2 Second analysis: all families
In this second analysis, the proposed approach has been tested on all the families
of the SCOP dataset, this being particularly important to compare the proposed
scheme with the state of the art. In this case we slightly changed some details
of our experimental pipeline; in particular, since we are dealing with 54 different
classification problems (i.e. 54 families), we did not fix a single number of topics,
but we let it vary in a reasonable range, keeping the best value. Moreover, in order
to be fully comparable with many works in the state of the art [64,65,132–134,187],
11.3 Experimental evaluation
Method
ROC
Reference
Monomodal PLSA
Monomodal CCG
Multimodal PLSA
Multimodal CCG
0.921
0.903
0.925
0.932
This
This
This
This
Ngram-based methods
SVM-Ngram
SVM-Ngram-LSA
SVM-Top-Ngram (n=1)
SVM-Top-Ngram (n=2)
SVM-Top-Ngram-combine
SVM-Ngram-p1
SVM-Ngram-KTA
0.826
0.878
0.907
0.923
0.933
0.887
0.892
[65]
[65]
[133]
[133]
[133]
[134]
[134]
Other methods
SVM-pairwise
SVM-LA
Profile (5,7.5)
SVM-Pattern-LSA
SVM-Motif-LSA
PSI-BLAST
SVM-Bprofile
SVM-PDT-profile (β=8,n=2)
HHSearch
SVM-LA-p1
0.896
0.925
0.980
0.879
0.860
0.676
0.921
0.950
0.915
0.958
[202]
[202]
[187]
[65]
[65]
[65]
[64]
[132]
[132]
[134]
135
chapter
chapter
chapter
chapter
Table 11.1. Average ROC scores for the 54 families in the SCOP 1.53 superfamily
benchmark for different methods
the classification is performed using SVM via the public GIST implementation5 ,
setting the kernel type to radial basis, and keeping the remaining parameters to
their default values.
Results are presented in Table 11.1, in comparison with the literature; in particular, the state of the art is split into methods which employ Ngrams (Ngrambased Methods) and methods which do not (Other Methods). From the table it
can be observed that the framework is rather accurate: when compared with other
Ngram-based methods, our best result outperforms all other approaches (except
the SVM-Top-Ngram-combine [133] approach, which however combines different
Ngram representations). Moreover, the proposed multimodal technique compares
reasonably well also with other more complex approaches. Interestingly, CCG outperforms PLSA only when used in a multimodal framework.
5
downloadable from http://www.chibi.ubc.ca/gist/ [128]
136
11 A multimodal approach for protein remote homology detection
11.4 Multimodal analysis of bitter taste receptor TAS2R38
The main goal of this section is to qualitatively validate the proposed multimodal
scheme in a real scenario. In particular we focus on a specific protein (the bitter
taste receptor TAS2R38 [40, 116]) belonging to the G-protein coupled receptors
(GPCRs) superfamily. This large group (with over 900 members only in humans)
of cell signaling membrane proteins is of major importance for drug development,
as GPCRs are one of the primary targets currently under investigation [163].
From our perspective, this context is very interesting for three reasons: i) sequence identities between members of different GPCR families are extremely low,
making the detection of remote homologues very challenging; ii) only 24 unique human GPCRs6 have their experimentally-determined structure as of January 2015
(i.e. very little structural information); iii) most importantly, it has already been
shown that the closest homologue of the TAS2R38 receptor (as given by standard
programs for sequence search, without manual intervention) does not represent a
good template usable to unravel structural/functional elements (in particular, regarding the active site and the specific residues involved in the ligand binding) [20].
We show here that our multimodal approach can be used to suggest an alternative
template. We sponsor this template by providing some elements supporting the
capabilities of the obtained multimodal model of capturing structural/functional
elements. To do that, a multimodal PLSA (with 3 topics7 ) has been trained, using
all sequences and the known 24 structures (downloaded from PDB): as a result,
all GPCR sequences are embedded in the topic probabilities θ space. The query
TAS2R38 sequence is embedded in the same space via inference on the model: the
nearest neighbor with known structure represents the suggested template. In this
case we have the N/OFQ Opioid Receptor (PDB id: 4EA3). On the contrary, if we
perform the same analysis with the single modality PLSA, we obtain as nearest
neighbor the CCR5 chemokine receptor (PDB id: 4MBS); as described above, modeling TAS2R38 using this template alone does not allow a correct characterization
of the binding cavity of the receptor [20].
To validate the new template, we try to mine the obtained multimodal model,
in order to see if the contained information exhibits structure-driven importance.
To do that, we analyze, for every topic, the 5 most probable Ngrams (as given by
the distribution β), trying to understand if they are related to positions in the two
proteins which are important from a structural point of view. Actually we have
found that some of these Ngrams (shown in the top part of Fig. 11.4, together
with the topic probabilities θ of the query and the corresponding nearest neighbor) represent words which are located with primary importance in the binding
cavity of both proteins – these critical residues already shown to be involved in
ligand recognition on our query TAS2R38 [146]. If we repeat the same analysis
using a PLSA model built using only sequences (central part of Fig. 11.4), no evident structural or functional information can be derived, this suggesting that the
N/OFQ Opioid Receptor, being obtained with a more “structure aware” model,
can represent a valid alternative to the CCR5 chemokine receptor.
6
7
The list of such proteins is obtained from http://blanco.biomol.uci.edu/mpstruc/
In this case we had to drastically reduce the number of topics since only 24 structures
are available – the topic space is built by using the structural information.
11.4 Multimodal analysis of bitter taste receptor TAS2R38
137
Fig. 11.4. On the top part of the figure, the first 5 Ngrams (sorted in descending order
w.r.t their β probabilities) for each topic are listed. Ngrams highlighted are known to
occur in the binding site locations of either of the two proteins. Slightly to the right, θ
distributions (with 3 topics) are displayed for the query TAS2R38 and its closest neighbor.
In the central part of the figure, we visualize the same information employing the PLSA
in a single-modal way. Finally, in the bottom part of the figure, the same information
has been extracted with the multimodal approach employing both real and predicted
structures. Interestingly, adding such predicted structures deteriorates the qualitative
results obtained by the multimodal scheme.
138
11 A multimodal approach for protein remote homology detection
A final experiment has been carried out in order to investigate if it may
be possible, in cases like this when very few structures are available, to enlarge the structural information of the training set by also using predicted
3D structure models8 . To test this we applied our proposed multimodal approach by enlarging the training set with the predicted structures of different
proteins belongin to the TAS2R group (24 GPCR models, downloaded from
http://zhanglab.ccmb.med.umich.edu/GPCR-HGmod/). Results are displayed in
the bottom part of Fig. 11.4: even if we obtain the same suggested template (the
N/OFQ Opioid Receptor — PDB id: 4EA3), the quality of the multimodal space
seems worse than that of the true multimodal approach. It seems that adding
predicted models does not help the proposed approach, but, on the contrary, adds
some noise. This was somehow expected, and confirms the intuition we got from
the other quantitative experiments: the fully exploitation of the proposed framework is based on the use of a small piece of information, which should be however
extremely informative (as is for real structures compared to simulated structures).
In conclusion, the availability of a method that, augmenting the descriptive
power of a sequence-based model, is able to predict relevant structural positions,
i.e. involved in ligand binding, is a fundamental step for setting up the modeling
protocol when no 3D experimental information is available. In the studied case,
the information obtained using our approach could be essential for guiding the
selection of better and biologically relevant target-template alignments.
8
For example those obtained using http://zhanglab.ccmb.med.umich.edu/GPCRHGmod/
Final remarks on protein remote homology
detection
In this part of the thesis we addressed the protein remote homology detection,
where some bag of words approaches already proved successful in the literature.
However, they could be further exploited: in Chap. 10, we derived a novel bag
of words representation – which we dubbed Soft Ngram – extracted from the
profile of a sequence, explicitly considering and capturing the frequencies in the
profile, thus reflecting the evolutionary history of the protein. We propose two
modeling approaches to derive feature vectors from the soft Ngram representation,
employable as input for the SVM discriminative classifier. Starting from the bag
of words model, we promote the use of topic models in the context of protein
remote homology detection: we derived a soft PLSA model, that deals with the
proposed characterization for sequences. In a thorough experimental evaluation,
we demonstrated on three benchmarks that the soft Ngram representation is more
descriptive and accurate than other profile-based approaches, being also superior to
almost all the approaches proposed in the literature. Looking back at the pipeline
proposed in Chap. 2, this chapter contributed in every aspect of the pipeline: in
the “what to count” stage, by considering every Ngram in the sequence profile
to build the dictionary; in the “how to count” stage, by associating to the count
value a probability that reflects evolutionary conservation; in the “how to model”,
by deriving a novel, soft PLSA topic model.
In Chap. 11, we investigated a multimodal approach for protein remote homology detection. In particular we provided some evidence that it is possible to
improve sequence based models by exploiting the available (even partial) 3D structures. The approach, based on topic models, allowed the derivation of a common
and intermediate feature space – the topic space – which embeds sequences being at the same time “structure aware”. We experimentally demonstrated that, in
cases where the sequence modality alone fails, introducing only 10% of the training structures resulted in significant improvements on detection scores. Moreover,
we applied the proposed approach to model a GPCR protein, finding evidences
of structural correlations between sequence Ngrams: such correlations can not be
recovered employing a sequence-only technique.
12
Conclusions and future works
This thesis investigated and promoted the bag of words paradigm for representing
and approaching problems in the wide field of Bioinformatics. The bag of words is
a vector representation particularly appropriate when the pattern is characterized
(or assumed to be characterized) by the repetition of basic, “constituting” elements
called words. By assuming that all possible words are stored in a dictionary, the
bag of words vector for one particular object is obtained by counting the number
of times each element of the dictionary occurs in the object.
The bag of words is particularly suited in bioinformatics for a twofold reason:
on one hand, it can be well justified by our current understanding of biology, where
a bag with no structure is precisely what we are able to observe (thus one of the
main drawback of the representation – that it destroys the object structure – is
relaxed and alleviated). On the other hand, it seems that some Bioinformatics
problems are inherently formulated as counting: we mentioned for example that
measuring gene expression means counting the number of mRNA molecules.
In this general picture, this thesis has been devoted to demonstrate that bag of
words representations and models can be conveniently exported and employed in
different scenarios of the Bioinformatics domain, and that bag of words approaches
can have a significant impact on the Bioinformatics and Computational Biology
communities. More in detail, the main contributions of this thesis are:
•
•
The proposal of a possible formalization of the bag of words paradigm to represent and model objects, by means of a detailed pipeline that can be employed
to face a problem using a bag of words approach.
The identification of three scenarios where bioinformatics problems can be effectively faced from a bag of words perspective, proposing different contributions
at different levels of the pipeline:
– Gene expression analysis: in this context, this thesis contributed by
recognizing the bag of words representation in gene expression data, and
subsequently by investigating the capabilities of topic models for the classification of gene expression experiments. More than this, we provided several
considerations on the interpretability of the results obtained, by using a
real dataset involving different species of grapevine resulting from a collaboration with the Functional Genomics Lab at the University of Verona.
Encouraged by the promising results, we performed a comprehensive evalu-
142
12 Conclusions and future works
–
–
ation of a more recent and powerful topic model, the Counting Grid, which
copes with a possible drawback of classic topic models: topics, i.e. biological
processes in this context, act independently of each other. We promote the
use of the CG model as an effective tool for visualization, gene selection,
and classification of gene expression samples.
HIV infection modeling: in this context, this thesis argued for the usage
of the bag of words representation and models for analyzing aspects of the
HIV infection in humans, focusing on i) the patient’s bag of epitopes, which
we found to be correlated with the patient HIV status by employing and
tailoring the Counting Grid model for this purpose, and ii) the patient’s bag
of TCRs, where robust statistics for measuring diversity of samples have
been thoroughly evaluated, using a dataset derived from a collaboration
with the David Geffen school of medicine, UCLA. As a second contribution,
a principled way of assessing the reliability of the bag of words have been
devised.
Protein remote homology detection: in this context, this thesis contributed by proposing a novel bag of words approach to characterize protein
sequences, by fully integrating evolutionary information in the representation: each word has been equipped with a weight that encodes its conservation across evolution. Moreover, a novel probabilistic model able to handle
the presence of this weight associated with each word has been developed.
A second contribution is aimed at properly integrating into existing models partial information derived from other sources. In particular, there is a
source of information which is typically disregarded by classical approaches:
the available experimentally-solved, possibly few, 3D structures of proteins.
A multimodal approach for protein remote homology detection has been
therefore derived, which permits to integrate in a model the possibly few
available 3D structures. A validation using standard benchmarks confirms
the potentialities of the proposed approach, as well as a qualitative analysis
performed in collaboration with the Applied Bioinformatics groups (University of Verona) on a real dataset of GPCR proteins.
For each scenario, motivations, advantages, and challenges of the bag of words
representations have been addressed, together with possible solutions that have
been thoroughly experimentally evaluated exploiting literature benchmarks as well
as datasets derived from direct interactions with clinical and biological laboratories
/ research groups (Functional Genomics Lab and Applied Bioinformatics groups
at the University of Verona, and the David Geffen school of medicine at UCLA).
The works done in this thesis pave the way for further studies, aimed at approaching novel bioinformatics challenges with a bag of words perspective. On one
hand, we suggested the characteristics of a problem which make the bag of words
representation particularly suited. On the other hand, we demonstrated that bag
of words models can be extremely versatile, and can be tailored for a vast range
of tasks (visualization, classification, clustering, interpretation, feature selection,
statistical analysis and reliability assessment).
Further contributions could also be aimed at improving existing results obtained in the scenarios addressed in this thesis: all the approaches we proposed in
12 Conclusions and future works
143
c
β
β
s
z
v
s
N
T
(a)
z
v
N
T
(b)
Fig. 12.1. It is possible to integrate the soft PLSA model – portrayed in (a) – with
“biologically-aware” similarity measures for Ngrams (such as the ones derived from a
sequence alignment), and enhance the model (b) by clustering (soft) Ngrams and assign
a label c to each of them.
the specific bioinformatics contexts open new perspectives. More in detail, in the
context of gene expression it is possible to propose novel topic models, enriching
and extending existing ones to address the specific gene expression scenario: for
example, it is possible to integrate genes’ dependencies known a priori (preliminary investigated in [180]) to better model the gene-topics distribution, leading to
better characterization of samples. Another research line can be devoted at boosting the gene selection technique and enhance the interpretabilty of the Counting
Grid by applying a sparse regressor (such as LASSO [226]) on the qkt distribution
(p(z|s) of the PLSA model). The lasso can be exploited to learn the most discriminative locations in the CG space, which already showed to provide a good
embedding of samples. Then, an analysis of the genes most prominent in these
locations may better highlight gene expression patterns that are associated with a
disease. Finally, one line of research which has been only marginally investigated
is the biclustering one, where the bag of words could be effectively employed [22].
In the context of HIV modeling, we plan to study probability models of the
epitope co-presentation for a broader spectrum of tasks, from correcting association
studies, to detecting patients or populations that are likely to react similarly to
an infection, to the rational vaccine design. In addition, a more comprehensive
evaluation of the reliability technique to benchmark datasets of TCR sequences is
currently under consideration.
In the context of protein remote homology detection, a future work along this
line of research is directed toward the study of more sophisticated models for the
Soft Ngram representation. For example, it may be possible to take into account
a sequence similarity measure between Ngrams. If two sequences has similar soft
bag of words representation, it could be interesting to check whether the observed
differences arise from Ngram substitutions that are likely to occur in nature (this
information being encoded in the substitution matrix). This dependence may be
introduced in the bayesian network of the soft PLSA through a variable, shown in
Fig. 12.1, modeling a clustering of Ngrams. Moreover, we are currently studying
more robust multimodal approaches, which can for example learn how to move
from the structure space to the sequence space. As a final consideration, we believe
that one of the most important trends in current Bioinformatics is the integration
of information from heterogeneous sources, and the Bag of Words can provide a
common representation between all of these sources.
144
12 Conclusions and future works
In conclusion, this thesis demonstrated the possibility of facing some Bioinformatics problems from a bag of words perspective. More than that, we gathered
evidence that this paradigm can be successfully exported in many other biological
contexts, and can be helpful for biomedical experts to gain a deeper understanding
on their specific problems.
References
1. A.K. Abbas, A.H.H. Lichtman, and S. Pillai. Basic immunology: functions and
disorders of the immune system. Elsevier Health Sciences, 2012.
2. T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and Y. Saeys. Robust biomarker
identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26:392–398, 2010.
3. M. Aharon, M. Elad, and A. Bruckstein. K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
4. B. Alberts, A. Johnson, J. Lewis, D. Morgan, M. Raff, K. Roberts, and P. Walter.
Molecular biology of the cell. Garland Science, 6 edition, 2014.
5. A.A. Alizadeh, M.B. Eisen, E. Davis, C. Ma, I. Lossos, A. Rosenwald, J.C. Boldrick,
H. Sabet, T. Tran, and X. Yu. Distinct types of diffuse large b-cell lymphoma
identified by gene expression profiling. Nature, 403(6769):503–511, 2000.
6. S. Alizon, V. von Wyl, T. Stadler, R.D. Kouyos, S. Yerly, B. Hirschel, J. Böni,
C. Shah, T. Klimkait, and H. Furrer. Phylogenetic approach reveals that virus
genotype largely determines hiv set-point viral load. PLoS pathogens, 6(9):e1001123,
2010.
7. U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine.
Broad patterns of gene expression revealed by clustering analysis of tumor and
normal colon tissues probed by oligonucleotide arrays. Proceedings of the National
Academy of Sciences, 96(12):6745–6750, 1999.
8. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local
alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
9. S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J.
Lipman. Gapped blast and psi-blast: a new generation of protein database search
programs. Nucleic acids research, 25(17):3389–3402, 1997.
10. S.A. Armstrong, J.E. Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D.
Minden, S.E. Sallan, E.S. Lander, T.R. Golub, and S.J. Korsmeyer. Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia.
Nature genetics, 30(1):41–47, 2001.
11. M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P.
Davis, K. Dolinski, S.S. Dwight, and J.T. Eppig. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000.
12. H.U. Asuncion, A.U. Asuncion, and R.N. Taylor. Software traceability with topic
modeling. In Proc. of the 32nd ACM/IEEE Int. Conference on Software Engineering, volume 1 of ICSE ’10, pages 95–104, 2010.
146
References
13. F. Balkwill. Cancer and the chemokine network. Nature Reviews Cancer, 4(7):540–
550, 2004.
14. C.H. Bassing, W. Swat, and F.W. Alt. The mechanism and regulation of chromosomal v(d)j recombination. Cell, 109(2):S45–S55, 2002.
15. P.D. Baum, J.J. Young, D. Schmidt, Q. Zhang, R. Hoh, M. Busch, J. Martin,
S. Deeks, and J.M. McCune. Blood t-cell receptor diversity decreases during the
course of hiv infection, but the potential for a diverse repertoire persists. Blood,
119(15):3469–3477, 2012.
16. A.D. Baxevanis and B.F.F. Ouellette. Bioinformatics: a practical guide to the analysis of genes and proteins, volume 43. Wiley, 2004.
17. T. Beißbarth and T.P. Speed. Gostat: find statistically overrepresented gene ontologies within a group of genes. Bioinformatics, 20(9):1464–1465, 2004.
18. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N.
Shindyalov, and P.E. Bourne. The protein data bank. Nucleic Acids Research,
28(1):235–242, 2000.
19. A. Bhattacharjee, w.G Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd,
J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E.J. Mark, E.S. Lander, W. Wong, B.E. Johnson, T.R. Golub, D.J. Sugarbaker, and M. Meyerson.
Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences,
98(24):13790–13795, 2001.
20. X. Biarnés, A. Marchiori, A. Giorgetti, C. Lanzara, P. Gasparini, P. Carloni, S. Born,
A. Brockhoff, M. Behrens, and W. Meyerhof. Insights into the binding of phenyltiocarbamide (ptc) agonist to its target human tas2r38 bitter receptor. PLoS ONE,
5(8):e12394, 2010.
21. M. Bicego, M. Cristani, V. Murino, E. Pekalska, and R.P.W. Duin. Clusteringbased construction of hidden markov models for generative kernels. In Energy
Minimization Methods in Computer Vision and Pattern Recognition, pages 466–
479, 2009.
22. M. Bicego, P. Lovato, A. Ferrarini, and M. Delledonne. Biclustering of expression microarray data with topic models. In Proc. of Int. Conference on Pattern
Recognition (ICPR), pages 2728–2731, 2010.
23. M. Bicego, P. Lovato, B. Oliboni, and A. Perina. Expression microarray classification using topic models. In ACM symposium on applied computing (SAC), pages
1516–1520, 2010.
24. M. Bicego, P. Lovato, A. Perina, M. Fasoli, M. Delledonne, M. Pezzotti, A. Polverari, and V. Murino. Investigating topic models’ capabilities in expression microarray
data classification. IEEE/ACM Tran. on Computational Biology and Bioinformatics, 9(6):1831–1836, 2012.
25. M. Bicego, A. Perina, V. Murino, A. Martins, P. Aguiar, and M. Figueiredo. Combining free energy score spaces with information theoretic kernels: Application to
scene classification. In Proc. of Int. Conference on Image Processing (ICIP), pages
2661–2664, 2010.
26. M. Bicego, A. Ulaş, U. Castellani, A. Perina, V. Murino, A.F.T. Martins, P.M.Q.
Aguiar, and M.A.T. Figueiredo. Combining information theoretic kernels with generative embeddings for classification. Neurocomputing, 101:161–169, 2013.
27. I. Bieche, F. Lerebours, S. Tozlu, M. Espie, M. Marty, and R. Lidereau. Molecular profiling of inflammatory breast cancer: Identification of a poor-prognosis gene
expression signature. Clinical Cancer Research, 10(20):6789–6795, 2004.
28. C.M Bishop. Pattern recognition and machine learning. springer New York, 2006.
29. D. Blei and J. Lafferty. Correlated topic models. Advances in neural information
processing systems, 18:147, 2006.
References
147
30. D.M. Blei. Probabilistic topic models. Communications of ACM, 55(4):77–84, 2012.
31. D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
32. W.D. Blizard. Multiset theory. Notre Dame Journal of formal logic, 30(1):36–66,
1988.
33. A. Bosch, A. Zisserman, and X. Munoz. Scene classification via plsa. In Proc. of
European Conference on Computer Vision, volume 4, pages 517–530, 2006.
34. A-L. Boulesteix. Pls dimension reduction for classification with microarray data.
Statistical Applications in Genetics and Molecular Biology, 3(1), 2004.
35. A-L. Boulesteix and K. Strimmer. Partial least squares: a versatile tool for the
analysis of high-dimensional genomic data. Briefings in bioinformatics, 8(1):32–44,
2007.
36. A-L. Boulesteix, C. Strobl, T. Augustin, and M. Daumer. Evaluating microarraybased classifiers: an overview. Cancer Informatics, 6:77, 2008.
37. G. Brelstaff, M. Bicego, N. Culeddu, and M. Chessa. Bag of peaks: interpretation
of nmr spectrometry. Bioinformatics, 25(2):258–264, 2009.
38. P.O. Brown and D. Botstein. Exploring the new world of the genome with dna
microarrays. Nature Genetics, 21:33–37, 1999.
39. I. Budowski-Tal, Y. Nov, and R. Kolodny. Fragbag, an accurate representation of
protein structure, retrieves structural neighbors from the entire pdb quickly and
accurately. Proceedings of the National Academy of Sciences, 107(8):3481–3486,
2010.
40. B. Bufe, P.A.S. Breslin, C. Kuhn, D.R. Reed, C.D. Tharp, J.P. Slack, U-K. Kim,
D. Drayna, and W. Meyerhof. The molecular basis of individual differences in
phenylthiocarbamide and propylthiouracil bitterness perception. Current Biology,
15(4):322–327, 2005.
41. N.A. Campbell and J.B. Reece. Biology. Sixth edition. Pearson, 2002.
42. U. Castellani, A. Perina, V. Murino, M. Bellani, G. Rambaldelli, M. Tansella, and
P. Brambilla. Brain morphometry by probabilistic latent semantic analysis. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010, pages
177–184. 2010.
43. C-C. Chang and C-J. Lin. LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.
44. J. Chang, S. Gerrish, C. Wang, J.L. Boyd-graber, and D.M. Blei. Reading tea leaves:
How humans interpret topic models. In Advances in Neural Information Processing
Systems 22, pages 288–296. 2009.
45. C-C. Chen and L.F. Lau. Functions and mechanisms of action of ccn matricellular
proteins. The International Journal of Biochemistry and Cell Biology, 41(4):771–
783, 2009.
46. P-C. Chen, S-Y. Huang, W.J. Chen, and C.K. Hsiao. A new regularized least
squares support vector regression for gene selection. BMC bioinformatics, 10(1):44,
2009.
47. G.A. Churchill. Fundamentals of experimental design for cdna microarrays. Nature
Genetics, 32:490–495, 2002.
48. M. Cohn, N.A. Mitchison, W.E. Paul, A.M. Silverstein, D.W. Talmage, and
M. Weigert. Reflections on the clonal-selection theory. Nature Reviews Immunology,
7(10):823–830, 2007.
49. M. Connors, J.A. Kovacs, S. Krevat, J.C. Gea-Banacloche, M.C. Sneller, M. Flanigan, J.A. Metcalf, R.E. Walker, J. Falloon, M. Baseler, R. Stevens, I. Feuerstein,
H. Masur, and H.C. Lane. Hiv infection induces changes in cd4+ t-cell phenotype
and depletions within the cd4+ t-cell repertoire that are not immediately restored
by antiviral or immune-based therapies. Nature Medicine, 3:533–540, 1997.
148
References
50. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409:860–921, 2001.
51. The UniProt Consortium. Activities at the universal protein resource (uniprot).
Nucleic Acids Research, 42(D1):D191–D198, 2014.
52. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297,
1995.
53. M. Cristani, A. Perina, U. Castellani, and V. Murino. Geo-located image analysis
using latent representations. In Proc. of IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), pages 1–8, 2008.
54. G. Csurka, C.R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization
with bags of keypoints. In Workshop on Statistical Learning in Computer Vision,
ECCV, pages 1–22, 2004.
55. O.G. Cula and K.J. Dana. Compact representation of bidirectional texture functions. In Proc. Int. Conference on Computer Vision and Pattern Recognition
(CVPR), volume 1, pages 1041–1047, 2001.
56. O. Dagliyan, F. Uney-Yuksektepe, I.H. Kavakli, and M. Turkay. Optimization based
tumor classification from microarray gene expression data. PLoS One, 6(2):e14579,
2011.
57. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
In Proc. Int. Conference on Computer Vision and Pattern Recognition (CVPR),
volume 2, pages 886–893, 2005.
58. G. D’Amico, E.A. Korhonen, A. Anisimov, G. Zarkada, T. Holopainen, R. Hagerling,
F. Kiefer, L. Eklund, R. Sormunen, H. Elamaa, R.A. Brekken, R.H. Adams, G.Y.
Koh, P. Saharinen, and K. Alitalo. Tie1 deletion inhibits tumor growth and improves
angiopoietin antagonist therapy. The Journal of Clinical Investigation, 124(2):824–
834, 2014.
59. M.M. Davis and P.J. Bjorkman. T-cell antigen receptor genes and t-cell recognition.
Nature, 334(6181):395–402, 1988.
60. J. José del Coz, J. Diez, and A. Bahamonde. Learning nondeterministic classifiers.
The Journal of Machine Learning Research, 10:2273–2293, 2009.
61. J.L. DeRisi, V.R. Iyer, and P.O. Brown. Exploring the metabolic and genetic control
of gene expression on a genomic scale. Science, 278(5338):680–686, 1997.
62. S.M. Dhanasekaran, T.R. Barrette, D. Ghosh, R. Shah, S. Varambally, K. Kurachi,
K.J. Pienta, M.A. Rubin, and A.M. Chinnaiyan. Delineation of prognostic biomarkers in prostate cancer. Nature, 412(6849):822–826, 2001.
63. C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene
expression data. Journal of bioinformatics and computational biology, 3(02):185–
205, 2005.
64. Q. Dong, L. Lin, and X. Wang. Protein remote homology detection based on binary
profiles. In Bioinformatics Research and Development, volume 4414 of Lecture Notes
in Computer Science, pages 212–223. 2007.
65. Q. Dong, X. Wang, and L. Lin. Application of latent semantic analysis to protein
remote homology detection. Bioinformatics, 22(3):285–290, 2006.
66. R.O. Duda, P.E. Hart, and D.G Stork. Pattern Classification (2nd Edition). Wiley
Interscience, 2001.
67. S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination methods for
the classification of tumors using gene expression data. Journal of the American
statistical association, 97(457):77–87, 2002.
68. T. Dunning. Statistical identification of language, 1994.
69. P.J. Eccles. An introduction to mathematical reasoning. Cambridge University
Press, 1997.
References
149
70. J. Eisenstein, A. Ahmed, and E.P. Xing. Sparse additive generative models of text.
In ICML, 2011.
71. F.C. Ekmekcioglu, M.F. Lynch, A.M. Robertson, T.M.T. Sembok, and P. Willett.
Comparison of ngram matching and stemming for term conflation in english, malay,
and turkish texts. Text Technology, 6:1–14, 1996.
72. R.G. Fahmy, C.R. Dass, L-Q. Sun, C.N. Chesterman, and L.M. Khachigian. Transcription factor egr-1 supports fgf-dependent angiogenesis during neovascularization
and tumor growth. Nature Medicine, 9(8):1026–1032, 2003.
73. A. Farahat and F. Chen. Improving probabilistic latent semantic analysis with
principal component analysis. In EACL, 2006.
74. Alessandro Farinelli, Matteo Denitto, and Manuele Bicego. Biclustering of expression microarray data using affinity propagation. In Pattern Recognition in Bioinformatics, LNCS, pages 13–24. 2011.
75. D. Filliat. A visual bag of words method for interactive qualitative localization and
mapping. In Proc. Int. Conference on Robotics and Automation (ICRA), 2007.
76. F. Finotello and B. Di Camillo. Measuring differential gene expression with rna-seq:
challenges and strategies for data analysis. Briefings in functional genomics, page
elu035, 2014.
77. E. Fisher. The influence of configuration on enzyme activity. Dtsch Chem Ges
(Translated from German), 27:2984–2993, 1894.
78. R.A. Fisher. Statistical methods for research workers. Number 5. Genesis Publishing
Pvt Ltd, 1936.
79. G. Fort and S. Lambert-Lacroix. Classification using partial least squares with
penalized logistic regression. Bioinformatics, 21(7):1104–1111, 2005.
80. N.K. Fox, S.E. Brenner, and J-M. Chandonia. Scope: Structural classification of
proteins - extended, integrating scop and astral data and classification of new structures. Nucleic Acids Research, 42(Database-Issue):304–309, 2014.
81. B. Frey and N. Jojic. A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 27:1–25, 2005.
82. S. Frintrop, E. Rome, and H.I. Christensen. Computational visual attention systems
and their cognitive foundations: a survey. ACM Transactions on Appied Perceptions,
7(1):6:1–6:39, 2010.
83. T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples
using microarray expression data. Bioinformatics, 16(10):906–914, 2000.
84. M.E. Garber, O.G. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler, M. PacynaGengelbach, M. van de Rijn, G.D. Rosen, C.M. Perou, and R.I. Whyte. Diversity of
gene expression in adenocarcinoma of the lung. Proceedings of the National Academy
of Sciences, 98(24):13784–13789, 2001.
85. L. Gerstein. Introduction to mathematical structures and proofs. Springer, 2012.
86. G. Getz, E. Levine, and E. Domany. Coupled two-way clustering analysis of gene
microarray data. Proceedings of the National Academy of Sciences, 97(22):12079–
12084, 2000.
87. T.R. Golub, D.K. Slonim, Pablo P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov,
H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C. Bloomfield, and E. Lander.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, 1999.
88. M. Gribskov and N.L. Robinson. Use of receiver operating characteristic (roc)
analysis to evaluate sequence matching. Computers and Chemistry, 20(1):25–33,
1996.
150
References
89. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal
of Machine Learning Research, 3:1157–1182, 2003.
90. T. Handstad, A.J.H. Hestnes, and P. Saetrom. Motif kernel generated by genetic
programming improves remote homology and fold detection. BMC Bioinformatics,
8(1), 2007.
91. X. Hang. Cancer classification by sparse representation using microarray gene expression data. In IEEE Int. Conf. on Bioinformatics and Biomeidcine Workshops
(BIBMW), pages 174–177, 2008.
92. M.J. Heller. Dna microarray technology: devices, systems, and applications. Annual
review of biomedical engineering, 4(1):129–153, 2002.
93. T. Hertz, D. Nolan, I. James, M. John, S. Gaudieri, E. Phillips, J.C. Huang, G. Riadi, S. Mallal, and N. Jojic. Mapping the landscape of host-pathogen coevolution:
Hla class i binding and its relationship with evolutionary conservation in human
and viral proteins. Journal of virology, 85(3):1310–1321, 2011.
94. T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-2):177–196, 2001.
95. Y. Hou, W. Hsu, M-L. Lee, and C. Bystroff. Efficient remote homology detection
using local structure. Bioinformatics, 19(17):2294–2301, 2003.
96. J.C. Huang and N. Jojic. Variable selection through correlation sifting. In Research
in Computational Molecular Biology, pages 106–123, 2011.
97. T. Jaakkola, M. Diekhans, and D. Haussler. Using the fisher kernel method to
detect remote protein homologies. In ISMB, volume 99, pages 149–158, 1999.
98. T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of computational biology, 7(1-2):95–114,
2000.
99. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. pages 487–493, 1999.
100. A.K. Jain, M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computer
Surveys, 31(3):264–323, 1999.
101. S. Jankelevich, B.U. Mueller, C.L. Mackall, S. Smith, S. Zwerski, L.V. Wood, S.L.
Zeichner, L. Serchuck, S.M. Steinberg, R.P. Nelson, et al. Long-term virologic and
immunologic responses in human immunodeficiency virus type 1-infected children
treated with indinavir, zidovudine, and lamivudine. Journal of Infectious Diseases,
183(7):1116–1120, 2001.
102. D. Jardine, L. Cornel, and M. Emond. Gene expression analysis characterizes antemortem stress and has implications for establishing cause of death. Physiological
genomics, 43(16):974–980, 2011.
103. Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. In Proc. of Int. Conference on Computer Vision (ICCV), pages 2407–
2414, 2011.
104. N. Jojic and A. Perina. Multidimensional counting grids: Inferring word order from
disordered bags of words. In Uncertainty in Artificial Intelligence, pages 547–556,
2011.
105. N. Jojic, M. Reyes-Gomez, D. Heckerman, C. Kadie, and O. Schueler-Furman.
Learning mhc ipeptide binding. Bioinformatics, 22(14):e227–e235, 2006.
106. I.K. Jordan, L. Marino-Ramirez, and E. Koonin. Evolutionary significance of gene
expression divergence. Gene, 345(1):119–126, 2005.
107. J-I. Jun and L.F. Lau. Taking aim at the extracellular matrix: Ccn proteins as
emerging therapeutic targets. Nature Reviews Drug Discovery, 10(12):945–963,
2011.
108. K. Karplus, C. Barrett, and R. Hughey. Hidden markov models for detecting remote
protein homologies. Bioinformatics, 14(10):846–856, 1998.
References
151
109. J.C. Kendrew, G. Bodo, H.M. Dintzis, R.G. Parrish, H. Wyckoff, and D.C. Phillips.
A three-dimensional model of the myoglobin molecule obtained by x-ray analysis.
Nature, 181(4610):662–666, 1958.
110. G. Kerr, H.J. Ruskin, M. Crane, and P. Doolan. Techniques for clustering gene
expression data. Computers in biology and medicine, 38(3):283–293, 2008.
111. T.M. Khoshgoftaar, D.J. Dittman, R. Wald, and W. Awada. A review of ensemble
classification for dna microarrays data. In IEEE Int. Conference on Tools with
Artificial Intelligence (ICTAI), pages 381–389, 2013.
112. M. Khoshhali, A. Moslemi, M. Saidijam, J. Poorolajal, and H. Mahjub. Predicting
the categories of colon cancer using microarray data and nearest shrunken centroid.
Journal of Biostatistics and Epidemiology, 1(1), 2014.
113. P. Kiepiela, A.J. Leslie, I. Honeyborne, D. Ramduth, C. Thobakgale, S. Chetty,
P. Rathnavalu, C. Moore, K.J. Pfafferott, and L. Hilton. Dominant influence of hlab in mediating the potential co-evolution of hiv and hla. Nature, 432(7018):769–775,
2004.
114. J.G. Kim, S.J. Lee, Y.S. Chae, B.W. Kang, Y.J. Lee, S.Y. Oh, M.C. Kim, K.H. Kim,
and S.J. Kim. Association between phosphorylated amp-activated protein kinase
and mapk3/1 expression and prognosis for patients with gastric cancer. Oncology,
85(2):78–85, 2013.
115. S. Kim, P. Georgiou, and S. Narayanan. Latent acoustic topic models for unstructured audio classification. APSIPA Tran. on Signal and Information Processing,
1:e6, 2012.
116. U-K. Kim, E. Jorgenson, H. Coon, M. Leppert, N. Risch, and D. Drayna. Positional cloning of the human quantitative trait locus underlying taste sensitivity to
phenylthiocarbamide. Science, 299(5610):1221–1225, 2003.
117. J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On combining classifiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998.
118. J.E. Krebs, B. Lewin, E.S. Goldstein, and S.T. Kilpatrick. Lewin’s essential genes.
Jones and Bartlett Publishers, 2013.
119. L.I. Kuncheva. A stability index for feature selection. In Proc. of Int. Conference
of the 25th IASTED International Multi-Conference: Artificial Intelligence and Applications (AIAP), pages 390–395, 2007.
120. J. Lasserre and C.M. Bishop. Generative or discriminative? getting the best of both
worlds. Bayesian Statistics, 8:3–24, 2007.
121. C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. De
Schaetzen, R. Duque, H. Bersini, and A. Nowe. A survey on filter techniques for
feature selection in gene expression microarray analysis. IEEE/ACM Transactions
on Computational Biology and Bioinformatics, 9(4):1106–1119, 2012.
122. J.W. Lee, J.B. Lee, M. Park, and S.H. Song. An extensive comparison of recent
classification tools applied to microarray data. Computational statistics and data
analysis, 48(4):869–885, 2005.
123. K. Lee and D.P.W. Ellis. Audio-based semantic concept classification for consumer
video. IEEE Tran. on Audio, Speech, and Language Processing, 18(6):1406–1416,
2010.
124. C.S. Leslie, E. Eskin, A. Cohen, J. Weston, and W.S. Noble. Mismatch string kernels
for discriminative protein classification. Bioinformatics, 20(4):467–476, 2004.
125. C.S. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: A string kernel for
svm protein classification. In Proc. of Pacific Symposium on Biocomputing (PSB),
pages 566–575, 2002.
126. B. Lewin and G. Dover. Genes V, volume 299. 1994.
127. X. Li and A. Godil. Investigating the bag-of-words method for 3d shape retrieval.
EURASIP Journal on Advances in Signal Processing, (1):108130, 2010.
152
References
128. L. Liao and W.S. Noble. Combining pairwise sequence similarity and support vector
machines for detecting remote protein evolutionary and structural relationships.
Journal of Computational Biology, 10(6):857–868, 2003.
129. M. Lienou, H. Maitre, and M. Datcu. Semantic annotation of satellite images using
latent dirichlet allocation. IEEE Geoscience and Remote Sensing Letters, 7(1):28–
32, 2010.
130. L. Lin, Y. Shen, B. Liu, and X. Wang. Protein fold recognition and remote homology
detection based on profile-level building blocks. In IEEE ICBECS, pages 1–5, 2010.
131. W-C. Lin, A.F.Y. Li, C-W. Chi, W-W. Chung, C.L. Huang, W-Y. Lui, H-J. Kung,
and C-W. Wu. tie-1 protein tyrosine kinase: A novel independent prognostic marker
for gastric cancer. Clinical Cancer Research, 5(7):1745–1751, 1999.
132. B. Liu, X. Wang, Q. Chen, Q. Dong, and X. Lan. Using amino acid physicochemical
distance transformation for fast protein remote homology detection. PLoS ONE,
7(9), 2012.
133. B. Liu, X. Wang, L. Lin, Q. Dong, and X. Wang. A discriminative method for protein
remote homology detection and fold recognition combining top-n-grams and latent
semantic analysis. BMC Bioinformatics, 9(1):510, 2008.
134. B. Liu, D. Zhang, R. Xu, J. Xu, X. Wang, Q. Chen, Q. Dong, and K-C. Chou. Combining evolutionary information extracted from frequency profiles with sequencebased kernels for protein remote homology detection. Bioinformatics, 30(4):472–
479, 2014.
135. H. Liu, L. Liu, and H. Zhang. Ensemble gene selection by grouping for microarray
data classification. Journal of biomedical informatics, 43(1):81–87, 2010.
136. P. Lovato, M. Bicego, M. Cristani, N. Jojic, and A. Perina. Feature selection using
counting grids: application to microarray data. In Proc. Int. Workshop on Statistical
Techniques in Pattern Recognition (SPR2012), volume 7626 of LNCS, pages 629–
637, 2012.
137. P. Lovato, M. Bicego, M. Kesa, V. Murino, N. Jojic, and A. Perina. Traveling on
discrete embeddings of gene expression. Bioinformatics, 2015. submitted.
138. P. Lovato, M. Cristani, and M. Bicego. Soft ngram representation and modeling for
protein remote homology detection. IEEE/ACM Tran. on Computational Biology
and Bioinformatics, 2015. submitted.
139. P. Lovato, A. Giorgetti, and M. Bicego. A multimodal approach to protein remote
homology detection. http://f1000.com/posters/browse/summary/1097145, 2014.
140. P. Lovato, A. Giorgetti, and M. Bicego. A multimodal approach for protein remote
homology detection. IEEE/ACM Tran. on Computational Biology and Bioinformatics, 2015. in press.
141. D.G. Lowe. Object recognition from local scale-invariant features. In Proc. Int.
Conference on Computer Vision (ICCV), page 1150, 1999.
142. D. Lu, C.D. Wolfgang, and T. Hai. Activating transcription factor 3, a stressinducible gene, suppresses ras-stimulated tumorigenesis. Journal of Biological
Chemistry, 281(15):10473–10481, 2006.
143. S.C. Madeira and A.L. Oliveira. Biclustering algorithms for biological data analysis:
a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on,
1(1):24–45, 2004.
144. M.Z. Man, G. Dyson, K. Johnson, and B. Liao. Evaluating methods for classifying
expression data. Journal of Biopharmaceutical statistics, 14(4):1065–1084, 2004.
145. C.D. Manning, P. Raghavan, and H. Schutze. Introduction to information retrieval,
volume 1. Cambridge university press Cambridge, 2008.
146. A. Marchiori, L. Capece, A. Giorgetti, P. Gasparini, M. Behrens, P. Carloni, and
W. Meyerhof. Coarse-grained/molecular mechanics of the tas2r38 bitter taste receptor: Experimentally-validated detailed structural prediction of agonist binding.
PLoS ONE, 8(5):e64675, 2013.
References
153
147. A. Martins, N.A. Smith, E.P. Xing, P.M.Q. Aguiar, and M.A.T. Figueiredo. Nonextensive information theoretic kernels on measures. The Journal of Machine Learning
Research, 10:935–975, 2009.
148. J.D. Mcauliffe and D.M. Blei. Supervised topic models. In Advances in neural
information processing systems, pages 121–128, 2008.
149. A.J. McMichael and Sarah L S.L. Rowland-Jones. Cellular immune responses to
hiv. Nature, 410(6831):980–987, 2001.
150. L.M. Merino, J. Meng, S. Gordon, B.J. Lance, T. Johnson, V. Paul, K. Robbins,
J.M. Vettel, and Y. Huang. A bag-of-words model for task-load prediction from eeg
in complex environments. In ICASSP, pages 1227–1231, 2013.
151. A.J. Minn, G.P. Gupta, P.M. Siegel, P.D. Bos, W. Shu, D.D. Giri, A. Viale, A.B.
Olshen, W.L Gerald, and J. Massague. Genes that mediate breast cancer metastasis
to lung. Nature, 436(7050):518–524, 2005.
152. S. Moir, T-W. Chun, and A.S. Fauci. Pathogenic mechanisms of hiv disease. Annual
Review of Pathology: Mechanisms of Disease, 6:223–248, 2011.
153. V. Moncho-Amor, I. Ibanez de Caceres, E. Bandres, B. Martinez-Poveda, J.L.
Orgaz, I. Sanchez-Perez, S. Zazo, A. Rovira, J. Albanell, B. Jimenez, F. Rojo,
C. Belda-Iniesta, J. Garcia-Foncillas, and R. Perona. Dusp1/mkp1 promotes angiogenesis, invasion and metastasis in non-small-cell lung cancer. Oncogene, 30(6):668–
678, 2011.
154. C.B. Moore, M. John, I.R. James, F.T. Christiansen, C.S. Witt, and S.A. Mallal.
Evidence of hiv-1 adaptation to hla-restricted immune responses at a population
level. Science, 296(5572):1439–1443, 2002.
155. G. Mori, S. Belongie, and J. Malik. Shape contexts enable efficient retrieval of similar
shapes. In Proc. Int. Conference on Computer Vision and Pattern Recognition
(CVPR), volume 1, pages 723–730, 2001.
156. S.B. Needleman and C.D. Wunsch. A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Journal of molecular biology,
48(3):443–453, 1970.
157. H. Nielsen, J. Engelbrecht, S. Brunak, and G. von Heijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein
engineering, 10(1):1–6, 1997.
158. K. Nigam, A.K. McCallum, S. Thrun, and T. Mitchell. Text classification from
labeled and unlabeled documents using em. Machine Learning, 39(2-3):103–134,
2000.
159. J. Nikolich-Zugich, M.K. Slifka, and I. Messaoudi. The many important facets of
t-cell repertoire diversity. Nature Reviews Immunology, 4(2):123–132, 2004.
160. B. Niu, L. Fu, S. Sun, and W. Li. Artificial and natural duplicates in pyrosequencing
reads of metagenomic data. BMC bioinformatics, 11(1):187, 2010.
161. C.L. Nutt, D.R. Mani, R.A. Betensky, P. Tamayo, J.G. Cairncross, C. Ladd,
U. Pohl, C. Hartmann, M.E. McLaughlin, T.T. Batchelor, P. Black, A. von Deimling, S. Pomeroy, T. Golub, and D. Louis. Gene expression-based classification of
malignant gliomas correlates better with survival than histological classification.
Cancer research, 63(7):1602–1607, 2003.
162. A. Osareh and B. Shadgar. Classification and diagnostic prediction of cancers using
gene microarray data analysis. Journal of Applied Sciences, 9(3):459–468, 2009.
163. J.P. Overington, B. Al-Lazikani, and A.L. Hopkins. How many drug targets are
there? Nature Reviews Drug Discovery, 5(12):993–996, 2006.
164. G. Paass, E. Leopold, M. Larson, J. Kindermann, and S. Eickeler. Svm classification using sequences of phonemes and syllables. In Principles of Data Mining and
Knowledge Discovery, pages 373–384. 2002.
154
References
165. S. Pancoast and M. Akbacak. Bag-of-audio-words approach for multimedia event
classification. In INTERSPEECH, 2012.
166. H. Pearson. Genetics: what is a gene? Nature, 441(7092):398–401, 2006.
167. K. Pearson. Contributions to the mathematical theory of evolution. ii. skew variation in homogeneous material. Philosophical Transactions of the Royal Society of
London, pages 343–414, 1895.
168. W.R. Pearson. Rapid and sensitive sequence comparison with fastp and fasta.
Methods in enzymology, 183:63–98, 1990.
169. W.R. Pearson. An introduction to sequence similarity (“homology”) searching.
Current protocols in bioinformatics, page 3.1, 2013.
170. H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 27(8):1226–1238, 2005.
171. A. Perina, M. Bicego, U. Castellani, and V. Murino. Exploiting geometry in counting
grids. In SIMBAD, pages 250–264, 2013.
172. A. Perina, M. Cristani, U. Castellani, V. Murino, and N. Jojic. A hybrid generative/discriminative classification framework based on free-energy terms. In Proc. of
Int. Conference on Computer Vision (ICCV), pages 2058–2065, 2009.
173. A. Perina, M. Cristani, U. Castellani, V. Murino, and N. Jojic. Free energy score
spaces: using generative information in discriminative classifiers. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 34(7):1249–1262, 2012.
174. A. Perina, M. Cristani, U. Castellani, V. Murino, and N.Jojic. Free energy score
space. In Advances in Neural Information Processing Systems, 2009.
175. A. Perina and N. Jojic. Image analysis by counting on a grid. In Proc. of IEEE Conf.
on Computer Vision and Pattern Recognition (CVPR), pages 1985–1992, 2011.
176. A. Perina, N. Jojic, M. Bicego, and A. Truski. Documents as multiple overlapping
windows into grids of counts. In Advances in Neural Processing Information Systems
(NIPS), pages 10–18, 2013.
177. A. Perina, M. Kesa, and M. Bicego. Expression microarray data classification using
counting grids and fisher kernel. In Proc. of Int. Conference on Pattern Recognition
(ICPR), pages 1770–1775, 2014.
178. A. Perina, P. Lovato, M. Cristani, and M. Bicego. A comparison on score spaces
for expression microarray data classification. In Proc. on Pattern Recognition in
Bioinformatics (PRIB), pages 202–213. 2011.
179. A. Perina, P. Lovato, and N. Jojic. Bags of words models of epitope sets: Hiv viral
load regression with counting grids. In Proc. Int. Pacific Symposium on Biocomputing (PSB), pages 288–299, 2014.
180. A. Perina, P. Lovato, V. Murino, and M. Bicego. Biologically-aware latent dirichlet allocation (balda) for the classification of expression microarray. In Pattern
Recognition in Bioinformatics (PRIB), LNCS, pages 230–241. 2010.
181. M. Polesani, L. Bortesi, A. Ferrarini, A. Zamboni, M. Fasoli, C. Zadra, A. Lovato,
M. Pezzotti, M. Delledonne, and A. Polverari. General and species-specific transcriptional responses to downy mildew infection in a susceptible (vitis vinifera) and
a resistant (v. riparia) grapevine species. BMC genomics, 11(1):117, 2010.
182. S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M. McLaughlin,
J.Y.H Kim, L.C. Goumnerova, P.M. Black, C. Lau, J. Allen, D. Zagzag, J. Olson,
T. Curran, C. Wetmore, J. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano,
G. Stolovitzky, D. Louis, J. Mesirov, E. Lander, and T. Golub. Prediction of central
nervous system embryonal tumour outcome based on gene expression. Nature,
415(6870):436–442, 2002.
183. A. Prelić, S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9):1122–1129, 2006.
References
155
184. B. Qian and R.A. Goldstein. Performance of an iterated t-hmm for homology
detection. Bioinformatics, 20(14):2175–2180, 2004.
185. K.M. Quinn, B.L. Monroe, M. Colaresi, M.H. Crespin, and D.R. Radev. How to
analyze political attention with minimal assumptions and costs. American Journal
of Political Science, 54(1):209–228, 2010.
186. G. Ramsay. Dna chips: State-of-the-art. Nature Biotechnology, 16(1):40–44, 1998.
187. H. Rangwala and G. Karypis. Profile-based direct kernels for remote homology
detection and fold recognition. Bioinformatics, 21(23):4239–4247, 2005.
188. N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G.R.G. Lanckriet, R. Levy,
and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In Proc.
of Int. Conference on Multimedia, pages 251–260, 2010.
189. D.M. Raup. Taxonomic diversity estimation using rarefaction. Paleobiology, pages
333–342, 1975.
190. S. Resino, J.M. Bellon, D. Gurbindo, J.A. Leon, and M. Muñoz-Fernández. Recovery of t-cell subsets after antiretroviral therapy in hiv-infected children. European
journal of clinical investigation, 33(7):619–627, 2003.
191. S. Resino, E. Seoane, A. Pérez, E. Ruiz-Mateos, M. Leal, and M. Muñoz-Fernández.
Different profiles of immune reconstitution in children and adults with hiv-infection
after highly active antiretroviral therapy. BMC infectious diseases, 6(1):112, 2006.
192. F. Revillion, V. Pawlowski, L. Hornez, and J.P. Peyrat. Glyceraldehyde-3-phosphate
dehydrogenase gene expression in human breast cancer. European Journal of Cancer, 36(8):1038–1042, 2000.
193. M.J. Rodriguez-Colman, G. Reverter-Branchat, M.A. Sorolla, J. Tamarit, J. Ros,
and E. Cabiscol. The forkhead transcription factor hcm1 promotes mitochondrial biogenesis and stress resistance in yeast. Journal of Biological Chemistry,
285(47):37092–37101, 2010.
194. S. Rogers, M. Girolami, C. Campbell, and R. Breitling. The latent process decomposition of cdna microarray data sets. IEEE/ACM Transactions on Computational
Biology and Bioinformatics, 2(2):143–156, 2005.
195. M. Ronaghi, M. Uhlén, and P. Nyrén. A sequencing method based on real-time
pyrophosphate. Science, 281(5375):363–365, 1998.
196. D.T. Ross, U. Scherf, M.B. Eisen, C.M. Perou, C. Rees, P. Spellman, V. Iyer, S.S.
Jeffrey, M. Van de Rijn, M. Waltham, A. Pergamenschikov, J. Lee, D. Lashkari,
D. Shalon, T. Myers, J. Weinstein, D. Botstein, and P. Brown. Systematic variation
in gene expression patterns in human cancer cell lines. Nature genetics, 24(3):227–
235, 2000.
197. T. Rossignol, L. Dulau, A. Julien, and B. Blondin. Genome-wide monitoring of
wine yeast gene expression during alcoholic fermentation. Yeast, 20(16):1369–1385,
2003.
198. D.E. Sabatino, F. Mingozzi, D.J. Hui, H. Chen, P. Colosi, H.C.J. Ertl, and K.A.
High. Identification of mouse aav capsid-specific cd8+ t cell epitopes. Molecular
Therapy, 12(6):1023–1033, 2005.
199. Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in
bioinformatics. Bioinformatics, 23(19):2507–2517, 2007.
200. M. Sahlgren and R. Cöster. Using bag-of-concepts to improve the performance of
support vector machines in text categorization. In Proceedings of the 20th International Conference on Computational Linguistics, page 487, 2004.
201. H. Saigo, J-P. Vert, T. Akutsu, and N. Ueda. Comparison of svm-based methods
for remote homology detection. Genome Informatics, 13:396–397, 2002.
202. H. Saigo, J-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using
string alignment kernels. Bioinformatics, 20(11):1682–1689, 2004.
156
References
203. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGrawHill, Inc., 1986.
204. G. Sandve and F. Drablos. A survey of motif discovery methods in an integrated
framework. Biology Direct, 1(1):11, 2006.
205. G. Schwarz. Estimating the dimension of a model. The annals of statistics, 6(2):461–
464, 1978.
206. M.M.K. Shahzad, J.M. Arevalo, G.N. Armaiz-Pena, C. Lu, R.L. Stone, M. MorenoSmith M. Nishimura, J-W. Lee, N.B. Jennings, J. Bottsford-Miller, P. Vivas-Mejia,
S.K. Lutgendorf, G. Lopez-Berestein, M. Bar-Eli, S.W. Cole, and A.K. Sood. Stress
effects on fosb- and interleukin-8 (il8)-driven ovarian cancer growth and metastasis.
Journal of Biological Chemistry, 285(46):35462–35470, 2010.
207. J. Shankar, A. Messenberg, J. Chan, T.M. Underhill, L.J. Foster, and I.R.
Nabi. Pseudopodial actin dynamics control epithelial-mesenchymal transition in
metastatic cancer cells. Cancer Research, 70(9):3780–3790, 2010.
208. D. Shibata. Clonal diversity in tumor progression. Nature genetics, 38(4):402–403,
2006.
209. S. Shivashankar, S. Srivathsan, B. Ravindran, and A.V. Tendulkar. Multi-view
methods for protein structure comparison using latent dirichlet allocation. Bioinformatics, 27(13):161–168, 2011.
210. D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo,
A.R. Renshaw, A.V. D’Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff,
T.R. Golub, and W.R. Sellers. Gene expression correlates of clinical prostate cancer
behavior. Cancer cell, 1(2):203–209, 2002.
211. R. Singh, B. Raj, and P. Smaragdis. Latent-variable decomposition based dereverberation of monaural and multi-channel signals. In Int. Conf. on Acoustics Speech
and Signal Processing (ICASSP), pages 1914–1917, 2010.
212. J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching
in videos. In Proc. Int. Conference on Computer Vision (ICCV), volume 2, pages
1470–1477, 2003.
213. P. Smaragdis, B. Raj, and M. Shashanka. Missing data imputation for timefrequency representations of audio signals. Journal of signal processing systems,
65(3):361–370, 2011.
214. P. Smaragdis, M. Shashanka, and B. Raj. A sparse non-parametric approach for
single channel separation of known sounds. In Advances in Neural Information
Processing Systems, pages 1705–1713. 2009.
215. P. Smaragdis, M. Shashanka, and B. Raj. Topic models for audio mixture analysis.
In NIPS Workshop on Applications for Topic Models: Text and Beyond, pages 1–4,
2009.
216. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences.
Journal of molecular biology, 147(1):195–197, 1981.
217. T.G. Smolinski, R. Buchanan, G.M. Boratyn, M. Milanova, and A.A. Prinz. Independent component analysis-motivated approach to classificatory decomposition of
cortical evoked potentials. BMC bioinformatics, 7(Suppl. 2):S8, 2006.
218. M. De Souto, I. Costa, D. Araujo, T. Ludermir, and A. Schliep. Clustering cancer
gene expression data: a comparative study. BMC Bioinformatics, 9(1):497, 2008.
219. A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardins, and S. Levy. A comprehensive evaluation of multicategory classification methods for microarray gene
expression cancer diagnosis. Bioinformatics, 21(5):631–643, 2005.
220. J.E. Staunton, D.K. Slonim, H.A. Coller, P. Tamayo, M.J. Angelo, J. Park,
U. Scherf, J.K. Lee, W.O. Reinhold, J.N. Weinstein, J. Mesirov, E. Lander, and
T. Golub. Chemosensitivity prediction by transcriptional profiling. Proceedings of
the National Academy of Sciences, 98(19):10787–10792, 2001.
References
157
221. D. Stekel. Microarray bioinformatics. Cambridge University Press, 2003.
222. A.I. Su, J.B. Welsh, L.M. Sapinoso, S.G. Kern, P. Dimitrov, H. Lapp, P.G. Schultz,
S.M. Powell, C.A. Moskaluk, H.F. Frierson, and G. Hampton. Molecular classification of human carcinomas by use of gene expression signatures. Cancer research,
61(20):7388–7393, 2001.
223. K. Tamura, D. Peterson, N. Peterson, G. Stecher, M. Nei, and S. Kumar. Mega5:
molecular evolutionary genetics analysis using maximum likelihood, evolutionary
distance, and maximum parsimony methods. Molecular biology and evolution,
28(10):2731–2739, 2011.
224. A. Tanay, R. Sharan, and R. Shamir. Biclustering algorithms: A survey. Handbook
of computational molecular biology, 9(1-20):122–124, 2005.
225. J.D. Thompson, D.G. Higgins, and T.J. Gibson. Clustal w: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic acids research, 22(22):4673–
4680, 1994.
226. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288, 1996.
227. K. Tokunaga, Y. Nakamura, K. Sakata, K. Fujimori, M. Ohkubo, K. Sawada, and
S. Sakiyama. Enhanced expression of a glyceraldehyde-3-phosphate dehydrogenase
gene in human lung cancers. Cancer research, 47(21):5616–5619, 1987.
228. S. Troup, C. Njue, E.V. Kliewer, M. Parisien, C. Roskelley, S. Chakravarti, P.J.
Roughley, L.C. Murphy, and P.H. Watson. Reduced expression of the small leucinerich proteoglycans, lumican, and decorin is associated with poor outcome in nodenegative invasive breast cancer. Clinical Cancer Research, 9(1):207–214, 2003.
229. P. Valiant and G. Valiant. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems
(NIPS), pages 2157–2165, 2013.
230. M. Varma and A. Zisserman. Classifying images of materials: achieving viewpoint
and illumination independence. In Proc. European Conference on Computer Vision
(ECCV), volume 3, pages 255–271, 2002.
231. S. Vinga and J. Almeida. Alignment-free sequence comparison – a review. Bioinformatics, 19(4):513–523, 2003.
232. C. Wang, D. Blei, and F-F. Li. Simultaneous image classification and annotation. In
Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages
1903–1910, 2009.
233. L. Wang, J. Zhu, and H. Zou. Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics, 24(3):412–419, 2008.
234. X. Wang and O. Gotoh. A robust gene selection method for microarray-based cancer
classification. Cancer informatics, 9:15, 2010.
235. Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):57–63, 2009.
236. R.L. Warren, J.D. Freeman, T. Zeng, G. Choe, S. Munro, R. Moore, J.R. Webb,
and R.A. Holt. Exhaustive t-cell repertoire sequencing of human peripheral blood
samples reveals signatures of antigen selection and a directly measured repertoire
size of at least 1 million clonotypes. Genome research, 21(5):790–797, 2011.
237. S. Watanabe. Pattern Recognition: Human and Mechanical. Wiley, 1985.
238. S. Whelan and N. Goldman. A general empirical model of protein evolution derived
from multiple protein families using a maximum-likelihood approach. Molecular
biology and evolution, 18(5):691–699, 2001.
239. World Health Organization (WHO). Number of deaths due to hiv/aids, 2013. http:
//www.who.int/gho/hiv/epidemicstatus/deaths/en/.
158
References
240. B. Wielockx, C. Libert, and C. Wilson. Matrilysin (matrix metalloproteinase-7):
a new promising drug target in cancer and inflammation? Cytokine and Growth
Factor Reviews, 15(23):111–115, 2004.
241. C.F.J. Wu. On the convergence properties of the EM algorithm. The Annals of
Statistics, 1(1):95–103, 1983.
242. M. Xu, L-Y. Duan, J. Cai, L-T. Chia, C. Xu, and Q. Tian. Hmm-based audio
keyword generation. In Advances in Multimedia Information Processing, pages 566–
574. 2005.
243. S.H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike:
joint friendship and interest propagation in social networks. In Proc. of the 20th
International Conference on World Wide Web (WWW), WWW ’11, pages 537–546,
2011.
244. Z. Yang. Maximum likelihood phylogenetic estimation from dna sequences with
variable rates over sites: approximate methods. Journal of Molecular evolution,
39(3):306–314, 1994.
245. L. Yu, Y. Han, and M.E. Berens. Stable gene selection from microarray data via
sample weighting. IEEE/ACM Tran. on Computational Biology and Bioinformatics,
9:262–272, 2012.
246. N. Yukinawa, S. Oba, K. Kato, and S. Ishii. Optimal aggregation of binary classifiers for multiclass cancer diagnosis using gene expression profiles. IEEE/ACM
Transactions on Computational Biology and Bioinformatics, 6(2):333–343, 2009.
247. G.L. Zhang, H. Rahman Ansari, P. Bradley, G.C. Cawley, T. Hertz, X. Hu, N. Jojic, Y. Kim, O. Kohlbacher, O. Lund, C. Lundegaard, C.A. Magaret, M. Nielsen,
H. Papadopoulos, G.P.S. Raghava, V-S. Tal, L.C. Xue, C. Yanover, S. Zhu, M.T.
Rock, J.E. Crowe Jr., C. Panayiotou, M.M. Polycarpou, W. Duch, and V. Brusic.
Machine learning competition in immunology – prediction of hla class i binding
peptides. Journal of Immunological Methods, 374(1-2):1–4, 2011.
248. H. Zhang, C-Y. Yu, B. Singer, and M. Xiong. Recursive partitioning for tumor
classification with gene expression microarray data. Proceedings of the National
Academy of Sciences, 98(12):6730–6735, 2001.
249. Y-J. Zhang, H. Li, H-C. Wu, J. Shen, L. Wang, M-W. Yu, P-H. Lee, I.B. Weinstein,
and R.M. Santella. Silencing of hint1, a novel tumor suppressor gene, by promoter hypermethylation in hepatocellular carcinoma. Cancer Letters, 275(2):277–
284, 2009.
250. S. Zhu, D. Wang, K. Yu, T. Li, and Y. Gong. Feature selection for gene expression
using model-based entropy. IEEE/ACM Transactions on Computational Biology
and Bioinformatics, 7(1):25–36, 2010.
13
Sommario
Molti problemi di Pattern Recognition statistica sono stati affrontati nella letteratura recente attraverso la rappresentazione “bag of words”, una rappresentazione
particolarmente appropriata quando negli oggetti del problema si riescono ad individuare dei semplici elementi “costituenti”. Mediante la rappresentazione bag of
words, gli oggetti vengono caratterizzati da un vettore in cui ogni elemento conta
il numero di occorrenze dei costituenti nell’oggetto.
Nonostante il grande successo ottenuto in diversi campi della ricerca scientifica, tecniche e modelli basati su questa rappresentazione non sono ancora stati
sfruttati appieno in Bioinformatica, a causa delle sfide metodologiche e applicative
poste da questa specifica disciplina. Ciononostante, in questo contesto la rappresentazione bag of words sembra essere particolarmente appropriata: da un lato,
numerosi problemi bioinformatici sono inerentemente posti attraverso meccanismi
di conteggio; dall’altro, in molti scenari biologici la struttura degli oggetti che li
caratterizzano è assente o sconosciuta, e uno dei maggiori svantaggi della rappresentazione bag of words (che non modella tale struttura) viene a cadere.
Questa tesi si inserisce nel contesto appena presentato, e promuove l’utilizzo
della rappresentazione bag of words per caratterizzare oggetti e problemi in Bioinformatica e Biologia Computazionale. In questa tesi vengono investigate tutte le
problematiche relative alla creazione di rappresentazioni e modelli bag of words
per specifici problemi, e vengono proposte possibili soluzioni e approcci. In dettaglio, sono stati individuati ed analizzati in questa tesi tre specifici problemi
bioinformatici: l’analisi dell’espressione genica, il modeling dell’infezione HIV, e
l’identificazione di omologia remota fra proteine. Per ogni scenario sono state analizzate le motivazioni, i vantaggi, e le sfide poste dall’utilizzo di rappresentazioni
e modelli bag of words, e sono state proposte diverse soluzioni. I meriti degli
approcci proposti sono stati dimostrati attraverso estese validazioni sperimentali,
sia sfruttando benchmark ampiamente utilizzati in letteratura, sia utilizzando dati
derivanti dall’interazione diretta con laboratori e gruppi di ricerca clinici/biologici.
La conclusione raggiunta indica che gli approcci basati sulla rappresentazione bag
of words possono avere un impatto determinante nelle comunità della Bioinformatica e Biologia Computazionale.