Bag of Words approaches for Bioinformatics
Transcription
Bag of Words approaches for Bioinformatics
Pietro Lovato Bag of Words approaches for Bioinformatics Ph.D. Thesis XXVII cycle (January 2012 - December 2014) Università degli Studi di Verona Dipartimento di Informatica Advisor: dr. Manuele Bicego Series N◦ : TD-03-15 Università di Verona Dipartimento di Informatica Strada le Grazie 15, 37134 Verona Italy Abstract In recent years, several Pattern Recognition problems have been successfully faced by approaches based on the “bag of words” representation. This representation is particularly appropriate when the pattern is characterized (or assumed to be characterized) by the repetition of basic, “constituting” elements called words. By assuming that all possible words are stored in a dictionary, the bag of words vector for one particular object is obtained by counting the number of times each element of the dictionary occurs in the object. Even if largely applied to several scientific fields (with increasingly sophisticated approaches), techniques based on this representation have not been completely exploited in Bioinformatics, due to the methodological and applicative challenges derived from the peculiar scenario. However, in this context the bag of words paradigm seems to be particularly suited: on one hand, many biological mechanisms inherently subsume a counting process; on the other hand, in many Bioinformatics scenarios the objects of the problem are either unstructured or with unknown structure, so that one of the main drawbacks of the bag of words representation (it destroys the object’s structure) does not hold anymore. This permits to exploit and to derive highly effective and interpretable solutions, a stringent need in nowadays Bioinformatics research. This thesis is inserted in the above described scenario, and promotes the use of the bag of words paradigm to face problems in Bioinformatics. We investigated the different problematics and aspects related to the creation of bag of words models and representations for some specific Bioinformatics problems, as well as proposing original solutions and approaches based on this representation. In particular, in this thesis three scenarios have been analyzed: the gene expression analysis, the modeling of HIV infection, and the protein remote homology detection. For each scenario, motivations, advantages, and challenges of the bag of words representations are addressed, proposing possible solutions. The merits of bag of words representations and models have been demonstrated in extensive experimental evaluations, exploiting widely used benchmarks as well as datasets derived from direct interactions with biological and clinical laboratories and research groups. With this thesis, we provided evidence that the bag of words representation can have a significant impact on the Bioinformatics and Computational Biology communities. Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5 7 7 2 The Bag of Words paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 What to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Easy-to-define words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Difficult-to-define words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 How to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 How to model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 The bag of words as a multinomial . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Probabilistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Inference and learning in bayesian networks . . . . . . . . . . . . . . . 9 10 11 12 13 15 16 18 19 21 Part I Gene expression analysis 3 The gene expression analysis problem . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Background: gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 DNA Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Computational analysis of a gene expression matrix . . . . . . . . . . . . . . 3.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 28 30 33 4 Gene expression classification using topic models . . . . . . . . . . . . . . 4.1 Topic models and gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Probabilistic Latent Semantic Analysis (PLSA) . . . . . . . . . . . 4.2 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The interpretability of the feature vector . . . . . . . . . . . . . . . . . . . . . . . 35 35 36 38 41 43 45 VI 5 Contents The Counting Grid model for gene expression data analysis . . . 5.1 The Counting Grid model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Class embedding and biomarker identification . . . . . . . . . . . . . . . . . . . 5.3 Example: mining yeast expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Embedding and clustering performances . . . . . . . . . . . . . . . . . . 5.4.2 Qualitative evaluation of gene selection . . . . . . . . . . . . . . . . . . . 5.4.3 Quantitative evaluation of gene selection . . . . . . . . . . . . . . . . . 5.4.4 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 51 52 55 56 58 60 62 63 Part II HIV modeling 6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7 Regression of HIV viral load using bag of words . . . . . . . . . . . . . . . 7.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 What to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 How to count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 How to model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Information extracted: regression of viral load value . . . . . . . . 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Experiment 1: modeling antigen presentation with the counting grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Experiment 2: comparison with the state of the art . . . . . . . . 77 77 79 79 79 80 81 82 Bag of words analysis for T-Cell Receptors . . . . . . . . . . . . . . . . . . . . 8.1 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Diversity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Reliability of the bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Shannon index analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Nuanced patterns in the bag of words . . . . . . . . . . . . . . . . . . . . 8.2.4 Rarefaction curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Total number of species estimation . . . . . . . . . . . . . . . . . . . . . . 8.2.6 Reliability of the bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . 87 88 89 90 91 91 92 95 95 96 97 8 83 84 Part III Protein remote homology detection 9 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 9.1 Background: protein functions and homology . . . . . . . . . . . . . . . . . . . . 108 9.2 Computational protein remote homology detection . . . . . . . . . . . . . . . 110 9.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Contents 1 10 Soft Ngram representation and modeling for protein remote homology detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.1 Profile-based Ngram representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.2 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.2.1 Modeling: soft bag of words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 10.2.2 Soft PLSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 10.2.3 SVM classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 10.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 10.3.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 10.3.2 Detection results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 122 11 A multimodal approach for protein remote homology detection 127 11.1 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 11.1.1 Componential Counting Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 11.2 The proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 11.2.1 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 11.2.2 Multimodal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 11.2.3 Classification scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 11.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11.3.1 First analysis: families 3.42.1.1 and 3.42.1.5 . . . . . . . . . . . . . . . 133 11.3.2 Second analysis: all families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.4 Multimodal analysis of bitter taste receptor TAS2R38 . . . . . . . . . . . . 136 12 Conclusions and future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Sommario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 1 Introduction Humans have developed highly sophisticated skills to sense the environment and to take actions according to what they observe. Daily routine deals with recognizing faces, understanding spoken words, reading handwritten characters, and so on. However, complex processes underlie these acts of pattern recognition. Following a classical definition by Duda [66]: “pattern recognition is the act of taking in raw data and taking an action based on the “category” of the pattern”. The notion of pattern is extremely general, and is defined by Watanabe [237] as “an entity, vaguely defined, that could be given a name”. A fingerprint image, a human face, a DNA sequence, or a text document are just examples of patterns; they are the objects, or instances, of the problem under consideration. Over the years, researchers questioned if it could be possible to give similar capabilities to machines: from automated speech recognition, fingerprint identification, optical character recognition, DNA sequence manipulation and much more, it is clear that reliable, accurate pattern recognition by machines would be immensely useful. Automatic pattern recognition entails that these real-world objects are acquired and abstracted, through sensors and measurements, in a digital representation that a machine is able to understand. From an initial set of raw data one can derive features, that are characteristics intrinsic to the object itself, intended to be informative, non redundant, facilitating the decision making model, and possibly leading to better human interpretations. In the end, the individual objects are described by the set of values assigned to the different features, encoded in a mathematical entity such as a vector, a string, a tree, a graph, or others. Given the representation of objects, the pattern recognition strategy typically exploits the so-called “learning from examples” paradigm, in which a large set of N objects or instances of the problem – called a training set – is acquired and represented, and used to learn the parameters of a model or a classifier. Once the model is trained, it can determine the category of new objects, which are said to comprise a testing set. The ability to correctly categorize new examples that differ from those used for training is known as generalization capability. In a simple example, depicted in figure 1.1, the objects to be represented and modeled are text documents, where the goal can be for example to categorize them into literary genres. Suppose that, after the acquisition phase, the raw data for a document consists of a sequence of ASCII characters. This already represents 2 1 Introduction Acquisition Humans have developed highly sophisticated skills to sense their environment and to take actions according to what they observe. Daily routine deals with recognizing faces, understanding spoken words, reading handwritten characters, and so on. Representation Representation Features: - ASCII characters Pattern: sequence of char c1,c2,c3,...,cN Features: - Document length - words average length Pattern: 2-dimensional vector [dl,wl] Fig. 1.1. In the example, a document is acquired and digitally encoded as a list of ASCII characters (top part of the figure). Another representation for the document, in the bottom, consists of a numerical vector that is obtained by measuring the document length and the average length of its words. a possible representation for documents, where features are individual characters listed in a structure whose length varies according to the number of characters in the documents. More than this, numerical features can be extracted from the raw data: for example, one can compute the document length and the average length of the words comprising it, representing the document with two numerical values stored in a fixed-length vector. In other words, a document is represented as a point in a vector space (2-dimensional in this case), also called feature space. Even if there may be an information loss during this process of projecting objects in a feature space, such an approach is perhaps the most employed in pattern recognition. Its strength is that it can leverage a wide spectrum of mathematical tools ranging from statistics, to geometry, to optimization. Of course the choice and the combination of features is crucial, and many different ways to characterize an object exist and have been proposed in the past. In particular, an effective one called the “bag of words” [203] asserted itself and has assumed a great importance in recent years. The bag of words is a representation particularly appropriate when the pattern is characterized (or assumed to be characterized) by the repetition of basic constituting elements called words1 . If we assume that all possible words are stored in a dictionary, the bag of words vector for one particular object is obtained by counting the number of times each element of the dictionary occurs in the object. Looking back at the document categorization example, if we are given a dictionary that lists all possible words, the document can be characterized with a vector where each element counts the number of occurrences – in the document – of each given word in the dictionary. A scheme of the bag of words representation for documents is shown in figure 1.2. The bag of words has been very successful in the literature: a non exhaustive list of fields where it has been effectively applied includes Natural Language Processing, where it has been originally introduced [70, 200, 203]; Signal Processing, where it has been employed to model audio signals [165, 215], biochemical signals (such as NMR spectra) [37], or medical signals (such as EEG) [150]; robotics, where it is mainly employed for fast robot localization [75]. Also, in the fields of Image Processing and Computer Vision, it has been proposed to characterize textures [55,230], contours [155], images [54,57,141], 3D shapes [127], and videos [212]. 1 The terminology stemmed from the Natural Language Processing community [203], where it is assumed that the constituting elements of a document are words. 1 Introduction 3 Dictionary Acquisition Humans have developed highly sophisticated skills to sense their environment and to take actions according to what they observe. Daily routine deals with recognizing faces, understanding spoken words, reading handwritten characters, and so on. aardvark about all apple ... gas ... zebra Bag of words vector aardvark about all apple ... gas ... zebra 0 3 2 0 ... 1 ... 0 Fig. 1.2. The bag of words representation. Given a document to represent, and a dictionary containing all possible words, the bag of words vector is obtained by counting how many times each word of the dictionary appears in the document. One of its main advantages is that it can represent in a vector space many types of objects, even ones that are non-vectorial in nature (like documents in the example of Fig.1.2), for which less computational tools are available. The obtained results motivated researchers to delve deeper into this paradigm, by proposing methods to refine the information contained in a bag of words, to better interpret it, and to boost performances in classification tasks. For example, by taking into explicit consideration that the values observed in a bag of words arise from a counting process, probabilistic models have been successfully developed end exploited. The relation is that counts are consistently explained through the multinomial distribution. Many probabilistic models have been proposed in the literature for bag of words: among others, there is a class of models whose importance has drastically grown in recent years, called topic models [29–31, 94, 104, 148, 176]. Topic models were originally introduced in the Natural Language Processing community [31, 94], with the main goal of describing documents (represented as bags of words), abstracting the topics the various documents are speaking about. Their wide usage is motivated by their expressiveness and efficiency, by the interpretability of the solution provided [44], and by state of the art results achieved in several applications [12, 33, 53, 129, 185, 188, 215, 232, 243]. Facing a task with a bag of words approach needs some issues to be addressed, which can be summarized as follows: • What to count? The definition and the extraction of the “building blocks”, i.e. the entities to be counted, is crucial for any bag of words method. Back to the document example, depending on the task it could be more appropriate to define as building blocks syllables, rather than words. In Computer Vision, the dictionary elements of a bag of words are usually saliency points in images: the definition of saliency is not trivial, and its extraction usually go through a complex processing scheme, whose pros and cons must be addressed carefully. In particular, since the dictionary must be “discrete”, it is sometimes required to quantize the continuous signals. In practice, the definition of the words requires the experimentation of multiple possibilities, and the combination of automated techniques with the intuition and knowledge of the domain expert. 4 • • 1 Introduction How to count? Counting implies measuring the level of presence of an element (the “word”) in an object. The counting process can be easy and straightforward, or can be more difficult. Counting words in document seems like a natural and easy task to perform; however, quantifying the number of molecules in a cell may be challenging due to technological problems. Moreover, counts derive from a measurement process. Thus, they may be affected by noise or prone to systematic errors, derived from a technological problem (as in the above example), or from an inappropriate processing of the raw data. This in turn means that counts are affected by uncertainty, which has been hardly taken into account in bag of words approaches. Finally, it should be noted that a proper count should in principle be a discrete number, but – generally speaking – it can also be any real number that reflects a level of presence / importance / power. How to model? The bag of words representation can be modeled further, and depending on the task at hand, different models can be employed. As it is often the case, a more complex model may offer a richer description or higher performances, at the cost of an increased complexity. The literature on the subject is huge, including fast and straightforward solutions, as well as complicated models which take into considerations many facets and aspects. It is very important to notice that one of the main drawbacks of the bag of words representation is that – in many domains and applications – it destroys the possible structure of objects. The term “bag” is used because (in the text domain where it has been originally introduced) the ordering of the words in the document is lost: the sentences “the man killed the wolf” and “the wolf killed the man” will result in the same bag of words vector despite the huge difference in semantics. Nevertheless, it may still be very convenient to employ the bag of words, as it readily extracts a numerical vector which may facilitate the subsequent steps and achieve high accuracy. This representation would be completely suited in those scenarios where the object is either unstructured or with unknown structure: in this case there is no information loss, and the bag of words can bring out all its potentialities. In this general picture, the rapidly developing field of bioinformatics is increasingly providing scenarios where the bag of words representation seems to be a very promising possibility, for a twofold reason: on one hand, the bag of words seems a natural choice since many problems are intrinsically formulated as counting; on the other hand, in different contexts the structure is truly absent or unknown, excising one of the main drawbacks of the bag of words representation. For example, the cell regulates its function by adjusting the amount (i.e. the level of presence) of particular molecules – proteins – it manufactures. This regulation phenomenon, called gene expression, is the process by which the information encoded in a gene is used to direct this assembly of proteins. The most important aspect is the amount (count) of these molecules, which primarily determines the function of the cell. Moreover, there is no obvious ordering of the genes: the biological machine inherently works ignoring the spatial position of the molecules, and gene expression is carried out by simply looking at types and concentration of substances. For these reasons, the bag of words seems to be a very suitable choice of representation for the gene expression domain. However, there are methodological and applicative 1.1 Contributions 5 challenges – derived from the peculiar scenario – that have to be addressed to completely exploit the bag of words approach. For example, it is not straightforward to map gene expression into a count value: the mostly used measuring technology, called DNA Microarray [47], essentially measures fluorescence emitted by the gene products2 . In another example, the immune system gathers evidence of the execution of various molecular processes, both foreign and the cells’ own, because particular receptors (called TCRs) observe sets of epitopes, small segments of the proteins involved in these processes. Epitopes do not have any obvious ordering in this scheme: the immune system, through TCRs, sees these epitope sets as disordered “bags”, based on whose counts the action needs to be taken. In this context, the bag of words would provide a set of tools for capturing correlations in the immune target abundances during cellular immune surveillance, and could be immensely useful for detecting patients or populations that are likely to react similarly to an infection, or for rational vaccine design. From the modeling point of view, it seems that in the examples portrayed probabilistic models and topic models (based on the bag of words) represent particularly suited choices. Aside from the state of the art performances obtained in several other scenarios, probability theory provides a consistent framework for the quantification and manipulation of noise and uncertainty, being able to explain the process of data generation, to increase the predictive accuracy, and to provide a more interpretable description. Particularly in biology and medicine, interpretability is a key requirement and a stringent need: very often, the final goal is helping the biological expert to gain a deeper understanding of the phenomenon under investigation. 1.1 Contributions This thesis is inserted in the above described framework, and is aimed at investigating and promoting the applicability of bag of words representations and models in the wide field of bioinformatics. The first contribution of this thesis is therefore the identification of some bioinformatics scenarios which can be faced from a bag of words perspective. For each scenario, motivations, advantages, and challenges of the bag of words representations are addressed, proposing possible solutions. From a methodological point of view, this thesis contributes in two ways: i) bag of words approaches have been exported from other contexts and tailored to the specific bioinformatics scenario; ii) novel bag of words representations and models have been derived. From a more applicative perspective, the derived representation and models have been extensively tested, contributing to push forward the state of the art. Primary importance has been given to the interpretability of approaches results, in an effort to provide a biologist or a clinician with tools that permit to gain relevant insights into the phenomenon under consideration. More in detail, three applicative contexts have been analyzed: 2 Emerging technologies such as Rna-seq [235] are increasingly providing a way to directly observe and count such molecules, although they are not as widespread as microarrays. 6 • • • 1 Introduction Gene expression analysis: In this context, the first original contribution was in recognizing that a vector of gene expressions can be considered as a bag of words. Given that, we investigated the capabilities of topic models for the classification task, reaching state of the art results on many datasets; considerations on the interpretability of the obtained representations are provided, with the use of a real dataset involving different species of grapevine (resulting from a collaboration with the Functional Genomics Lab at the University of Verona). Finally, we show the suitability of more recent models, to mine knowledge from gene expression data (more than classification): our approach permits to visualize a gene expression dataset by embedding biological samples in a 2D map, and to derive a principled and founded method to highlight the most discriminative genes involved in a pathology. HIV modeling: This thesis contributed in the HIV and immune system modeling, by promoting the usage of the bag of words representation and models for epitope sets, as well as by studying the TCRs counts variation upon infection. In fact, upon HIV infection, two phenomena co-occur: i) the patient’s bag of epitopes changes, since new fragments of the virus are presented for immune surveillance; ii) the patient’s bag of TCRs changes, since HIV/AIDS implies a progressive failure of the immune system, resulting in a drastic decrease of TCR levels. In the first case, a bag of words representation has been derived and modeled with the final goal of regressing the viral load value (an estimate of the patient HIV status). In the second case, the quality of TCR counts, extracted via 454 pyrosequencing from different HIV patients, has been assessed. Using the proposed approach, realized in collaboration with the David Geffen School of Medicine (UCLA), we were able to propose a reliable estimate of the bag of words (which is heavily prone to noise and sequences errors) and to statistically validate clinical hypothesis. Protein remote homology detection: Finally, this thesis addressed the protein remote homology detection problem, a crucial task in bioinformatics where the goal is to determine if two proteins have a similar biological function even when their sequence similarity is low. In this context, the bag of words approach has been already investigated in the literature, and have proved to be successful: by positing an analogy with the document scenario, biological “words” have been extracted from a protein sequence (for example using Ngrams, namely short contiguous subsequences of N symbols). This thesis contributed in two different directions. The first one is aimed at integrating evolutionary information into the bag of words representation, equipping each word/Ngram with a weight that encode its conservation across evolution. A novel bag of words approach, called soft bag of words, has been devised, together with a novel probabilistic model able to handle the presence of a weight associated with each word. The second research direction is aimed at properly integrating into existing models partial information derived from other sources. In particular, there is a source of information which is typically disregarded by classical approaches: the available experimentally-solved, possibly few, 3D structures of proteins. In this thesis a multimodal approach 1.3 Publications 7 for protein remote homology detection has been proposed, validating it using standard benchmarks, as well as employing a real dataset involving the superfamily of GPCR proteins (in collaboration with the Applied Bioinformatics Group at the University of Verona). 1.2 Organization of the thesis This thesis is divided in an introductory chapter and three main parts. The first chapter formally presents the bag of words paradigm, and introduces the notation and formalism employed in the subsequent chapters. The three main parts describe the proposed approaches in the three bioinformatics scenarios, namely the gene expression analysis, the HIV modeling, and the protein remote homology detection. In the gene expression part, Chap. 3 introduces and describes the problem, and summarizes the recent literature. Then, Chap. 4 describes how to employ bag of words and topic models for gene expression classification, also discussing the interpretability of the method. Finally, Chap. 5 deals with the usage of a more recent and sophisticated model for gene expression, presenting methodological and applicative contributions achieved. In the HIV modeling part, Chap. 6 introduces and describes the problem, as well as the state of the art. Then, Chap. 7 discusses the proposed approach for HIV viral load regression, whereas Chap. 8 describes the detailed analysis performed on TCR bags of different HIV patients. In the protein remote homology detection part, Chap. 9 introduces the problem and surveys the state of the art, while the subsequent chapters detail the lines of researches investigated: Chap. 10 presents the novel soft bag of words approach, whereas Chap. 11 is concerned with the study of a multimodal approach to integrate structural information to ease the detection task. Finally, in Chap. 12 conclusions are drawn and future perspectives are envisaged. 1.3 Publications Some parts of this thesis have been published in conference proceedings or in international journals. In the context of gene expression analysis, Chap. 4 has been published in [24]; a preliminary study of the ideas presented in Chap. 5 has been published in [136], whereas the comprehensive approach has been submitted to a journal [137]. In the context of HIV modeling, Chap. 7 has been published in [179], whereas Chap. 8 is still under consideration for publication. In the context of protein remote homology detection, Chap. 10 has been submitted to a journal [138]; the multimodal approach of Chap. 11 has been preliminary presented as a poster in [139], and the complete study is in press for publication [140]. 2 The Bag of Words paradigm There are several application scenarios where the bag of words scheme have been applied with success; some examples have been presented in Chap. 1. All these approaches follow a common pipeline, which consists of several steps that lead to the construction of the bag of words representation and the solution of the task. While the general idea is clear (i.e. representing an object with a vector of counts), a general formalization seems to be missing in the literature. This chapter fills this gap, by defining a possible pipeline that can be employed to face a problem using a bag of words approach. This pipeline is schematically depicted in Fig. 2.1, and explained briefly in the following. The starting point is the problem to solve or the task at hand; in the pattern recognition approach, we are given several training examples, i.e. instances or objects of the problem. The crucial aspect is to recognize that a given object of the problem can be seen as composed by simpler, “constituting” elements – that we will call words in the following (referring to the text domain where they have been originally introduced [203]). In other fields of Computer Science, the concept of “word” is sometimes called atom, token, chunk, or building block. Depending on the problem, the identification of words can be straightforward or not: for example, it is intuitive that textual words are constituting elements of a text document, whereas it is not so easy to define words for an image. The universe of all words – that can constitute every possible object of the problem – is called dictionary. In Fig. 2.1 this stage in the pipeline is represented by the diamond “what to count”. The second step is to perform the counting process, i.e. to determine the number of times each word of the dictionary appears in the object to represent. This leads to a numerical vector, where each element is associated to a word of the dictionary; its value is the number of times this word appears in the object. In Fig. 2.1, this stage in the pipeline is represented by the diamond “how to count”. Through the bag of words vector, objects are embedded into a feature space: vectors in this feature space can already be used to solve the task, for example as input for a classifier. Otherwise, the bag of words representation can be modeled, for example by taking into explicit consideration the fact that these features are counts. One possible choice is to employ probabilistic modeling, which provides a consistent framework to explain the process of data generation, to manage the presence of 10 2 The Bag of Words paradigm Problem - Task What to count Dictionary How to count Bag of words representation How to model Solution / Knowledge Fig. 2.1. A possible pipeline of execution of a bag of words approach. uncertainty/noise, to provide a more interpretable description, and to possibly increase the predictive accuracy. This stage in the pipeline is represented by the diamond “how to model”. The next sections are devoted at detailing the aforementioned steps: for every step, the mathematical notations and the formalization of concepts is introduced, along with a brief survey of the state of the art. 2.1 What to count Suppose that we are given a set of training objects X = {x1 , . . . , xT }. In this stage, two problems have to be addressed. The first one is to define simpler, “constituting” elements whose repetition characterizes the object. For example, in Fig. 2.2(a) the object is a truck built with the famous Lego bricks, and it is reasonable to define individual elements as the different types of bricks. The truck is a complex object, but composed by the repetition of some simpler bricks, opportunely assembled. In the following, we will refer to these elements (bricks) as words, and we will denote them with the symbol w. In the example, individual words are the different types of bricks that can be used. The second problem is to collect all words – that can constitute every possible object of the problem – in a dictionary. The mathematical definition of a dictionary is a set D comprising all possible words: D = {w1 , . . . , wN }. Fig. 2.2 (b) shows the dictionary of our example, which should contain all possible brick types, not limited to the ones needed for building the truck (see for instance the green brick in the dictionary, which is not a piece of the truck). The dictionary D can be prespecified and known a priori, or can be created by aggregating all words observed 2.1 What to count 11 Fig. 2.2. (a) Words can be defined as the different bricks composing the lego truck; (b) the dictionary is the set of all possible words. at least once in at least one training instance. In any case, it is worth stressing that the dictionary represents a universe: constituting elements of any object must be elements in the dictionary. If a novel word – not contained in the dictionary – is observed, for example during the testing phase, it should be either discarded, or the dictionary has to be re-tuned. In the literature, there are many contexts where the dictionary and words therein are clearly identifiable. In some other contexts it is more difficult, and the main effort of defining the bag of words is the identification of such words. In the following we detail these two possible cases. 2.1.1 Easy-to-define words The first example in this scenario is the Natural Language Processing field, where the bag of words has been originally introduced [68, 203]. In the original formulation, words are seen as the constituting elements of a text, and the dictionary has an intuitive and literal meaning. However, depending on the task, it may be more convenient to decompose a text in Ngrams (sets of N consecutive characters extracted from a word) [71], or syllables/phonemes [164]. This way of reasoning has also been exported in biology: many molecules are essentially strings or sequences (called polymers) composed by many repeated subunits. The most striking example is perhaps DNA, a long polymer “written” using four letters (A,T,C,G) called nucleotides. Similarly, proteins are made up by the linear combination of 20 different “building blocks” called amino acids [126]. In this context, words can be defined by taking individual symbols or Ngrams (like before, N consecutive nucleotides/amino acids extracted from a sequence – sometimes called Kmers in this biological context) [231]; or a word can be defined through complex heuristics that take into explicit consideration the biological significance [204]. Another bioinformatics example is shown in Fig. 2.3. The figure schematically represents a portion of the cell, centered around the nucleus. Colored dots correspond to mRNA molecules, whose amount and types ensure proper growth, development, and health maintenance of the cell [126]. mRNA molecules are copies transcribed from genes, and the production of these copies process is regulated by an important process called gene expression. This mechanism acts as both an “on/off” switch to control which genes are expressed in a cell as well as a 12 2 The Bag of Words paradigm Fig. 2.3. The amount and types of mRNA molecules in a cell – represented by colored dots in the schematic portion of the cell portrayed – reflect the function of the cell. On the right, a possible dictionary containing the list of all known genes. “volume control” that increases or decreases the level of expression of particular genes as necessary [126]. Thus, the more expressed a gene is, the more copies will be transcribed. Given this, it is reasonable to assert that genes are “constituting elements” of the cell, and can be employed as words to be used in a bag of words representation [23]. The dictionary, i.e. the ensemble of all the genes in an organism, is usually known a priori, and obtained through complex sequencing studies like the human genome project [50]. In a very similar fashion, the immune system gathers evidence of a viral infection by surveying the amount and types of epitopes (small segments of the viral proteins) which are cleaved in the cell and presented to the cell surface as a mean of warning. The immune system sees these epitope sets as disordered “bags”, based on whose counts the action needs to be taken [49]. In this context, it may be reasonable to employ as words the epitopes. 2.1.2 Difficult-to-define words In the previous section we gave a brief and non-exhaustive list of examples where the word definition is somehow natural or easily derived from the problem. However, for some other applications the concept of “word” is not so explicitly defined. A striking example can be found in Computer Vision, where bag of words approaches brought a substantial boost to the state of the art, and have been a turning point in the field. The main merit of these approaches is that they were able to define words that can be counted in natural images. For example, a successful line of research is aimed at extracting repeating keypoints that may represent local salient areas of the images – much like words are local features of a document. Saliency can be defined for example as stability under minor affine and photometric transformations (such as in SIFT, SURF, HOG [57, 141]), or based on computational models of the human visual attention system – these last approaches are concerned with finding locations in images that are visually salient (e.g. high spatial frequency edges) [82]. In any case, however, these local image descriptors are high-dimensional, realvalued feature vectors. Thus, a vector quantization step is required to obtain a 2.2 How to count 13 discrete vocabulary: this is traditionally performed with a clustering algorithm such as k-means [100], or learned with more sophisticated approaches [3] which take into account the final task (e.g. classification). In the end, words correspond clusters: different SIFTs in the same cluster are represented by the same word, with a consequent information loss. A graphical representation of the approach is illustrated in Fig. 2.4. Finally, another research line suggests to generate keypoints by sampling the images using a grid or pyramid structure, or even by random sampling: these have been historically preferred for fast extraction of words in videos [212]. It is worth mentioning that in computer vision, bag of words representation have been proposed also to characterize textures [55, 230], where words are the repeating texton element, and 2D and 3D shapes [127]. A final consideration: as in the text scenario, when applied to images the bag of words destroys the structure, i.e. the spatial layout of keypoints in images, contrarily to what happens in some of the examples we presented in the field of bioinformatics. Another interesting application domain where bag of words approaches have been successfully used, but the definition of words required some efforts is audio processing. In particular, [215] noted that sounds that human listeners find meaningful are best represented in the time-frequency domain, and these kinds of representations are essentially counting the number of time-frequency acoustic quanta that collectively make up complex sound scenes, similar to how we count words that make up documents. It is important to notice that a time-frequency transform is often complex-valued, and is often computed with tools such as the short-time Fourier transform, constant-Q transforms, wavelets, etc. However, because the hearing system is more sensitive to the relative energy between different frequencies, for most practical applications only the modulus of these transforms is used, whereas the phase is discarded. Thus, a discrete non-negative count value can be derived, from which the analogy sound frequency / word can be established [211, 213–215]. Other approaches for audio processing are instead based on Mel Frequency Cepstral Coefficients (MFCCs [242]), and employ a vector quantization step (similarly to the Computer Vision scenario) of these coefficients to derive the acoustic words [115, 123]. Once the dictionary is built, the next step is to perform the counting process and obtain the bag of words vector representation. 2.2 How to count Counting is perhaps one of the oldest mathematical activities, and one of the first we learn as children. Intuitively, to count means to determine the number of elements in a set. Standard English dictionaries define it as “to say numbers one after the other in order, to calculate the number of people or things in a group”. The mathematical definition of “counting” [69,85] resembles this last definition: given a finite set of elements Y , to count is to establish a function between the elements of Y and the natural numbers N0 in progressive order (zero is excluded). Therefore, one element of the set is associated with “1”, another one with “2”, 14 2 The Bag of Words paradigm w1 w1 w4 w4 … … w2 w2 w3 (a) (b) w3 (c) Fig. 2.4. Pipeline for defining words in images. (a) Keypoints are extracted from training images and embedded in a vector space; (b) the dictionary is derived with a vector quantization step that cluster together similar keypoints in a single “word”; (c) whenever a new keypoint is extracted, it is assigned to the nearest word. and so on until all elements of Y are assigned a natural number. We would like this function to be bijective, to determine the count value. Thus, we first define Nn = {x ∈ N|1 ≤ x ≤ n} For each integer n ∈ N0 , Nn is the set of natural numbers until n. Then, to count a finite set Y is to establish a bijection f : Y → Nn for some n ∈ N0 . In other words, if there exists a bijection f : Y → Nn , then we say that the number of elements in Y is n and write |Y | = n. A graphical representation using sets is depicted in Fig. 2.5. This definition can be useful in the bag of words representation to count the number of instances of one word; since in general the object is composed by the repetition of different words, we want to extend this notion defined on a set to a more general scenario. For this reason, we denote an object X as a multiset [32], a generalization of the notion of a set where the members are allowed to appear more than once. For example, X = {’a’,’a’,’a’,’b’,’d’}. In addition, as described in the previous section, we are given a dictionary D with |D| = W elements: in this particular example, we define D = {’a’,’b’,’c’,’d’}, and W = 4. To build a bag of words vector x for the object X, we count how many times each word wi ∈ D occur in X. Specifically, we build a vector where each element represents a word in the dictionary and the value of that element is the number of times the word appears in the object. Mathematically, we can think of this as a function count : (X, D) → N (X, wi ) 7→ |{wi |wi ∈ X}| (2.1) The bag of words vector x of the object X is then a vector of size W defined as x = [count(X, w1 ), count(X, w2 ), . . . , count(X, wW )] (2.2) In the example, x = [3, 1, 0, 1]. In this particular formulation, counts are required to be discrete values. However, the definition can be extended also to continuous values, motivated by the two following considerations: 2.3 How to model 15 N5 Y 1 2 3 4 5 Fig. 2.5. To count means to establish a bijective function between the set Y and the natural numbers considered until n. • • There are cases where the count is a discrete value, but due to technological problems we can only observe a continuous value “proportional” in some way to the real count. Consider for example the gene expression scenario presented in the previous section: genes, and molecules in general, are very difficult to be directly observed. The current technology to measure gene expression, called DNA microarray, detects and quantifies mRNA by detection of fluorescencelabeled targets. A researcher must then use a special scanner to measure such fluorescent intensity, and the raw image extracted is elaborated with image processing techniques to obtain a final expression value. More generally, the constraint can be relaxed by drawing a parallelism between a count value and a measure that reflects a value of presence, importance, power, or frequency. Actually, it is reasonable to intend these values as counts: the more present an element in an object, the higher its count. For example, as described above, acoustic words are counted by computing the magnitude of a signal in the Fourier domain, and this can result in a real value [215]. Finally, it is worth noticing that a concept often related in the literature with counting is the histogram [167]. In its precise definition, a histogram is a graphical representation of the distribution of continuous data. For simplicity, consider the 1-D case, where we have several real numbers laid on the x axis. A histogram is obtained by first “binning” the range of possible values that are observed – that is, divide the entire range of values into a series of small, discrete intervals. Then, one counts how many values fall into each interval, and draws a rectangle having width equal to the interval range, and height equal to the count. An example of histogram is pictured in Fig. 2.6. From the bag of words perspective, each bin in a histogram represents a word, that has been obtained through a discretization (i.e. a vector quantization) of the observed continuous values; the height of each bin is the count value that is present in an entry of the bag of words vector. 2.3 How to model Through the bag of words representation, an object is projected in a vector space. In this space, the problem under consideration may be solved. However – depending on the task at hand – it can be convenient to employ a model that further exploits the information contained in the bag of words vectors. This is done in order to 16 2 The Bag of Words paradigm 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 Fig. 2.6. A histogram is a rough estimate of the distribution of continuous values, obtained by depicting the count of values occurring in certain ranges. increase the performances, to explain the process of data generation (if the goal is classification or clustering), to highlight particular facets of the data providing a more interpretable description (if the goal is visualization/interpretation), or to validate the observed bag of words. Many models exist and have been proposed, depending on the application and the problem to solve. This thesis adopts the perspective of probabilistic modeling, which will be explained in the following sections: statistics and probability theory provide a consistent framework for the quantification and manipulation of uncertainty, and can take into account all of the aforementioned considerations. We will introduce the notions and a general framework for probabilistic modeling, along with some examples: specific models will be presented when needed throughout the thesis. Before that, we will introduce how the bag of words can be seen from a probabilistic perspective. 2.3.1 The bag of words as a multinomial In this section we will describe how the bag of words can be regarded as a random variable. Consider a simple example where a die is thrown. The result of one throw is a discrete random variable that can take one of 6 possible mutually exclusive values. In the spirit of the bag of words approach, we will refer to each of these values as a word, and the 6 possible words constitute the dictionary. Therefore D = {1, 2, 3, 4, 5, 6}. There are different ways of expressing the variable characterizing a word: a particularly convenient representation is the “1-of-W” scheme, where the variable is a W-dimensional (W=6 in our dice example) vector w in which one of the elements wv equals 1, and all remaining elements equal 0. Suppose for example that a particular observation of the variable corresponds to the result “4” of the die. Then w will be represented as w = [0, 0, 0, 1, 0, 0] 2.3 How to model 17 We can think of this as an indicator function that “selects” the observed word in the dictionary. Thus, we can refer to a word either with its index v in the dictionary, or with a “1-of-W” vector w where wv = 1. In addition to that, we are aware of the probabilities of the different words: in our example, they are all equal to 1/6. If these probabilities are encoded in the vector 1 1 1 1 1 1 , , , , , π= 6 6 6 6 6 6 then the distribution of w is p(w|π) = W Y πvwv (2.3) v=1 Since all entries of w are zeros except one, in our example this formula simply reduces to p(w|π) = π4 = 1/6. Then we can make a step forward: we would like to obtain a vector x that counts how many times the different words w occurred throughout N independent throws of the die, i.e. a proper bag of words vector. First, let us denote each of the N results as w1 , . . . , wN . Through the 1-of-W scheme, x is easily computed with the element-wise sum of each wn : x = w1 + w2 + . . . + wN (2.4) Suppose for example that in 5 different throws we obtain the results {3, 3, 5, 3, 2}. Through the 1-of-W scheme, the bag of words is computed as follows: w1 = [0 0 1 0 0 0] w2 = [0 0 1 0 0 0] w3 = [0 0 0 0 1 0] w4 = [0 0 1 0 0 0] w5 = [0 1 0 0 0 0] X wi = x = [0 1 3 0 1 0] i A multinomial distribution is the probabilistic distribution that describes x, namely the number of times each of W possible words occurs out of N trials, where each word has a probability πk . π represents the parameter of the multinomial distribution. Usually, we are interested in computing the probability that a particular observation x is generated by a multinomial distribution with known parameter π. Since each word wi is independent, the probability mass function can be derived from Eq. 2.3 Y K N p(x|π, N ) = πkxk (2.5) x1 x2 . . . x K k=1 where the normalization coefficient is the number of ways of partitioning N words into W groups of size x1 , . . . , xk and is given by 18 2 The Bag of Words paradigm N x1 x2 . . . x K = N! x1 !x2 ! . . . xK ! (2.6) Note that the variables xk are subject to the constraint W X xk = N k=1 2.3.2 Probabilistic models Once the key distributions are defined, probabilistic manipulations can be expressed in terms of two simple equations, known as the sum rule and the product rule [28]. In general, given two random variables a and b we can write the following equations: X (Sum rule) p(a) = p(a, b) (2.7) b (Product rule) p(a, b) = p(b|a)p(a) (2.8) The sum rule is sometimes called marginalization, and the sum is over all possible values b can take. Note also that the summation must be replaced by an integral if b is continue rather than discrete. All of the probabilistic inference and learning manipulations discussed in this thesis (but the statement is much more general), no matter how complex, amount to repeated application of these two equations. For example, by applying two times the product rule, one can easily derive the Bayes’ rule, which states the following: (Bayes’ rule) p(a|b) = p(b|a)p(a) p(b) (2.9) This basic equations serve as the main ingredients of probabilistic generative models [81]. The goal of generative modeling is to formally develop statistical models that can explain the input data, or visible variables x, as tangible effects that are generated from a combination of hidden variables h, representing the causes, also coupled with conditional interdependencies. Let us look back at the dice example, where a die is thrown N times. We already introduced the variable x, representing the sum of the vectors w1 + . . . + wN (each wi corresponding to the result of one throw represented through the “1-ofW” scheme). We also noted that x is a multinomial variable. We can complicate the example by supposing that there are two possible dice: one is a common die (denoted h1 ), the other has only the even numbers, duplicated (denoted h2 ). In this example, before throwing the die N times, the identity of the die h is chosen. Moreover, we are only able to see the result of the throws, but not the identity of the die. Our goal is to understand if a particular observation x resulted by throwing N times either h1 or h2 . In order to do so, the idea is to compute p(h = h1 |x) and p(h = h2 |x), called the posterior probability; after that, we can decide that x has been generated by the die ĥ, where ĥ = arg max p(h|x) h 2.3 How to model a 19 b c Fig. 2.7. A simple Bayesian Network. The problem of course is to compute the posterior p(h|x), which can be solved by reversing the conditional probability by using Bayes’ law, thus leading to p(h = h1 |x) = p(x|h = h1 )p(h = 1) p(x) and, in a similar way, we can compute p(h = h2 |x). At this point, one should recall that p(x|h) is a multinomial distribution whose parameter varies depending on the die h chosen: Y K N (h) xk p(x|h, N ) = πk (2.10) x1 x2 . . . x K k=1 To put these formulae into concrete perspective, suppose we instantiate our example with the following: • • • x = [4 0 3 0 3 0], N = 10; p(h1 ) = p(h2 ) = 0.5, i.e. the prior probability that one die is preferred to the other is flat; π (h1 ) = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6], whereas π (h2 ) = [1/3, 0, 1/3, 0, 1/3, 0] Then, p(h = h1 |x) = = p(x|h = h1 )p(h = h1 ) p(x|h = h1 )p(h = h1 ) = P2 = p(x) i=1 p(x|h = hi )p(hi ) 7 · 10−5 · 0.5 = 0.001 0.0356 and p(x|h = h2 )p(h = h2 ) p(x|h = h2 )p(h = h2 ) = = P2 p(x) i=1 p(x|h = hi )p(hi ) 0.0711 · 0.5 = = 0.999 0.0356 p(h = h2 |x) = We can conclude that it is much more likely that the observed x has been generated by throwing 10 times the die h2 . 2.3.3 Bayesian networks One could proceed to formulate and solve complicated probabilistic models purely by the algebraic manipulation introduced in the previous section. However, this can 20 2 The Bag of Words paradigm w1 w2 ... h wN π h w N π Fig. 2.8. A simple Bayesian Network for the dice example. In other literature, this model is known as a mixture of unigrams. result in an unnecessarily complex framework, leading to a proliferation of formulae to keep track of. For this reason it is highly advantageous to augment the analysis using graphical representations of probability distributions, called bayesian networks, which start from the concept of a graph. In a bayesian network, each node represents a random variable, and the links express probabilistic relationships between these variables. The graph then captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors, each one depending only on a subset of the variables. More specifically, a bayesian network for random variables a1 , . . ., aN is a directed acyclic graph on the set of variables, along with one conditional distribution for each variable given its parents, p(ai |apai ). Then, the graph captures the way in which the joint probability of all the variables is decomposed, namely by saying that p(a1 , . . . , aN ) = N Y p(ai |apai ) i=1 that is, the observation of a particular value for a variable is influenced only by the value taken by the direct parents of such variable. A graph example is shown in figure 2.7. The joint distribution of the three variables a, b, and c is given by p(a, b, c) = p(c|a, b)p(a)p(b). However, in a typical bayesian network some of the nodes v are clamped to observed values (i.e. measurements), and some other nodes h are hidden, and aim at representing the causes that generated that particular set of observations, which we called the training data. Consider the bayesian network for the dice example, which is shown in Fig. 2.8 (left). First, note that the variable h is represented with a shaded node, to denote that it is hidden. As already mentioned, h is a discrete variable that represents the index 1, 2 of the die from which the data point w, i.e. the result of a throw (visible variable), is generated. The joint distribution of this model is N Y p(w1 , . . . , wN , h) = p(h) · p(wi |h) (2.11) i=1 This network describes the generation of a bag of words x by simulating the process of throwing a die N times. In fact, first we select a die h according to p(h) 2.3 How to model 21 (h has no parent in the graph), and then assign a value to N different variables wi drawing N times from p(w|h). The final bag of words x is constructed by simply summing: x = w1 + . . . + wN . The graphical representation can be made more compact, by surrounding some of the nodes with a box, called a plate (as in the right part of Fig. 2.8). When dealing with complex models, it is inconvenient to explicitly draw N nodes w1 , . . . , wN ; plate notation allows such multiple nodes to be expressed more compactly, where a single representative node w is surrounded with a plate labeled with N , indicating that there are N nodes – which are independent and identically drawn (i.i.d) – of this kind. Finally, it is possible to include in the representation the parameters of the distributions. In Fig. 2.8 we explicitly depicted π, not circled to denote that it is not treated as a random variable1 . This simple example of a bag of words model has been originally named in the literature mixture of unigram model [158]. So far, we described the meaning of a bayesian network and how it is possible – given some specified parameters – to evaluate the joint probability of the variables. In the following, we will introduce the main tasks, that represent the true essence of bayesian networks: learning and inference. In the learning phase, training data are used to infer plausible configurations of the model parameters π. With the model trained, inference consists in “querying” the model, in order to compute estimates or making decisions about basically every probabilistic relation that can be expressed between any two or more variables in the model. 2.3.4 Inference and learning in bayesian networks When the model is trained and its parameters fully specified, the main inferential problem consists in computing the posterior distributions of one or more subsets of hidden nodes h, which will be denoted with q(h): q(h) = p(h|v) (2.12) In general, probabilistic inference reduces to use Bayes’ rule and the other rules of probability to express the posterior distributions as function of the conditional ones specified by the joint probability of the model. As an example, consider again the mixture of unigram model for the dice example in Fig. 2.8: the goal is to infer the posterior distribution over h, p(h|w), where for simplicity we will refer to w instead of w1 , . . . , wN . Using Bayes’ rule we can write p(w|h)p(h) q(h) = p(h|w) = p(w) where p(w) can be computed using the sum and product rules X X p(w) = p(w, h) = p(w|h)p(h) h 1 h Of course, it is possible to treat parameters as random variables. In that case, we should have also specified a distribution over parameters p(π) including it in the joint probability decomposition. 22 2 The Bag of Words paradigm In general, when dealing with complex models or models with more than one hidden variable, it may happen that the distribution p(h|v) cannot be computed, or it requires an exponential number of values to be stored. In such cases, the posterior p(h|v) is said to be intractable, and a variety of approximations (called variational approximations) can be made in order to make computations possible and efficient [81]. In the exact formulation, if there are H hidden variables, one must use an unconstrained form for p(h|v), that is no factorization over h = hi has to be assumed: q(h) = p(h|v) = p(h1 , . . . , hN |v) The main idea of variational approximation is to keep only few dependencies between the hidden variables in the posterior. Assume for example only one dependence: hi depends on hj . This yields the following factorization: # " N Y p(hn |v) p(hi |hj , v) (2.13) q(h) = p(h|v) = n=1,n6=i This approximation is called a structured variational approximation [81]. Another common choice is to assume the complete factorization over the hidden variables hi . In this case the equation for the posterior becomes: q(h) = p(h|v) = N Y p(hn |v) (2.14) n=1 This kind of approximation is called mean field approximation and is the most frequently used method, due to its simplicity. In the learning phase, we want to estimate plausible configurations of the model parameters. Learning is possible provided that we are given several examples, or training data: the main idea is that there is a setting of the parameters that produced the observed training data. Since the model parameters π are unknown at the moment, we consider them as hidden variables. At this point, hidden variables can be divided into the parameters, denoted by π, and one set of hidden variables h(t) for each training case t, t = 1, . . . , T . So, h = [π, h(1) , . . . , h(T ) ]. In a similar way, there is one set of visible variables for each training case, so v = [v (1) , . . . , v (T ) ]. Assuming that the training cases are independent and identically distributed (i.i.d.), the distribution over all visible and hidden variables (including parameters) is p(h, v) = T Y p(h(t) , v (t) , π) = p(π) t=1 T Y p(h(t) , v (t) |π) (2.15) t=1 For example, in the mixture of unigram model, with T i.i.d training data xt , t = 1, . . . , T , the joint probability is given by p(h, w) = T Y (t) (t) p(h , w , π) = p(π) t=1 = p(π) T Y p(h(t) , w(t) |π) = t=1 T Y t=1 p(h(t) ) N Y W Y n=1 v=1 (t) πk wv (2.16) 2.3 How to model 23 Given this quantity, the learning problem can be seen as the problem of maximizing the data log-likelihood X p(v) = p(h, v) (2.17) h i.e. finding the model which best fits the data. In formulae, the best parameter configuration π̂ is X π̂ = arg max p(h, v) (2.18) π h For the same reasons discussed in the inference phase, it may be that the likelihood is intractable, and approximate techniques must be employed. One of the most famous tool in statistical estimation for approximate inference is the Expectation-Maximization (EM [81]), which will be presented in the next section. The Expectation-Maximization algorithm In the context described so far, for a set of parameters π and remaining hidden variables h(1) , . . . , h(T ) , EM is an algorithm that obtains a point estimates for π, which will be called π̂, and computes the exact posterior over the other RVs h(t) , given π. The starting point is to derive a bound on the log-likelihood, i.e. the logprobability of the visible RVs ln p(v). This derivation can be carried out using the Jensen’s inequality: given a real convex function f , numbers x1 , . . . , xn in its domain, and probabilities µ1 , . . . , µn : ! n n X X µk f (xk ) (2.19) f µk xk ≤ k=1 k=1 If the function f is concave instead of convex, the direction of the inequality is simply reversed. To obtain a convex combination inside the concave ln function of the log-likelihood, we employ the q posterior distribution we discussed during inference: ! ! X X q(h)p(h, v) ln p(v) = ln p(h, v) = ln (2.20) q(h) h h ! X p(h, v) ≥ = −F(q, p) (2.21) q(h) ln q(h) h The function F is called the free energy, and is an upper bound on the negative log-likelihood. Moreover, since we have to account for training data, we can rewrite the free energy formula as: F(q, p) = − ln p(π) + T X X t=1 h(t) ln q(h(t) ) p(h(t) , v (t) |π) (2.22) which is the main equation we will employ when making reference to the free energy. Note that, since p(π) is constant by definition, it can be omitted in the 24 2 The Bag of Words paradigm subsequent derivations. In this equation, two are the unknown quantities: the distribution q(h(t) ) and the parameters π. EM estimate these two unknown quantities by alternating between minimizing F(q, p) with regard to the set of distributions q(h(1) , . . . , h(T ) ) (Expectation step or E-step), and minimizing F(q, p) with regard to π (Maximization step or Mstep). These two solutions give the EM algorithm, summarized in the following steps: • • • Initialization: Choose values for π̂ (randomly, or using some clever strategy). E-Step: Compute p(h(t) |v (t) , π̂), then assign q(h(t) ) ← p(h(t) |v (t) , π̂). M-Step: Minimize F(q, p) w.r.t. π̂ by solving ! T X X (t) (t) (t) ∂ ln p(h , v |π̂) = 0 q(h ∂ π̂ (t) t=1 h • Repeat for a fixed number of iterations or until convergence. Summarizing, the bag of words paradigm provides a general framework – articulated in three main steps – that can be employed to solve a pattern recognition problem. Many contributions in the literature provided enhancement to each stage, although the formalization of a general pipeline seems to be missing: this chapter provided a possible one, that will be exploited for the solution of the problems described in the next parts of this thesis. Part I Gene expression analysis 3 The gene expression analysis problem In recent years, the research areas of molecular biology and genomics experienced a rapid and profitable growth thanks to advances in knowledge and in technology. On one hand, individual studies led to new discoveries about the roles played by specific genes in the development of diseases. On the other hand, population studies are possible with technologies such as DNA microarrays [92] and RNA-seq [235], which provided scientists with a way to measure the expression levels of thousands of genes simultaneously. This created a stringent need for algorithmic approaches able to extract information from data and create compact, interpretable representations of the problem. This part of the thesis describes how gene expression data can be approached with bag of words approaches. In particular, this chapter explains the problem and the computational challenges related to the analysis of gene expression data, along with the state of the art present in the recent literature. Subsequently, motivations and original contributions in this context are summarized and detailed in the next 2 chapters. 3.1 Background: gene expression The balance in cell processes like growth, response to stimuli, and maintenance, is complexly regulated by the mechanism of gene expression. In classical genetics, a gene is an abstract concept – a unit of inheritance that ferried a characteristic from parent to child [166]. Examples of these characteristics that are inherited are the color of the eye of a person, the blood type, or diseases such as haemophilia or color blindness to name a few. Further studies, and the development of biochemistry, proved that the hereditary nature of every living organism is defined by its genome, which contains the genes [41]. The genome consists of a long sequence of molecules called nucleic acids – in particular, DNA. Most DNA molecules consist of two strands coiled around each other to form a double helix. The two DNA strands are known as polynucleotides since they are composed of simpler units called nucleotides: there are four possible nucleotides in DNA, either guanine (G), adenine (A), thymine (T), or cytosine (C). Structurally, A pairs with T and C with G, mainly for di- 28 3 The gene expression analysis problem mensional reasons – only this combination fits the constant width geometry of the DNA spiral. DNA provides the information needed to construct the organism. The term information is used because the genome does not itself perform any active role in the development of the organism. By a complex series of interactions, the gene sequence is used to produce another type of molecules, the proteins, in the appropriate time, place, and quantity [118]. Proteins either form part of the structure of the organism, or have the capacity to perform the chemical reactions necessary for life. The process by which information from a gene is used in the synthesis of a functional product (i.e. a protein) is called gene expression. This process is essentially articulated in two stages: Transcription DNA expresses its genetic instructions by first transferring its information to a messenger RNA (mRNA) molecule, in a process called transcription. The term transcription is appropriate because, although the information is transferred from DNA to RNA, the information remains in the language of nucleic acids. A gene encoding for a protein contains not only the sequence that will eventually be directly translated into the protein (the coding sequence) but also regulatory sequences that direct and regulate the synthesis of that protein. Translation The mRNA molecule then transfers the genetic information to a protein by specifying its amino acid sequence. This process is termed translation because the information must be translated from the language of nucleotides into the language of aminoacids. Since the cardinality of the proteins alphabet is grater than the nucleotides one, an important question is how many nucleotides are necessary to specify a single amino acid. With a sequence of 3 nucleotides, there are 43 = 64 possible combinations of the 4 RNA alphabet symbols, more than enough to specify 20 different aminoacids (in fact, this code is redundant – a mechanism aimed at preventing translation errors). During translation, the RNA nucleotides are “read” by translational machinery in a sequence of nucleotide triplets, each one coding for a specific amino acid. Two special triplets specify the start and the end of the protein sequence. Gene expression is a highly complex process, that allows a cell to respond dynamically both to environmental stimuli and to its own changing needs. This mechanism acts as both an “on/off” switch to control which genes are expressed in a cell as well as a “volume control” that increases or decreases the level of expression of particular genes as necessary. Disruptions or changes in gene expression are responsible for many diseases. Gene expression may be controlled at any of a number of points along the molecular pathway from DNA to protein: the most important one is during transcription, where gene expression selects genes to be transcribed into mRNA and provides the efficiency of the process, namely how many proteins have to be produced. 3.1.1 DNA Microarray The technologies that permit to detect the expression level of the genes in an organism has been made possible only in recent years, with the advent of a tech- 3.1 Background: gene expression 29 Fig. 3.1. (left) A microarray slide contain thousands of probes, each one corresponding to a known gene. (right) Labeled cDNA hybridize to the slide and the fluorescence emitted is a measure of gene expression. nology called DNA microarray. A microarray is a laboratory tool used to detect the expression of thousands of genes at the same time [38, 186]. They are microscope slides that are printed with thousands of tiny spots arranged in a grid, with each spot containing a known, single-strand, DNA sequence composing gene (Fig. 3.1). The DNA molecules attached to each slide act as probes to detect gene expression, which is the set of messenger RNA (mRNA) transcripts expressed by a group of genes. To perform a microarray analysis, mRNA molecules are typically collected from an experimental sample: the mRNAs are converted into complementary DNA (cDNA), and labeled with a fluorescent dye. The sample is then allowed to bind to the microarray slide, in a process called hybridization. If a particular gene is very active, it produces many molecules of messenger RNA, thus, more labeled cDNA which hybridize to the probes on the microarray slide and generate a very bright fluorescent area. Genes that are somewhat less expressed produce fewer mRNA, thus, less labeled cDNA hybridized, which results in dimmer fluorescent spots. If there is no fluorescence, none of the mRNA have hybridized to the DNA, indicating that the gene is inactive [38, 186]. Following hybridization, the microarray is scanned to measure the expression of each gene printed on the slide, resulting in a fluorescence image such as the one shown in Fig. 3.2. The output of the digital system that scans the fluorescence image is a matrix of numbers which gives a quantitative value for the expression of each gene in the various spots of the microarray. This is done with image-processing techniques, which help to segment the spots, remove noise, measuring the quality of the spots, and quantify the signal. Clearly, this is a non trivial task due to a wide variety of factors such as the strong noise present in the image, the difficulty to estimate background/foreground, the non-perfect alignment of the spots in an ideal grid, and others [221]. 30 3 The gene expression analysis problem Fig. 3.2. Example of fluorescence image derived after microarray hybridization. Usually, several hybridizations are carried out: the idea is to measure the expression of different samples, which can belong to one or several classes. For example, some samples could be collected from healthy individuals, and other samples could be collected from individuals with a disease like cancer. The final result, which is typically the data investigated with pattern recognition techniques, is a gene expression matrix, in which a row corresponds to a gene, a column to a sample, and a given entry represents the expression level of that particular gene in a given experiment (sample). Summarizing, a gene expression matrix is a combination of many different microarray experiments, each one arranged in a column, measuring the expression levels of all genes (each one arranged in a row). An example is shown in figure 3.3. As a final comment, it is important to notice that, even if the microarray technology is currently the most widespread, emerging and more advanced technologies will eventually make microarrays obsolete. One worth mentioning is RNASeq [235], which enables to investigate at high resolution all the RNAs present in a sample, characterizing their sequences and quantifying their abundances at the same time. In practice, millions of short strings, called “reads”, are sequenced from random positions of the input RNAs. These reads can then be computationally mapped on a reference genome to reveal a “transcriptional map”, where the number of reads aligned to each gene gives a direct count of its expression level [76]. 3.2 Computational analysis of a gene expression matrix The advent of the microarray technology has created a need for algorithmic approaches that extract information from gene expression data to create compact, interpretable representations. Many computational tasks can be carried out to an- 3.2 Computational analysis of a gene expression matrix 31 Fig. 3.3. A gene expression matrix (taken from [102]), where rows indicate genes, columns different experiments, and the color indicates the expression level. 32 3 The gene expression analysis problem alyze this matrix: i) selection of differentially expressed or discriminative genes; ii) classification of samples; iii) clustering of genes or samples (i.e. for individuate pathological subtypes); iv) biclustering, i.e. simultaneous cluster of genes and samples. In the following, we will briefly review each of these. Gene selection A typical gene expression matrix contains hundreds of experiments and thousands of genes. Depending on the task, gene selection techniques may represent an important class of preprocessing tools: such methods, by eliminating uninformative genes, can reduce the dimension of the problem space, and alleviate the curse of dimensionality issue [66, 89]. Moreover, in this context such operation may have a large impact from the biological / medical point of view, because it can help researchers to identify a stable and informative set of biomarkers for cancer diagnosis, prognosis, and therapeutic targeting [89,199]. Several approaches have been proposed in the literature, ranging from simple filters based on variance or entropy up to complex methods which consider labels and concepts like redundancy and relevance. A comprehensive and recent review on gene selection can be found in [121]. Sample classification The classification of samples is an important emerging clinical applications of gene expression analysis, for example to distinguish different diseases according to different expression levels in normal and tumor cells. As introduced in the previous paragraph, the major challenge is perhaps the curse of dimensionality; there are very few samples in comparison to the number of genes analyzed, and many models may not generalize well to new data despite excellent performance on the training set [219]. Furthermore, many of the features are irrelevant or redundant to the problem being researched, making gene selection (described in the previous paragraph) necessary even if an algorithm could handle the large quantity of data. Other approaches perform feature extraction, which aims at “summarizing” the numerous genes in form of a small number of new components (often linear combinations of the original expression levels). Some examples are Partial Least Squares (PLS) [34, 35], generalized Partial Least Squares [79], or Independent Component Analysis (ICA) [217]. After dimension reduction, one can apply any classification method to the constructed components. Several reviews on the subjects are present [36, 67, 111, 122, 144, 219]. Perhaps the most popular classification algorithm is the Support Vector Machine (SVM), which is suitable for classifying high dimensional data without suffering too much from the curse of dimensionality problem, and proper performances in the gene expression scenario have been demonstrated in many studies [56, 83, 112, 219, 233]. Clustering In this task, the goal is to subdivide a set of genes or samples in such a way that genes (samples) with similar expressions along samples (genes) fall into the 3.3 Contributions 33 same cluster, whereas dissimilar items fall in different clusters. Beyond simple visualization, when clustering genes usually the goal is to identify functionally similar genes, or to infer a functional role for unknown genes in the same cluster [84]. When clustering samples, it may be employed to identify biologically relevant structure in large data sets: in this context, the most employed clustering analysis is perhaps hierarchical clustering, which allows the inference of possible group structure within the data [194] and, thus, have been used to indicate tentative new subtypes for some cancers [5]. Two reviews on the subject can be found in [110, 218]. Biclustering In the clustering context, a recent trend is represented by the application of biclustering methodologies, namely clustering techniques able to simultaneously group genes and samples [143, 183]; a bicluster may be defined as a subset of genes that show similar activity patterns in a specific subset of samples. This kind of analysis may have a clear biological impact in the gene expression scenario, where a bicluster may be associated to a biological process that is active only in some samples and may involve only a subset of genes. Different approaches for biclustering gene expression data have been presented in the literature in the past, each one characterized by different features, like computational complexity, effectiveness, interpretability, optimization criterion and others (some reviews can be found in [143, 183, 224]). Generally speaking, most biclustering methodologies have been obtained by adapting, tailoring and opportunely combining existing clustering techniques [22, 74, 86, 248]. To computationally analyze a gene expression matrix, it is important to notice that most approaches do not consider that expression levels are essentially counts of mRNA molecules (gene products), although very difficult to be directly measured. This consideration can motivate the use of a bag of words approach, where a column in the gene expression matrix is interpreted as a numerical vector counting how many times each word/gene occurs in the sample. Another motivation is that genes and mRNA molecules in the cell have a “bag” structure: there is no obvious ordering of the genes in this context, therefore a bag of words representation does not destruct the structure of the objects (samples). Finally, there are probabilistic models for bag of words that provided state of the art classification and clustering accuracy, as well as highly interpretable solutions. In particular, we introduced in Chap. 1 topic models, probabilistic models aimed at representing the various topics that a corpus of documents is speaking about. For these reasons, the bag of words and its related models (in particular, topic models) appear to be a convenient tool for the gene expression data analysis problem. 3.3 Contributions This part of the thesis we report the contributions achieved in the field of gene expression data analysis. In particular, after postulating the bag of words representation for gene expression samples, we investigated the capabilities of topic 34 3 The gene expression analysis problem models (which have been never investigated in the gene expression domain) in mining information from such data, employing them for a variety of tasks. In this context, the topics introduced by topic models capture a fundamental information in this context, namely a co-occurrent pattern of genes: they are latent modules, that assign high probability to genes that tend to be highly co-expressed. We faced the sample classification task, by extracting topic models-derived feature vectors to be used in a discriminative setting with support vector machines. This results in a hybrid generative discriminative scheme [99], where surrogates or by-products of the generative model learning are injected as features in a discriminative classifier. The proposed approach has been extensively tested on 10 different benchmark data sets (in other works, it is common to only evaluate 2-3), by employing several different topic models, different ways of extracting feature vectors from the trained topic models, different classification schemes, and different kernels. Obtained results, when compared to the state of the art, confirm the suitability of this bag of words approach for the classification of gene expression data. Moreover, considerations on the interpretability of the obtained feature descriptors have been provided, with the use of a real dataset involving different species of grape plants. This work is described in chapter 4 and has been published in [24]. Then, we made one step forward along the direction of modeling gene expression with topic models. In particular, we employed a more recent and sophisticated topic model called the Counting Grid [104, 175], to mine and extract an informative representation for a set of expression samples. The main motivation is that most topic models, even if they represent a proper choice, have a clear drawback: they assume that topics act independently of each other. While this assumption is often needed to simplify computations and inference, it may be too impoverishing in the gene expression scenario, where it is known that biological processes are tightly co-regulated and interdependent in a complex way [126]. The Counting Grid model copes with the aforementioned limitation: the idea behind the model is that topics are arranged in a discrete grid, learned in a way that “similar” topics are closely arranged. Similar biological samples, i.e. sharing some topics and active genes, will be mapped close on the grid, allowing for an intuitive visualization of the dataset. We made a comprehensive evaluation of the model in the gene expression scenario, by i) visualizing on four different datasets how samples are embedded and clustered together on the grid, naturally separating between classes; ii) validating numerically this claim, also performing a thorough evaluation on parameters’ sensitivity; iii) proposing a novel methodology to highlight and automatically select genes particularly involved in the pathology or in the phenomenon of interest; iv) demonstrating that the model achieve state-of-the-art results for classification tasks. 4 Gene expression classification using topic models This chapter describes how to employ the bag of words representation and topic models for classification of gene expression data. First, motivations and considerations on the suitability of the proposed approach are presented, together with a brief review of the models employed. Then, a classification scheme is proposed, based on highly interpretable features extracted from topic models. An extensive experimental evaluation, involving ten different literature benchmarks, demonstrates the suitability of topic models for this classification task. Finally, we performed a qualitative analysis on a dataset involving grapevine plants expression, that confirms the great interpretability of the proposed approach. 4.1 Topic models and gene expression As already introduced, the basic idea underlying topic models is that each document may be characterized by the presence of one or more topics (e.g. sport, finance, politics), which induce the presence of some particular words. From a probabilistic point of view, the document may be seen as a mixture of topics. The representation of documents and words with topic models has one clear advantage: each topic is individually interpretable, providing a probability distribution over words that picks out a coherent cluster of correlated terms. This may be really advantageous in the gene expression context, since the final goal is to provide knowledge about biological systems, and highlight possible hidden correlations. As largely detailed in the previous chapters, the novel application of topic models in the gene expression scenario starts from the analogy that can be set between the pair word-document and the pair gene-sample: actually it is reasonable to intend the samples as documents and the genes as words. In fact each sample is characterized by a vector of gene expressions: the expression level of a gene in a sample may be easily interpreted as the count of a word in a document (the higher the level the more present the gene/word is in the sample/document). This permits to consider the expression matrix as a bag of words matrix, thus opening the possibility of exploiting all the tools developed for the bag of words representation. At this point it is important to notice that, contrarily to many other bag of words applications where the word order is lost, expression levels 36 4 Gene expression classification using topic models Topic = 'economics' 80% Dictionary words 20% Document st mostly talking about economics Fig. 4.1. Intuitive representation of the PLSA model for document analysis. have a natural “bag” structure: there is no obvious ordering of the genes in this picture, therefore the bag of words representation does not alter the underlying structure of samples. Usually, topic models take as input a set of documents, each one containing a set of words. The documents are summarized by an occurrence matrix, where each entry indicates the number of occurrences of a given word in a given document. In the same way, in the gene expression scenario the input is a set of T samples, summarized by an expression matrix n(gn , st ) which measures the expression level of the gene gn in the sample st . The dictionary is of size N , namely we have N different genes appearing in the sample set, and the dictionary indexes these genes. The simplest model employed in this chapter is called Probabilistic Latent Semantic Analysis (PLSA [94]). Even if such model has been introduced in the text analysis community, in the next section we re-formulated its theory in order to deal with the gene expression scenario, assuming the analogy gene/words, sample/documents, and expression-level/word-counts. 4.1.1 Probabilistic Latent Semantic Analysis (PLSA) In PLSA, the presence of a gene gn in the sample st is mediated by a latent topic variable, z ∈ Z = {z1 ,..., zZ }, also called aspect class. Intuitively, a topic may represent a biological process, which is active only in a subset of samples, and characterized by the high expression levels of a subset of the genes. The joint probability of the observed variables is X X p(gn , st ) = p(gn , zk , st ) = p(st ) · p(gn |zk ) · p(zk |st ) (4.1) k k In other words , the topic zk is a probabilistic co-occurrence of genes encoded by the distribution βzk (g) = p(gn |zk ), g = {g1 ,..., gN }. p(zk |st ) (with z = {z1 ,..., zK }) represents the proportion of the topics in the sample st ; finally p(st ) accounts for the global expression pattern of the sample st (in the document scenario, this accounts for documents of different lengths). An intuitive visualization of the PLSA model is depicted in Fig. 4.1, whereas the bayesian network representation is shown in Fig. 4.2. From this, the generative process for a sample st can be derived as follows. First, a topic zk is drawn from the distribution p(z|st ): a topic 4.1 Topic models and gene expression 37 β g z s N T Fig. 4.2. Bayesian network representation of the PLSA model. particularly present in sample st will more likely be selected. Then, a gene gn is drawn from the distribution p(gn |zk ), which is conditioned to the value assumed by zk . Finally, the process is repeated, by selecting another topic and another gene, until the whole sample is generated. The hidden distributions of the model, p(g|z) and p(z|s), are learned using an exact Expectation-Maximization algorithm (EM). The EM iteratively learns the model by minimizing a bound F on the loglikelihood L (i.e. the probability of the visible variables p(g)) by alternating the E and M-step (F is the free energy defined in Chap. 2, Eq. 2.22). In this context, the data-loglikelihood L is: L= T N X X n(gn , st ) · log p(gn , st ) (4.2) n=1 t=1 For the E-step, p(zk |st , gn ) can be obtained by looking at the bayesian network structure: p(zk |st )p(gn |zk ) (n,t) q(zk ) = p(zk |gn , st ) = P , qk (4.3) t k p(zk |s )p(gn |zk ) With this notation, the Free Energy for the PLSA model can be written as X X X (n,t) (n,t) (n,t) F= n(gn , st ) qk ln qk − qk ln p(st )p(zk |st )p(gn |zk ) (4.4) n,t k k In the M-step, the minimum of the free energy is found by setting the various derivatives to zero. Three normalization constraints have to be accounted, one for each hidden distribution. Thus, the free energy formula has to be augmented by appropriate Lagrange multipliers τk , ρi and ϕ, giving the following constraints: X τk · 1 − p(gn |zk ) = 0 (4.5) j ρt · 1 − X p(zk |st ) = 0 (4.6) k X ϕ· 1− p(st ) = 0 (4.7) t After eliminating the Lagrange multipliers, the M-step re-estimation equations can be obtained. For example, deriving w.r.t. p(gn |zk ) and setting derivatives equal to 0 leads to: 38 4 Gene expression classification using topic models P p(gn |zk ) · τk = t n(gn , st )qk(n,t) P n p(gn |zk ) = 1 P t (n,t) /(τk ) p(gn |zk ) = t n(gn , s )qk τ = P P n(g , st )q (n,t) k n n t k The final result, which gives the M-step equation for p(gn |zk ), is: P t (n,t) t n(gn , s )qk p(gn |zk ) = P P t (n,t) n t n(gn , s )qk (4.8) In a similar way, one can obtain estimates for p(zk |st ) and p(st ). The other M-step updates are summarized as follows: (n,t) t n n(gn , s )qk p(zk |s ) = P P t (n,t) k n n(gn , s )qk P t n n(gn , s ) p(st ) = P P t t n n(gn , s ) t P (4.9) (4.10) The E-step and the M-step equations are alternated until a termination condition is met, for example when there is a little change of the data log-likelihood across two consecutive iterations. It is important to note that t is a dummy index into the list of documents in the training set. In other words, st is a random variable with as many possible values as training documents: the model learns the topic mixtures p(z|s) only for those documents on which it is trained. For this reason, PLSA can not be considered a well-defined generative model of documents; there is no natural way to use it to assign probability to a previously unseen document. However, once the model has been learned one can estimate the topic proportion of an unseen sample. This is achieved by applying the learning algorithm while keeping fixed the previously learned parameter p(gn |zk ) and estimating p(zk |st ) for the sample at hand by using Eq. 4.9. As a final consideration, note that this is an admixture model [81]: one sample is made by the contribution of different topics, each one with its proportion, contrarily to the mixture of unigrams model presented in Chap. 2. This is particularly appropriate in the gene expression scenario, where a topic can be intended as a biological process. It is reasonable that many processes are active at the same time (with different levels) in a sample, and these processes influence the measured expression levels. 4.2 The proposed approach As explained in the previous section, given the analogy between the pair worddocument and the pair gene-sample, we can in general associate the expression 4.2 The proposed approach 39 matrix n(gn , st ) (e.g coming from a DNA microarray experiment) to the count matrix of topic models, to be explicitly or implicitly used to trained the topic model. Note that the framework can be employed with any topic model, not just PLSA (in fact, in the experimental evaluation we considered also some other topic models). In order to classify samples, we propose to exploit a hybrid generativediscriminative scheme [26,99,120,173], where the generative and the discriminative paradigms are merged together. Recall that generative approaches, such as topic models, are based on probabilistic class models and a priori class probabilities, learned from training data and combined via Bayes’ rule to yield posterior probabilities. On the contrary, discriminative learning methods aim at learning class boundaries or posterior class probabilities directly from data, without relying on intermediate generative class models. Generative and discriminative classification schemes represent the two main directions for classifying data: each philosophy brings pros and cons; the last research frontier aims at fusing them, following heterogeneous recipes [173]. In the hybrid generative-discriminative scheme adopted here, the typical pipeline is to learn a generative model – suitable to properly describe the problem –, and use it to project every object in a feature space (the so-called generative embedding space), where a discriminative classifier may be trained. In particular, the approach we employed here is realized as follows: Generative model training Given the training set, the topic model is trained as explained in the previous section. Different schemes may be adopted to fit the best model (or set of models) to the data, namely by learning one model per class, one per the whole dataset or others – an interesting analysis has been reported, for the Hidden Markov Model case, in [21]. Here we employ the basic one, namely training one single model for all classes. Generative embedding Within this step, all the objects are projected, through the learned model, to a vector space. In this way, standard discriminative classifiers are proved to achieve higher performances than a solely generative or discriminative approach [99, 120, 173]. There are many embeddings that can be built on the generative model: a first and simple choice is to use the estimated topic posteriors distribution. The intuition is that since every topic may be approximately associated to a biological process (or to a set of – [22]), the topic distribution p(z|st ) characterizing a sample may indicate which and to which extent the different processes are active in such sample, thus representing a significant and possibly discriminant feature from a threefold perspective: 1. they provide a really interpretable representation of the microarray experiments, in terms of biological processes, as shown in section 4.4; 2. the dimensionality of the feature vector is reduced from the number of genes N to the number of topics K, with K N – thus providing a more compact and easy-to-manage representation. 3. finally, such descriptors represent multinomial distributions, which are suitable to be classified using kernels on probability measures (also called Information 40 4 Gene expression classification using topic models Theoretic Kernels and detailed in the following section) – which have been shown to be very effective in classification problems involving text, images, and other types of data (see [147] and the references therein); moreover, very recently, they have been shown to be very suitable for the hybrid generativediscriminative approach (see for example [25]). Moreover, it is important to notice that this representation with the topic posteriors has been already successfully used in computer vision for classification purposes [33, 53] as well as in the medical informatics domain [42] (this being confirmed by our experimental evaluation). More sophisticated choices are of course possible: here we investigated a possible extendibility of the proposed approach by employing a more complex descriptor called FESS (Free Energy Score Spaces – [174]). In the FESS, the embedding is achieved via the unique decomposition in addends that composes the free energy of the model. For PLSA, it has been defined in Eq. 4.4, but we can re-write it in the following way: X X F(st ) = n(gn , st ) · p(zk |gn , st ) · log p(zk |gn , st ) + n − X n k t n(gn , s ) · X p(zk |gn , st ) · log p(gn , st , zk ) · p(zk ) (4.11) k where the first term represents the entropy of the posterior distribution and the second term is the cross-entropy. As visible in Eq. 4.11, both terms are composed of Z × N addends, and their sum is equal to the free energy. The idea of FESS is to decompose the free energy in its addends, i.e., F(st ). For PLSA this results in a space of dimension equal to 2 × Z × N ; we will refer to this as FESS L3 . In [174], the authors point out that, if the dimensionality is too high, some of the sums can be carried out to reduce the dimensionality of the vector. The choice of the addends to optimize is intuitive but guided by the particular application. In our case, as previously done in [174, 178], we perform the sums over the gene indices, optimizing the topics contribute. The resulting score space has dimension equal to 2 × Z; we will refer to this space as FESS L2 . In few words, the FESS expresses how well each data point (i.e. sample) fits different parts of a trained generative model. It has been found that the FESS is highly informative for discriminative learning, yielding state-of-the-art results in several contexts [172, 174]. However, its suitability in the gene expression context has never been investigated. Discriminative classification In the resulting generative embedding space any discriminative vector-based classifier may be employed. In this fashion, according to the generative/discriminative classification paradigm, we used the information coming from the generative process as discriminative features of a discriminative classifier. Almost always in hybrid generative discriminative schemes, Support Vector Machines (SVM [52]) are employed in the resulting generative embedding space. 4.3 Experimental evaluation 41 4.3 Experimental evaluation The suitability of the proposed classification scheme has been extensively tested on ten different well-known datasets, briefly summarized in Table 4.1 (in the literature usually only 2-3 datasets are employed). As in many gene expression analyses, a beneficial effect may be obtained by selecting a sub group of genes, in order to limit the dimensionality of the problem and to reduce the possible redundancy present in the dataset. Here we employed the Minimum-Redundancy Maximum-Relevance feature selection approach [63, 170]1 . In order to have a fair comparison with the state of the art, for every dataset we selected the best result in the literature (at least to the best of our knowledge) – they are reported in Table 4.1; we used then, in our experiments, the same number of genes used in that paper (when specified); if not specified, we retain 500 genes (as in several other papers [23,177,194]). For similar reasons, also the cross validation protocols – again reported in Table 4.1 – have been chosen by looking at the corresponding state of the art papers. In the learning phase, the PLSA model has been built only on the training set. Since the training procedure can converge to local optima of the likelihood, the training has been repeated 20 times, starting from different random initializations, retaining the model with the highest data likelihood. The number of topics is a free parameter in topic models, and should be set in advance. Different automatic techniques have been proposed in the literature to set such a number, ranging from hold-out likelihood [194] to cross validation, from a priori knowledge to probabilistic model selection methods – e.g. the Bayesian Information Criterion (BIC – [205]). Here we adopted a very simple scheme: starting from the observation that topic models are adequate in finding clusters (they were designed as clustering techniques), we thought it reasonable to fix the number of topics as proportional to the number of classes (after few trials, we found that three times the number of classes was a reasonable choice). Despite the simplicity of this rule, obtained results were very satisfactory. An analysis of the performances of PLSA with respect to this parameter is discussed in the next section. Table 4.1. Summary of the employed dataset. In particular, N represents the number of genes, T the number of samples, and C the number of classes. Dataset leuk2 leuk1 11tumors colon brain1 brain2 lung nci60 prostate 9tumors 1 N,T,C Citation 11225,72,3 [10] 5327,72,3 [87] 12533,174,11 [222] 2000,62,2 [7] 5920,90,5 [182] 10367,50,4 [161] 12600,203,5 [19] 7129,60,9 [196] 10509,102,2 [210] 5726,60,9 [220] Test Protocol 5-fold CV 10-fold CV 5-fold CV LOO CV 4-fold CV 10-fold CV 5-fold CV 10-fold CV LOO CV 10-fold CV http://www.mathworks.com/matlabcentral/fileexchange/14916. 42 4 Gene expression classification using topic models A final note on the training for the PLSA model: in some cases the raw expression matrix contained negative values and cannot be used as it is as the count matrix of topic models (which requires positive values); therefore a simple shifting step has been applied to the matrix in order to have positive values. Table 4.2. Classification errors of the proposed approaches for different datasets. Method Leuk2 Leuk1 11Tumors Colon Brain1 Brain2 Lung NCI60 Prostate 9Tumors PLSA Lin PLSA JS PLSA JT FESS L2 FESS L3 0.0267 0.0267 0.0267 0.0267 0.0267 0.0286 0.0143 0.0286 0.0278 0.0417 0.0900 0.0583 0.0534 0.0465 0.0457 0.0968 0.0968 0.1129 0.0833 0.0833 0.4800 0.3067 0.3067 0.0963 0.0963 0.0686 0.0490 0.0490 0.0392 0.0392 0.3200 0.2533 0.2733 0.0516 0.0556 LPD Lin LPD JS LPD JT 0.0133 0.0125 0.0133 0.0393 0.0133 0.0143 0.0947 0.0671 0.0840 0.1935 0.1205 0.1800 0.0585 0.2700 0.1935 0.1429 0.1800 0.0673 0.2533 0.1774 0.1101 0.1800 0.0578 0.2867 0.0588 0.0588 0.0490 0.3433 0.3600 0.3767 Our Best 0.0267 0.0143 0.0457 0.0833 0.0761 0.0778 0.0397 0.0963 0.0392 0.0516 Bayesian 0.0297 0.0143 0.0847 0.1452 0.0863 0.2800 0.0542 0.3433 0.0882 0.2933 Supervised TM 0.0810 0.0833 0.5866 0.0806 0.3318 0.3200 0.1541 0.6733 0.0588 0.5900 State-of-art (Reference) 0.0520 ( [60]) 0.0650 0.1350 0.1500 0.0620 0.1170 0.0240 ( [135]) ( [162]) ( [219]) ( [46]) ( [250]) ( [234]) 0.2460 ( [91]) 0.0150 0.0250 ( [246]) ( [219]) 0.0982 0.0982 0.0967 0.0773 0.0761 0.1000 0.1200 0.2200 0.0778 0.0778 0.0541 0.0447 0.0397 0.0711 0.0711 As almost always in hybrid generative discriminative schemes, the classification accuracies have been computed using Support Vector Machines in the resulting generative embedding space – the parameter C has been selected using Cross Validation on the training set. As already discussed, more than using the standard linear kernel, we exploited the probabilistic nature of the feature vector by the use of different kernels on measures (also called information theoretic kernels [147]), which provide similarity between probabilistic distributions. It has been shown in other contexts (see for example [25]) that such combination may be beneficial for some hybrid generative-discriminative methods. In particular, here we employ the standard Jensen-Shannon kernel (JS), based on the Jensen-Shannon divergence between two distributions p1 and p2 : JS(p1 , p2 ) = H p + p H(p ) + H(p ) 1 2 1 2 − 2 2 (4.12) where H is the Shannon entropy. We also employed a more recent kernel, introduced by [147], which is based on a non-extensive generalization of the classical Shannon information theory, and defined on (possibly unnormalized) probability measures: the Jensen-Tsallis (JT) kernel, defined as: KqJT (p1 , p2 ) = lnq (2) − Tq (p1 , p2 ) (4.13) where lnq (x) = (x1−q − 1)/(1 − q) is the q-logarithm, Tq (p1 , p2 ) = Sq p + p S (p ) + S (p ) 1 2 q 1 q 2 − 2 2q (4.14) 4.3 Experimental evaluation 43 is the Jensen-Tsallis q-difference and Sq (r) is the Tsallis non-extensive entropy, defined, for a multinomial distribution r = (r1 , . . . , rW ), as ! W X 1 q 1− Sq (r1 , . . . , rW ) = ri (4.15) q−1 i=1 The parameter q has been adjusted by cross validation on the training set. For what concerns the FESS, after extracting the descriptors, we used in our experiments the SVM with linear kernel. Another interesting point of analysis is related to the different possible ways in which topic models can be exploited for classification. Alternatives to our staged scheme exist: in particular, here we compared our approach to a simple Bayesian scheme – which trains one model per class and performs classification with the Bayes rule –, and to the supervised topic models approach [148] – which explicitly takes into account the labels in the training process2 . Finally, we compared our classification results with the ones obtained using as a topic model the Latent Process Decomposition (LPD [194]), a generative model explicitly proposed for the microarray scenario to cluster genes. The LPD is inspired by topic models: however, in the PLSA a word is generated by a multimodal distribution p(gn |zk ), whereas in the LPD the word-topic probability is modeled by a single gaussian (µgn ,zk , σgn ,zk ), thus reflecting the continuous nature of the expression level. All the obtained results are reported in table 4.2, together with state of the art results. “Lin”, “JS”, and “JT” stand for linear, Jensen-Shannon and JensenTsallis kernels, respectively. “FESS L2” and “FESS L3” are the two variants of the FESS approach introduced before. 4.3.1 Discussion As a general comment, from the table it can be argued that descriptors extracted from Topic Models are really effective for expression microarray classification. When compared with literature, we can observe that our results are in line with the state of the art. Moreover, in three cases (Brain1, Brain2 and 9Tumors), our best result is substantially better than the state of the art. It is important to notice, at this point, that we compared our results (obtained within a single framework) with results obtained with many different techniques on different datasets, each technique possibly tailored for the specific dataset (which are very different in terms of composition and difficulty – see table 4.1). Some more specific observations can be drawn from the table: in particular, by looking at the behavior of the different kernels, we can notice that a beneficial effect is obtained when exploiting the probabilistic nature of the feature vector by using the information theoretic kernels. When comparing PLSA with LPD, it seems that in average there is not such a big difference in terms of accuracy, with some datasets slightly preferring PLSA. A possible justification may be searched in the sensitivity of LPD model to the choice of the number of topics. To investigate 2 The code can be found in http://cran.r-project.org/web/packages/lda/. 44 4 Gene expression classification using topic models Linear Kernel 0.14 Jensen−Tsallis kernel 0.14 pLSA LPD 0.12 0.12 0.1 0.1 Averaged Error Averaged Error pLSA LPD 0.08 0.06 0.08 0.06 0.04 0.04 0.02 0.02 0 5 10 15 Number of topics 20 25 30 0 5 10 15 Number of topics 20 25 30 Fig. 4.3. Accuracies on the Leuk dataset by varying the number of topics. Results are shown for PLSA and LPD with linear (left) and Jensen-Tsallis (right) kernels. such behavior, we performed an exhaustive analysis on the Leuk2 dataset, by varying such number from 3 to 30 (step 3). In Figure 4.3 the error curves are displayed, employing the linear kernel (on the left) and the Jensen-Tsallis kernel (on the right). It seems evident from the plots that the accuracies for the PLSA do not vary too much while changing the number of topics, whereas the LPD is more sensitive to such choice (when the number of topics is properly chosen, LPD outperforms PLSA). This is true both for linear and for the JT kernels. The accuracy and the possible margin in extensibility of the proposed approach is evident when looking at the results obtained with FESS. Actually it turned out that when the topic proportion descriptor is not enough to discriminate (see for example NCI60 and 9tumors), the FESS signature permits to unravel such complexity, leading to excellent results (on the contrary, when the topic proportion feature vector works well, only a marginal improvement is obtained by using FESS). Finally, by comparing the different ways of exploiting topic models for classification (our approach, the Bayesian scheme, and the supervised topic models method), it seems evident that in problems with few classes a supervised topic model is a good choice, leading to very good results, whereas when the number of classes increases the separation of the training set made by the Bayesian scheme is more appropriate. In general, nevertheless, our hybrid approach is better, confirming the fact, shown in other many different contexts, that this scheme is able to exploit the complementarity of the generative and the discriminative paradigms: generative models, which are more suited to describe data, are used to derive features, which are then classified by discriminative techniques, which are more suited to find decision boundaries. 4.4 The interpretability of the feature vector 45 4.4 The interpretability of the feature vector In this section we would like to demonstrate that the extracted p(z|s) vectors of PLSA are highly interpretable. In particular, p(z|s) characterizes “how present” every topic is in a given sample, and we already posited that a topic may be easily associated to a biological process. Actually, by definition, a topic characterizes a subset of samples where the gene expressions are highly correlated. Therefore p(z|s) may be used to infer the different biological processes which are active over the different samples. It should be noted that also the probability of the genes given the topic may be very useful: actually it may be interpreted as the impact of the different genes in a particular biological process. Moreover, the probabilistic nature of these models permits to encode also the level of the impact, thus taking into account the well known fact that not all biological processes are taking place in every sample. To show these characteristics we applied the proposed scheme on a real dataset, in a study conducted in collaboration with the Functional Genomics Lab at the University of Verona. The dataset included 48 samples (and 24676 genes) of microarray expressions of two grapevine species, V. vinifera and V. riparia, both subjected to infection with Plasmopara viticola, a pathogen responsible for a destructive disease. It is known that V. riparia is resistant to the pathogen, while V. vinifera is more susceptible to infection, and the study focused in understanding molecular switches, signals and effectors involved in resistance [181]. In the paper, they reported a microarray analysis of early transcriptional changes associated to P. viticola infection in both susceptible Vitis vinifera and resistant Vitis riparia plants (12 and 24 h post inoculation). The same experiments have been conducted with the plant treated with water, a neutral agent used as control. We chose this dataset since it is very complex and structured; different classes can be highlighted: in particular, samples can be divided on the basis of the type of plant (V. vinifera or V. riparia), of the time point (after 12h or 24h), or the pathogen/water treatment. In the training phase, we employed the Bayesian Information Criterion (BIC [205]) to have a rough estimate on the best number of topics to set. In very few words, BIC defines a term that penalizes the likelihood of the model depending on the number of its free parameters; in such way, larger models – which do not lead to a substantial increase of the likelihood – are discouraged. a PLSA model was trained 50 times and the best model (in a likelihood sense) was retained. Using BIC, we found that the best values ranged in the interval [5 8]. Guided by the expertise of biologists, the number of topics has been set to 6. Then information has been extracted from the topic/document and word/topic distributions. In particular, in figure 4.4 we report on the left an intuitive bar-plot of the probability p(z|d) (different rows correspond to different topics z), while the figure on the right represents the functional categories, as analyzed by the biologists, of the most important genes (found by looking at p(g|z)). Studying the composition of the dataset, we observed that it is rather accurately reflected by the p(z|d) distribution (on the left of the figure). Actually every topic can reflect a different aspect of the dataset. For example, some topics show groups of samples which are more correlated with the effects of treatment at the 46 4 Gene expression classification using topic models Fig. 4.4. PLSA analysis. (a) Bar representation of the p(z|s) distribution for each of the 6 topics. The main classes are represented on the bottom of the figure. (b) Functional category distribution of topic specific genes. different time points rather than with a specific reaction to the pathogen in comparison with the control (water). This is evident in the 3rd and 4th topics, which represent V.vinif era after 12 hour and 24 hour respectively, the former without pathogen inoculation and the latter infected. The last topic captures the processes of V.riparia after 12 hour since the infiltration, in the first case with water, in the second with the pathogen. From the specific disease resistance point of view, the analysis confirmed the tendency of a specific response in V. riparia. In fact, the 1st topic deals with samples related to infected V. riparia’s leaves at both time points (12 and 24 hours after infection). By looking at the genes which are most active in the 1s t topic, biologists found that their distribution is particularly significant. In fact, important functional categories among the involved genes (listed on the right side of figure 4.4) are carbohydrate metabolism and transport, in contrast with a strong contribution of photosynthesis-related gene expression in other topics. As previously reported, the primary metabolic reprogramming underlies defense in biotrophic interactions in order to potentially supply both energy and precursors to implement a defense mode. It is also worth noting that, within topic 1, the same trend of the last 12 experiments is visible on the classes of V. vinifera subjected to inoculation (samples 12-24). In fact, this means that an activation of some genes – possibly involved in the response to the pathogen – is undergoing, but the response is too weak, explaining the susceptibility of the plant to P. viticola. 4.4 The interpretability of the feature vector 47 Concluding, all these observations qualitatively confirm the capabilities of the proposed descriptors to encode different aspects of the dataset. 5 The Counting Grid model for gene expression data analysis In the previous section, we showed that topic models introduce an interesting intermediate level of representation, based on the concept of topic, which essentially expresses a co-occurent pattern of genes: they are latent modules that assign high probability to genes that tend to be highly co-expressed. However, a common assumption of most topic models is that these modules act independently of each other. While this assumption is often needed to simplify computations and inference, it may be too simplistic in the gene expression scenario, where it is known that biological processes are tightly co-regulated and interdependent in a complex way. In this chapter we make a step forward – pursuing the bag of words and the topic model philosophy but coping with the afore-described limitation – presenting a novel strategy to extract an informative representation for a set of experimental samples through a generative model called Counting Grid (CG [104]). The Counting Grid is a model for objects represented as bag of words recently introduced for text mining [104] and image processing [175]. The key idea of topic models is still present: a document is abstracted in an intermediate representation of “topics”, which are probability distribution over words that picks out a coherent cluster of correlated terms. However, here topics are arranged on a discrete grid, learned in a way that “similar” topics are closely arranged. Fig. 5.1 pictures this idea, and compare it with the PLSA model. Similar biological samples, i.e. sharing some topics and active genes, will be mapped close on the grid, allowing for an intuitive visualization of the dataset. More specifically, the CG seems to be very suitable in the gene expression scenario for the following reasons: • • The CG provides a powerful representation – successful in other fields [104, 171, 175] – which permits to capture evolution of patterns in the experiments, that can be clearly visualized. The CG is well suited for data that exhibits smooth variation between samples. Expression values are biologically constrained to lie within certain bounds by purifying selection [106] and variation in only a few expression values can cause a pathology. This specific property of the data is captured well by the model. 50 5 The Counting Grid model for gene expression data analysis sample st 80% 20% PLSA 'economics' 'politics' CG more 'politics' more 'psychology' sample st more 'computer science' Fig. 5.1. In the PLSA model, a document is an admixture of independent topics. In the example, document st is composed by the topics ’economics’ and ’politics’ with a 0.8:0.2 proportion. In the CG model, neighboring topics are similar, and a document is generated from one window in the grid. Traveling in any direction on the grid lead to a smooth topic transition. • Last, but not least, it is possible (preliminary investigated in [177]) to achieve a better classification accuracy with respect to other probabilistic approaches as well as to the recent state of the art. In this chapter we make a comprehensive evaluation of the CG model in the gene expression scenario by providing the following main contributions. 1. By testing and visualizing different data sets, we show that samples belonging to different biological conditions (such as different types of cancer) cluster together on the grid. 2. We prove that the model is able to select genes that are involved in the pathology or in the phenomenon which motivated the experiment, deriving a principled and founded way to extract the most important genes. 3. We show that the model achieves state-of-the-art results for classification tasks. 4. We evaluate the sensitivity of the model to parameters such as grid and window size and the robustness of the model to overfitting. 5.1 The Counting Grid model 51 Ex Ey T Wx k Wy g N sample s1 sample s2 (a) π (b) Fig. 5.2. (a) The Counting Grid model. Closely mapped samples s1 and s2 share some topics, as they have a common subset of genes particularly active. (b) Bayesian network representation for the Counting Grid model. Before detailing how these goals are achieved, in the following we will review the Counting Grid model. 5.1 The Counting Grid model We have shown in the previous chapter that from a set of samples st , t = 1 . . . , T the PLSA topic model learns a small number of topics which correlate related genes particularly active in a subset of samples. However, there are no strong constraints in how topics are mixed, because they are assumed to be statistically independent. In the Counting Grid model, these distributions representing topics are arranged on a discrete grid. Formally, the Counting Grid πi,z is a D-dimensional discrete grid indexed by i = (i1 , . . . , iD ) where each id ∈ [1 . . . Ed ] and E = (E1 , . . . , ED ) describes the extent of the counting P grid. Each cell represents a tight distribution over genes (indexed by n), so n πi,n = 1 everywhere on the grid. A given sample st , represented by expression values {gnt } is assumed to follow a distribution found in a window somewhere in the counting grid. In particular, using windows of dimensions W = [W1 , . . . , WD ], each bag can be generated by first averaging all expression levels in the hypercube window Wk = [k . . . k + W] starting from the location k (upper-left corner of the window) and extending Pin each direction d by Wd grid positions to form the histogram hk,n = Q 1Wd i∈Wk πi,n , and then d generating the bag of genes from such averaged histogram. In other words, the position (upper-left corner) of the window k in the grid is a latent variable given 52 5 The Counting Grid model for gene expression data analysis which the probability of the bag of genes {gnt } for sample st is ! Y X Y g t t 1 t gn πi,n n p({gn }|k) = (hk,n ) = Q d Wd n n i∈Wk Relaxing the terminology, we will refer to E and W respectively as the counting grid size and the window size, indicating with Wk the particular window placed at location k. We will refer to the ratio of the window volumes, κ, as a capacity of the model in terms of an equivalent number of topics, as this represents how many non overlapping windows can be fit onto the grid. An example of a 2D grid is depicted in figure 5.2 on the left; on the right, the bayesian network for the model is depicted. To learn a Counting Grid, we need to maximize the log likelihood of the data: ! T X X Y gt n log P = log hk,n (5.1) t=1 k n The sum over the latent variables k makes it difficult to perform assignment to the latent variables while also estimating the model parameters. The problem is solved by employing an EM procedure [81], which – as explained in Chap. 2 – iteratively learns the model by minimizing the free energy F by alternating the E and M-step. In particular, for the CG model the free energy F is equal to XX L≥F =− qkt · log qkt + (5.2) t + k XX t k qkt X n gnt · log X πi,n i∈Wkt where qkt = P (k|st ) is the posterior distribution over the latent mapping onto the counting grid of the t-th sample, and L is the data log likelihood. The E-step estimates q, aligning all bags to grid windows to match the bags’ histograms. The M-step re-estimates the counting grid π given the current q. To avoid local minima, it is important to consider the counting grid as a torus, and perform all windowing operations accordingly. 5.2 Class embedding and biomarker identification In this section the novel methodology aimed to extract the most relevant and discriminant genes is presented. The starting point is the process which we called label embedding. As often in the case of gene expression, samples have a label associated that reflects for example a pathological subtype, or a different tissue of the organism. Suppose that for each sample, we are given a label y t = l, l = [1, . . . , L] representing its class index. Once a Counting Grid is learned and the location of each sample is located on the grid (by looking at qkt ), it is possible to obtain a posterior probability of each class p(l|i) = γl (i) in each position i: this indicates which regions of the CG better “explain” the class labeled by l. This is achieved by using the posterior probabilities qkt already inferred: 5.2 Class embedding and biomarker identification a) γi Tumor 53 b) γi Tumor Tumor c) πi.z ( πi.z ) Gene Not Expr. Gene Expr. d) Fz,i Fig. 5.3. (a) Label embedding γi . (b) Gradient of the embedding. (c) Counting grid for a particular gene (πn ) and its gradient. (d) Fn,i P P γl (i) = t k|i∈Wk P P t qkt · [y t = l] k|i∈Wk qkt (5.3) where [·] is the indicator function, that indicates membership of an element in the class. The output is 1 if sample st belongs to class l, 0 otherwise. Roughly speaking, the main idea is to “average” all the mappings of the training samples belonging to a given class. If the CG is able to capture the underlying behavior of a specific class, then only a part of this averaged map will be different than zero, possibly in a spatially coherent small region – the region which more likely “explains” the training patterns of that class. In order to clarify this concept, in Fig. 5.3 (a) we show the label embedding for the prostate cancer dataset [210], which comprises two classes. In the figure the tumoral class is embedded. Please observe that the active (non zero) locations are all grouped in spatially coherent zones of the averaged map. Therefore, even if the labels are not used during the learning of the CG, tumoral and non-tumoral samples are naturally separated (since we are in a two class problem, the embedding of non tumoral class is simply obtained by reversing this image); this suggests that indeed the CG is suitable to describe the latent structure which generates the data. 54 5 The Counting Grid model for gene expression data analysis As a second step, we compute the gradient of the embedding, ∇γi , which returns information about where and how the classes separates – see Fig. 5.3(b). In this case the idea is to find which are the regions in the CG where the first class “translates” to the second class or vice versa. Please note that in the two class case we only need to compute the gradient on one map, since the map of the second class is just the complementary of the first. The generalization to the multiclass case can be simply addressed by considering 1 vs all embeddings, although alternatives are possible. As a final step, to get the gene score Fn , upon which we will base the strategy to rank genes, we evaluate how much the expression of the different genes varies along the borders between the classes. The idea is straightforward: to discriminate between the two classes the most useful genes are the ones which vary most where we have the class transition. For example in Fig.5.3(c) we show for a particular gene n̂ the map πn̂,i , which represents where that gene is more expressed in the grid. We also show its gradient in each position (yellow arrows). After a quick glance at Fig.5.3(b) one can convince himself that the expression of n̂ is mostly expressed in tumoral samples and often varies where a transition between tumoral and non-tumoral samples is present; that suggests that the gene is important for classification and related to the disease. To capture this idea mathematically we compute the directional derivatives of the πn,i in the direction ~v of the gradient of the class embedding ~v = ∇γi and we sum over all the locations i in the grid. To reward more the variation in expression where we have a high variation between classes, we also multiply by the modulus of v. In formulae we have that the feature score is equal to: X X ~v · ∇πn,i = (5.4) Fn = ~v · ∇πn,i |~v | · |~v | i i In the formula, we take the absolute value because we regarded as equally relevant genes which under express in the transition to class l and genes which over express in the transition to class l. Fig. 5.3(d) shows that Fn̂,i 6= 0 only along the borders between the 2 classes. Fn represents the rank score of every gene, which permits to order the genes from the most prominent (i.e. the one which varies the most in the direction of “transition” of the classes) to the least. Summarizing, the proposed gene ranking approach consists of the following steps (for the two class case, however generalizing to more classes is straightforward): 1. Training of the Counting Grid on the whole dataset (generative step, labels are not used) 2. Label embedding of the training samples of one class 3. Computation of the gradient of the map, which estimates the regions of the maps where there is the transition from one class to the other 4. Computation in such zones of the gradient of the genes 5. As a final score, each gene is ranked by its averaged variation in the direction where the two classes vary most. 5.3 Example: mining yeast expression 55 20 8 7 OD 600 nm Glucose (g/liter) 15 6 5 10 4 3 5 2 1 0 9 11 13 15 17 0 21 19 Time (hours) Fig. 5.4. Temporal profile of the cell density, as measured by OD at 600 nm and glucose concentration in the media. 5.3 Example: mining yeast expression To illustrate the main features of the proposed framework we present a simple example, where we studied a dataset by De Risi et al. [61], measuring the gene expression of 6400 genes in Saccharomyces Cerevisiae during the diauxic shift, a recurring cycle in the natural history of yeast involves a shift from anaerobic (fermentation) to aerobic (respiration) metabolism. In few words, if the yeast find itself in a medium rich in sugar, it follows a rapid growth fueled by fermentation, with the production of ethanol. When the fermentable sugar is exhausted, the yeast cells turn to ethanol as a carbon source for aerobic growth, as depicted in Fig. 5.4. This switch from anaerobic growth to aerobic respiration upon depletion of glucose, is known to be correlated with widespread changes in the expression of genes involved in fundamental cellular processes [61]. In this particular experiment, expression values have been measured 2 2 4 TP4 TP5 6 8 10 2 12 TP6-7 2 4 4 6 6 8 10 12 4 6 8 10 12 −3 x 10 3.04 3.03 3.02 3.01 TP3 8 TP1-2 10 12 3 2.99 2.98 Fig. 5.5. Yeast dataset embedding. Each time point is placed in a location of the grid, highlighted in red (left part of the figure). There is a clear path connecting the dots: since the most pronounced transition occurs between the 3rd and the 4th time points, in the right part we show the class embedding γl (i) of samples l = {4, 5, 6, 7}. 56 5 The Counting Grid model for gene expression data analysis at 7 different time points, as shown in Fig. 5.4. From our point of view, each time point is a bag st = {gnt }, n = 1, . . . , 6400. As done in the previous chapter, we performed a filtering of the genes1 , obtaining a final refined dataset of 310 gene expression values at 7 time points. We learned the CG using these 7 samples, by setting the parameter κ to 4: specifically, we opted for a 12×12 grid with a 6×6 window for a clearer visualization. In the left part of figure 5.5 we provide a visualization of the mapping position on the learned CG of the 7 experiments – each red dot corresponds to the maximum of the q t , i.e. to the most probable position of a given time point t. The highlighted path connects the temporal transition between the 7 time points, permitting a clear understanding of the dataset. By looking at this embedding, it seems that the more pronounced transition occurs between the 3rd and the 4th time points. Thus, we roughly divided the dataset in 2 classes: we can see the distribution of the “respiration” class (samples 4-5-6-7), i.e. the map γl (i), in the right part of figure 5.5. From this map, we computed the gradient of γl (i) (portrayed in figure 5.6), and identified the genes which vary the most across the direction of the gradient, as described in section 5.2. For example, the gene highlighted in the zoomed portion of the grid (figure 5.6) is gad1, which seems to rapidly activate during the transition from fermentation to respiration. This is in line with previous findings reported in the literature [193, 197]. We extracted the top 10 relevant genes by using the framework described in section 5.2, which are reported in Tab. 5.1 (please note that gad1 is indeed the most relevant gene). To prove that these genes are indeed relevant from a biological point of view, we looked for terms in the Gene Ontology (GO) [11], which are highly over-represented among these 10 genes, with respect to all other terms pertaining the remaining 300 genes2 . Statistically significant (p < 0.05) terms are reported in table 5.2, and they are interestingly related to synthesis of sugar and response to oxidative stress. The p-value is computed employing a chi-squared test with Benjamini multiple hypothesis correction (more details can be found in [17]). This simple example permits to show the main features of the proposed framework: i) the 7 experiments are projected in the grid in a meaningful way, with a clear path which indicates the temporal evolution of the gene expressions; ii) by looking at the gradient of the class embeddings we can highlight genes which are responsible of the transition of the gene expressions from “fermentation” to “respiration”, this being qualitatively confirmed by the GO analysis. 5.4 Experimental evaluation To quantitatively assess the merits of the Counting Grid model in the gene expression scenario, we performed several experiments on three datasets widely employed in the literature. The first one is a prostate cancer dataset by [62], containing the 1 2 Following http://www.mathworks.com/help/bioinfo/examples/gene-expressionprofile-analysis.html We carried out this analysis by employing the online tool GOstat http://gostat.wehi.edu.au/. 5.4 Experimental evaluation 2 4 6 8 10 12 57 −3 x 10 3.04 2 3.03 4 3.02 6 3.01 8 3 10 2.99 12 2.98 Fig. 5.6. Derivatives computed on the map γl (i). On the right, a zoom of an area of the CG where the gradient is high. The highlighted gene is the one which varies the most in the gradient direction. Table 5.1. Top genes selected with the proposed approach Rank 1 2 3 4 5 6 7 8 9 10 Gene name gad1 (YMR250W) hsp12 (YFL014W) gsy1 (YFR015C) ygp1 (YNL160W) ctt1 (YGR088W) sam4 (YPL273W) gsy2 (YLR258W) sol4 (YGR248W) hsp30 (YCR021C) pgm2 (YMR105C) Description Glutamate decarboxylase Heat shock protein Glycogen synthase Yeast glycoprotein Cytosolic catalase T S-adenosylmethionine metabolism Glycogen synthase 6-phosphogluconolactonase Heat shock protein Phosphoglucomutase Table 5.2. Statistically significant GO terms over-represented in the pool of the 10 selected genes. GO GO:0005978 GO:0006979 GO:0006950 Description glycogen biosynth. process response to oxidative stress response to stress Genes gsy1,pgm2,gsy2 hsp12,ctt1,gad1 hsp30,ygp1, hsp12,ctt1,gad1 p-value 0.0225 0.0235 0.0247 expression of 9984 genes in 53 different samples: 14 samples labeled for benign prostatic hyperplasia (BPH), three normal adjacent prostate (NAP), one normal adjacent tumour (NAT), 14 localized prostate cancer (PCA), one prostatitis (PRO), and 20 metastatic tumours (MET). The second is a lung cancer dataset [19], also employed in the experiments dome in the previous chapter, consisting of 203 gene expression profiles from normal and tumour samples, with the tumors labelled as squamous, COID, small cell, and adenocarcinoma (5 classes in total). Finally, the 58 5 The Counting Grid model for gene expression data analysis 5 Prostate dataset 10 5 15 Lung dataset 10 5 15 Brain dataset 10 15 COID Normal Normal PCA 5 5 5 PNET MET 10 10 10 Malignant glioma SQ Adenocarcinoma NAP BPH Medulloblastoma SMCL 5 10 15 −3 x 10 5 10 15 −4 x 10 5 4.46 4.44 5 MET 4.36 10 1.185 4.34 1.18 4.32 4.9714 4.9712 4.971 10 4.9708 4.9706 Medulloblastoma 4.3 15 1.175 −4 x 10 4.9722 4.9716 4.38 Adenocarcinoma 15 4.9718 5 4.4 1.195 1.19 10 4.42 5 10 4.972 1.205 1.2 Rhab 15 15 15 4.9704 4.9702 15 15 Fig. 5.7. CG embeddings for the three studied datasets. brain tumor dataset [182] contains the expression levels of 7129 genes measured in 90 different patients classified in 5 classes (normal, primitive neuroectodermal tumor – PNET, atypical teratoid/rhabdoid tumors – Rhab, and malignant gliomas). We reduced the dimensionality of the original data sets by retaining the top 500 genes ranked by variance. In the following, we first show that the model is able to properly embed the samples on separated parts of the grid, where different zones reflect different sample class/conditions – this shows that the framework well captures the differences in gene expressions related to different classes; then, we extract the most relevant genes with the approach of section 5.2, validating them from a medical point of view; finally, we report classification accuracies obtained by using descriptors extracted from the model, reaching the state-of-the-art performances. 5.4.1 Embedding and clustering performances Following the original recipe of [104], a single CG is learned using all samples (but ignoring their labels). Data samples are embedded into the CG space: we show some embeddings on a 15×15 grid (using a 3×3 window) in figure 5.7, to have an immediate insight into the datasets. To evaluate how well samples cluster on the grid, we resort to the external criterion of purity [145]. In few words, we leave out one sample and estimate γl (k) on the remaining data by employing Eq. 3. Then, we assign a label to the test sample by computing X y test = argmaxl qktest · γl (k) (5.5) k The accuracy obtained with this nearest neighbor strategy is our purity score. We considered CG dimensionalities from 1 to 5, testing systematically up to 40 complexities per dimension. Results, shown in figure 5.8, confirms the capabilities 5.4 Experimental evaluation 1 NN in CG space (Prostate dataset) 0.9 Purity [0−1] 0.8 0.7 (a) 0.6 1D 2D 3D 4D 5D 0.5 0.4 0.3 0 10 1 1 10 2 10 Capacity κ 3 4 10 10 NN in CG space (Lung dataset) 0.9 Purity [0−1] 0.8 0.7 (b) 0.6 1D 2D 3D 4D 5D 0.5 0.4 0.3 0 10 1 1 10 2 10 Capacity κ 3 4 10 10 NN in CG space (Brain dataset) 0.9 Purity [0−1] 0.8 0.7 (c) 0.6 1D 2D 3D 4D 5D 0.5 0.4 0.3 0 10 1 10 2 10 Capacity κ 3 10 Fig. 5.8. Purity results. 4 10 59 60 5 The Counting Grid model for gene expression data analysis of the proposed framework to embed the different classes of each dataset in different regions of the grid; moreover, except for the Brain tumor dataset, it seems that the grid size and the choice of the capacity do not affect much clustering abilities (with only 1-dimensional counting grids being slightly worse). Interestingly, performances do not drop even for very large complexities, suggesting that the model is robust with respect to overtraining. 5.4.2 Qualitative evaluation of gene selection Table 5.3. Top genes selected with the proposed approach Prostate dataset (stability index: 0.891) Gene name Description CTGF Connective tissue growth factor EGR1 Early growth response 1 AMACR Alpha-methylacyl-CoA racemase ATF3 Activating transcription factor 3 LUM Lumican MMP7 Matrix metalloproteinase 7 SPRY4 Sprouty (Drosophila) homolog 4 FOSB FBJ murine osteosarcoma viral oncogene homolog B FGG Fibrinogen, gamma polypeptide DCT Dopachrome tautomerase Lung dataset (stability index: 0.907) Rank Gene name Description 1 GAPDH Glyceraldehyde-3-phosphate dehydrogenase 2 MAPK3 Mitogen-activated protein kinase 3 3 IL13RA2 Interleukin 13 receptor 4 NCAM1 Neural cell adhesion molecule 1 5 TIE1 Tyrosine kinase 6 CYP2C19 Cytochrome P450 7 SLC20A1 Solute carrier family 20 8 YWHAE Tyrosine 3-monooxygenase 9 ERF Ets2 repressor factor 10 CXCR5 Chemokine (C-X-C motif) receptor 5 Brain dataset (stability index: 0.813) Rank Gene name Description 1 MAPK3 Mitogen-activated protein kinase 3 2 CXCR5 Chemokine (C-X-C motif) receptor 5 3 TIE1 Tyrosine kinase 4 CYP2C19 Cytochrome P450 5 DUSP1 Dual specificity phosphatase 1 6 HINT1 Histidine triad nucleotide binding protein 1 7 MAPK11 Mitogen-activated protein kinase 11 8 RABGGTA Rab geranyltransferase 9 EIF2AK2 Eukaryotic translat. initiation factor 2-alpha kinase 2 10 IL13RA2 Interleukin 13 receptor, alpha 2 Rank 1 2 3 4 5 6 7 8 9 10 5.4 Experimental evaluation 61 In this section we provide a qualitative evaluation of the gene selection procedure, in order to understand if the most relevant genes extracted are significant from a medical point of view. In the next section, a quantitative evaluation and comparison with other state of the art methods is reported. With the approach proposed in section 5.2, we extracted the 10 most relevant genes that are involved in a particular tumor class (Metastasis for the prostate dataset, adenocarcinoma for the lung, and medulloblastoma for the brain). Ideally, the genes selected by our framework should not vary too much when varying the model capacity – thus confirming results shown in section 5.4.1 - figure 5.8. To investigate this aspect we run the gene selection several times using CG of different complexities, and validate the “stability” of the selected through the index proposed by [119]: this index takes values in the range [−1, 1], and the higher its value, the larger the number of commonly selected genes during different trainings of the algorithm. More in detail, given two sets of genes f1 and f2 , the stability index is defined as follows: KI(f1 , f2 ) = r − (s2 /N ) s − (s2 /N ) (5.6) where s denotes the signature size, r = |f1 ∩ f2 | and N is the total number of genes in the dataset. For every dataset, such stability index was never below 0.8, as reported in Tab. 5.3, confirming a preliminary investigation carried out in [136]. In Tab. 5.3 we report the most frequently selected genes while varying the model complexity: on these genes we carried out a detailed investigation in order to assess their potential significance for cancer biology. Prostate Cancer dataset The top gene highlighted by the algorithm for prostate cancer is CTGF. CTFG belongs to the CNN protein family which is involved in functions such as cell adhesion, proliferation, differentiation and apoptosis [107]. CNN family proteins have been identified as diagnostic and therapeutic agents for cancer [107]. Expression of CCN family proteins is altered in various cancers, including breast, colorectal, gallbladder, gastric, ovarian, pancreatic, and prostate cancers, gliomas, hepatocellular carcinoma, non-small cell lung and squamous cell carcinoma, lymphoblastic leukemia, melanoma, and cartilaginous tumors [45]. CTGF specifically has been shown to be involved in the invasiveness of cancer cells [45]. Similarly, it has been reported [72] that tumor angiogenesis and tumor growth are critically dependent on the activity of EGR1, the second top gene selected. Gene ATF3 codes for a transcription factor, that affects cell death and cell cycle progression. There is some evidence [142] that this gene can suppress ras-mediated tumor genesis. Lumican levels in breast cancer are associated with disease progression and have been used to predict survival ( [228] reported that low levels of lumican are related to tumor size), while FOSB has been found to drive ovarian cancer [206], and can be used as a prognostic indicator for epithelial ovarian cancer. Finally, MMP7 has been found to be involved in cancer metastasis and has been proposed to be used as a target for drug intervention in cancer [240]. 62 5 The Counting Grid model for gene expression data analysis We also compared this result with the one obtained with the LPD model [194] described in the previous chapter. Interestingly, there is some overlap between our and their result: of the 6 genes we found to be correlated with cancer, their model was able to highlight CTGF and EGR1 genes, although they were ranked lower. Lung Cancer dataset GAPDH expression was found to be strongly elevated in human lung cancer cells [227]. It is also correlated with breast cancer cell proliferation and aggressiveness [192]. IL13RA was found to be one of the genes that mediate the metastasis of breast cancer to lung [151]. NCAM has been researched as a target for immunotherapy for cancer as it is expressed in small cell lung cancer, neuroblastoma, rhabdomyosarkoma, brain tumours, multiple myelomas and acute myeloid leukaemia. TIE1 is involved in angiogenesis, the creation of new blood vessels, which as in important process also in tumor progression [58]. The experimental deletion of this gene from mice inhibits tumor angiogenesis and growth [58]. YWHAE is correlated with survival in breast cancer: it was found to be enriched in metastatic tumor cell pseudopods [207], and is involved in the pathology of small cell lung cancer. Brain Tumor dataset MAPK3 belongs to a family of proteins that regulate cell proliferation, differentiation and cell cycle progression. It was shown to be a prognostic biomarker in gastric cancer and implicated in the progression of hepatocellular carcinoma [114]. CXCR5 is a protein in CXC chemokine receptor family, which plays a role in the spread of cancer, including metastases [13]. TIE1 was implicated as a prognostic marker for gastric cancer [131] and showed over-expression in breast cancer. DUSP1 is a promoter for tumor angiogenesis, invasion and metastasis in non-small-cell lung cancer [153] and plays a prognostic role in breast cancer [27]. HINT1 is a tumor suppressor gene [249]. 5.4.3 Quantitative evaluation of gene selection We assessed numerically the performances of the gene selection approach presented in section 5.2 by performing a classification experiment on two benchmark datasets (namely the Colon and Prostate – summarized in Tab. 5.4) by employing only the genes selected with the proposed approach. We compare our results with state-ofthe-art methodologies for gene selection. To have a fair comparison with the stateof-the-art, we adopted the testing protocol of [245]: the data set was randomly split 2:3/1:3 (training/testing). Table 5.4. Summary of the datasets used Name Colon Prostate N. Genes 2000 6033 N. Samples 62 102 Reference [7] [210] 5.4 Experimental evaluation 63 Table 5.5. Classification results (AUC) for the dataset used. Colon dataset Gene Signature Size Gene Sel. Method 10 50 100 150 200 SVM-RFE [245] 76.4 77.5 79.2 79.4 80.1 Ens.SVM-RFE [245] 80.3 79.4 78.6 78.6 79.4 SW SVM-RFE [245] 79.5 81.2 78.4 76.2 76.2 ReliefF [245] 78.8 80.1 78.5 77.5 76.1 Ens. ReliefF [245] 78.9 80.2 79.1 77.3 76.1 SW ReliefF [245] 78.3 79.6 78.1 76.4 75.4 [2] 85.0 86.0 87.0 87.5 86.5 Our method 81.38 89.53 89.64 89.25 88.97 Prostate dataset Gene Signature Size Gene Sel. Method 10 50 100 150 200 SVM-RFE [245] 89.8 91.3 92.1 92.1 92.2 Ens.SVM-RFE [245] 92.9 92.0 92.0 92.6 92.7 SW SVM-RFE [245] 93.4 91.3 90.0 90.7 91.2 ReliefF [245] 93.3 93.0 91.4 91.4 91.7 Ens. ReliefF [245] 93.4 92.4 91.4 91.0 91.9 SW ReliefF [245] 93.3 92.7 91.4 91.3 91.4 [2] 95.5 96.0 95.0 94.0 94.0 Our method 78.21 88.30 92.45 94.99 95.73 More in detail, we employed the whole dataset to train a CG (of course labels are ignored in this phase), from which we computed the Fn score for each gene: after that, only the top-ranked genes have been extracted: in particular, we retain the top [10 50 100 200] genes. Then classification is performed using a linear SVM with the parameter C = 1, using the area under the ROC curve (AUC) as an estimate for the classification performance. The test has been repeated 100 times, and the mean of the computed AUCs is shown in table 5.5, along with comparative state-of-the-art results (see the references between brackets). As for the Counting Grid size, we varied its dimensions by selecting κ between 5 and 40, reporting in the table the mean of the obtained AUCs. From table 5.5 it is evident that the proposed approach produces results comparable, and in many cases superior, with state-of-the-art techniques. Furthermore, when looking at the stability, we can observe that our approach is very competitive: the obtained indices are shown in Table 5.6. Since the proposed approach is aimed at explaining the data through a generative model, and labels are used later on, the stability index is very high: for both datasets and all different signature sizes, it is always above 0.9, while the best result found in the references we used for comparison is 0.78. 5.4.4 Classification results As a last experiment, we employed the Counting Grid in a classification setting. We followed the standard hybrid generative-discriminative recipe explained in the 64 5 The Counting Grid model for gene expression data analysis Table 5.6. Stability of the proposed approach Colon dataset Gene Signature Size Gene Sel. Method 10 50 100 150 200 Best [245] 78.00 75.00 70.00 69.00 67.00 [2] 65.00 59.00 58.00 61.00 62.00 Our method 94.32 92.40 91.73 90.79 90.53 Prostate dataset Gene Signature Size Gene Sel. Method 10 50 100 150 200 Best [245] 68.00 65.00 68.00 68.50 69.00 [2] 72.00 72.00 73.00 72.00 71.00 Our method 90.04 94.36 95.60 95.73 96.37 previous chapter [173]: the idea is to characterize every sample with a feature vector obtained from the learned CG, so that samples are projected in a highly informative space where standard discriminative classifiers such as Support Vector Machines (SVM) can be used. In our experiments, we employed two strategies, both based on the definition of a kernel to be used with a SVM classifier. In few words, in the former case [171] the kernel is defined on the basis of a geometric reasoning on the grid of the learned CG, which is called Spreading similarity measure: SSMS (s1 , s2 ) = SM (qk1 ∗ SW , qk2 ∗ SW ) (5.7) In particular, we used the variant with the Histogram Intersection kernel: SMint (a, b) = K X min ai , bi (5.8) i=1 The second kernel employed is the Fisher Kernel [99], whose derivation in the CG case has been proposed in [177]. In the original formulation, the authors first define t the Fisher score for a gene F Sk,n t F Sk,n = gnt · X i∈Wkt qit , hi,n (5.9) and the concatenation of the F S, computed for all genes n, comprises the Fisher score for a sample. Then, the standard linear kernel is computed from these Fisher score vectors. These two classification strategies has been applied on the three datasets3 . 3 Since the prostate dataset has never been studied for classification (some classes have too few samples), to have a comparison with the literature we used another widely employed prostate cancer dataset [210], which contains expression profiles of 102 samples (2 class problem). 5.4 Experimental evaluation 65 Accuracies have been computed using the dataset author’s protocol: LeaveOne-Out for the Prostate dataset, 5-fold cross-validation for the Lung dataset, 4-fold cross-validation for the Brain dataset. The best result obtained by varying the complexity of the grid is reported in table 5.7. In order to have a clear insight of the gain obtained by explicitly consider the relation between topics, as done in the CG case, we applied the same hybrid classification strategies to the PLSA, in the same way described in the previous chapter; finally we compare our results to those obtained with the LPD – we took the results from Tab. 4.2 of Chap. 4. In two datasets out of three, the CG model (equipped with the Fisher kernel) was able to outperform other topic model-based approaches, as well as the current state-of-the art (taken from [177]). Table 5.7. Classification results HI PLSA HI CG Fisher PLSA Fisher CG LPD Best SoA Reference Prostate 0.826 0.773 0.921 0.940 0.951 0.982 [234] Lung 0.911 0.918 0.938 0.959 0.942 0.938 [46] Brain 0.858 0.869 0.862 0.900 0.890 0.865 [162] Final remarks on gene expression analysis In this part of the thesis, we explored the potentialities of topic models for classification and interpretation of gene expression data. Looking at the pipeline of the bag of words paradigm presented in Chap. 2 (Fig. 2.1), this part contributed in all stages of the pipeline from a methodological point of view, by casting the gene expression scenario in the bag of words framework; from an applicative point of view, it particularly contributed in the “How to model” stage, by tailoring and applying topic models for solve the classification task, and by motivating the use of the very recent CG model to mine knowledge from gene expression data. In particular, Chap. 4 proposed a classification scheme, based on highly interpretable features extracted from topic models. This resulted in a hybrid generative-discriminative approach for classification. An extensive experimental evaluation, involving 10 different literature benchmarks, confirmed the suitability of topic models for classifying this kind of data. Finally, a qualitative analysis on grapevine plants expression suggested the great expressiveness of the proposed approach. Chap. 5, building upon the motivations and the results obtained so far, investigate the Counting Grid model as a tool applicable to different analysis of a gene expression matrix, particularly suited because it models the smooth changes in gene expression over time or over different samples. The model provides an intuitive embedding, where samples effectively cluster together according to the patterns that categorize them in one or several tumoral classes. Also, as a novel methodological contribution we employ the model to perform gene selection. Finally, we assessed its capability in a classification setting. All these merits have been extensively tested: results demonstrate the suitability of the model from a twofold perspective: numerically, by reaching state-of-the-art accuracies in classification and gene selection experiments; clinically, by realizing that many of the selected genes are potentially significant for cancer biology. Part II HIV modeling 6 Introduction Understanding the human immune system, namely the ways in which the body protects itself from diseases and infections, is one of the most challenging topics in biology. There is a very broad branch of the biological sciences – immunology – devoted to the study of an organism’s defense (immune) system. As in many other contexts in the life sciences, recent technological advances have drastically transformed this research field: the sequencing of the human genome have produced increasingly large volumes of data relevant to immunology research; at the same time, huge amounts of functional and clinical data are being reported in the scientific literature and stored in clinical records. Thus, computational methodologies can be immensely useful for extracting meaningful representations from such data, and capturing correlations between the pathogen actions and the defense system reactions, which are still largely unknown. One of the most challenging scenario is perhaps understanding these correlations in the case of the human immunodeficiency virus (HIV), which ranks among the most deadly infectious epidemics ever recorded [239]. HIV is a severe disease that targets the immune system and destroys the ability of a person to react to other opportunistic infections. In this part of the thesis we focus on two aspects of HIV that can be computationally analyzed from a bag of words perspective. The first one concerns the problem of determining a correlation between a patient HIV status and a phenomenon called antigen presentation. In few words, cells present to their surface disordered fragments of their proteins. If the cell is infected, some of these fragments will belong to the foreign pathogen, and will be detected by specialized receptors called TCRs. As these fragments do not appear in a particular spatial organization on the surface, the immune system effectively sees the infection as a bag of molecules, based on whose counts the action needs to be taken. In this context the bag of words approach seems particularly suited, with all of the aspects of the bag of words pipeline described in Chap. 2 involved. Such a perspective has never been adopted in this context. We will show that through the bag of words representation and models proposed in Chap. 7 we can predict the severity of the HIV infection in a person. The second aspect regards the analysis of counts of TCRs derived from a set of sequencing experiments. Even if it is known that the main consequence of HIV is depleting the types and counts of TCRs, some methodological and applicative research lines are still open. In particular, we fo- 72 6 Introduction cused on the reliability of observed counts, and on the robust estimations of a TCR population diversity. These analysis, missing from the literature, can have a profound clinical impact, as shown in Chap. 8. 6.1 Background The collection of cells, tissues, and molecules that mediate resistance to infections is called the immune system, a highly sophisticated machinery that recognize and respond to possibly harmful substances called antigens. The term antigen is very generic, and refers to any substance that causes an immune response; in particular, disease-causing antigens are called pathogens [1]. In order to understand the computational problems faced in this part of the thesis, we will briefly introduce some biological backgrounds. In particular, we will describe in a very simplified fashion how a specific part of the immune system is able to recognize pathogen-infected cells. A central role is played by T cells, a type of white blood cells that circulate around the body, scanning for cellular abnormalities and infections. T cells can not “see” inside cells to detect an infection, but rely on a phenomenon called antigen presentation. In very few words, most of the cells present to their surface short fragments derived from cellular proteins, as a mean of advertising their state to the immune system. These fragments, called epitopes, are the part of an antigen – both foreign and the cell’s own – that are effectively recognized by the immune system [1]. Moreover, the epitope transport from inside the cell towards the surface is mediated by special molecules called major histocompatibility complex, or MHC (which in humans is called human leukocyte antigen – HLA). MHC/HLA molecules are proteins that provide a recognizable scaffolding to present an antigen to the T cell. Thus, T cells only recognize an antigen if it is carried on the surface of a cell [4]. Finally, the specialized receptors on T cells that perform such recognition are called T cell receptors (TCRs). The input to the cellular immune surveillance, summarizing the concepts described, is illustrated in Fig. 6.1. We show a simplified illustration of an infected cell which expresses both self (black) and viral (red) proteins (Fig.6.1a). MHC molecules bind to a small fraction of peptides from these proteins, the epitopes. Inside these MHC complexes, epitopes are transported to the surface of the cell, where they may be “spotted” by the T cells through the specialized TCRs. As the sampled epitopes do not appear in a particular spatial organization on the surface (Fig. 6.2a), the immune system effectively sees the infection as a bag of MHC molecules loaded with different viral epitopes. Depending on the application, this representation may be further simplified into a bag of epitopes (Fig.6.2b), under the assumption that the main effect of the MHC molecules is the epitope selection (e.g. choosing conserved vs non-conserved targets [93]). The challenge of this recognition is that the organism cannot predict the precise pathogen-derived antigens that will be encountered. For this reason, the immune system relies on the generation and maintenance of a diverse TCR repertoire. In other words, TCR diversification must have evolved to keep up with emerging pathogens, to cover most of the antigenic universe with corresponding receptors [159]. This diversification occurs through a complex genetic mechanism called 6.1 Background RGY HQYA YD GK DY IA L 73 T cell K ED L E L R S LY NTVATLY C TGS E LQ TCR antigen QVP L R P MTY K A AVDL P VT epitope MHC (a) (b) Fig. 6.1. (a) MHC type I binds to a fraction of proteins and exports them to the cell surface, where these sampled peptides appear without particular order. In this cartoon image only 3 different MHC I molecules are present. (b) Specialized T cells recognize epitopes through receptors called TCRs. z cz 6 0 2 3 2 1 (a) (b) Fig. 6.2. (a) Epitopes appear on the cellular surface without particular order. (b) A bag of epitopes and its relative counts cz . VDJ recombination, which consists of nearly random mutations in the TCR gene sequence. In particular, the TCR genomic region includes a large number of variable (V), diversity (D) and joining (J) gene segments that are used to produce functional TCR. A simplified view is shown in Fig. 6.3, where it is shown that the VDJ recombination leads to the construction of a functional TCR gene. Moreover, at recombination junctions some bases can be randomly added/subtracted [14,59]. For this reason, immune receptors are extraordinarily difficult sequencing targets because any given receptor variant may be present in very low abundance and may differ by only a single nucleotide. Through VDJ recombination the enormous repertoire of antigen receptors is generated, providing the versatility that is essential to normal immune functioning: in fact, it has been observed that around 106 distinct TCR molecules can be generated by VDJ recombination, although estimates of the potential repertoires are around 109 [236]. T-cell diversity contributes to immune defense in two ways: on one hand, it provides an initial pool from which the best and most efficient T cells will be selected to attack the pathogen; on the 74 6 Introduction V segments D segments J segments Fig. 6.3. As a TCR develops, it rearranges its DNA by randomly choosing different segments of the V, D, and J regions, cutting them out and pasting them back together in random combinations. other hand, it provides the flexible TCR reserve should the pathogen attempt to escape by mutation. Measuring the diversity and quantity of TCR is a crucial task in immunology, as the concentration of T-cells is a general predictor of the likelihood of opportunistic infections. For example, a sharp rise in the incidence of otherwise rare infections has been observed when counts of T cells fall below 200 cells/µL of blood [152]. This is exactly the target of the Human immunodeficiency virus (HIV), which binds and infects T-cells. As the disease progresses, the number of T cells declines below its normal level, destroying the ability of the patient to mount an immune response. As a consequence, the patient becomes hypersusceptible to opportunistic infections – often fatal – by pathogens. The severity of HIV viral infection is generally measured by clinicians with the so-called viral load, a quantity that reflects the number of virus particles in a milliliter of blood. 6.2 Contributions In the above described context, this part of the thesis addresses two problems: the first one is to investigate the correlation between the bag of epitopes presented to the cell surface and the patient HIV status; the second one is related to a robust statistical analysis of the counts of TCRs in HIV infected patients, aimed at discovering how the progress of the disease affects different kinds of patients. More in detail, the first contribution of this part is arguing for new applications of the bag of words paradigm as a set of tools for capturing correlations in the immune target abundances in cellular immune surveillance. Consequently, we propose a novel way of modeling bags of words in this context which i) treats observed epitope abundances as counts and ii) moves away from the traditional componential structure towards a spatial embedding that captures smooth changes in antigen presentation. We promote the use of topic models to capture cellular presentation, and more generally the view that the immune system has of the invading pathogens. Furthermore, we demonstrate that the newest of these models, the counting grid employed in Chap. 5, seems to be especially well suited to this task, providing stronger predictions than what can be found in biomedical literature. In the experimental section, we restrict to the analysis of the links between 6.2 Contributions 75 the HIV viral load and the patients HLA types, leading to significant improvement with respect to the state of the art. As a second contribution, we analyzed the counts of TCRs derived from a set of sequencing experiments in collaboration with the David Geffen School of medicine (UCLA). We developed a framework that allowed clinicians to assess the diversity of TCR populations in healthy, diseased, and perinatally infected patients (i.e. youths that contacted the disease in the womb because the mother was infected). Moreover, we looked from a methodological point of view at the reliability of the observed counts: as mentioned before, many TCRs differ by only a single nucleotide, and are observed very few times (in most cases, only once). A final comment: most of the work presented in this part of the thesis has been set up during an abroad internship at Microsoft Research, Redmond (US), in the eScience group under the supervision of N. Jojic. 7 Regression of HIV viral load using bag of words This chapter discusses the problem of modeling the immune response to HIV from a bag of words perspective. In particular, we fully exploit the pipeline of Chap. 2 with the final goal of regressing the HIV viral load (i.e. a number that reflects the severity of the infection) starting from a bag of word representation of epitope sets. We introduced in the previous chapter that the mammalian immune system consists of a number of interacting subsystems employing various infection clearing paths, with antigen presentation playing a central role in many of them. Moreover, we discussed that the immune system needs to recognize a virus not as a whole but as a set of disordered viral epitopes (a “bag” of epitopes). Like the gene expression context, this is another example in computational biology where the bag of words representation seems particularly suited because the structure of the objects in the problem is truly unknown, rather than just sacrificed for computational efficiency. This chapter has a dual purpose: i) it argues for the application of the bag of words paradigm as a set of tools for capturing correlations between the epitopes abundance in HIV patients and their viral load; and ii) it demonstrates that it is possible to effectively model the bag of words representation, by deriving a novel regression method – based on the counting grid [104] – that provides stronger predictions than what can be found in biomedical literature. In the remainder of this chapter (after a brief review of the state of the art), we first explain how to extract and model the bag of words representation from epitope sets: in the explanation, we address every step in the pipeline proposed in Chap. 2. We then report an experimental evaluation, leading to significant improvement with respect to the state of the art. 7.1 State of the art Explaining the differences in viral loads in different HIV patients is a crucial problem investigated by researchers in the HIV community (e.g., [6,149]). The hypothesis is that the variation in epitope presentation across patients is expected to reflect on the variation in viral load, at least to some extent [93, 154]. In particular, early studies showed that changes in viral load occur in synchrony with the emergence 78 7 Regression of HIV viral load using bag of words Table 7.1. Comparisons with the biomedical and computational biology literature. The percentage of viral load (VL) explained, is the square of the Pearson’s linear correlation coefficient ×100 (See Tab.7.2) Reference [154] [113] [93] [96] Bag of words Major Result VL considered too noisy. Associations with mutations found 1-2% of VL variance explained through individual allele association 4% of VL variance explained through by targeting efficiency 4,3%-9% of VL variance explained by combinations of epitopes Up to 13,50% of VL variance explained by embedding into the CG of new epitopes in immune assays (e.g., [6, 149]). However, in case of the highly polymorphic HIV, a handful of epitopes usually fails to control the infection, and so researchers turned to population studies in search for optimal immune targets. These early studies [6, 149] failed to detect significant links between patients epitopes and viral load, mainly because the straightforward statistical approaches could not handle small dataset sizes (typically around 200 patients or less). However, in [154] the evidence of an association between viral mutations and patients’ epitope types was recognized. Generally speaking, the viral load is highly variable and it may depend on numerous factors, such as gender, age, prior infections and general health of the individual. Yet, any statistically significant result has been seen as having important consequences to HIV research. Eventually, larger cohorts allowed researchers to clearly assess the link between epitope types and viral load. For example, in [113], certain epitope types were found to strongly associate with low viral load in a cohort of over 700 HIV patients in southern Africa. In these studies, despite the statistically strong associations, the viral load in positive and negative patients still had such large variance that each of these epitope types alone could only explain less than 2% of the total log viral load variance in the population. For these reasons, computational methods that capture these correlations and that are able to regress the viral load value for different patients have assumed a great importance in recent years. Tab. 7.1 provides an immediate insight into the state of the art on the computational methods that faced this task. To put these numbers into perspective, it is important to make two observations. First, even weak signals, e.g. in [113, 154] had the tendency to move the entire field, as valuable characteristics of the interaction between HIV and the host immune system were revealed, informing both the research on HIV drugs and the research on HIV vaccine. Second, in addition to high variation of the viral load due to factors that relate to age and general health, it is known that the set point viral load depends strongly on the infecting strain (see [6] for a recent study), and as HIV was found to mutate in its reactions to HLA presentation, this variation in fitness in the infecting strains may itself be due to the HLA pressure from previous hosts. Thus the increase in explanatory power of HLA types from around 4% of the log viral load to around 13.5% is potentially of great importance. Further analysis in selected combinations of features in the counting grid may lead to further advances in understanding the evolutionary arms race between HLA and the human immune system. 7.2 The proposed approach 79 7.2 The proposed approach Our problem is to model epitope sets with the bag of words paradigm, in particular to show that it is possible to perform regression and find correlations between this bag and the HIV viral load. In order to do so, it is necessary to i) extract a dictionary of possible epitopes that can be generated from the viral proteins; ii) count the abundance of these epitopes (this task is particularly difficult because there is no technique able to directly measure such counts); iii) choose a model that is able to capture epitope co-presentation and perform regression. In the following sections each step is addressed. 7.2.1 What to count The first observation that has to be made is that the concentration of any viral epitope on the cellular surface depends on the source protein’s expression level. In the following, we will denote a viral protein sequence as S = s1 s2 . . . sL . Moreover, it has been shown that most epitopes transported to the surface by HLA molecules are of length 9 [198]. For these reasons, we identified as words 9mers of protein sequences. In principle, the dictionary should be composed by all possible 9mers, leading to a dictionary of size 209 = 512 · 109 . Instead, we opted for a more data-driven dictionary: given a sequence S = s1 s2 . . . sL corresponding to a viral protein, we extracted all possible 9mers with overlap observed in the sequence: W = {w1 , w2 , . . . , wl , . . . , wL−9+1 } where wl = [sl . . . sl+9−1 ] is the l-th 9mer extracted from the sequence. In particular, we considered three essential HIV proteins in isolation, whose sequences are known in the literature: GAG (core structural protein), POL (reverse transcriptase, used by the virus to integrate itself in the human genome), and VPR (essential for the replication of the virus). In this way, the dictionary is composed by listing all unique 9mers observed, and each word/9mer in the dictionary is indexed by n. 7.2.2 How to count Unfortunately, there is no technology able to directly measure the epitopes concentration (count) on the cell surface. Thus, to obtain the count value we reasoned about the mechanism employed by HLA molecules for generating 9mers starting from the viral proteins. In particular, each HLA molecule indexed by m (in a human host there are up to 6 different HLA) binds to the viral protein and cut it to obtain a 9mer, that is later transported to the cell surface. To generate the count value, we exploited an HLA-epitope complex prediction algorithm that can estimate the binding energy Eb (n, m) for each of the epitopes n and the different patient’s HLA molecules m [105]: the higher the energy, the less likely it is that the HLA m will create epitope n. We also used a cleavage energy estimate Ec (n) [157], which estimates – for each protein – the energy required to create the epitope n. 80 7 Regression of HIV viral load using bag of words We combined these information, and turned the total energy into a count (concentration) as follows cn = e−Ec (n)−minm Eb (n,m) (7.1) Even if other techniques for estimation of surface epitope (relative) counts exist (see [247] for a recent review), we choose to employ this particular one as it provides prediction for arbitrary HLA types, simply defined by their protein sequence. In the end, the vector xt = [ct1 , . . . , ctN ] is the bag of words representation for the epitopes characterizing patient t. 7.2.3 How to model It is important to notice that the counts cz are not independent. The MHC system, as well as viral mutations, create links among the abundances of different viral peptides in the observed bag. First, two patients infected by the same virus, e.g. HIV, are highly unlikely to have the exact same HLA molecules: therefore, each of their HLAs will select specific epitopes from HIV proteins, and the patients’ sets of immune targets will likely overlap only partially. In other words, each HLA molecule has its binding preferences that lead to selection of only one of a hundred to a thousand of epitopes. Second, the variation of the HIV epitope sets found in different patients exhibits strong co-occurrence patterns where a high count of one peptide often implies inclusion of several others, as they are all good binders to a particular HLA. This is precisely what topic models were meant to do for text documents, as summarized by Fig. 7.1 (a,b). In particular, the topic proportions p(z|t) (as given for example by a PLSA model) for individual patients t can be used as a compact representation, that discards the superfluous aspects of the bag of words. In this context, the HIV viral load can be regressed directly to these hidden variables. Modeling cellular peptide presentation as a mixture of topics can capture some of the presentation patterns discussed above. Upon model fitting, the topics may correspond to individual MHC molecules that are more frequent in the patient cohort, or entire families of MHC types that have similar presentation (sometimes referred to MHC supertypes). In this case, the topic probability distribution would reflect the probabilities of binding of a particular MHC (super)type to these different peptides. Some topics may also capture the HIV clade structure as mutations in each clade alter the MHC binding patterns. Among the different topic models, there are some reasons why the counting grid may be a more appropriate model of variation in epitope bags. These reasons relate with the manner in which biological entities interact and adapt to each other leading to patterns of slow evolution characterized by genetic drift, local coadaptation, as well as punctuated equilibrium. In case of antigen presentation, for example, millions of years of evolution created certain typical variants of MHC as well as minor variations on each of these major types. These variations are at least in part due to the interaction with viruses [93], and similarly the genetic variation in viruses reflect some of this evolutionary arms race, too. Thus, the HIV clade constraints, as well as MHC binding characteristics may be so interwoven that a 7.2 The proposed approach a) Topic ‘1’ Topic ‘2’ b) Topic ‘N’ 81 Counting Grid Document c) q1k d) E1 k2 + Wk W2 i2 W1 8.3 q3 + E2 + k1 q1 10.0 10.6 i1 + + q2 y1 = 8.3 y2 =10.0 y3 =11.2 π i,z 11.2 γ(i) Fig. 7.1. Capturing dependencies in bags of words. rigid view of cellular presentation as a mix of a small number of topics may be inappropriate. In the counting grid, the major variants of cellular presentation can be modeled as far away windows, while minor variations would be captured by slight window shifts in certain regions of the grid. 7.2.4 Information extracted: regression of viral load value The final task in our problem is the following: given the bag of words representation of every given patient t and a model for the set of patients, we want to derive a regression method able to predict the viral load. In the following, we present a novel methodological contribution, aimed at embedding continuous values y t (e.g., HIV viral load) on the grid and perform regression. First, let us look back at Eq. 5.3, where we embedded discrete labels into the learned grid. Here we generalize this notion: by using the inferred posterior probabilities qkt , we compute an M-step using the target value in place of counts (ctn ). In formulae this is equivalent to P P t t t k|i∈W qk · y (7.2) γ(i) = P P k t t k|i∈Wk qk 82 7 Regression of HIV viral load using bag of words E = [63,63] ; W = [11,11] E = [63,63] ; W = [22,22] ρ = 0,339 ρ = 0,314 Fig. 7.2. HIV viral load embedding in the 2D. The window is shown with a dotted line in the figure. In Fig.7.1 (c,d) we visually show the effect of this equation: the viral load y t of each sample is “copied” in the window positioned by qkt (i.e., Wk ) and then the result is averaged over all the samples. Also, in Fig.7.2 we show a couple of γs, estimated from the dataset we used in the experiments. The window W is shown with a dotted line in the figure. The function γ can then be used for regression in what is essentially a nearestneighbor strategy: when a new test patient is embedded based on its bag of words, the target y test is simply read out from γ, which is dominated by the training points which were mapped in the same region. In other words, given the mapping location qktest of the test sample, its prediction y test will be X y test = qktest · γ(k) (7.3) k Beside this simple scheme, we also propose a more complex one inspired by [96]. The idea is to regress the reconstruction error Ent = c̃tn − Rnt on residual viral load t t t the viral load prediction using the counting grid, yRED = y t − yCG , being yCG t and c̃n the normalized epitope counts. Following [96], we used a regularized linear regression with L1 norm (also known as LASSO [226]). 7.3 Experiments The experimental evaluation is aimed firstly at proving the suitability of the CG model, providing also a comparison with other bag of words models (which have never been applied to this task). Then, we demonstrate that – when used for regression – the CG significantly outperforms the state of the art in biomedical and computational biology literature [93,96,113,154] (see Tab.7.1 for a summary). As input data, we analyzed the cellular presentation of HIV patients from the Western Australia cohort [154]. We represented each patient’s cellular presentation by a set of 492 counts over that many 9-long peptides from the GAG protein, previously found to be targeted by the immune system as explained. The counts were calculated based on the patients HLA types and the energy estimation procedure discussed by [93] and in Section 7.2.2. This provides us with bags of epitopes (counts over the 492 words) that represent GAG in 135 different patients. We used 7.3 Experiments POL Pearson’s linear correlation coeff. 0.4 VPR 0.4 0.2 0 0 0 −0.2 −0.2 −0.4 −0.4 −0.4 −0.6 −0.6 −0.6 −0.8 −0.8 10 10 2 10 3 10 E −0.8 0 1 2 10 10 Capacity 10 3 10 Capacity 10 0 10 1 10 2 10 3 10 Capacity GAG 0.4 Pearson’s linear correlation coeff. 50 0.2 −0.2 1 GAG 0.4 0.2 0 83 0.2 0 −0.2 −0.4 −0.6 −0.8 0 10 1 10 2 10 3 10 Capacity Fig. 7.3. HIV viral load regression. On the top row, we depict the variation of the correlation factor ρ for the CG model with different complexities. We used colors to represent the CG size; the same capacity in fact can be obtained with different E/W combinations. On the bottom row, the same analysis on the GAG protein with LPD. the same process for two more proteins: POL and VPR. This resulted in bag of words matrices of respectively 88×135 and 939×135 words×samples. We analyzed only the clade B infected patients. 7.3.1 Experiment 1: modeling antigen presentation with the counting grid To employ the counting grid model, we first trained it on the bag of words ctn , without using the regression targets y t (log viral load). Then, in a leave-one-out fashion, we held out a sample t̂ and we estimated the regression function γ (see Eq. 7.2, with t 6= t̂) using all the others; finally, by reading out γ in the appropriate t̂ location we obtained the viral load prediction for sample t̂ using Eq. 7.2: yCG = P t̂ k qk · γ(k). Once we computed the estimated regression target for all the samples, we computed ρ, the pairwise correlation coefficient between the true and the estimated viral load. The proposed approach based on CG has been compared with the LPD model introduced in Chap. 4 [194]. To evaluate LPD we worked as for CGs: we learned a single model (without using the targets) and we predicted the viral load for the left out sample using linear regression based on the topic proportions p(z|t). We considered Counting Grids of various complexities picking between E = [12, 15, 18, 21, 25, 30, 35, 40, 50] and W = [2, 3, 4], only trying the combinations with capacity κ between 1.5 and T /2, where T is the number of samples available. 84 7 Regression of HIV viral load using bag of words Results for all the proteins are shown in Fig. 7.3, where we reported the results for a range of capacities κ which are roughly equivalent to the number of LPD topics K. LPD and CG reach similar results of POL and VPR, while CG have a clear advantage on GAG (we show graphically this last statement in Fig. 7.3). It is important to note how for the Counting Grid, the correlation factor varies much more regularly with the capacity κ. 7.3.2 Experiment 2: comparison with the state of the art In order to have a comprehensive comparison with the state of the art, we evaluated the regression results obtained with standard phylogenetic analysis [223]. We built a phylogenetic tree from GAG, POL, and VPR sequences using the maximum likelihood approach of [223]: the aim is to model the low-level generative process of random mutations by learning the probability distributions which govern it. The purpose of this analysis is to show if evolution of the sequences alone can give some insights into the viral load. Few parameters have to be tuned when computing such trees: in our experiments, we pick as a rate substitution matrix the WAG model [238], and we allowed for rate variations across sites, setting 4 discrete gamma categories [244]1 . Then, we want to predict the viral load ŷ for a testing sequence xtest by looking at sequences which are close to that one in the tree topology. For this reason, regression is carried out with the following formula: X ŷ test = e−C·dist(xtest ,xt ) · y t (7.4) t where t indexes the training sequences xt and their associated viral load value y t . The parameter C has been found with crossvalidation on the training set. To have a fair comparison and evaluate the robustness of our framework, for each protein we performed leave-one-out crossvalidation on the training set to pick the best model complexity (E/W for Counting Grids, Number of topics K for LPD), and we compared the results with trees. More in detail, we proceeded as explained in the previous section but – only using the training data – we regressed the viral load and computed the linear correlation factor for each complexity. Then we picked the complexity that gave better result and we predicted the viral load on the test sample. It is important to note that now i) the viral load of each sample can in principle be predicted with a different complexity, and ii) the test sample is not used to train the model. As final experiment, we also employed the advanced scheme proposed in Sec. 7.2.4, which we dub CGs→LASSO: as before, we used leave-one-out cross validation to choose the best model complexity. Results are shown it Tab.7.2. For LPD, this process failed and we could not obtain statistical significant results because of severe overtraining issues. As visible from Tab.7.2, column CGs→LASSO, the advanced CG scheme improved the performances in all the cases. The model complexities chosen by each round of leave-one-out did not differ much; regardless of the protein considered, for more 1 These are the default values set by the phylogenetic tool employed, MEGA [223] 7.3 Experiments 1 20 Iteration No. 5 10 25 30 85 15 40 Fig. 7.4. Evolution of the viral load across the iterations. than 89% of the data points the same complexity was typically chosen, as reported in the last column of Tab 7.2. The medical literature also reports other results obtained on the GAG protein, comparisons are shown in Tab.7.1. As visible our approach strongly outperforms all the methods [93, 96, 113, 154]. It remains to be understood exactly why CGs exhibit such a strong advantage over topic models (LPD). One intuitive explanation is that the slow smooth variations in count data that can be captured in counting grids better represent the dependencies that were produced by millions of years of coevolution between the HLA system and various invading pathogens [113]. This process involved numerous mixing of both the immune types and the viral strains, and may have produced the sort of thematic shifts in antigen presentation that CGs are designed to represent. A more speculative possibility is that the immune system, through some unknown mechanism, collates the reports from circulating T cells into an immune memory with similar structure. A final note on the embedding function γ: the bags of peptides are mapped to the counting grid iteratively as the grid is estimated as to best model the bags, but the regression target, the viral load, was not used during the learning of CGs or LPD models. However, the inferred mapping after each iteration can be used Table 7.2. Pearson’s linear correlation (after crossvalidation where applicable). Crossvalidation for LPD was found not statistically significant (NS) for GAG and POL. The last column report the most common CG’s complexity chosen in the rounds of leave-one-out crossvalidation. Protein GAG VPR POL CGs ρ 0.3301 0.2011 0.2338 CGs→LASSO Trees ρ ρ 0.3674 0.3519 0.2546 0.1061 0.2443 0.1812 LPD Ridge Regr. Complexity Chosen ρ ρ NS 0.1835 [30,5] - 89% 0.1202 NS [50,8] - 94% NS NS [40,11] - 97% 86 7 Regression of HIV viral load using bag of words to visualize how the embedded viral load γ evolves. This is illustrated in Fig.7.4 for a model of complexity E = [30 × 30], W = [8 × 8]. The emergence of areas of high (red) and low (blue) viral load indicates that as the structure in the antigen presentation is discovered, it does indeed reflect the variation in viral load. 8 Bag of words analysis for T-Cell Receptors In the previous chapter we studied one particular aspect of the HIV virus, namely how it is presented to the cell surface for T-cell recognition. Nevertheless, the infection carried on by HIV affects different parts of the immune system. In particular, from a clinical point of view, T-cell depletion is the central abnormality associated with HIV infection. This is the cause that leads to AIDS, a severe condition where – because of the low count of T-cells in the body – the ability of a person to react to other opportunistic infections is compromised. While antiretroviral therapy (ART) can restore T cell counts, the adequacy of the diversity of the reconstituting cells, particularly in long term survivors of perinatal infection (i.e. infection passed from a mother to her baby during pregnancy), is not well understood. While the quantity of T-cells is broadly indicative of the immunological status of an individual, the diversity, i.e. the number of different types of T-cells, is also a crucial factor [15,208]. In fact, it has been observed [48] that even if such diversity is generated in an antigen-independent fashion, during an immune response particular T-cells are selected and overproduced, in a process called clonal selection. In this chapter we build a bag of words representation able to characterize this aspect of the HIV virus: in particular, individual TCR sequences – which are called species in this context – compose the dictionary, and each patient is represented with a bag of word vector counting the number of different TCRs. This work stemmed from a collaboration with the David Geffen School of Medicine (UCLA), with the main goal of studying the diversity of the bag of words in different classes of HIV patients. In very few words, quantifying the diversity of a sample means to assess the total number of species present. However, here this analysis is particularly challenging, because i) the obtained sequences are extremely noisy, and ii) the observed sequences may be too few to correctly capture the underlying true species distribution. Moreover, typical dataset sizes are very reduced, because only a handful of patients are available. Therefore, making any claim about which HIV class of patients is more diverse is extremely difficult to prove. These concerns, typically disregarded in the literature, have been taken into account in this chapter by performing a critical and robust statistical analysis of the bag of words representation. As a second contribution, we questioned from a more methodological point of view the reliability of the observed bags of words, 88 8 Bag of words analysis for T-Cell Receptors proposing a possible criterion to assess their reliability. The investigation stems from the consideration that any given TCR variant may be present in very low abundance and may differ by only a single nucleotide. In fact, most species (the different TCR sequences in the dictionary) are very rare, with a corresponding count of 1, and any estimate of diversity is inevitably biased. The clinical partners provided a dataset obtained after a TCR sequencing using the technology of 454 Pyrosequencing [195] in people surviving beyond 15 years after perinatal infection. In particular, the pool of TCR sequences has been derived from 9 different patients: 3 perinatally infected subjects that received ART therapy (positive samples), 3 healthy children (negative samples), and 3 cord samples used as controls. For each patient, TCR sequences can be classified according to their VDJ recombination. In particular, sequenced TCR are divided in 3 known V segments and 15 known J segments. The D region is hyper variable, and the main source of diversity. We performed a thorough evaluation of several statistics that estimate diversity of the bag of words, providing useful insights on the data. The interesting conclusion reached is that during ART of HIV-infected children, an early and sustained increase in TCR is seen, and this is somewhat in line with previous findings [101, 190, 191]. Surprisingly, we also observed that TCR diversity in positive infected children is even higher than in negative ones, suggesting a greater baseline thymic function (the thymus is the gland responsible for T-cell production). This claim has been numerically evaluated using several methods that provide robust estimates of diversity. However, even if some claims about differences in positive and negative patients can be made, it is not possible to reliably estimate the total number of species present in a sample. In fact, the results of our reliability analysis shows that the bag of words contains too many rare species, and these estimates of total diversity can not be considered reliable. In the following section, we describe the proposed approach, detailing i) the diversity measures that can be employed to robustly estimate differences between patients and ii) explaining the methodology to assess the reliability of a bag of words vector. Then, results are reported, and conclusions drawn. 8.1 The proposed approach In order to compare TCR sets from different patients, the starting point is to derive the bag of words representation. The pool of TCR sequences have to be categorized in species, i.e. the words indexed in the dictionary. We propose three different schemes to extract words and build the dictionary: 1. Raw scheme: Using this simple scheme, each unique sequence is assigned to a species: sequences belonging to a species are identical in every position of their nucleotidic sequence. In other words, if two sequence differ only by a single nucleotide, they are assigned to two different species. 2. Low error scheme: In order to deal with sequencing errors (even if in principle they should increase diversity in all samples equally) and to reduce noise, 8.1 The proposed approach 89 we propose the following filtering steps. First, all sequences where a local alignment with one of the known V or J strands scores higher than 5% errors (an error can be a mismatch, an insertion or a deletion) are removed. Then, individual sequences are counted using the raw scheme. 3. Cluster scheme: With this scheme, sequences which are very similar can be clustered and aggregated in a single species. Clustering is done by measuring biological similarity between sequences, and sophisticated approaches that take into account the TCR nature of sequences have been proposed in the past: here we employed the well-known cd-hit-454 tool1 [160]. In the following section, we describe how to quantify the diversity of this bag of words representation. 8.1.1 Diversity measures The simplest way of assessing the diversity of a bag of words vector is by measuring its entropy, which in this context is sometimes called the Shannon index. First, the counts ctn (number of sequences in the n-th species of the t-th sample) are normalized to obtain the frequency ctn t m cm fnt = P Then, the Shannon index for the t-th bag is equal to X Dt = − fnt log2 fnt (8.1) (8.2) n However, the sample size (i.e. the number of observed species) has a strong effect on the Shannon index. Since the observed TCR counts are only a fraction – or a sample – of a larger distribution of rare variants that could not be observed with the sequencing technology at our disposal, it may be that the Shannon index is too biased for estimating diversity. For this reason we propose to employ a robust histogram shape estimation technique called “Unseen estimator” [229]. This technique uses the observed distribution of species in a sample to estimate how many undetected dictionary elements, i.e. species, occur in various probability ranges. Given such a reconstruction, one can then use it to estimate any property of the distribution which only depends on the shape; such properties are termed symmetric and include entropy (i.e. the Shannon index) and support size [229]. Another robust strategy that can be employed in order to have consistent estimates across different patients is to subsample each patient so that he has the same number of sequences of the others. Of course, the subsampling procedure has to be repeated many times in order to have a significant comparison. The last robust measure of diversity we employed is the so-called rarefaction curve [189], originally introduced in ecology. The idea is the following: if we are given W individual sequences with N < W different species, one way to visualize 1 publicly available at http://weizhong-lab.ucsd.edu/cdhit 454/cgi-bin/index.cgi 90 8 Bag of words analysis for T-Cell Receptors the diversity patterns is to run randomization tests in each of which a subset of v < W sequences is picked at random, and the number of unique sequence types n is counted. When these tests are done many times for all values of v, we can plot the graph of species accumulation, by averaging over samples. The curve – which is often called a rarefaction curve [189] – is nearly linear at the beginning, as picking a handful of sequences from a large diverse set is not likely to result in any sequence repetitions. As v increases, the graph starts to curve and would asymptotically reach the total number of species in the population (if W is large enough so that the sample covered all of the diversity). If we have multiple groups of sequences with different numbers of sampled sequences, then the comparison of these graphs provides some idea as to which sample came from a more diverse population. 8.1.2 Reliability of the bag of words In this section we addressed the problem of assessing the reliability of the bag of words representation. This analysis stem from the consideration that typical TCR sequencing produces many rare variants, i.e. species observed only once, and every technique that attempts at estimating statistics may produce unreliable results. A crucial question is “What can one infer about an unknown distribution based on a random sample?” If the underlying distribution is relatively “simple” in comparison to the sample size – for example if our sample consists of 1000 independent draws from a distribution supported on 100 domain elements – then the empirical distribution given by the sample will likely be an accurate representation of the true distribution. If, on the other hand, we are given a relatively small sample in relation to the size and complexity of the distribution – for example a sample of size 100 drawn from a distribution supported on 1000 domain elements – then the empirical distribution may be a poor approximation of the true distribution. To assess whether our bag of words fall on this last category, we propose an approach based on the Unseen estimator [229]. The idea is that below a certain abundance threshold (i.e. for very rare species), the unseen estimator is not able to reconstruct the shape of the “missing” part of the distribution. We propose a reliability criterion, which is based on the consistency of the prediction across subsampling. More in detail, suppose that we are given a bag x = [x1 , . . . , xW ] with total PW counts N (i.e. N sequences have been observed, N = n=1 xn ). The dictionary, of size W , list all species D = {w1 , . . . , wW }. Consider also the normalized version with frequencies instead on counts: x = [f1 , . . . , fW ], where fn = xn /N . First, we define Wα as the number of species with frequency above a given value α: Wα = |{wi |fi > α}| (8.3) Then we compute for all α the values Ŵα obtained after the unseen estimator, and repeat the same procedure after subsampling x so that it has N/2 total counts. In (0.5) the end, after many subsampling, we obtain an average Ŵα . The reliability threshold α̂ is obtained by aggregating two criteria: (0.5) 1. The average number of predicted species after subsampling Ŵα̂ 20% of the number Ŵα̂ predicted with the full data; is within 8.2 Experimental results 91 2. The standard deviation over the repetitions is within 20% of the average pre(0.5) diction Ŵα̂ . Patient AP04 4 2 x 10 Observed data; # reliable species: 1657 Subsampled data; # reliable species: 1930 ± 353 1.8 Number of species 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 Sequence abundance threshold 8 9 10 −4 x 10 Fig. 8.1. Reliability evaluation. The shaded area indicates consistent estimates, namely estimates of the number of species that are robust with respect to subsampling. Fig. 8.1 depicts an example: the solid curve represents the (estimated) number of species above the threshold specified on the x axis, whereas the dashed curve represents the (estimated) number of species after 50% of sequences have been subsampled at random many times. As can be observed from the graph, in the left part (which is heavily influenced by rare, low frequency species), the two curves are rather different; however, they tend to converge moving towards the right. In addition, note that the standard deviation over the different subsamplings decreases when raising the threshold. The shaded portion of the curve indicates where estimates are reliable: when the shaded area starts, the average number of predicted species after subsampling (1930) is within 20% of the number predicted with the full data (1657). Moreover, the standard deviation over the repetitions (353 species) is within 20% of the average prediction of 1930 species. 8.2 Experimental results 8.2.1 Dataset statistics We report in Tab. 8.1 some basics statistics and give a general overview of the dataset used in our experimental evaluation. More in detail, in the top part of the table, the total number of TCR sequences for each patient is displayed; we reported also how these counts are distributed in the three V families considered, 92 8 Bag of words analysis for T-Cell Receptors Table 8.1. Sequences and species (unique sequences) counts for the TCR dataset. V4 V9 V 17 Count of TCR sequences Positive patients Negative patients AP04 AP22 CP04 CN13 CN02 BN02 38031 3377 24832 12789 33727 2883 27537 32125 37942 112979 54912 25396 25337 13015 23497 72688 30868 4734 V 4/9/17 90905 48517 86271 198456 119507 33013 Cord samples Cord12 Cord11 Cord13 29607 19827 9627 47801 82367 10987 10868 73392 11325 88276 175586 31939 V4 V9 V 17 Count of Positive patients AP04 AP22 CP04 19666 2862 17277 9646 13608 15016 11113 8865 10428 species - raw scheme Negative patients Cord samples CN13 CN02 BN02 Cord12 Cord11 Cord13 9532 15772 1962 25966 14231 8114 35440 18910 8395 29072 54042 8911 32168 11508 2529 8396 43461 11075 V 4/9/17 40425 77140 25335 42721 46190 12886 63434 111734 28100 V4, V9 and V172 . In the bottom part, we show the counts of species (i.e. unique sequences) obtained with the Raw scheme. From the table it can be noted that there is a very large variability in the number of sequenced TCRs: unfortunately, this is a limitation due to technology, not biology. In order to make these numbers more robust, we employed the Low error and Cluster scheme for building the bag of words: for the Cluster scheme, we employed the default parameters of the algorithm. Counts of these low error and clustered sequences are reported in Tab. 8.2, top and bottom part respectively. 8.2.2 Shannon index analysis In our first evaluation, we estimated diversity using the Shannon index as described in Sec. 8.1.1. The main hypothesis that we want to confirm is that the diversity of cord samples Dcord is greater than the diversity of positive samples Dpos , which in turn is greater than the diversity of negative samples Dneg , in formula: Dcord > Dpos > Dneg . In Tab. 8.3 we reported the Shannon index computed on the bag of words for each patient – also keeping sequences divided per V family. Using this simple scheme, our hypothesis Dcord > Dpos > Dneg seems to be supported by the values in Tab. 8.3, although not statistically supported. However, when computing the Shannon index on the reconstructed histogram using the Unseen estimator, the hypothesis is strongly supported, with a p-value p<0.01 (excluding the cord sample and negative control with a much lower yield of high quality sequences; the excluded two samples, however, are also consistent with this inequality, as can be observed from the table). Since many assumption on Gaussianity of the data are missing, the p-value has been computed through randomization tests [78]. 2 According to the nomenclature in http://www.imgt.org 8.2 Experimental results 93 Table 8.2. Species counts after low error and clustering processing steps. V4 V9 V 17 Count of species - low error scheme Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 14038 2399 12649 7700 10585 1416 23315 11580 6909 6662 9208 9871 22972 12065 5406 21501 41019 7354 7347 4683 6427 17032 7373 1412 6108 24979 5749 V 4/9/17 28047 V4 V9 V 17 Count of species - cluster scheme Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 3358 1571 4990 3862 2384 737 14079 5667 5202 2210 3110 3142 5701 3437 1489 12414 23041 5842 2000 1879 1825 4459 1461 348 3708 9959 4911 V 4/9/17 7568 16290 6560 28947 9957 47704 14022 30023 7282 8234 2574 50924 30201 77578 38667 20012 15955 Encouraged by these results, we made a step forward by observing that the obtained counts have some noticeable differences in sample size: they span from around 24000 sequences (patient cord13, low error scheme) to 175000 (patient CN13, low error scheme). As discussed in Sec. 8.1.1, this may affect the estimation of the Shannon index. Therefore, we subsample each of them to get a random pool of sequences, so that each patient has the same number of sequences M : X M = min ctn (8.4) t n From this subsampled data, we computed statistics such as the number of species, Shannon index, or fraction of singletons (number of sequences occurring only once divided by the total number of sequences). Results are reported in Tab. 8.4, where the numbers displayed are an average over 100 different random subsampling. The last column in the table report the exact number of species resulted after subsampling in each case. From the tables it seems evident that Dcord > Dpos > Dneg (also confirmed by p<0.01). Moreover, positive patients have overall more species and a higher fraction of singletons, suggesting that the tail of the distribution is longer. Another representation that can be useful to assess samples diversity is a pie chart, such as the ones reported in Fig. 8.2. In the figure, each slice aggregates 512 species, sorted by descending frequencies: by visual inspection of the graph, it can be noted that positive patients distributions resemble more the ones of the cords. For example, in the first 3/4 of the charts (where the more rare species are distributed) there are more slices in positive patients than in negatives, this again suggesting the higher diversity of positive w.r.t. negatives. Finally, we wanted to assess the abundance levels of the different species, in order to have a clearer picture of how they accumulate (from rare species to high frequent ones) in HIV patients. Since every patient has the same sample size, the 94 8 Bag of words analysis for T-Cell Receptors probability mass occupied by one sequence is 1/(sample size=24k) everywhere: the same abundance threshold can be used for comparing different samples. In Fig. 8.3 (a) we reported, for different abundance levels shown on the x axis, the number of species where sequences are at least that abundant (in other words, given a point on the x axis, the corresponding y indicates how many species are more frequent than the x point chosen). Confirming previous hypothesis, negative patients have more high frequent species (as the blue line is above the red one towards the right part of the plot), in contrast with positives having more rare species, and an overall higher diversity. Table 8.3. Shannon index D for the TCR dataset. V4 V9 V 17 Shannon index - low error scheme Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 12.74 11.12 13.13 12.60 11.95 10.19 14.39 13.22 11.82 10.04 11.67 11.03 11.14 10.59 9.02 13.65 14.80 11.48 11.19 11.45 10.68 11.90 10.22 8.93 12.35 13.81 12.48 V 4/9/17 13.05 12.68 13.11 12.70 12.41 10.00 15.12 15.64 13.39 Shannon index after unseen estimator - low error scheme Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 V4 14.07 14.23 14.45 14.08 12.92 11.77 16.81 14.66 15.59 V9 10.79 12.62 11.84 11.71 11.23 9.63 15.06 15.95 14.23 V 17 12.27 12.81 11.69 12.99 11.02 10.46 14.01 15.00 16.86 V 4/9/17 14.11 13.78 14.14 13.50 13.23 10.79 V4 V9 V 17 17.15 16.82 16.83 Shannon index - cluster scheme Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 10.87 10.37 11.74 11.49 10.10 9.13 13.32 12.02 11.10 8.69 10.31 9.66 9.82 9.20 7.66 12.70 13.78 10.89 9.53 10.02 9.04 10.21 8.43 7.08 11.46 12.40 12.14 V 4/9/17 11.40 11.35 11.67 11.28 10.79 8.59 14.13 14.45 12.82 Shannon index after unseen estimator - cluster scheme Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 V4 11.05 11.10 12.04 11.88 10.25 9.60 14.17 12.35 12.78 V9 8.80 10.51 9.84 9.89 9.27 7.77 13.16 14.21 12.39 V 17 9.70 10.40 9.20 10.33 8.46 7.24 12.10 12.66 14.37 V 4/9/17 11.58 11.64 11.91 11.41 10.94 8.74 14.76 14.81 14.56 8.2 Experimental results 95 Table 8.4. Statistics of the TCR data after subsampling. V4 V9 V 17 Average count of species - subsampled data Positive patients Negative patients Cord samples Sample size AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 1684 1812 1867 1853 1441 1416 2047 1934 1767 2087 3047 4075 3638 3257 3176 2684 6967 8460 7354 9958 1853 2296 1637 2026 1405 1412 3060 3153 3663 3718 V 4/9/17 10849 10775 11610 9576 9156 V4 V9 V 17 18858 19384 20012 24369 Average shannon index - subsampled data Positive patients Negative patients Cord samples Sample size AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 10.59 10.74 10.80 10.79 10.25 10.19 10.99 10.87 10.16 2087 9.69 11.17 10.50 10.29 9.99 8.73 12.51 12.92 11.48 9958 10.27 10.78 9.83 10.47 9.36 8.93 11.47 11.53 11.83 3718 V 4/9/17 12.43 12.39 12.47 11.83 11.78 V4 V9 V 17 7129 9.93 13.98 14.08 13.39 24369 Average fraction of singletons - subsampled data Positive patients Negative patients Cord samples Sample size AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 0.67 0.76 0.81 0.79 0.52 0.52 0.96 0.86 0.83 2087 0.25 0.30 0.28 0.26 0.26 0.22 0.55 0.74 0.69 9958 0.37 0.47 0.34 0.41 0.30 0.32 0.69 0.73 0.97 3718 V 4/9/17 0.34 0.34 0.38 0.32 0.30 0.23 0.66 0.66 0.78 24369 8.2.3 Nuanced patterns in the bag of words By looking at statistics of the whole sequences histogram, we concluded that Dcord > Dpos > Dneg . However, we wanted to investigate the possibility of a more nuanced picture in some specific parts of these distributions. After sorting the TCR counts in descending order (from high frequent variants to low frequent ones), we divided the histogram in 100 overlapping bins, each occupying 2% of the sequenced TCR total. In Fig. 8.4 we show the normalized partial Shannon index summations computed in these local portions of the histogram; the area under the curve equals the total Shannon index reported in Tab. 8.4. In every part of the spectrum, i.e. both for highly abundant (left part of the figure) and rare sequences (right part), the Shannon index for HIV+ dominates the one of HIV-, and in general supports the conclusion that Dcord > Dpos > Dneg . 8.2.4 Rarefaction curves The rarefaction curves for the dataset are shown in Fig. 8.5. We stopped the curves at the level of 1/3 of the smallest sample (9809 sequences) to make them comparable across samples. These rarefaction curves indicate with statistical significance (considering the number of species at x=9809, p=0.02) that TCR diversity in positive patients is higher than in negative ones. We carried out the same analysis by 96 8 Bag of words analysis for T-Cell Receptors ((+) AP04: 10849 species (-) CN13: 9576 species Cord12: 18858 species (+) AP22: 10775 species (-) CN02: 9156 species Cord11: 19384 species (+) CP04: 11610 species (-) BN02: 7129 species Cord13: 20012 species Fig. 8.2. Pie charts for the 9 samples after subsampling. In particular, the first column comprises positive samples, the second one negative samples, and the third one cord samples. keeping the three V families separate, and the result is shown in Fig. 8.6. Because we only have three patients of each class, p-value is weakened (p=0.04 at x=500), even if this result still support statistically the difference between positives and negatives. 8.2.5 Total number of species estimation We assessed the performances of the Unseen estimator to reconstruct the total number of species in our TCR populations. Results are reported in Tab. 8.5. 8.2 Experimental results Number of species above threshold 14000 Number of species 10000 AP04, species: 12612 AP22, species: 12385 CP04, species: 13586 CN13, species: 11177 CN02, species: 10577 4 10 Number of species 12000 Number of species above threshold − log domain 5 10 AP04, species: 12612 AP22, species: 12385 CP04, species: 13586 CN13, species: 11177 CN02, species: 10577 8000 6000 4000 97 3 10 2 10 1 10 2000 0 0 0.2 4 18 x 10 0.4 1.4 1.6 12 0 0.5 1 x 10 8 6 4 3 3.5 4 −3 x 10 AP04, species: 119223 AP22, species: 120426 CP04, species: 162350 CN13, species: 114709 CN02, species: 103192 5 10 10 1.5 2 2.5 Sequence abundance threshold Number of species above threshold after unseen estimator − log domain 6 10 AP04, species: 119223 AP22, species: 120426 CP04, species: 162350 CN13, species: 114709 CN02, species: 103192 14 4 10 3 10 2 10 1 10 2 0 10 1.8 −3 Number of species above threshold after unseen estimator 16 Number of species 0.6 0.8 1 1.2 Sequence abundance threshold Number of species 0 0 0 0.2 0.4 0.6 0.8 1 1.2 Sequence abundance threshold 1.4 1.6 1.8 −3 x 10 10 0 0.5 1 1.5 2 2.5 Sequence abundance threshold 3 3.5 4 −3 x 10 Fig. 8.3. (a) The y axis indicates the number of species above the abundance threshold on the x axis. (b) The same representation, showing the y axis in log domain for a better insight into high frequent species. (c) and (d) depict the same graphs, after the histogram of TCR species have been reconstructed with the unseen estimator. However, as we will demonstrate in the next section, these estimate cannot be considered reliable, as we do not have enough samples to make claims about the total number of species. 8.2.6 Reliability of the bag of words The following analysis, described in Sec. 8.1.2, is aimed at detecting the abundance threshold at which we can reliably estimate – with the Unseen estimator – the number of different species having abundance above this threshold. Fig. 8.7 reports the whole reliability study, evaluated for every patient. As a general comment, it can be noted that the reliability threshold is higher in positive patients, again suggesting that their species histogram comprises more rare variants (therefore an increased difficulty for the estimator to reconstruct the number of unseen species). Finally, we considered a global threshold, where predictions for every positive and negative patient are reliable: this threshold is set roughly to 2−4 (as this is the highest threshold, found in BN02). In Tab. 8.6, we reported the number of species predicted with the Unseen estimator above this abundance threshold. 15 Shannon index 14 13 12 11 10 Positives Negatives Cords 9 8 0 0.1 0.2 0.3 0.4 0.5 α 0.6 0.7 0.8 0.9 1 Fig. 8.4. From the subsampled species histogram (sorted by descendent frequencies) of each patient, we computed the Shannon index in contiguous bins centered in α, each region occupying 2% of total TCRs. Cord11 Cord12 Positive Negative Cord 8000 Cord13 Species 6000 CP04 (+) AP04 (+) AP22 (+) CN13 (-) CN02 (-) 4000 BN02 (-) 2000 0 0 2000 4000 6000 Individual sequences 8000 Fig. 8.5. Rarefaction curves computed on the nine samples. 10000 Family V 4 700 Cord12 Cord11 CP04 (+) CN13 (-) AP04 (+) AP22 (+) CN02 (-) Cord13 BN02 (-) Positive Negative Cord 600 Species 500 400 300 200 100 0 0 100 200 300 400 Individual sequences 500 600 Family V 9 3500 Cord11 Cord12 Cord13 AP22 (+) CP04 (+) CN13 (-) CN02 (-) AP04 (+) BN02 (-) Positive Negative Cord 3000 2500 Species 700 2000 1500 1000 500 0 0 500 1000 1500 2000 Individual sequences 2500 3500 Family V 17 1400 Cord11 Cord13 Cord12 AP22 (+) CN13 (-) AP04 (+) CP04 (+) CN02 (-) BN02 (-) Positive Negative Cord 1200 1000 Species 3000 800 600 400 200 0 0 500 Individual sequences 1000 1500 Fig. 8.6. Rarefaction curves computed on the nine samples, divided per V family. Table 8.5. Number of species estimated with Unseen for the TCR data. V4 V9 V 17 Estimated number of species - raw data Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 449131 21417 126559 49168 108600 23092 251454 75056 368308 64853 89915 137566 336466 113224 57045 187519 285919 219232 77883 73365 96022 310024 149969 32634 44976 447732 285609 V 4/9/17 410614 186552 341697 652815 352962 112926 453996 712770 1037168 V4 V9 V 17 Estimated number of species - low error data Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 206095 48092 68781 35568 71112 5992 165935 48358 174783 49537 84808 72943 197745 83000 34235 217107 162220 87685 81018 28639 57563 265697 58566 34382 34947 170585 119167 V 4/9/17 303702 112541 204755 380845 232023 64665 V4 V9 V 17 Estimated number of species - clustered data Positive patients Negative patients Cord samples AP04 AP22 CP04 CN13 CN02 BN02 Cord12 Cord11 Cord13 11288 2692 12176 6528 7185 1273 35082 7568 33341 5274 8912 9737 17758 11628 4681 30393 43808 30374 4162 7814 4844 11439 3253 869 10288 22402 26568 V 4/9/17 20403 20328 26811 40787 22956 6264 728317 362088 334221 82412 72102 Table 8.6. Unseen prediction above the reliability threshold of 2−4 . Positive patients AP04 AP22 CP04 769 1013 666 Negative patients CN13 CN02 BN02 770 763 281 87530 Patient AP04 4 4 x 10 7.633e−05 1 0 Patient CN13 4 4 x 10 2 1.705e−04 1 0 2 4 6 10−4 Sequence abundance threshold 0 x 10 2 2.645e-05 1 0 0 x 10 Patient Cord12 0 x 10 2 2.075e−04 1 Patient Cord11 0 6.386e-04 1 −4 2 4 6 10 Sequence abundance threshold x 10 2 2.321e−04 1 0 Patient Cord13 # reliably observed species: 45 3 0 −4 2 4 6 10 Sequence abundance threshold 4 4 # reliably observed species: 15 2 Patient BN02 3 0 2 4 6 10−4 Sequence abundance threshold x 10 −4 2 4 6 10 Sequence abundance threshold # reliably observed species: 266 1 4 Number of species Number of species 0 4 4.283e−05 4 3 0 1.369e−04 1 4 2 # reliably observed species: 3 0 Patient CN02 3 0 2 4 6 10−4 Sequence abundance threshold 4 4 2 # reliably observed species: 1931 3 Number of species Number of species # reliably observed species: 2557 3 0 2 4 6 10−4 Sequence abundance threshold 4 4 Patient CP04 # reliably observed species: 925 3 Number of species 2 0 4 x 10 4 # reliably observed species: 1119 3 Number of species Number of species # reliably observed species: 2059 Patient AP22 Number of species x 10 Number of species 4 4 −4 2 4 6 10 Sequence abundance threshold 3 2 2.876e−04 1 0 0 −4 2 4 6 10 Sequence abundance threshold Fig. 8.7. Reliability analysis on the 9 patients. In the legend, the number of observed species which can be reliably estimated is reported. Final remarks on HIV modeling In this part of the thesis we promoted the use of bag of words models to capture two aspects of the HIV infection, namely antigen presentation and TCR variation. In Chap. 7, we derive a bag of words representation to characterize the view that the immune system has of the invading pathogens, covering all aspects of the pipeline proposed in Chap. 2. We also demonstrated that the counting grid model seems to be especially well suited in the modeling stage, providing stronger predictions than what can be found in biomedical literature. Our experiment showed that cellular presentation of the GAG protein explains more than 13.5% of the log viral load. Although viral load varies dramatically across patients for a variety of reasons, e.g. gender, previous exposures to related viruses, etc., detection of statistically significant links between cellular presentation and viral load is expected to have important consequences to vaccine research [93]. In Chap. 8, we investigate a critical aspect usually overlooked, namely the quality of the bag of words representation of TCR populations. We derived several measures of diversity that can be employed to sudy differences between HIV patients, each one designed to be as robust as possible with respect to i) small sample sizes and ii) abundance of rare species. From an applicative point of view, we reached the conclusion that TCR diversity in positive infected children is even higher than in negative ones, providing potential new insights into the thymus function (i.e. the gland responsible for T-cell production), which may respond differently if the infection occurs in early development of the fetus. From a methodological point of view, we provided some evidence that the current sequencing technology does not allow for accurate estimates of the total population size: it seems that the sample size obtained (which resulted in a huge amount of rare species observed only once) is too poor to draw statistically significant conclusions. Concluding, we gave a general scheme that could be employed in other validation contexts (for example, in ecology where these types of analysis are crucial for characterizing an ecosystem). Part III Protein remote homology detection 9 Introduction One of the cornerstones of bioinformatics is the process of identifying homologous proteins, i.e. to detect if two proteins have a similar function or have an evolutionary relationship. The establishment of this relationship is usually done by measuring the similarity between the protein sequences. Through this comparative analysis, one can draw inferences regarding whether two proteins are homologous. It is important to distinguish similarity, which is a quantitative measure of how related two sequences are, from homology, which is the putative conclusion reached based on the assessment of their similarity [16]. Usually, homology is inferred when two sequences share more similarity than would be expected by chance; when significant similarity is observed, the simplest explanation is that the two sequences did not arise independently, but they derived from a common ancestor [169]. However, homologous sequences do not always share significant sequence similarity; there are thousands of homologous proteins whose pairwise alignment is not significant, but other evidences (such as the molecular three dimensional structure) clearly prove their homology. In these cases, detecting the homology on the basis of sequence alone is very challenging: in the literature, this problem of detecting homology in the presence of low sequence similarity is referred to as remote homology detection. A large number of computational approaches have been proposed for solving this task (an analysis of the state of the art is presented in Sec. 9.2). Among others, the bag of words approach has already been investigated in the literature: a natural analogy can be made between biological protein sequences (that are essentially strings composed by symbols from a 20-letters alphabet) and text documents. Under this parallelism, biological “words” that are usually extracted are called Ngrams or Kmers [125], such as the ones depicted in Fig. 9.1: they are short contiguous subsequences of N symbols. Consider for example the sequence MDCCDC, and suppose that we define as words 2grams, i.e. short subsequences of 2 amino acids. Thus, the dictionary contains 202 = 400 2grams. The 2grams extracted from the example sequence are the ones in the multiset {MD,DC,CC,CD,DC} (usually overlapped Ngrams are considered). Then, the bag of words representation is a vector where each element corresponds to a 2gram in the dictionary and the value of that element is the number of times the 2gram appears in any position of the sequence. 108 9 Introduction MDCCDC MDCCDC MDCCDC MDCCDC MDCCDC N=2 MDCCDC MDCCDC MDCCDC MDCCDC MDCCDC MDCCDC MDCCDC N=3 N=4 Fig. 9.1. Ngram definition. Given a sequence and a fixed value of N , Ngrams are short consecutive (and overlapped) subsequences of N symbols. This part of the thesis presents some contributions in this context, proposing novel bag of words approaches that can overcome the limits of existing approaches. In particular, in Chap. 10 we propose a novel bag of words representation for protein sequences, which is enriched with evolutionary information, a kind of information not fully exploited in this context. Then, in Chap. 11 we propose a multimodal strategy to integrate structural information (a more rich, yet difficult to obtain modality which is never used) into existing bag of words approaches for sequences. Before going into the details of the contributions, we will clearly formalize the problem and the current state of the art. 9.1 Background: protein functions and homology Proteins are highly complex and functionally sophisticated molecules, which perform a vast array of functions within living organisms. Despite their diversity, from a structural point of view they all consist of one or more chains, where each building block of this chain is taken from a set of 20 amino acids. Thus, the simplest representation of a protein (called primary structure), is simply the sequence of its amino acids. The primary structure drives the folding and intramolecular bonding of the linear amino acid chain, which ultimately determines the protein’s unique threedimensional shape. Usually, stable patterns of folding occur: these regular substructures, known as alpha helices and beta sheets (see Fig. 9.2 for an example) constitute the secondary structure of a protein. Most proteins contain several helices and sheets, in addition to other less common patterns. The ensemble of formations and folds in a single linear chain of amino acids – sometimes called a polypeptide – constitutes the tertiary structure of a protein. Finally, some proteins are made by the aggregation of multiple polypeptide chains or subunits, and in this case the protein is said to have a quaternary structure. Fig. 9.2 depicts graphically the four different types of protein structures. A more formal definition of protein (taken from [41]) is “a biologically functional molecule that consists 9.1 Background: protein functions and homology 109 hemoglobin alpha helix P13 protein beta sheet Fig. 9.2. Levels of protein structure, from primary (amino acids sequence) to quaternary (complex of functionally-folded three dimensional structures). of one or more polypeptides (linear chain of many amino acids), each folded and coiled into a specific three-dimensional structure”. Proteins are devoted to carry out an incredibly vast set of functions, which spans from enzymatic catalysts, transport and storage of other substances, structural scaffolding for cells and tissues, control of biochemical reactions and immune response, to regulation of growth and cell differentiation [41]. Such functions are univocally determined by the protein’s specific three dimensional structure [41]. This is because the protein structure allows molecular recognition: in almost every case, the function of a protein depends on its physical interaction with other molecules. For example, enzymes are proteins that catalyze biochemical reactions. The function of an enzyme relies on the structure of its active site, a cavity in the protein with a shape and size that enable it to fit the intended substrate, and with the correct chemical properties to bind the substrate efficiently. In other words, the function of an enzyme is possible because both the enzyme and the substrate possess specific complementary geometric shapes that fit into one another [77]. Not all proteins are enzymes, but all bind to other molecules in order to complete their tasks, and the precise function of a protein depends on the way its exposed surfaces interact with those molecules. In conclusion, we can generally assert that proteins with similar shape share similar function. The collection of proteins that are similar in shape and perform similar functions are said to comprise a protein family. Proteins from the same family also often have long stretches of similar amino acid sequences within their primary structure. These stretches have been conserved through evolution and are vital to the catalytic function of the protein. Summarizing, the key point is the following: to gain insights into the function of a newly discovered protein, it is of primary importance to identify its family, or some homologues members which function has been already discovered in the biological literature. Given what we presented so far, the most reliable method to determine a protein family is to analyze its three-dimensional (3D) structure, i.e. 110 9 Introduction the cartesian coordinates of every atom of the protein. Unfortunately, acquiring such coordinates requires some sophisticated experimental techniques. The most common method is x-ray crystallography [109], which is based on the scattering of X-rays by the electrons in the crystal’s atoms. Despite advances in techniques for determining protein structure, the structures of many proteins are still unknown. On the contrary, determining the amino acid sequence of a protein is a more straightforward task, easily doable with current technology [51]. One can convince himself of this fact by looking at the number of discovered sequences and structures stored on internet databases. A comprehensive freely accessible resource of protein sequence and functional information, called Uniprot1 [51], contains (as of January 2015) around 547000 manually annotated (i.e. of high quality and confidence) sequences. The corresponding database of experimentally-determined 3D structures, the Protein Data Bank (PDB2 ), contains (as of January 2015) around 35000 solved structures, that can be retrieved in the form of cartesian coordinates for each atom in the protein [18]. Therefore, to determine homology usually a researcher has to resort to the analysis of solely the protein sequences. However, in the protein remote homology detection scenario, homologous proteins share low sequence similarity: in such cases, detecting the homology becomes a very challenging problem for which sophisticated techniques should be derived. The next section clearly formalizes the problem and the current state of the art. 9.2 Computational protein remote homology detection From the computational point of view, protein remote homology detection is a crucial and widely studied problem, which has assumed great importance in recent years. Most of the approaches can be split into three basic groups, according to the taxonomy proposed in [128]: i) pairwise sequence comparison algorithms; ii) generative models for protein families; iii) discriminative classifiers. In the first, simplest case, similarities between proteins are evaluated via pairwise sequence alignment, a technique aimed at finding the best superimposition between two sequences. In practice, a sequence alignment is obtained by inserting spaces inside the sequences (the so called gaps) in order to maximize the point to point similarity between them [8]. A simple example is shown in Fig. 9.3. A huge number of algorithms for sequence alignment exist in the literature [8, 9, 156, 168, 216, 225], which can be classified in several different categories. The main taxonomy divides the approaches in three categories: global alignment methods, which are aimed at finding the best overall alignment between two sequences; local alignments, which detect related segments in a pair of sequences, and multiple alignments, which are aimed at simultaneously align more than two sequences. Among the algorithms proposed in the past for pairwise alignment, the Needleman-Wunsch [156] and Smith-Waterman [216] algorithms are the most accurate methods, whereas heuristic algorithms such as BLAST [8] and FASTA [168] trade reduced accuracy for improved efficiency. All of these techniques heavily rely 1 2 http://www.uniprot.org/ http://www.rcsb.org/ 9.2 Computational protein remote homology detection 111 on a fundamental parameter, called the substitution matrix, which encodes the biological knowledge and assigns a score for matches/mismatches based on the rate at which one character in a sequence is likely to mutate into another one (the higher, the more likely it is). Another important parameter, the gap penalty, is specified by a pair of values representing the cost for inserting a gap and extending an existing one. Then, advanced approaches have obtained higher rate of accuracy by collecting statistical information from a set of similar sequences. One of the most famous methods, PSI-BLAST [9], uses BLAST to iteratively build a probabilistic profile of a query sequence and obtains a more sensitive sequence comparison score. Briefly, a profile derive from the results of a standard sequence alignment (through BLAST). These results are combined into a general sequence which summarizes significant features present in these sequences. A query against the protein database is then run using this profile instead of a single sequence, and a larger group of proteins is found as a result of this new query. This larger group is used to construct another profile, and the process is iterated. By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than a standard protein-protein BLAST. The second category in the taxonomy relies on generative models. The most famous model employed in this context is the profile hidden Markov model (HMM) [108], which uses examples of proteins in a family to train a generative model which characterizes the family [184]. Generative models improve upon profile-based methods by iteratively collecting homologous sequences from a large database and incorporating the resulting statistics into a central model. All of the resulting statistics, however, are generated from positive examples, i.e., from sequences that are known or posited to be evolutionarily related to one another. Because the homology detection task can be seen as the problem of discriminating between related and unrelated sequences, we can employ discriminative approaches, which can explicitly model the difference between these two sets of sequences. In this context, the most employed classifier is the support vector machine (SVM) [52], which uses both positive and negative examples, and has provided state-of-the-art performances in this context [128]. Many SVM-based approaches have been proposed: SVM-Fisher [97,98] couples an iterative HMM training scheme Gap Match Mismatch M V - - - F F C L | | | | . : M V S S S F F S I Scores: 5 Similarity: 4 -10 -.5 -.5 6 6 -1 2 11 Fig. 9.3. Alignment of two sequences. 112 9 Introduction with a discriminative classifier; SVM-LA [202] derives a string kernel obtained after pairwise sequence alignment; SVM-k-spectrum extracted Ngrams from the sequences and fed the bag of words representation to the SVM. Other examples include Mismatch-SVM [124], SVM-pairwise [128], SVM-I-sites [95], SVM-SW [202], and others. A more detailed comparison of SVM-based methods has been presented in [201]. Apart from the specific classifier employed, it is important to notice that particular success has been obtained by employing representations that are based on sequence profiles. As explained before, the profile of a sequence S = s1 . . . sL , is the result of a multiple sequence alignment between S and its closest neighbors found by a database search (such as the afore described PSI-BLAST [9]). The information contained in the profile may be very useful, and has been exploited for protein remote homology detection in [133], where a novel representation called top-Ngram is extracted by looking at the N most frequent amino acids in each position of the profile. Another profile-based approach is proposed by Liu et al. in a recent paper [134], where a profile-based sequence is derived by rewriting each amino acid in the original sequence with the most probable one according to the profile, and standard Ngrams (i.e. groups of N consecutive amino acids in this new sequence) are extracted and used to classify sequences. In both cases, a bag of words approach is employed: the feature vector is obtained by counting the number of times each Ngram (or top-Ngram) occurs in the “profile-enriched” sequence. 9.3 Contributions The profile-based approaches achieved state of the art prediction performances, therefore seem to hold high potential. However, it is important to observe that the profile information may be further exploited: in particular, in such approaches only few amino acids of the profile are considered – one, in the approach of [134], N in the top N-grams technique of [133]. Moreover, such approaches do not use the frequencies associated to the profile amino acids: for example, in the approach of [134], every sequence amino acid is replaced by the most frequent profile amino acid, no matter how frequent it is (simply the most frequent); by doing so, there is no difference between a situation where a strong conservation throughout evolution is present (e.g. the frequency of the top amino acid is near 1, all the others are near 0) and a situation where this conservation is not present (e.g. the frequencies are more or less identical among different amino acids). The same reasoning holds also for the Top Ngrams approach proposed in this thesis. We made a contribution by proposing of a novel representation called soft Ngram, which is able to take into considerations all these aspects. Soft Ngrams are extracted from the profile of a sequence, explicitly considering and capturing the frequencies in the profile, thus reflecting the evolutionary history of the protein. Then, two modeling approaches to derive feature vectors from the soft Ngram representation, employable as input for the SVM discriminative classifier. Starting from the bag-of-words model, we promote the use of topic models in the context of protein remote homology detection: we derived a soft PLSA model, that deals with the proposed characterization for sequences. In a thorough experimental evaluation, we demonstrated 9.3 Contributions 113 on three benchmarks that the soft Ngram representation is more descriptive and accurate than other profile-based approaches, being also superior to almost all the approaches proposed in the literature. Then, an alternative route is explored: actually, the main idea is that the 3D structure of a protein (when available) represent a source of information which is typically disregarded by classical approaches. In particular, we provided some evidence that it is possible to improve sequence based models by exploiting the available (even partial) 3D structures. The approach, based on topic models, allowed the derivation of a common and intermediate feature space – the topic space – which embeds sequences being at the same time “structure aware”. We experimentally demonstrate that, in cases where the sequence modality alone fails, introducing only 10% of the training structures resulted in significant improvements on detection scores. Moreover, we applied the proposed approach to model a GPCR protein, finding evidence of structural correlations between sequence Ngrams: such correlations can not be recovered employing a sequence-only technique. An interesting conclusion is that this multimodal scheme seems to be particularly suitable for those situations where the sequence modality alone fails. 10 Soft Ngram representation and modeling for protein remote homology detection This chapter presents the novel profile-based representation for sequences, called soft Ngram. This representation, which extends the traditional Ngram scheme, permits to extract information from all the symbols in the profile, also considering the associated evolutionary frequencies: this is in practice achieved by extracting Ngrams from the whole profile, which are subsequently equipped with a weight directly computed from the corresponding evolutionary frequencies. This chapter also illustrates two different approaches to model the proposed representation and derive a feature vector, which can be effectively used for discriminative classification using a support vector machine (SVM). A thorough evaluation on three benchmarks demonstrates that the new approach outperforms other Ngram-based methods, and achieves state-of-the-art results when compared to a broader spectrum of techniques. 10.1 Profile-based Ngram representation This section reviews the approaches of [133] and [134], which derive a Ngram representation on the basis of the profile. In both cases, the starting point is the profile of a sequence S = s1 . . . sL , which is represented by a matrix M m1,1 m1,2 . . . m1,L m2,1 m2,2 . . . m2,L M= . (10.1) .. . . .. .. . . . m20,1 m20,2 . . . m20,L where 20 is the total number of standard amino acids, L is the length of the sequence, and mi,l reflects the probability of amino acid i (i = 1, . . . , 20) occurring at sequence position l (l = 1, . . . , L) across evolution. Thus, the elements in each column of M add up to 1. Once the profile of a sequence is computed, the frequencies in each column of M are sorted in descending order, with the resulting sorted matrix denoted M̃ (right part of figure 10.1b). An entry m̃i,l contains the frequency of the i-th most probable amino acid in position l, which is then denoted s̃i,l . This matrix is then 116 10 Soft Ngram representation and modeling for protein remote homology detection employed to extract the Ngram representation. The two methods [133,134] employ different strategies to extract Ngrams from the profile matrix M̃: Column-Ngram [133] In this approach, called in the original paper top-Ngram, each column of M̃ is considered independently. Given a column l, a column-Ngram is the concatenation of the most probable N amino acids in position l, and is denoted by vl = s̃1,l . . . s̃N,l . Row-Ngram [134] In this approach, only the first row of M̃ is considered (i.e. only the most probable/frequent amino acid in each position of the profile): the original sequence is rewritten by substituting each amino acid with the corresponding most frequent amino acid of the profile. Then Ngrams are extracted as in other approaches [125], i.e. by considering N consecutive amino acids. Summarizing, a row-Ngram vl is composed by amino acids s̃1,l . . . s̃1,l+N −1 – please note that neighboring Ngrams in the sequence overlap by N − 1 amino acids. From the description above it seems evident that none of these approaches fully exploit the complete profile information contained in M̃: in both cases only few amino acids of M̃ are considered – 1 for Row-Ngrams, N for Column-Ngrams; moreover, the elements of M are used only to determine the ranking, completely disregarding the evolutionary information contained in the values of M: the approaches do not make any difference between a situation where a strong conservation throughout evolution is present (the top value of M̃ is near 1, all the others are near 0) and a situation where this conservation is not present (values of M̃ are more uniformly distributed). We will see how both these aspects are considered with the proposed representation. In any case, once extracted, the set of Ngrams of a given sequence are summarized with a bag of words vector, obtained by counting the number of times each possible Ngram appears in the sequence. More in detail, given all distinct Ngrams {v} – the dictionary – the bag of words c is defined as the vector of length V = |{v}| = 20N , where an entry c(v) indicates the number of time the dictionary Ngram v is present in the set of Ngrams extracted from the sequence. This vector, computed for every sequence, is used for classification. 10.2 The proposed approach In this section the proposed approach is described: we first present the soft Ngram representation and its major differences with the methods presented in the previous section; then, the two modeling strategies to derive a fixed-length feature vector are detailed. A summarizing scheme which graphically depicts the pipeline of the proposed approach is shown in figure 10.1. The basic idea behind the soft Ngram representation is that in a given position l of a sequence there are many plausible Ngrams, each one with a different probability driven by evolution. To give an illustrative example, consider the situation 10.2 The proposed approach Sequence S: E C S S ... 117 R PSI-BLAST A R ... ... ... ... V ... ... Profile M: (a) 0.05 0.07 0.08 0.12 ... 0.09 0.00 0.06 0.17 0.13 ... 0.28 0.02 0.02 0.13 0.01 ... 0.02 Sorted profile Column-2gram C A G E S ... R C ... K E C W ... W 0.31 0.43 0.26 0.23 ... 0.28 0.21 0.07 0.18 0.17 ... 0.10 (b) ... C ̃ M: ... ̃ S: E D Row-2gram 0.01 0.01 0.00 0.01 ... 0.00 Soft weight assignment w1 0.02 0.09 AA AC AD AE w1(v) 0.12 ... ... ... ... ... EC wL-1 0 0.14 0.03 0 w3 w2 0 0 0 0 0 0.06 0.25 0 ... ... ... ... Row-2gram dictionary ... ... YY ((c) 0 0.74 0.01 0.35 0 0 0 0.10 Soft 2gram representation Probabilistic modeling ∑ wl l 0 0 0.55 (d) 0.01 ... Soft PLSA ... ... 0.20 0 0 0 0.25 0 Soft bag-of-words Fig. 10.1. The proposed soft Ngram representation. (a) The profile M of a sequence is computed with PSI-BLAST. (b) Each column in the profile is sorted, and soft Ngrams are extracted (row- or column-wise) from S̃. (c) After having built the dictionary, each soft Ngram representation vector wl is computed for the sequence by combining frequency values in the sorted profile matrix M̃. (d) The final feature vector is derived with either of the two proposed modeling schemes – soft bag-of-words or soft PLSA. 118 10 Soft Ngram representation and modeling for protein remote homology detection where the Ngram size is 1. Assume that in a given position l of the sequence, the top two amino acids are A with frequency 0.8, and R with frequency 0.2. Previous Ngram approaches will consider only one amino acid, A, which is the most probable. In our perspective, we consider two amino acids: A, whose “weight” is 0.8, and R, with weight equal to 0.2. This permits to encode the whole evolutionary information contained in the profile (taking into account also the frequency) and to discriminate between situations where the two top frequencies are different (e.g. if A and R would have frequencies 0.6 and 0.4, respectively, standard profile-approaches would extract again A, whereas our representation treats these situations as different). More in detail, the representation is obtained in two steps: Ngram extraction, and weight assignment. Ngram extraction. Ngrams are extracted by tailoring the previous definition of column- and rowNgrams in the following way: • • Soft column-Ngram. Ngrams are extracted from the whole column (not only from the top N positions): in particular soft column-Ngrams are of the form vi,l = s̃i,l . . . s̃i+N −1,l ∀i ∈ [1, . . . , 20 − N + 1]. For each column, Ngrams are extracted with N − 1 overlap degree N − 1. Soft row-Ngram. Ngrams are extracted for all possible rows of M̃: Soft rowNgrams are of the form vi,l = s̃i,l . . . s̃i,l+N −1 ∀i ∈ [1, . . . , 20]. For each row, Ngrams are extracted with N − 1 overlap degree. Weight Assignment. The goal is to assign a weight to each soft column- or row-Ngram extracted in the previous step. Such weight should reflect the evolutionary frequencies of the amino acids which compose it. Inspired by the score fusion technique of [117], we propose two simple strategies to extract this quantity – which we denoted as wl (v): • • Sum strategy, where the profile frequencies of the amino acids constituting the Ngrams are summed Prod strategy, where the profile frequencies of the amino acids constituting the Ngrams are multiplied 10.2.1 Modeling: soft bag of words The goal of the modeling stage is to derive a single feature vector that characterizes any given sequence S t . In this paper we propose two methods, one called soft bagof-words, the other soft PLSA. The former is presented in this section, whereas the latter is described in the next section. In the classical bag of words representation (as said many times during this thesis), the feature vector is obtained by counting the number of times each Ngram of the dictionary occurs in the sequence. In our proposed soft bag-of-words, the feature vector is obtained again by a counting process, which however considers the weights: each Ngram extracted from the sequence does not count as “1”, but as much as its weight. In other words, for each soft Ngram of the dictionary, we 10.2 The proposed approach 119 summed the weights of all its occurrences in the set of Ngrams extracted from the profile. In formulae, an entry ct (v) of the feature vector characterizing sequence t is X ct (v) = wlt (v) (10.2) l Once this quantity is computed for each element v in the dictionary, the final vector denoted ct represents the feature vector that can be used in a classification setting. 10.2.2 Soft PLSA A more sophisticated way of modeling the proposed representation stems from the already stressed consideration that objects represented as counts may be successfully modeled in a probabilistic way, using e.g. the already seen topic models. For example, the PLSA model presented in Chap. 4 seems a suitable model also in the context of protein remote homology detection: in this peculiar scenario, documents correspond to sequences and Ngrams correspond to words. From a probabilistic point of view, the sequence may be seen as a mixture of topics, each one providing a probability distribution over Ngrams. Standard Ngram representations [125], including the profile-based approaches of [133] and [134], can be directly modeled using PLSA. This model, however, can not be applied as is to our proposed soft representation, due to the presence of weights. Here we propose an adaptation of the PLSA, which we call soft PLSA, able to directly exploit and consider the information contained in the weights. In essence, the soft PLSA borrows the same metaphor of classic PLSA: given the set of soft Ngrams l extracted from the profile of a sequence S t , the presence of a particular soft Ngram v in such set is mediated by a latent topic variable z ∈ Z = {z1 , . . . , zK }. However, in our formulation the probability of observing a particular pair (v, S t ), i.e. the log likelihood denoted by log p(wlt = v, S t ), is weighted by the soft value wlt (v): " log p(wlt t = v, S ) = wlt (v) t · log p(S ) K X !# βvk θkt (10.3) k=1 where βvk = p(v|zk ) and θkt = p(zk |S t ). In practice, the topic zk is a probabilistic co-occurrence of soft Ngrams encoded by the distribution βvk . Intuitively, θkt measures the level of presence of each topic zk in the sequence S t . On the other hand, βvk expresses how much the soft Ngram indexed by v in the dictionary is related to topic zk . Under this model, the full data log-likelihood for a training set of T sequences is L= = T X t log p(d ) + " Lt V X X t=1 v=1 T X V X t=1 t log p(d ) + v=1 wlt (v) l=1 t c (v) · log log K X βvk θkt # k=1 K X k=1 βvk θkt (10.4) 120 10 Soft Ngram representation and modeling for protein remote homology detection where we highlighted the fact that the value ct (v) is the sum over all weights assigned to the different occurrences of soft Ngram v in the sequence. Finally, p(dt ) accounts for sequences of different lengths. Given the training set, we have to devise an algorithm which permits to learn the parameters of the model β and θ such that the loglikelihood of the observations is maximized. Our learning strategy is based on the exact ExpectationMaximization (EM), which after initializing the parameters β and θ iterates the following two steps: • • the E-step, which computes the posterior over the topics qkvt = p(zk |v, dt ) given the current estimate of the model the M-step, where β, θ and the prior over sequences p(S t ) are re-estimated given the q obtained with the previous E-step. For a more detailed review on the EM algorithm, interested readers may refer to Chap. 2 or to [81]. In our context, the E-step formula is computed with the Bayes rule starting from the values of β and θ: βvk · θkt p(zk |v, dt ) = qkvt = PK t k=1 βvk · θk (10.5) The M-step rules for updating θ and β are as follows: T X βvk ∝ θkt ∝ t qkvm L X wlt (v) (10.6) wlt (v) (10.7) t=1 l=1 V X L X v=1 t qkvm l=1 t t p(S ) ∝ V X L X wlt (v) (10.8) v=1 l=1 where the symbol ∝ indicates that the result of each formula should be normalized so that the probability constraint (the sum should be 1) is satisfied. With the trained model, inference can be performed on an unknown sequence S test , in order to estimate its topic proportion vector θtest . Such quantity may be computed with a single M-step iteration. Following again the hybrid generative-discriminative scheme [99, 173], we decided to employ as feature vector for a given sequence S t the corresponding topic t ]. proportions vector θt = [θ1t , . . . , θK 10.2.3 SVM classification Once computed, the feature vectors ct (for the soft bag-of-words) or θt = t ] (for the soft PLSA), can be used to face the protein remote homology [θ1t , . . . , θK detection problem; as done in many other remote homology detection systems, the training feature vectors are input to learn a Support Vector Machine, which is then used to classify the test protein sequences. 10.3 Experimental evaluation 121 SCOP 1.53 F2 F1 f1 f2 f3 ... f4 ... f9 F23 f50 f51 ... f54 Negative train/test, randomly split Positive test Positive train Fig. 10.2. The tree shows the subdivision in families and superfamilies used to simulate remote homology in the SCOP dataset. In the example, the first protein family is highlighted as positive test set. 10.3 Experimental evaluation 10.3.1 Experimental details The experimental evaluation is based on three benchmarks: the first one is a wellknown dataset created from SCOP version 1.53 [80]. This dataset1 [128] contains 4352 sequences from 54 different protein families and 23 superfamilies, in a hierarchical structure as the one shown in Fig. 10.2. The idea is that – given a sequence – its remote homologues are the sequences in the same superfamily, but not in the same family. Thus, to simulate remote homology, 54 different subsets are created: in each of these, an entire target family is left out as positive testing set. Positive training sequences are selected from other families belonging to the same superfamily (i.e. sharing remote homology), whereas negative examples are taken from different superfamilies and split between training and testing with the same proportions of the positive class. Class labels are very unbalanced, with a vast majority of objects belonging to the negative class. In fact, on average the positive class (train + test) is composed by 49 sequences, whereas the negative one is made by 4267. The second dataset has been created for our evaluation, starting from the observation that the version 1.53 of SCOP is fairly outdated (September 2000); therefore, we downloaded sequences from the more recent SCOP 2.04, ensuring that all pairwise similarities have E-value greater than 10−5 . A total of 8700 sequences were extracted at the end. The subdivision is carried out with the same protocol of the SCOP 1.53 benchmark, resulting in 89 different subsets, each one corresponding to one particular protein family2 . Finally, we used a third dataset to assess the performances of our framework in a more challenging task: in particular we employed a fold benchmark extracted from SCOP 1.67 [90], where homologous sequences are taken at a superfamily level 1 2 Available at http://noble.gs.washington.edu/proj/svm-pairwise/ The dataset is available at http://www.pietrolovato.info/proj/softngrams.html 122 10 Soft Ngram representation and modeling for protein remote homology detection rather than at a family level – making this dataset considerably harder than the SCOP 1.53 and SCOP 2.04 ones. The dataset contains 3840 sequences and is split in 86 different subsets3 . The proposed soft Ngram approaches have been evaluated and compared against the corresponding non-soft versions in different experimental conditions. In particular, we performed different trials by varying the dictionary size – we considered 1grams, 2grams, and the concatenation of the two dictionaries: in this last case the dictionary contains 420 distinct Ngrams. For what concerns the second model, the soft PLSA has been compared with the standard PLSA model, learned on profile-based Ngrams. To the best of our knowledge, standard PLSA has never been investigated for remote homology detection with profile-based representations. As detailed in the previous section, the models (both PLSA and soft PLSA) are trained on the training set alone, and feature vector θs for testing sequences are obtained via an inference step. Both models require the number of topics K to be known beforehand. To set this parameter, we performed a coarse search, finding that the most reasonable choice is to set it to ∼100. In all the experiments we noticed that the learning is sensitive to the initial choice of the parameters β and θ. In fact, the convergence of the EM algorithm to a good local optimum depends on the choice of the starting point for the EM iterations [241]. A good initialization is therefore crucial: following ideas contained in [73], we chose to ini1 ∀k. To initialize β, we clustered sequences into K tialize θ uniformly, i.e. θkt = K groups using the complete link algorithm for hierarchical clustering. This way, the k-th cluster groups together similar sequences: the average of their feature vectors, after normalization (s.t. sum is equal to 1), is the initialization for βv,k . As in many previous works [64, 65, 132–134, 187], classification is performed using SVM via the public GIST implementation4 , setting the kernel type to radial basis, and keeping the remaining parameters to their default values. Detection accuracies are measured using the receiver operating characteristic (ROC) score [88], which represents the area under the ROC curve (the larger this value the better the detection). 10.3.2 Detection results and discussion In the first set of experiments we compared the soft bag-of-words and the soft PLSA with the corresponding standard bag-of words and PLSA models, on the SCOP 1.53 and SCOP 2.04 superfamily benchmarks. Averaged ROC scores, for all families, are presented in table 10.1 and 10.2, respectively. From the tables it can be observed that ROC scores are always higher when the soft representation is employed, reflecting that the considered information enriches the description of proteins and leverages performances. To assess statistical significance of our results and to demonstrate that increments in ROC score gained with the proposed approach are not due to mere chance, we performed a Wilcoxon signed-rank test with Bonferroni correction [133]: we found that in 52 cases out of 56, the increased performances with soft Ngrams and soft PLSA are significant with p-value p < 0.05. Additionally, we noticed that in many cases the product strategy works best 3 4 http://www.biomedcentral.com/1471-2105/8/23/additional downloadable from http://www.chibi.ubc.ca/gist/ [128] 10.3 Experimental evaluation 123 Table 10.1. ROC scores computed on the SCOP 1.53 and SCOP 2.04 superfamily benchmarks. In the table we compared between bag-of-words (BoW) and soft bag-ofwords model. Dictionary 1-gram row 2-gram [134] col 2-gram [133] row (1,2)-gram col (1,2)-gram SCOP 1.53 BoW softBoW,sum softBoW,prod 0.906 0.930 0.929 0.947 0.947 0.923 0.944 0.950 0.940 0.957 0.941 0.933 0.944 0.934 Dictionary 1-gram row 2-gram [134] col 2-gram [133] row (1,2)-gram col (1,2)-gram SCOP 2.04 BoW softBoW,sum softBoW,prod 0.923 0.953 0.949 0.952 0.958 0.937 0.958 0.961 0.952 0.959 0.960 0.947 0.958 0.956 Table 10.2. ROC scores computed on the SCOP 1.53 and SCOP 2.04 superfamily benchmarks. In the table we compared between PLSA and soft PLSA model. Dictionary 1-gram row 2-gram [134] col 2-gram [133] row (1,2)-gram col (1,2)-gram SCOP 1.53 PLSA softPLSA,sum softPLSA,prod 0.925 0.946 0.947 0.962 0.950 0.941 0.950 0.964 0.954 0.962 0.964 0.948 0.949 0.959 Dictionary 1-gram row 2-gram [134] col 2-gram [133] row (1,2)-gram col (1,2)-gram SCOP 2.04 PLSA softPLSA,sum softPLSA,prod 0.939 0.963 0.959 0.964 0.967 0.951 0.965 0.970 0.955 0.963 0.970 0.960 0.966 0.971 in combination with row-Ngrams, whereas the sum strategy with column-Ngrams: since multiplication implies statistical independence between amino acids, it may be a more reasonable assumption between different amino acids in the same row. Finally, we report in Table 10.3 comparative results with other approaches of the literature applied to the SCOP 1.53 benchmark. When compared to other techniques that are based on Ngram counting, the proposed approach (by using both soft BoW and soft PLSA) sets the best performance so far; looking at the global picture, the table shows that, except in one case, the proposed approach outperforms every state-of-the-art method. In order to better investigate the behavior of the proposed framework, we reported in Fig. 10.3 the ROC curves obtained on the SCOP 1.53 benchmark. To 124 10 Soft Ngram representation and modeling for protein remote homology detection Table 10.3. Average ROC scores for the 54 families in the SCOP 1.53 superfamily benchmark for different methods Method Soft Ngram (our best) Soft PLSA (our best) ROC 0.957 0.964 Reference This chapter This chapter Ngram based methods SVM-Ngram SVM-Ngram-LSA SVM-Top-Ngram (n=1) SVM-Top-Ngram (n=2) SVM-Top-Ngram-combine SVM-Ngram-p1 SVM-Ngram-KTA 0.826 0.878 0.907 0.923 0.933 0.887 0.892 [65] [65] [133] [133] [133] [134] [134] Other methods SVM-pairwise SVM-LA Profile (5,7.5) SVM-Pattern-LSA SVM-Motif-LSA PSI-BLAST SVM-Bprofile SVM-PDT-profile (β=8,n=2) HHSearch SVM-LA-p1 0.896 0.925 0.980 0.879 0.860 0.676 0.921 0.950 0.915 0.958 [202] [202] [187] [65] [65] [65] [64] [132] [132] [134] draw the curves, we considered all 54 families at once: this means that the false positive rate and the true positive rate are not relative to one particular family, but rather they are an average over the different subsets. In each figure, we compared the soft approach with its standard counterpart, reporting the area under the curve in the legend. For every comparison, we confirm that the proposed soft methods outperform their non soft counterparts. Interestingly, there is a major boost when 1grams are employed. 1grams correspond to the amino acids readily available from the profile, and are the core piece of information that we are considering; this may suggest that exploiting all amino acids in the profile – along with their corresponding frequency – is a key step in developing novel representations to ease the remote detection problem. Finally, in table 10.4 we reported results obtained on the SCOP 1.67 fold benchmark, where the task is more challenging (detecting homologies at the superfamily level rather than at the family level). In the table, the best configuration – achieved using row (1,2)-gram for soft BoW and soft PLSA – is reported. Even in this difficult case, the proposed framework proved to be very effective, with our soft PLSA approach setting a new state of the art. 10.3 Experimental evaluation 125 1gram 1 0.8 TPR 0.6 0.4 0.2 0 Soft BoW, AUC=0.930 BoW, AUC=0.808 0 0.2 0.4 0.6 0.8 FPR Soft PLSA, AUC=0.932 PLSA, AUC=0.879 1 0 0.2 0.4 0.6 0.8 1 FPR Row−2gram 1 0.8 TPR 0.6 0.4 0.2 0 Soft BoW, AUC=0.900 BoW, AUC=0.850 0 0.2 0.4 0.6 0.8 FPR Soft PLSA, AUC=0.944 PLSA, AUC=0.932 1 0 0.2 0.4 0.6 0.8 1 FPR Column−2gram 1 0.8 TPR 0.6 0.4 0.2 0 Soft BoW, AUC=0.939 BoW, AUC=0.923 0 0.2 0.4 0.6 0.8 FPR Soft PLSA, AUC=0.951 PLSA, AUC=0.941 1 0 0.2 0.4 0.6 0.8 1 FPR Row−(1,2)gram 1 0.8 TPR 0.6 0.4 0.2 0 Soft BoW, AUC=0.937 BoW, AUC=0.844 0 0.2 0.4 0.6 FPR 0.8 Soft PLSA, AUC=0.946 PLSA, AUC=0.939 1 0 0.2 0.4 0.6 0.8 1 FPR Column−(1,2)gram 1 0.8 TPR 0.6 0.4 0.2 0 Soft BoW, AUC=0.939 BoW, AUC=0.917 0 0.2 0.4 0.6 FPR 0.8 Soft PLSA, AUC=0.952 PLSA, AUC=0.944 1 0 0.2 0.4 0.6 0.8 1 FPR Fig. 10.3. ROC curves computed on the SCOP 1.53 dataset. In each subfigure, the proposed soft representation is compared with its standard counterpart. 126 10 Soft Ngram representation and modeling for protein remote homology detection Table 10.4. Average ROC scores for the 86 families in the SCOP 1.67 fold benchmark for different methods Method Soft Ngram (our best) Soft PLSA (our best) ROC 0.828 0.861 Reference This chapter This chapter Ngram based methods SVM-Top-Ngram (n=2) SVM-Top-Ngram-combine-LSA 0.813 0.854 [133] [133] Other methods PSI-BLAST SVM-pairwise SVM-LA Gpkernel Mismatch eMOTIF SVM-Bprofile (Ph=0.11) SVM-Bprofile-LSA (Ph=0.11) SVM-Nprofile-LSA (N =9) 0.501 0.724 0.834 0.844 0.814 0.698 0.804 0.823 0.823 [90] [90] [90] [90] [90] [90] [133] [133] [130] 11 A multimodal approach for protein remote homology detection Even if reaching satisfactory accuracies on several benchmark datasets (e.g. the SCOP 1.53 dataset detailed in the previous chapter), there are still complex cases where even state-of-the-art approaches may perform poorly for the protein remote homology detection task. In such cases, it may be possible that information derived from other sources helps, provided that it is possible to properly integrate such (even partial) information into existing models. In the context of protein remote homology detection, there is a source of information which is typically disregarded by classical approaches: the available experimentally-solved, possibly few, 3D structures1 . Now the question is: Is it possible to improve sequence-based methods by integrating information derived from such 3D structures? In this chapter we provide some evidence that this is possible, by deriving a multimodal approach2 for remote homology detection. We took inspiration from the multimodal image and text retrieval context [103], where images are equipped with loosely related narrative text descriptions, and retrieved by using textual queries. This scenario is particularly interesting with respect to our scopes, because it shares many similarities with our context: i) the link between the modalities is weak, partially hidden, and, in general, difficult to infer; ii) most importantly, the context is asymmetric: one of the two modalities is richer than the other, yet being more difficult or expensive to obtain – therefore fewer examples are typically available (it is known that the number of experimentally-determined structures is one order of magnitude lower than the number of known sequences). The goal is to develop an approach which works directly on the weaker source of information (the sequence), being however built taking into account the (possibly smaller) richer source (the structure). In this chapter we show that such multimodal point of view can be effectively explored for protein remote homology detection: as said above, the richer modality is represented by a (possibly small) subset of structures – retrieved from PDB – which are used to derive a “structure-aware” model for sequences. Our multimodal 1 2 Some papers already show the potentialities which can be gained with structural information (see for example [95]); however, they are all based on 3D predictions made from sequences, therefore not using the true 3D structures found in PDB. From a general point of view, a multimodal approach represents a technique aimed at solving a given task by integrating different sources of information. 128 11 A multimodal approach for protein remote homology detection approach, based on the recent [176], starts by encoding sequences and structures with a bag of words representation. In particular, sequences are described using counts of Ngrams, (presented in the previous chapter); structures are described using counts of 3D fragments, as in [39]. Both representations are then modeled using topic models: we investigate here two models, the already presented PLSA [94] and the Componential Counting Grids (CCG) model [176]. The latter represents a recent admixture model extension of the Counting Grid (its use in the protein remote homology detection context has never been investigated). For both models, we created an augmented model accounting for structural information in two steps: i) a model (PLSA or CCG) for the available structures is learned, creating a latent space which acts as a common, intermediate representation; ii) all the sequences are embedded into this space derived from structures. Such embedding is determined by exploiting the (partial) available correspondences between sequences and structures. The suitability of the proposed multimodal framework for protein remote homology detection has been evaluated in two ways: on one hand, we performed various tests on the standard SCOP 1.53 benchmark [128], demonstrating that i) the proposed framework permits drastic improvements in those scenarios where sequence modality fails – even when only 10% of training sequences have their corresponding structure; ii) on the whole benchmark (54 families), it favorably compares with other recent approaches. On the other hand, we performed a thorough analysis on a member of the GPCR superfamily, suggesting that the proposed multimodal approach can extract information that cannot be derived by employing only sequence-based approaches. 11.1 Materials and methods This section briefly summarizes the probabilistic models (in particular, CCG) employed in our approach. To employ these models, a document should be represented with a bag of words vector, where each entry nt (wi ) counts the number of times a given word wi occurs in a given document (indexed by t). In our biological scenario documents correspond to proteins, while basic building blocks (such as sequence Ngrams) are the observed words. Once learned, the topic models permit to represent all proteins in the topic space: even if in the protein case this space does not have a straightforward biological meaning3 , it turned out to be really informative for proteins comparison, as largely shown in [209]. In the following, we will detail the Componential Counting Grid model (CCG, [176]). 11.1.1 Componential Counting Grid The Componential Counting Grid (CCG – [176]), introduced in the context of Natural Language Processing, is a recent extension that combines the basic ideas of the Counting Grid [104] with the “admixture” nature of PLSA (i.e. different words of 3 In some other cases a biological interpretation can be easily assigned, as in the gene expression case (see Chap. 4 and 5). 11.1 Materials and methods CG CCG 0.2 sample st 129 0.8 sample st Fig. 11.1. Difference between CG and CCG models. In the generative process of the CG, all words from a sample are generated from the same window in the grid. In the CCG, words composing a sample are allowed to be generated from multiple windows. a document may be drawn from different topics). As the Countng Grid, the model stems from the fact that for many text corpora, documents evolve into one another in a smooth way, with some words dropping and new ones being introduced. For example, news stories smoothly change across the days, as certain evolving stories progressively fall out of novelty and new events create new stories. CCG introduces these topological constraints by arranging topics in a 2-dimensional grid; similar topics are placed nearby, in a way that they can be contained in a fixed-size windows inside the grid. Contrarily to the CG, where one document is assumed to be generated by only one window in the appropriate position, in the CCG different words in the same documents may be generated from multiple windows. This difference is highlighted in Fig. 11.1. More formally, the componential counting grid is a grid of discrete locations πx,y , with fixed dimensions E = E1 × E2 . Each location is endowed with a distribution over all V words, which acts exactly like the distribution p(w|z) for PLSA: given a location zk , k = (x, y) (i.e. a topic), πk represents a multinomial distribution describing the probability of each word given that location (i.e. a topic). To model smooth transitions between topics, CCG assume that a word is not generated from a single distribution πk related to a single position of the grid k (as in PLSA), but also considering distributions in a neighborhood of k. In particular, a word in a document t is generated by i) choosing a location zk from a multinomial distribution p(z | t) = θt (like topics proportion of PLSA); ii) sampling from the average of all the πk relative to a window of fixed dimensions W = W1 × W2 centered at zk . As detailed in [176], model parameters and hidden distributions are learned using a variational EM algorithm. Similarly to PLSA, the model is completely specified given the parameters α (Dirichlet prior over locations) and π. Again, 130 11 A multimodal approach for protein remote homology detection given these quantities, inference on an unknown object permits to recover the value of θtnew . 11.2 The proposed approach In this section the multimodal approach used to integrate structural and sequential information is explained. From a very general perspective, the main idea is the following (see Fig. 11.2): suppose we have a set of sequences {seqi }; for some of them we also know the corresponding structures {structi }. Then, we determine, from the set of structures {structi }, a function f (struct), which is able to project all structures in a feature space (Fig. 11.2(a)). The goal is to determine a function g(seq) so that f (structi ) ≡ g(seqi ) for all available structures (i.e. corresponding sequences and structures should share the same representation). The found function f can now be used to project whatever sequence in the common space, which is now built using structural information (Fig. 11.2(b)). In order to realize this, we exploit an approach derived from the multimodal image-text retrieval literature [176], which is based on topic models described in the previous section. Even if different alternatives exist [31, 103], in such retrieval context the approach proposed in [176] appeared to be simpler and more effective. 11.2.1 Data representation To employ topic models, we have to define a bag of words representation for proteins, and in particular for both sequences and structures. For the sequence modality, we use as words the Ngrams: more in detail, in all our experiments we used bigrams, i.e. subsequences composed by two consecutive amino acids. In the structural domain, we employed as words structural fragments, as proposed in [39]: each fragment is a list of 3D coordinates for consecutive Cα atoms in the backbone of the protein – in their original work, the authors provide different dictionaries of fragments. In our study, following other papers [39, 209], we employed the 400 11 dictionary (composed by 400 structural fragments each of length 11). In the end, we have two different dictionaries, one for each modality: a dictionary DST = {w1ST , . . . , wVSTST } for structures, and a dictionary DSE = {w1SE , . . . , wVSESE } for sequences. Inputs and data involved in our method are composed by: • A set containing S pairs of corresponding sequence/structure counts (bags), for a subset of training proteins {(STTrt , SETrt )}, t = 1, . . . , S where STTrt = nt (wiST ), i = 1 . . . , VST SETrt = nt (wiSE ), i = 1, . . . , VSE 11.2 The proposed approach 131 (a) (b) Fig. 11.2. The idea of the multimodal scheme. • A set of T −S sequence bags, representing sequences in the training set without the corresponding 3D structure {SES+1 , . . . , SETTr } Tr • A set of N testing sequences’ bags {SE1Te , . . . SEN Te } where SETet = nt (wiSE ) 11.2.2 Multimodal learning The key idea of the proposed multimodal approach is that the latent topic space learned by PLSA (or CCG) establishes a common representation where both sequences and structures can be embedded. Since the two modalities are asymmetric (with the structural being the richer one), we impose this latent space to be powered by (possibly few) structures. The proposed approach articulates in three major steps: 132 11 A multimodal approach for protein remote homology detection Topic model learning on structures. First of all, we learn a topic model (PLSA or CCG) using the available structure counts {ST1Tr . . . STSTr }: acknowledged the superiority of the structural modality, we emphasize the topic space to be “structure-driven”. For what concerns the learning, we already emphasized that choosing a good initialization for parameters p(w|z) (π for CCG) is crucial for proper learning – the typical random initialization may lead to poor local minima. In order to overcome this issue, we used the same initialization described in the previous chapter: we cluster words into Z groups (where Z represents the number of topics) using the complete link algorithm, which performs an agglomerative clustering. Then, we initialize β (π) so that each topic has high probability of generating the words inside its cluster, and low probability of generating words outside the cluster. At the end of this learning stage, each structure is characterized in a space by t its corresponding vector θST , t = 1, . . . , S. Multimodal projection. In this step, we exploit correspondences between structures and sequences, projecting the sequences in the latent space learned with t for the S structures in the previous step. We impose that the topic proportions θSE t training sequences are equal to the θST obtained from the corresponding structures. In this way we are establishing a 1:1 mapping between the structural topics and the sequential topics. In practice, this is achieved by learning the PLSA/CCG model t t on sequence counts keeping θSE fixed and set to θST . As a result, the parameters βSE and αSE (πSE and αSE for CCG) of the learned model are completely specified in the sequence domain. However, they have been learned taking into consideration the topic proportions derived from the model learned on structures. Inference on the remaining training and testing sequences. For train, . . . , SETTr }, where 3D structures are unknown, an ing proteins in the set {SES+1 Tr inference step with the learned enriched model can be performed to recover the t , t = S + 1, . . . , T . The same inference is performed on testing topic proportions θSE t for SEtTe , t = 1, . . . , N . As explained in the background sequences to derive θSE section, inference is performed by keeping fixed α, β (α and π for CCG), and t for the new samples. estimating θSE Summarizing, we propose to learn a topic model on the richer structural modality, and consequently embed the corresponding sequences in the same latent space, discovering the parameters governing Ngrams (sequence words) distributions in a “structure-aware” sense. 11.2.3 Classification scheme In order to perform classification, we employed the generative embedding scheme [120], using as feature vector the topic posterior θt , to be used for training a discriminative classifier such as an SVM. SVMs are therefore trained using all t t θSE (t = 1, . . . , T ) in the training set, whereas classification is carried out on θSE (t = 1, . . . , N ). 11.3 Experimental evaluation 133 11.3 Experimental evaluation In this section the proposed approach is evaluated with the standard and widely used SCOP 1.53 benchmark [128] described in the previous chapter. In particular, we first perform a thorough analysis on two cases where the sole sequence modality fails, showing that drastic improvements can be obtained by the multimodal approach, even if using few structures; then we evaluate the proposed approach on the whole benchmark, in order to have a clear comparison with alternative approaches in the state of the art. As done in the previous chapter, detection accuracies are measured using the receiver operating characteristic (ROC) score [88] (the larger this value the better the detection). 11.3.1 First analysis: families 3.42.1.1 and 3.42.1.5 In this first part we performed a thorough analysis on two cases where the sequence modality fails (i.e. cases where a proper characterization of the family cannot be determined). In particular, we concentrate on families 3.42.1.1 and 3.42.1.5, on which almost random accuracies are obtained by using models based solely on sequences. We applied the proposed multimodal scheme on these two families, starting from the corresponding 3D structures downloaded from PDB. In particular, once the sequences and the structures are encoded as explained in previous sections, the models (PLSA or CCG) are learned from the training set, in order to get the θs usable to train the SVM. θs for the testing set are then extracted via model inference. When using PLSA, taking inspiration from [65, 209], we set the number of topics to 100. For CCG, we exploited the concept of capacity [176], already defined for the Counting Grid: it measures how many non-overlapping windows can fit onto the grid. This can be equated to the number of topics in a topic model: therefore we set the CCG dimension as E = [20, 20] and W = [2, 2], so that the capacity equals to 100. After computing the θs, the classification has been carried out using the public libsvm implementation4 [43], employing the RBF kernel. Parameter C of the SVM has been set as 10−3 for every experiment, whereas the RBF parameter σ has been found by grid search (testing power of 2: [2−4 , . . . , 24 ]), retaining for each family the one performing better on average (reasonable values lie around 2−2 ). In order to get a complete understanding of the proposed approach, we also assessed the performances when only a limited number of structures are available for learning. In particular, we used an increasing fraction of randomly chosen structures to build the structure model. Since there is a very limited number of positive examples (29 for the first family, 26 for the second), we decided to always consider all of them, sampling at random negative training examples. The structure model is then transferred to sequence model; inference on the enriched sequence model finally permitted to get descriptors for all training and testing sequences, to be used by the SVM classifier. Detection results, for fractions ranging from 0.1 to 1 (i.e. all training structures), are averaged over 50 runs, and reported in Figure 11.3, for both the PLSA and CCG models. We also determined whether 4 http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ 134 11 A multimodal approach for protein remote homology detection mmCCG mmCCG mmPLSA mmPLSA CCG baseline PLSA baseline CCG baseline PLSA baseline Fig. 11.3. Detection scores displayed as a function of the number of structures used in the multimodal approach. “mmPLSA” (“mmCCG”) stands for the proposed multimodal approach by using the PLSA (CCG) model. Filled markers indicate statistically significant improvements over the baseline. Results are reported for (left) family 3.42.1.1 and (right) family 3.42.1.5. the improvement gained with the proposed multimodal approach is statistically significant, using a standard t-test with alternative hypothesis “multimodal results are greater than the baseline”. In Figure 11.3, filled markers indicate statistical significance at significance level α = 0.05. From these plots it seems evident that the use of structural information permits to derive a better sequence model: in both families, CCG achieves significant improvements when employing only 10% of all training structures. For the second family, even if multimodal PLSA accuracies are higher than the baseline, statistical significance is obtained only when 80% or more of the structures are employed. When all training structures are considered, the improvement is rather high for both models. When comparing the two probabilistic models, it appears evident that the Componential Counting Grid outperforms the PLSA model, both when used on the sequence modality alone and when empolyed in a multimodal framework. Such a model, never used in the context of protein remote homology detection, permits to derive a better and more discriminant description of count data, as outlined in [176] for other application fields. 11.3.2 Second analysis: all families In this second analysis, the proposed approach has been tested on all the families of the SCOP dataset, this being particularly important to compare the proposed scheme with the state of the art. In this case we slightly changed some details of our experimental pipeline; in particular, since we are dealing with 54 different classification problems (i.e. 54 families), we did not fix a single number of topics, but we let it vary in a reasonable range, keeping the best value. Moreover, in order to be fully comparable with many works in the state of the art [64,65,132–134,187], 11.3 Experimental evaluation Method ROC Reference Monomodal PLSA Monomodal CCG Multimodal PLSA Multimodal CCG 0.921 0.903 0.925 0.932 This This This This Ngram-based methods SVM-Ngram SVM-Ngram-LSA SVM-Top-Ngram (n=1) SVM-Top-Ngram (n=2) SVM-Top-Ngram-combine SVM-Ngram-p1 SVM-Ngram-KTA 0.826 0.878 0.907 0.923 0.933 0.887 0.892 [65] [65] [133] [133] [133] [134] [134] Other methods SVM-pairwise SVM-LA Profile (5,7.5) SVM-Pattern-LSA SVM-Motif-LSA PSI-BLAST SVM-Bprofile SVM-PDT-profile (β=8,n=2) HHSearch SVM-LA-p1 0.896 0.925 0.980 0.879 0.860 0.676 0.921 0.950 0.915 0.958 [202] [202] [187] [65] [65] [65] [64] [132] [132] [134] 135 chapter chapter chapter chapter Table 11.1. Average ROC scores for the 54 families in the SCOP 1.53 superfamily benchmark for different methods the classification is performed using SVM via the public GIST implementation5 , setting the kernel type to radial basis, and keeping the remaining parameters to their default values. Results are presented in Table 11.1, in comparison with the literature; in particular, the state of the art is split into methods which employ Ngrams (Ngrambased Methods) and methods which do not (Other Methods). From the table it can be observed that the framework is rather accurate: when compared with other Ngram-based methods, our best result outperforms all other approaches (except the SVM-Top-Ngram-combine [133] approach, which however combines different Ngram representations). Moreover, the proposed multimodal technique compares reasonably well also with other more complex approaches. Interestingly, CCG outperforms PLSA only when used in a multimodal framework. 5 downloadable from http://www.chibi.ubc.ca/gist/ [128] 136 11 A multimodal approach for protein remote homology detection 11.4 Multimodal analysis of bitter taste receptor TAS2R38 The main goal of this section is to qualitatively validate the proposed multimodal scheme in a real scenario. In particular we focus on a specific protein (the bitter taste receptor TAS2R38 [40, 116]) belonging to the G-protein coupled receptors (GPCRs) superfamily. This large group (with over 900 members only in humans) of cell signaling membrane proteins is of major importance for drug development, as GPCRs are one of the primary targets currently under investigation [163]. From our perspective, this context is very interesting for three reasons: i) sequence identities between members of different GPCR families are extremely low, making the detection of remote homologues very challenging; ii) only 24 unique human GPCRs6 have their experimentally-determined structure as of January 2015 (i.e. very little structural information); iii) most importantly, it has already been shown that the closest homologue of the TAS2R38 receptor (as given by standard programs for sequence search, without manual intervention) does not represent a good template usable to unravel structural/functional elements (in particular, regarding the active site and the specific residues involved in the ligand binding) [20]. We show here that our multimodal approach can be used to suggest an alternative template. We sponsor this template by providing some elements supporting the capabilities of the obtained multimodal model of capturing structural/functional elements. To do that, a multimodal PLSA (with 3 topics7 ) has been trained, using all sequences and the known 24 structures (downloaded from PDB): as a result, all GPCR sequences are embedded in the topic probabilities θ space. The query TAS2R38 sequence is embedded in the same space via inference on the model: the nearest neighbor with known structure represents the suggested template. In this case we have the N/OFQ Opioid Receptor (PDB id: 4EA3). On the contrary, if we perform the same analysis with the single modality PLSA, we obtain as nearest neighbor the CCR5 chemokine receptor (PDB id: 4MBS); as described above, modeling TAS2R38 using this template alone does not allow a correct characterization of the binding cavity of the receptor [20]. To validate the new template, we try to mine the obtained multimodal model, in order to see if the contained information exhibits structure-driven importance. To do that, we analyze, for every topic, the 5 most probable Ngrams (as given by the distribution β), trying to understand if they are related to positions in the two proteins which are important from a structural point of view. Actually we have found that some of these Ngrams (shown in the top part of Fig. 11.4, together with the topic probabilities θ of the query and the corresponding nearest neighbor) represent words which are located with primary importance in the binding cavity of both proteins – these critical residues already shown to be involved in ligand recognition on our query TAS2R38 [146]. If we repeat the same analysis using a PLSA model built using only sequences (central part of Fig. 11.4), no evident structural or functional information can be derived, this suggesting that the N/OFQ Opioid Receptor, being obtained with a more “structure aware” model, can represent a valid alternative to the CCR5 chemokine receptor. 6 7 The list of such proteins is obtained from http://blanco.biomol.uci.edu/mpstruc/ In this case we had to drastically reduce the number of topics since only 24 structures are available – the topic space is built by using the structural information. 11.4 Multimodal analysis of bitter taste receptor TAS2R38 137 Fig. 11.4. On the top part of the figure, the first 5 Ngrams (sorted in descending order w.r.t their β probabilities) for each topic are listed. Ngrams highlighted are known to occur in the binding site locations of either of the two proteins. Slightly to the right, θ distributions (with 3 topics) are displayed for the query TAS2R38 and its closest neighbor. In the central part of the figure, we visualize the same information employing the PLSA in a single-modal way. Finally, in the bottom part of the figure, the same information has been extracted with the multimodal approach employing both real and predicted structures. Interestingly, adding such predicted structures deteriorates the qualitative results obtained by the multimodal scheme. 138 11 A multimodal approach for protein remote homology detection A final experiment has been carried out in order to investigate if it may be possible, in cases like this when very few structures are available, to enlarge the structural information of the training set by also using predicted 3D structure models8 . To test this we applied our proposed multimodal approach by enlarging the training set with the predicted structures of different proteins belongin to the TAS2R group (24 GPCR models, downloaded from http://zhanglab.ccmb.med.umich.edu/GPCR-HGmod/). Results are displayed in the bottom part of Fig. 11.4: even if we obtain the same suggested template (the N/OFQ Opioid Receptor — PDB id: 4EA3), the quality of the multimodal space seems worse than that of the true multimodal approach. It seems that adding predicted models does not help the proposed approach, but, on the contrary, adds some noise. This was somehow expected, and confirms the intuition we got from the other quantitative experiments: the fully exploitation of the proposed framework is based on the use of a small piece of information, which should be however extremely informative (as is for real structures compared to simulated structures). In conclusion, the availability of a method that, augmenting the descriptive power of a sequence-based model, is able to predict relevant structural positions, i.e. involved in ligand binding, is a fundamental step for setting up the modeling protocol when no 3D experimental information is available. In the studied case, the information obtained using our approach could be essential for guiding the selection of better and biologically relevant target-template alignments. 8 For example those obtained using http://zhanglab.ccmb.med.umich.edu/GPCRHGmod/ Final remarks on protein remote homology detection In this part of the thesis we addressed the protein remote homology detection, where some bag of words approaches already proved successful in the literature. However, they could be further exploited: in Chap. 10, we derived a novel bag of words representation – which we dubbed Soft Ngram – extracted from the profile of a sequence, explicitly considering and capturing the frequencies in the profile, thus reflecting the evolutionary history of the protein. We propose two modeling approaches to derive feature vectors from the soft Ngram representation, employable as input for the SVM discriminative classifier. Starting from the bag of words model, we promote the use of topic models in the context of protein remote homology detection: we derived a soft PLSA model, that deals with the proposed characterization for sequences. In a thorough experimental evaluation, we demonstrated on three benchmarks that the soft Ngram representation is more descriptive and accurate than other profile-based approaches, being also superior to almost all the approaches proposed in the literature. Looking back at the pipeline proposed in Chap. 2, this chapter contributed in every aspect of the pipeline: in the “what to count” stage, by considering every Ngram in the sequence profile to build the dictionary; in the “how to count” stage, by associating to the count value a probability that reflects evolutionary conservation; in the “how to model”, by deriving a novel, soft PLSA topic model. In Chap. 11, we investigated a multimodal approach for protein remote homology detection. In particular we provided some evidence that it is possible to improve sequence based models by exploiting the available (even partial) 3D structures. The approach, based on topic models, allowed the derivation of a common and intermediate feature space – the topic space – which embeds sequences being at the same time “structure aware”. We experimentally demonstrated that, in cases where the sequence modality alone fails, introducing only 10% of the training structures resulted in significant improvements on detection scores. Moreover, we applied the proposed approach to model a GPCR protein, finding evidences of structural correlations between sequence Ngrams: such correlations can not be recovered employing a sequence-only technique. 12 Conclusions and future works This thesis investigated and promoted the bag of words paradigm for representing and approaching problems in the wide field of Bioinformatics. The bag of words is a vector representation particularly appropriate when the pattern is characterized (or assumed to be characterized) by the repetition of basic, “constituting” elements called words. By assuming that all possible words are stored in a dictionary, the bag of words vector for one particular object is obtained by counting the number of times each element of the dictionary occurs in the object. The bag of words is particularly suited in bioinformatics for a twofold reason: on one hand, it can be well justified by our current understanding of biology, where a bag with no structure is precisely what we are able to observe (thus one of the main drawback of the representation – that it destroys the object structure – is relaxed and alleviated). On the other hand, it seems that some Bioinformatics problems are inherently formulated as counting: we mentioned for example that measuring gene expression means counting the number of mRNA molecules. In this general picture, this thesis has been devoted to demonstrate that bag of words representations and models can be conveniently exported and employed in different scenarios of the Bioinformatics domain, and that bag of words approaches can have a significant impact on the Bioinformatics and Computational Biology communities. More in detail, the main contributions of this thesis are: • • The proposal of a possible formalization of the bag of words paradigm to represent and model objects, by means of a detailed pipeline that can be employed to face a problem using a bag of words approach. The identification of three scenarios where bioinformatics problems can be effectively faced from a bag of words perspective, proposing different contributions at different levels of the pipeline: – Gene expression analysis: in this context, this thesis contributed by recognizing the bag of words representation in gene expression data, and subsequently by investigating the capabilities of topic models for the classification of gene expression experiments. More than this, we provided several considerations on the interpretability of the results obtained, by using a real dataset involving different species of grapevine resulting from a collaboration with the Functional Genomics Lab at the University of Verona. Encouraged by the promising results, we performed a comprehensive evalu- 142 12 Conclusions and future works – – ation of a more recent and powerful topic model, the Counting Grid, which copes with a possible drawback of classic topic models: topics, i.e. biological processes in this context, act independently of each other. We promote the use of the CG model as an effective tool for visualization, gene selection, and classification of gene expression samples. HIV infection modeling: in this context, this thesis argued for the usage of the bag of words representation and models for analyzing aspects of the HIV infection in humans, focusing on i) the patient’s bag of epitopes, which we found to be correlated with the patient HIV status by employing and tailoring the Counting Grid model for this purpose, and ii) the patient’s bag of TCRs, where robust statistics for measuring diversity of samples have been thoroughly evaluated, using a dataset derived from a collaboration with the David Geffen school of medicine, UCLA. As a second contribution, a principled way of assessing the reliability of the bag of words have been devised. Protein remote homology detection: in this context, this thesis contributed by proposing a novel bag of words approach to characterize protein sequences, by fully integrating evolutionary information in the representation: each word has been equipped with a weight that encodes its conservation across evolution. Moreover, a novel probabilistic model able to handle the presence of this weight associated with each word has been developed. A second contribution is aimed at properly integrating into existing models partial information derived from other sources. In particular, there is a source of information which is typically disregarded by classical approaches: the available experimentally-solved, possibly few, 3D structures of proteins. A multimodal approach for protein remote homology detection has been therefore derived, which permits to integrate in a model the possibly few available 3D structures. A validation using standard benchmarks confirms the potentialities of the proposed approach, as well as a qualitative analysis performed in collaboration with the Applied Bioinformatics groups (University of Verona) on a real dataset of GPCR proteins. For each scenario, motivations, advantages, and challenges of the bag of words representations have been addressed, together with possible solutions that have been thoroughly experimentally evaluated exploiting literature benchmarks as well as datasets derived from direct interactions with clinical and biological laboratories / research groups (Functional Genomics Lab and Applied Bioinformatics groups at the University of Verona, and the David Geffen school of medicine at UCLA). The works done in this thesis pave the way for further studies, aimed at approaching novel bioinformatics challenges with a bag of words perspective. On one hand, we suggested the characteristics of a problem which make the bag of words representation particularly suited. On the other hand, we demonstrated that bag of words models can be extremely versatile, and can be tailored for a vast range of tasks (visualization, classification, clustering, interpretation, feature selection, statistical analysis and reliability assessment). Further contributions could also be aimed at improving existing results obtained in the scenarios addressed in this thesis: all the approaches we proposed in 12 Conclusions and future works 143 c β β s z v s N T (a) z v N T (b) Fig. 12.1. It is possible to integrate the soft PLSA model – portrayed in (a) – with “biologically-aware” similarity measures for Ngrams (such as the ones derived from a sequence alignment), and enhance the model (b) by clustering (soft) Ngrams and assign a label c to each of them. the specific bioinformatics contexts open new perspectives. More in detail, in the context of gene expression it is possible to propose novel topic models, enriching and extending existing ones to address the specific gene expression scenario: for example, it is possible to integrate genes’ dependencies known a priori (preliminary investigated in [180]) to better model the gene-topics distribution, leading to better characterization of samples. Another research line can be devoted at boosting the gene selection technique and enhance the interpretabilty of the Counting Grid by applying a sparse regressor (such as LASSO [226]) on the qkt distribution (p(z|s) of the PLSA model). The lasso can be exploited to learn the most discriminative locations in the CG space, which already showed to provide a good embedding of samples. Then, an analysis of the genes most prominent in these locations may better highlight gene expression patterns that are associated with a disease. Finally, one line of research which has been only marginally investigated is the biclustering one, where the bag of words could be effectively employed [22]. In the context of HIV modeling, we plan to study probability models of the epitope co-presentation for a broader spectrum of tasks, from correcting association studies, to detecting patients or populations that are likely to react similarly to an infection, to the rational vaccine design. In addition, a more comprehensive evaluation of the reliability technique to benchmark datasets of TCR sequences is currently under consideration. In the context of protein remote homology detection, a future work along this line of research is directed toward the study of more sophisticated models for the Soft Ngram representation. For example, it may be possible to take into account a sequence similarity measure between Ngrams. If two sequences has similar soft bag of words representation, it could be interesting to check whether the observed differences arise from Ngram substitutions that are likely to occur in nature (this information being encoded in the substitution matrix). This dependence may be introduced in the bayesian network of the soft PLSA through a variable, shown in Fig. 12.1, modeling a clustering of Ngrams. Moreover, we are currently studying more robust multimodal approaches, which can for example learn how to move from the structure space to the sequence space. As a final consideration, we believe that one of the most important trends in current Bioinformatics is the integration of information from heterogeneous sources, and the Bag of Words can provide a common representation between all of these sources. 144 12 Conclusions and future works In conclusion, this thesis demonstrated the possibility of facing some Bioinformatics problems from a bag of words perspective. More than that, we gathered evidence that this paradigm can be successfully exported in many other biological contexts, and can be helpful for biomedical experts to gain a deeper understanding on their specific problems. References 1. A.K. Abbas, A.H.H. Lichtman, and S. Pillai. Basic immunology: functions and disorders of the immune system. Elsevier Health Sciences, 2012. 2. T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and Y. Saeys. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26:392–398, 2010. 3. M. Aharon, M. Elad, and A. Bruckstein. K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006. 4. B. Alberts, A. Johnson, J. Lewis, D. Morgan, M. Raff, K. Roberts, and P. Walter. Molecular biology of the cell. Garland Science, 6 edition, 2014. 5. A.A. Alizadeh, M.B. Eisen, E. Davis, C. Ma, I. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, and X. Yu. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769):503–511, 2000. 6. S. Alizon, V. von Wyl, T. Stadler, R.D. Kouyos, S. Yerly, B. Hirschel, J. Böni, C. Shah, T. Klimkait, and H. Furrer. Phylogenetic approach reveals that virus genotype largely determines hiv set-point viral load. PLoS pathogens, 6(9):e1001123, 2010. 7. U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745–6750, 1999. 8. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990. 9. S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997. 10. S.A. Armstrong, J.E. Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, and S.J. Korsmeyer. Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature genetics, 30(1):41–47, 2001. 11. M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, and J.T. Eppig. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000. 12. H.U. Asuncion, A.U. Asuncion, and R.N. Taylor. Software traceability with topic modeling. In Proc. of the 32nd ACM/IEEE Int. Conference on Software Engineering, volume 1 of ICSE ’10, pages 95–104, 2010. 146 References 13. F. Balkwill. Cancer and the chemokine network. Nature Reviews Cancer, 4(7):540– 550, 2004. 14. C.H. Bassing, W. Swat, and F.W. Alt. The mechanism and regulation of chromosomal v(d)j recombination. Cell, 109(2):S45–S55, 2002. 15. P.D. Baum, J.J. Young, D. Schmidt, Q. Zhang, R. Hoh, M. Busch, J. Martin, S. Deeks, and J.M. McCune. Blood t-cell receptor diversity decreases during the course of hiv infection, but the potential for a diverse repertoire persists. Blood, 119(15):3469–3477, 2012. 16. A.D. Baxevanis and B.F.F. Ouellette. Bioinformatics: a practical guide to the analysis of genes and proteins, volume 43. Wiley, 2004. 17. T. Beißbarth and T.P. Speed. Gostat: find statistically overrepresented gene ontologies within a group of genes. Bioinformatics, 20(9):1464–1465, 2004. 18. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne. The protein data bank. Nucleic Acids Research, 28(1):235–242, 2000. 19. A. Bhattacharjee, w.G Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E.J. Mark, E.S. Lander, W. Wong, B.E. Johnson, T.R. Golub, D.J. Sugarbaker, and M. Meyerson. Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences, 98(24):13790–13795, 2001. 20. X. Biarnés, A. Marchiori, A. Giorgetti, C. Lanzara, P. Gasparini, P. Carloni, S. Born, A. Brockhoff, M. Behrens, and W. Meyerhof. Insights into the binding of phenyltiocarbamide (ptc) agonist to its target human tas2r38 bitter receptor. PLoS ONE, 5(8):e12394, 2010. 21. M. Bicego, M. Cristani, V. Murino, E. Pekalska, and R.P.W. Duin. Clusteringbased construction of hidden markov models for generative kernels. In Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 466– 479, 2009. 22. M. Bicego, P. Lovato, A. Ferrarini, and M. Delledonne. Biclustering of expression microarray data with topic models. In Proc. of Int. Conference on Pattern Recognition (ICPR), pages 2728–2731, 2010. 23. M. Bicego, P. Lovato, B. Oliboni, and A. Perina. Expression microarray classification using topic models. In ACM symposium on applied computing (SAC), pages 1516–1520, 2010. 24. M. Bicego, P. Lovato, A. Perina, M. Fasoli, M. Delledonne, M. Pezzotti, A. Polverari, and V. Murino. Investigating topic models’ capabilities in expression microarray data classification. IEEE/ACM Tran. on Computational Biology and Bioinformatics, 9(6):1831–1836, 2012. 25. M. Bicego, A. Perina, V. Murino, A. Martins, P. Aguiar, and M. Figueiredo. Combining free energy score spaces with information theoretic kernels: Application to scene classification. In Proc. of Int. Conference on Image Processing (ICIP), pages 2661–2664, 2010. 26. M. Bicego, A. Ulaş, U. Castellani, A. Perina, V. Murino, A.F.T. Martins, P.M.Q. Aguiar, and M.A.T. Figueiredo. Combining information theoretic kernels with generative embeddings for classification. Neurocomputing, 101:161–169, 2013. 27. I. Bieche, F. Lerebours, S. Tozlu, M. Espie, M. Marty, and R. Lidereau. Molecular profiling of inflammatory breast cancer: Identification of a poor-prognosis gene expression signature. Clinical Cancer Research, 10(20):6789–6795, 2004. 28. C.M Bishop. Pattern recognition and machine learning. springer New York, 2006. 29. D. Blei and J. Lafferty. Correlated topic models. Advances in neural information processing systems, 18:147, 2006. References 147 30. D.M. Blei. Probabilistic topic models. Communications of ACM, 55(4):77–84, 2012. 31. D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. 32. W.D. Blizard. Multiset theory. Notre Dame Journal of formal logic, 30(1):36–66, 1988. 33. A. Bosch, A. Zisserman, and X. Munoz. Scene classification via plsa. In Proc. of European Conference on Computer Vision, volume 4, pages 517–530, 2006. 34. A-L. Boulesteix. Pls dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology, 3(1), 2004. 35. A-L. Boulesteix and K. Strimmer. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in bioinformatics, 8(1):32–44, 2007. 36. A-L. Boulesteix, C. Strobl, T. Augustin, and M. Daumer. Evaluating microarraybased classifiers: an overview. Cancer Informatics, 6:77, 2008. 37. G. Brelstaff, M. Bicego, N. Culeddu, and M. Chessa. Bag of peaks: interpretation of nmr spectrometry. Bioinformatics, 25(2):258–264, 2009. 38. P.O. Brown and D. Botstein. Exploring the new world of the genome with dna microarrays. Nature Genetics, 21:33–37, 1999. 39. I. Budowski-Tal, Y. Nov, and R. Kolodny. Fragbag, an accurate representation of protein structure, retrieves structural neighbors from the entire pdb quickly and accurately. Proceedings of the National Academy of Sciences, 107(8):3481–3486, 2010. 40. B. Bufe, P.A.S. Breslin, C. Kuhn, D.R. Reed, C.D. Tharp, J.P. Slack, U-K. Kim, D. Drayna, and W. Meyerhof. The molecular basis of individual differences in phenylthiocarbamide and propylthiouracil bitterness perception. Current Biology, 15(4):322–327, 2005. 41. N.A. Campbell and J.B. Reece. Biology. Sixth edition. Pearson, 2002. 42. U. Castellani, A. Perina, V. Murino, M. Bellani, G. Rambaldelli, M. Tansella, and P. Brambilla. Brain morphometry by probabilistic latent semantic analysis. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010, pages 177–184. 2010. 43. C-C. Chang and C-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. 44. J. Chang, S. Gerrish, C. Wang, J.L. Boyd-graber, and D.M. Blei. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22, pages 288–296. 2009. 45. C-C. Chen and L.F. Lau. Functions and mechanisms of action of ccn matricellular proteins. The International Journal of Biochemistry and Cell Biology, 41(4):771– 783, 2009. 46. P-C. Chen, S-Y. Huang, W.J. Chen, and C.K. Hsiao. A new regularized least squares support vector regression for gene selection. BMC bioinformatics, 10(1):44, 2009. 47. G.A. Churchill. Fundamentals of experimental design for cdna microarrays. Nature Genetics, 32:490–495, 2002. 48. M. Cohn, N.A. Mitchison, W.E. Paul, A.M. Silverstein, D.W. Talmage, and M. Weigert. Reflections on the clonal-selection theory. Nature Reviews Immunology, 7(10):823–830, 2007. 49. M. Connors, J.A. Kovacs, S. Krevat, J.C. Gea-Banacloche, M.C. Sneller, M. Flanigan, J.A. Metcalf, R.E. Walker, J. Falloon, M. Baseler, R. Stevens, I. Feuerstein, H. Masur, and H.C. Lane. Hiv infection induces changes in cd4+ t-cell phenotype and depletions within the cd4+ t-cell repertoire that are not immediately restored by antiviral or immune-based therapies. Nature Medicine, 3:533–540, 1997. 148 References 50. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409:860–921, 2001. 51. The UniProt Consortium. Activities at the universal protein resource (uniprot). Nucleic Acids Research, 42(D1):D191–D198, 2014. 52. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297, 1995. 53. M. Cristani, A. Perina, U. Castellani, and V. Murino. Geo-located image analysis using latent representations. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. 54. G. Csurka, C.R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1–22, 2004. 55. O.G. Cula and K.J. Dana. Compact representation of bidirectional texture functions. In Proc. Int. Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 1041–1047, 2001. 56. O. Dagliyan, F. Uney-Yuksektepe, I.H. Kavakli, and M. Turkay. Optimization based tumor classification from microarray gene expression data. PLoS One, 6(2):e14579, 2011. 57. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. Int. Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 886–893, 2005. 58. G. D’Amico, E.A. Korhonen, A. Anisimov, G. Zarkada, T. Holopainen, R. Hagerling, F. Kiefer, L. Eklund, R. Sormunen, H. Elamaa, R.A. Brekken, R.H. Adams, G.Y. Koh, P. Saharinen, and K. Alitalo. Tie1 deletion inhibits tumor growth and improves angiopoietin antagonist therapy. The Journal of Clinical Investigation, 124(2):824– 834, 2014. 59. M.M. Davis and P.J. Bjorkman. T-cell antigen receptor genes and t-cell recognition. Nature, 334(6181):395–402, 1988. 60. J. José del Coz, J. Diez, and A. Bahamonde. Learning nondeterministic classifiers. The Journal of Machine Learning Research, 10:2273–2293, 2009. 61. J.L. DeRisi, V.R. Iyer, and P.O. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278(5338):680–686, 1997. 62. S.M. Dhanasekaran, T.R. Barrette, D. Ghosh, R. Shah, S. Varambally, K. Kurachi, K.J. Pienta, M.A. Rubin, and A.M. Chinnaiyan. Delineation of prognostic biomarkers in prostate cancer. Nature, 412(6849):822–826, 2001. 63. C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02):185– 205, 2005. 64. Q. Dong, L. Lin, and X. Wang. Protein remote homology detection based on binary profiles. In Bioinformatics Research and Development, volume 4414 of Lecture Notes in Computer Science, pages 212–223. 2007. 65. Q. Dong, X. Wang, and L. Lin. Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22(3):285–290, 2006. 66. R.O. Duda, P.E. Hart, and D.G Stork. Pattern Classification (2nd Edition). Wiley Interscience, 2001. 67. S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association, 97(457):77–87, 2002. 68. T. Dunning. Statistical identification of language, 1994. 69. P.J. Eccles. An introduction to mathematical reasoning. Cambridge University Press, 1997. References 149 70. J. Eisenstein, A. Ahmed, and E.P. Xing. Sparse additive generative models of text. In ICML, 2011. 71. F.C. Ekmekcioglu, M.F. Lynch, A.M. Robertson, T.M.T. Sembok, and P. Willett. Comparison of ngram matching and stemming for term conflation in english, malay, and turkish texts. Text Technology, 6:1–14, 1996. 72. R.G. Fahmy, C.R. Dass, L-Q. Sun, C.N. Chesterman, and L.M. Khachigian. Transcription factor egr-1 supports fgf-dependent angiogenesis during neovascularization and tumor growth. Nature Medicine, 9(8):1026–1032, 2003. 73. A. Farahat and F. Chen. Improving probabilistic latent semantic analysis with principal component analysis. In EACL, 2006. 74. Alessandro Farinelli, Matteo Denitto, and Manuele Bicego. Biclustering of expression microarray data using affinity propagation. In Pattern Recognition in Bioinformatics, LNCS, pages 13–24. 2011. 75. D. Filliat. A visual bag of words method for interactive qualitative localization and mapping. In Proc. Int. Conference on Robotics and Automation (ICRA), 2007. 76. F. Finotello and B. Di Camillo. Measuring differential gene expression with rna-seq: challenges and strategies for data analysis. Briefings in functional genomics, page elu035, 2014. 77. E. Fisher. The influence of configuration on enzyme activity. Dtsch Chem Ges (Translated from German), 27:2984–2993, 1894. 78. R.A. Fisher. Statistical methods for research workers. Number 5. Genesis Publishing Pvt Ltd, 1936. 79. G. Fort and S. Lambert-Lacroix. Classification using partial least squares with penalized logistic regression. Bioinformatics, 21(7):1104–1111, 2005. 80. N.K. Fox, S.E. Brenner, and J-M. Chandonia. Scope: Structural classification of proteins - extended, integrating scop and astral data and classification of new structures. Nucleic Acids Research, 42(Database-Issue):304–309, 2014. 81. B. Frey and N. Jojic. A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:1–25, 2005. 82. S. Frintrop, E. Rome, and H.I. Christensen. Computational visual attention systems and their cognitive foundations: a survey. ACM Transactions on Appied Perceptions, 7(1):6:1–6:39, 2010. 83. T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906–914, 2000. 84. M.E. Garber, O.G. Troyanskaya, K. Schluens, S. Petersen, Z. Thaesler, M. PacynaGengelbach, M. van de Rijn, G.D. Rosen, C.M. Perou, and R.I. Whyte. Diversity of gene expression in adenocarcinoma of the lung. Proceedings of the National Academy of Sciences, 98(24):13784–13789, 2001. 85. L. Gerstein. Introduction to mathematical structures and proofs. Springer, 2012. 86. G. Getz, E. Levine, and E. Domany. Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences, 97(22):12079– 12084, 2000. 87. T.R. Golub, D.K. Slonim, Pablo P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C. Bloomfield, and E. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, 1999. 88. M. Gribskov and N.L. Robinson. Use of receiver operating characteristic (roc) analysis to evaluate sequence matching. Computers and Chemistry, 20(1):25–33, 1996. 150 References 89. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003. 90. T. Handstad, A.J.H. Hestnes, and P. Saetrom. Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics, 8(1), 2007. 91. X. Hang. Cancer classification by sparse representation using microarray gene expression data. In IEEE Int. Conf. on Bioinformatics and Biomeidcine Workshops (BIBMW), pages 174–177, 2008. 92. M.J. Heller. Dna microarray technology: devices, systems, and applications. Annual review of biomedical engineering, 4(1):129–153, 2002. 93. T. Hertz, D. Nolan, I. James, M. John, S. Gaudieri, E. Phillips, J.C. Huang, G. Riadi, S. Mallal, and N. Jojic. Mapping the landscape of host-pathogen coevolution: Hla class i binding and its relationship with evolutionary conservation in human and viral proteins. Journal of virology, 85(3):1310–1321, 2011. 94. T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-2):177–196, 2001. 95. Y. Hou, W. Hsu, M-L. Lee, and C. Bystroff. Efficient remote homology detection using local structure. Bioinformatics, 19(17):2294–2301, 2003. 96. J.C. Huang and N. Jojic. Variable selection through correlation sifting. In Research in Computational Molecular Biology, pages 106–123, 2011. 97. T. Jaakkola, M. Diekhans, and D. Haussler. Using the fisher kernel method to detect remote protein homologies. In ISMB, volume 99, pages 149–158, 1999. 98. T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of computational biology, 7(1-2):95–114, 2000. 99. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. pages 487–493, 1999. 100. A.K. Jain, M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computer Surveys, 31(3):264–323, 1999. 101. S. Jankelevich, B.U. Mueller, C.L. Mackall, S. Smith, S. Zwerski, L.V. Wood, S.L. Zeichner, L. Serchuck, S.M. Steinberg, R.P. Nelson, et al. Long-term virologic and immunologic responses in human immunodeficiency virus type 1-infected children treated with indinavir, zidovudine, and lamivudine. Journal of Infectious Diseases, 183(7):1116–1120, 2001. 102. D. Jardine, L. Cornel, and M. Emond. Gene expression analysis characterizes antemortem stress and has implications for establishing cause of death. Physiological genomics, 43(16):974–980, 2011. 103. Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. In Proc. of Int. Conference on Computer Vision (ICCV), pages 2407– 2414, 2011. 104. N. Jojic and A. Perina. Multidimensional counting grids: Inferring word order from disordered bags of words. In Uncertainty in Artificial Intelligence, pages 547–556, 2011. 105. N. Jojic, M. Reyes-Gomez, D. Heckerman, C. Kadie, and O. Schueler-Furman. Learning mhc ipeptide binding. Bioinformatics, 22(14):e227–e235, 2006. 106. I.K. Jordan, L. Marino-Ramirez, and E. Koonin. Evolutionary significance of gene expression divergence. Gene, 345(1):119–126, 2005. 107. J-I. Jun and L.F. Lau. Taking aim at the extracellular matrix: Ccn proteins as emerging therapeutic targets. Nature Reviews Drug Discovery, 10(12):945–963, 2011. 108. K. Karplus, C. Barrett, and R. Hughey. Hidden markov models for detecting remote protein homologies. Bioinformatics, 14(10):846–856, 1998. References 151 109. J.C. Kendrew, G. Bodo, H.M. Dintzis, R.G. Parrish, H. Wyckoff, and D.C. Phillips. A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature, 181(4610):662–666, 1958. 110. G. Kerr, H.J. Ruskin, M. Crane, and P. Doolan. Techniques for clustering gene expression data. Computers in biology and medicine, 38(3):283–293, 2008. 111. T.M. Khoshgoftaar, D.J. Dittman, R. Wald, and W. Awada. A review of ensemble classification for dna microarrays data. In IEEE Int. Conference on Tools with Artificial Intelligence (ICTAI), pages 381–389, 2013. 112. M. Khoshhali, A. Moslemi, M. Saidijam, J. Poorolajal, and H. Mahjub. Predicting the categories of colon cancer using microarray data and nearest shrunken centroid. Journal of Biostatistics and Epidemiology, 1(1), 2014. 113. P. Kiepiela, A.J. Leslie, I. Honeyborne, D. Ramduth, C. Thobakgale, S. Chetty, P. Rathnavalu, C. Moore, K.J. Pfafferott, and L. Hilton. Dominant influence of hlab in mediating the potential co-evolution of hiv and hla. Nature, 432(7018):769–775, 2004. 114. J.G. Kim, S.J. Lee, Y.S. Chae, B.W. Kang, Y.J. Lee, S.Y. Oh, M.C. Kim, K.H. Kim, and S.J. Kim. Association between phosphorylated amp-activated protein kinase and mapk3/1 expression and prognosis for patients with gastric cancer. Oncology, 85(2):78–85, 2013. 115. S. Kim, P. Georgiou, and S. Narayanan. Latent acoustic topic models for unstructured audio classification. APSIPA Tran. on Signal and Information Processing, 1:e6, 2012. 116. U-K. Kim, E. Jorgenson, H. Coon, M. Leppert, N. Risch, and D. Drayna. Positional cloning of the human quantitative trait locus underlying taste sensitivity to phenylthiocarbamide. Science, 299(5610):1221–1225, 2003. 117. J. Kittler, M. Hatef, R.P.W. Duin, and J. Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998. 118. J.E. Krebs, B. Lewin, E.S. Goldstein, and S.T. Kilpatrick. Lewin’s essential genes. Jones and Bartlett Publishers, 2013. 119. L.I. Kuncheva. A stability index for feature selection. In Proc. of Int. Conference of the 25th IASTED International Multi-Conference: Artificial Intelligence and Applications (AIAP), pages 390–395, 2007. 120. J. Lasserre and C.M. Bishop. Generative or discriminative? getting the best of both worlds. Bayesian Statistics, 8:3–24, 2007. 121. C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. De Schaetzen, R. Duque, H. Bersini, and A. Nowe. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(4):1106–1119, 2012. 122. J.W. Lee, J.B. Lee, M. Park, and S.H. Song. An extensive comparison of recent classification tools applied to microarray data. Computational statistics and data analysis, 48(4):869–885, 2005. 123. K. Lee and D.P.W. Ellis. Audio-based semantic concept classification for consumer video. IEEE Tran. on Audio, Speech, and Language Processing, 18(6):1406–1416, 2010. 124. C.S. Leslie, E. Eskin, A. Cohen, J. Weston, and W.S. Noble. Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467–476, 2004. 125. C.S. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel: A string kernel for svm protein classification. In Proc. of Pacific Symposium on Biocomputing (PSB), pages 566–575, 2002. 126. B. Lewin and G. Dover. Genes V, volume 299. 1994. 127. X. Li and A. Godil. Investigating the bag-of-words method for 3d shape retrieval. EURASIP Journal on Advances in Signal Processing, (1):108130, 2010. 152 References 128. L. Liao and W.S. Noble. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology, 10(6):857–868, 2003. 129. M. Lienou, H. Maitre, and M. Datcu. Semantic annotation of satellite images using latent dirichlet allocation. IEEE Geoscience and Remote Sensing Letters, 7(1):28– 32, 2010. 130. L. Lin, Y. Shen, B. Liu, and X. Wang. Protein fold recognition and remote homology detection based on profile-level building blocks. In IEEE ICBECS, pages 1–5, 2010. 131. W-C. Lin, A.F.Y. Li, C-W. Chi, W-W. Chung, C.L. Huang, W-Y. Lui, H-J. Kung, and C-W. Wu. tie-1 protein tyrosine kinase: A novel independent prognostic marker for gastric cancer. Clinical Cancer Research, 5(7):1745–1751, 1999. 132. B. Liu, X. Wang, Q. Chen, Q. Dong, and X. Lan. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS ONE, 7(9), 2012. 133. B. Liu, X. Wang, L. Lin, Q. Dong, and X. Wang. A discriminative method for protein remote homology detection and fold recognition combining top-n-grams and latent semantic analysis. BMC Bioinformatics, 9(1):510, 2008. 134. B. Liu, D. Zhang, R. Xu, J. Xu, X. Wang, Q. Chen, Q. Dong, and K-C. Chou. Combining evolutionary information extracted from frequency profiles with sequencebased kernels for protein remote homology detection. Bioinformatics, 30(4):472– 479, 2014. 135. H. Liu, L. Liu, and H. Zhang. Ensemble gene selection by grouping for microarray data classification. Journal of biomedical informatics, 43(1):81–87, 2010. 136. P. Lovato, M. Bicego, M. Cristani, N. Jojic, and A. Perina. Feature selection using counting grids: application to microarray data. In Proc. Int. Workshop on Statistical Techniques in Pattern Recognition (SPR2012), volume 7626 of LNCS, pages 629– 637, 2012. 137. P. Lovato, M. Bicego, M. Kesa, V. Murino, N. Jojic, and A. Perina. Traveling on discrete embeddings of gene expression. Bioinformatics, 2015. submitted. 138. P. Lovato, M. Cristani, and M. Bicego. Soft ngram representation and modeling for protein remote homology detection. IEEE/ACM Tran. on Computational Biology and Bioinformatics, 2015. submitted. 139. P. Lovato, A. Giorgetti, and M. Bicego. A multimodal approach to protein remote homology detection. http://f1000.com/posters/browse/summary/1097145, 2014. 140. P. Lovato, A. Giorgetti, and M. Bicego. A multimodal approach for protein remote homology detection. IEEE/ACM Tran. on Computational Biology and Bioinformatics, 2015. in press. 141. D.G. Lowe. Object recognition from local scale-invariant features. In Proc. Int. Conference on Computer Vision (ICCV), page 1150, 1999. 142. D. Lu, C.D. Wolfgang, and T. Hai. Activating transcription factor 3, a stressinducible gene, suppresses ras-stimulated tumorigenesis. Journal of Biological Chemistry, 281(15):10473–10481, 2006. 143. S.C. Madeira and A.L. Oliveira. Biclustering algorithms for biological data analysis: a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1):24–45, 2004. 144. M.Z. Man, G. Dyson, K. Johnson, and B. Liao. Evaluating methods for classifying expression data. Journal of Biopharmaceutical statistics, 14(4):1065–1084, 2004. 145. C.D. Manning, P. Raghavan, and H. Schutze. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008. 146. A. Marchiori, L. Capece, A. Giorgetti, P. Gasparini, M. Behrens, P. Carloni, and W. Meyerhof. Coarse-grained/molecular mechanics of the tas2r38 bitter taste receptor: Experimentally-validated detailed structural prediction of agonist binding. PLoS ONE, 8(5):e64675, 2013. References 153 147. A. Martins, N.A. Smith, E.P. Xing, P.M.Q. Aguiar, and M.A.T. Figueiredo. Nonextensive information theoretic kernels on measures. The Journal of Machine Learning Research, 10:935–975, 2009. 148. J.D. Mcauliffe and D.M. Blei. Supervised topic models. In Advances in neural information processing systems, pages 121–128, 2008. 149. A.J. McMichael and Sarah L S.L. Rowland-Jones. Cellular immune responses to hiv. Nature, 410(6831):980–987, 2001. 150. L.M. Merino, J. Meng, S. Gordon, B.J. Lance, T. Johnson, V. Paul, K. Robbins, J.M. Vettel, and Y. Huang. A bag-of-words model for task-load prediction from eeg in complex environments. In ICASSP, pages 1227–1231, 2013. 151. A.J. Minn, G.P. Gupta, P.M. Siegel, P.D. Bos, W. Shu, D.D. Giri, A. Viale, A.B. Olshen, W.L Gerald, and J. Massague. Genes that mediate breast cancer metastasis to lung. Nature, 436(7050):518–524, 2005. 152. S. Moir, T-W. Chun, and A.S. Fauci. Pathogenic mechanisms of hiv disease. Annual Review of Pathology: Mechanisms of Disease, 6:223–248, 2011. 153. V. Moncho-Amor, I. Ibanez de Caceres, E. Bandres, B. Martinez-Poveda, J.L. Orgaz, I. Sanchez-Perez, S. Zazo, A. Rovira, J. Albanell, B. Jimenez, F. Rojo, C. Belda-Iniesta, J. Garcia-Foncillas, and R. Perona. Dusp1/mkp1 promotes angiogenesis, invasion and metastasis in non-small-cell lung cancer. Oncogene, 30(6):668– 678, 2011. 154. C.B. Moore, M. John, I.R. James, F.T. Christiansen, C.S. Witt, and S.A. Mallal. Evidence of hiv-1 adaptation to hla-restricted immune responses at a population level. Science, 296(5572):1439–1443, 2002. 155. G. Mori, S. Belongie, and J. Malik. Shape contexts enable efficient retrieval of similar shapes. In Proc. Int. Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 723–730, 2001. 156. S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970. 157. H. Nielsen, J. Engelbrecht, S. Brunak, and G. von Heijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein engineering, 10(1):1–6, 1997. 158. K. Nigam, A.K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2-3):103–134, 2000. 159. J. Nikolich-Zugich, M.K. Slifka, and I. Messaoudi. The many important facets of t-cell repertoire diversity. Nature Reviews Immunology, 4(2):123–132, 2004. 160. B. Niu, L. Fu, S. Sun, and W. Li. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC bioinformatics, 11(1):187, 2010. 161. C.L. Nutt, D.R. Mani, R.A. Betensky, P. Tamayo, J.G. Cairncross, C. Ladd, U. Pohl, C. Hartmann, M.E. McLaughlin, T.T. Batchelor, P. Black, A. von Deimling, S. Pomeroy, T. Golub, and D. Louis. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer research, 63(7):1602–1607, 2003. 162. A. Osareh and B. Shadgar. Classification and diagnostic prediction of cancers using gene microarray data analysis. Journal of Applied Sciences, 9(3):459–468, 2009. 163. J.P. Overington, B. Al-Lazikani, and A.L. Hopkins. How many drug targets are there? Nature Reviews Drug Discovery, 5(12):993–996, 2006. 164. G. Paass, E. Leopold, M. Larson, J. Kindermann, and S. Eickeler. Svm classification using sequences of phonemes and syllables. In Principles of Data Mining and Knowledge Discovery, pages 373–384. 2002. 154 References 165. S. Pancoast and M. Akbacak. Bag-of-audio-words approach for multimedia event classification. In INTERSPEECH, 2012. 166. H. Pearson. Genetics: what is a gene? Nature, 441(7092):398–401, 2006. 167. K. Pearson. Contributions to the mathematical theory of evolution. ii. skew variation in homogeneous material. Philosophical Transactions of the Royal Society of London, pages 343–414, 1895. 168. W.R. Pearson. Rapid and sensitive sequence comparison with fastp and fasta. Methods in enzymology, 183:63–98, 1990. 169. W.R. Pearson. An introduction to sequence similarity (“homology”) searching. Current protocols in bioinformatics, page 3.1, 2013. 170. H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226–1238, 2005. 171. A. Perina, M. Bicego, U. Castellani, and V. Murino. Exploiting geometry in counting grids. In SIMBAD, pages 250–264, 2013. 172. A. Perina, M. Cristani, U. Castellani, V. Murino, and N. Jojic. A hybrid generative/discriminative classification framework based on free-energy terms. In Proc. of Int. Conference on Computer Vision (ICCV), pages 2058–2065, 2009. 173. A. Perina, M. Cristani, U. Castellani, V. Murino, and N. Jojic. Free energy score spaces: using generative information in discriminative classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1249–1262, 2012. 174. A. Perina, M. Cristani, U. Castellani, V. Murino, and N.Jojic. Free energy score space. In Advances in Neural Information Processing Systems, 2009. 175. A. Perina and N. Jojic. Image analysis by counting on a grid. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1985–1992, 2011. 176. A. Perina, N. Jojic, M. Bicego, and A. Truski. Documents as multiple overlapping windows into grids of counts. In Advances in Neural Processing Information Systems (NIPS), pages 10–18, 2013. 177. A. Perina, M. Kesa, and M. Bicego. Expression microarray data classification using counting grids and fisher kernel. In Proc. of Int. Conference on Pattern Recognition (ICPR), pages 1770–1775, 2014. 178. A. Perina, P. Lovato, M. Cristani, and M. Bicego. A comparison on score spaces for expression microarray data classification. In Proc. on Pattern Recognition in Bioinformatics (PRIB), pages 202–213. 2011. 179. A. Perina, P. Lovato, and N. Jojic. Bags of words models of epitope sets: Hiv viral load regression with counting grids. In Proc. Int. Pacific Symposium on Biocomputing (PSB), pages 288–299, 2014. 180. A. Perina, P. Lovato, V. Murino, and M. Bicego. Biologically-aware latent dirichlet allocation (balda) for the classification of expression microarray. In Pattern Recognition in Bioinformatics (PRIB), LNCS, pages 230–241. 2010. 181. M. Polesani, L. Bortesi, A. Ferrarini, A. Zamboni, M. Fasoli, C. Zadra, A. Lovato, M. Pezzotti, M. Delledonne, and A. Polverari. General and species-specific transcriptional responses to downy mildew infection in a susceptible (vitis vinifera) and a resistant (v. riparia) grapevine species. BMC genomics, 11(1):117, 2010. 182. S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M. McLaughlin, J.Y.H Kim, L.C. Goumnerova, P.M. Black, C. Lau, J. Allen, D. Zagzag, J. Olson, T. Curran, C. Wetmore, J. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D. Louis, J. Mesirov, E. Lander, and T. Golub. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870):436–442, 2002. 183. A. Prelić, S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9):1122–1129, 2006. References 155 184. B. Qian and R.A. Goldstein. Performance of an iterated t-hmm for homology detection. Bioinformatics, 20(14):2175–2180, 2004. 185. K.M. Quinn, B.L. Monroe, M. Colaresi, M.H. Crespin, and D.R. Radev. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54(1):209–228, 2010. 186. G. Ramsay. Dna chips: State-of-the-art. Nature Biotechnology, 16(1):40–44, 1998. 187. H. Rangwala and G. Karypis. Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics, 21(23):4239–4247, 2005. 188. N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G.R.G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In Proc. of Int. Conference on Multimedia, pages 251–260, 2010. 189. D.M. Raup. Taxonomic diversity estimation using rarefaction. Paleobiology, pages 333–342, 1975. 190. S. Resino, J.M. Bellon, D. Gurbindo, J.A. Leon, and M. Muñoz-Fernández. Recovery of t-cell subsets after antiretroviral therapy in hiv-infected children. European journal of clinical investigation, 33(7):619–627, 2003. 191. S. Resino, E. Seoane, A. Pérez, E. Ruiz-Mateos, M. Leal, and M. Muñoz-Fernández. Different profiles of immune reconstitution in children and adults with hiv-infection after highly active antiretroviral therapy. BMC infectious diseases, 6(1):112, 2006. 192. F. Revillion, V. Pawlowski, L. Hornez, and J.P. Peyrat. Glyceraldehyde-3-phosphate dehydrogenase gene expression in human breast cancer. European Journal of Cancer, 36(8):1038–1042, 2000. 193. M.J. Rodriguez-Colman, G. Reverter-Branchat, M.A. Sorolla, J. Tamarit, J. Ros, and E. Cabiscol. The forkhead transcription factor hcm1 promotes mitochondrial biogenesis and stress resistance in yeast. Journal of Biological Chemistry, 285(47):37092–37101, 2010. 194. S. Rogers, M. Girolami, C. Campbell, and R. Breitling. The latent process decomposition of cdna microarray data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(2):143–156, 2005. 195. M. Ronaghi, M. Uhlén, and P. Nyrén. A sequencing method based on real-time pyrophosphate. Science, 281(5375):363–365, 1998. 196. D.T. Ross, U. Scherf, M.B. Eisen, C.M. Perou, C. Rees, P. Spellman, V. Iyer, S.S. Jeffrey, M. Van de Rijn, M. Waltham, A. Pergamenschikov, J. Lee, D. Lashkari, D. Shalon, T. Myers, J. Weinstein, D. Botstein, and P. Brown. Systematic variation in gene expression patterns in human cancer cell lines. Nature genetics, 24(3):227– 235, 2000. 197. T. Rossignol, L. Dulau, A. Julien, and B. Blondin. Genome-wide monitoring of wine yeast gene expression during alcoholic fermentation. Yeast, 20(16):1369–1385, 2003. 198. D.E. Sabatino, F. Mingozzi, D.J. Hui, H. Chen, P. Colosi, H.C.J. Ertl, and K.A. High. Identification of mouse aav capsid-specific cd8+ t cell epitopes. Molecular Therapy, 12(6):1023–1033, 2005. 199. Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517, 2007. 200. M. Sahlgren and R. Cöster. Using bag-of-concepts to improve the performance of support vector machines in text categorization. In Proceedings of the 20th International Conference on Computational Linguistics, page 487, 2004. 201. H. Saigo, J-P. Vert, T. Akutsu, and N. Ueda. Comparison of svm-based methods for remote homology detection. Genome Informatics, 13:396–397, 2002. 202. H. Saigo, J-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682–1689, 2004. 156 References 203. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGrawHill, Inc., 1986. 204. G. Sandve and F. Drablos. A survey of motif discovery methods in an integrated framework. Biology Direct, 1(1):11, 2006. 205. G. Schwarz. Estimating the dimension of a model. The annals of statistics, 6(2):461– 464, 1978. 206. M.M.K. Shahzad, J.M. Arevalo, G.N. Armaiz-Pena, C. Lu, R.L. Stone, M. MorenoSmith M. Nishimura, J-W. Lee, N.B. Jennings, J. Bottsford-Miller, P. Vivas-Mejia, S.K. Lutgendorf, G. Lopez-Berestein, M. Bar-Eli, S.W. Cole, and A.K. Sood. Stress effects on fosb- and interleukin-8 (il8)-driven ovarian cancer growth and metastasis. Journal of Biological Chemistry, 285(46):35462–35470, 2010. 207. J. Shankar, A. Messenberg, J. Chan, T.M. Underhill, L.J. Foster, and I.R. Nabi. Pseudopodial actin dynamics control epithelial-mesenchymal transition in metastatic cancer cells. Cancer Research, 70(9):3780–3790, 2010. 208. D. Shibata. Clonal diversity in tumor progression. Nature genetics, 38(4):402–403, 2006. 209. S. Shivashankar, S. Srivathsan, B. Ravindran, and A.V. Tendulkar. Multi-view methods for protein structure comparison using latent dirichlet allocation. Bioinformatics, 27(13):161–168, 2011. 210. D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.R. Renshaw, A.V. D’Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2):203–209, 2002. 211. R. Singh, B. Raj, and P. Smaragdis. Latent-variable decomposition based dereverberation of monaural and multi-channel signals. In Int. Conf. on Acoustics Speech and Signal Processing (ICASSP), pages 1914–1917, 2010. 212. J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In Proc. Int. Conference on Computer Vision (ICCV), volume 2, pages 1470–1477, 2003. 213. P. Smaragdis, B. Raj, and M. Shashanka. Missing data imputation for timefrequency representations of audio signals. Journal of signal processing systems, 65(3):361–370, 2011. 214. P. Smaragdis, M. Shashanka, and B. Raj. A sparse non-parametric approach for single channel separation of known sounds. In Advances in Neural Information Processing Systems, pages 1705–1713. 2009. 215. P. Smaragdis, M. Shashanka, and B. Raj. Topic models for audio mixture analysis. In NIPS Workshop on Applications for Topic Models: Text and Beyond, pages 1–4, 2009. 216. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, 1981. 217. T.G. Smolinski, R. Buchanan, G.M. Boratyn, M. Milanova, and A.A. Prinz. Independent component analysis-motivated approach to classificatory decomposition of cortical evoked potentials. BMC bioinformatics, 7(Suppl. 2):S8, 2006. 218. M. De Souto, I. Costa, D. Araujo, T. Ludermir, and A. Schliep. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 9(1):497, 2008. 219. A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardins, and S. Levy. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631–643, 2005. 220. J.E. Staunton, D.K. Slonim, H.A. Coller, P. Tamayo, M.J. Angelo, J. Park, U. Scherf, J.K. Lee, W.O. Reinhold, J.N. Weinstein, J. Mesirov, E. Lander, and T. Golub. Chemosensitivity prediction by transcriptional profiling. Proceedings of the National Academy of Sciences, 98(19):10787–10792, 2001. References 157 221. D. Stekel. Microarray bioinformatics. Cambridge University Press, 2003. 222. A.I. Su, J.B. Welsh, L.M. Sapinoso, S.G. Kern, P. Dimitrov, H. Lapp, P.G. Schultz, S.M. Powell, C.A. Moskaluk, H.F. Frierson, and G. Hampton. Molecular classification of human carcinomas by use of gene expression signatures. Cancer research, 61(20):7388–7393, 2001. 223. K. Tamura, D. Peterson, N. Peterson, G. Stecher, M. Nei, and S. Kumar. Mega5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular biology and evolution, 28(10):2731–2739, 2011. 224. A. Tanay, R. Sharan, and R. Shamir. Biclustering algorithms: A survey. Handbook of computational molecular biology, 9(1-20):122–124, 2005. 225. J.D. Thompson, D.G. Higgins, and T.J. Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic acids research, 22(22):4673– 4680, 1994. 226. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. 227. K. Tokunaga, Y. Nakamura, K. Sakata, K. Fujimori, M. Ohkubo, K. Sawada, and S. Sakiyama. Enhanced expression of a glyceraldehyde-3-phosphate dehydrogenase gene in human lung cancers. Cancer research, 47(21):5616–5619, 1987. 228. S. Troup, C. Njue, E.V. Kliewer, M. Parisien, C. Roskelley, S. Chakravarti, P.J. Roughley, L.C. Murphy, and P.H. Watson. Reduced expression of the small leucinerich proteoglycans, lumican, and decorin is associated with poor outcome in nodenegative invasive breast cancer. Clinical Cancer Research, 9(1):207–214, 2003. 229. P. Valiant and G. Valiant. Estimating the unseen: improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems (NIPS), pages 2157–2165, 2013. 230. M. Varma and A. Zisserman. Classifying images of materials: achieving viewpoint and illumination independence. In Proc. European Conference on Computer Vision (ECCV), volume 3, pages 255–271, 2002. 231. S. Vinga and J. Almeida. Alignment-free sequence comparison – a review. Bioinformatics, 19(4):513–523, 2003. 232. C. Wang, D. Blei, and F-F. Li. Simultaneous image classification and annotation. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 1903–1910, 2009. 233. L. Wang, J. Zhu, and H. Zou. Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics, 24(3):412–419, 2008. 234. X. Wang and O. Gotoh. A robust gene selection method for microarray-based cancer classification. Cancer informatics, 9:15, 2010. 235. Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):57–63, 2009. 236. R.L. Warren, J.D. Freeman, T. Zeng, G. Choe, S. Munro, R. Moore, J.R. Webb, and R.A. Holt. Exhaustive t-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured repertoire size of at least 1 million clonotypes. Genome research, 21(5):790–797, 2011. 237. S. Watanabe. Pattern Recognition: Human and Mechanical. Wiley, 1985. 238. S. Whelan and N. Goldman. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Molecular biology and evolution, 18(5):691–699, 2001. 239. World Health Organization (WHO). Number of deaths due to hiv/aids, 2013. http: //www.who.int/gho/hiv/epidemicstatus/deaths/en/. 158 References 240. B. Wielockx, C. Libert, and C. Wilson. Matrilysin (matrix metalloproteinase-7): a new promising drug target in cancer and inflammation? Cytokine and Growth Factor Reviews, 15(23):111–115, 2004. 241. C.F.J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 1(1):95–103, 1983. 242. M. Xu, L-Y. Duan, J. Cai, L-T. Chia, C. Xu, and Q. Tian. Hmm-based audio keyword generation. In Advances in Multimedia Information Processing, pages 566– 574. 2005. 243. S.H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike: joint friendship and interest propagation in social networks. In Proc. of the 20th International Conference on World Wide Web (WWW), WWW ’11, pages 537–546, 2011. 244. Z. Yang. Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. Journal of Molecular evolution, 39(3):306–314, 1994. 245. L. Yu, Y. Han, and M.E. Berens. Stable gene selection from microarray data via sample weighting. IEEE/ACM Tran. on Computational Biology and Bioinformatics, 9:262–272, 2012. 246. N. Yukinawa, S. Oba, K. Kato, and S. Ishii. Optimal aggregation of binary classifiers for multiclass cancer diagnosis using gene expression profiles. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(2):333–343, 2009. 247. G.L. Zhang, H. Rahman Ansari, P. Bradley, G.C. Cawley, T. Hertz, X. Hu, N. Jojic, Y. Kim, O. Kohlbacher, O. Lund, C. Lundegaard, C.A. Magaret, M. Nielsen, H. Papadopoulos, G.P.S. Raghava, V-S. Tal, L.C. Xue, C. Yanover, S. Zhu, M.T. Rock, J.E. Crowe Jr., C. Panayiotou, M.M. Polycarpou, W. Duch, and V. Brusic. Machine learning competition in immunology – prediction of hla class i binding peptides. Journal of Immunological Methods, 374(1-2):1–4, 2011. 248. H. Zhang, C-Y. Yu, B. Singer, and M. Xiong. Recursive partitioning for tumor classification with gene expression microarray data. Proceedings of the National Academy of Sciences, 98(12):6730–6735, 2001. 249. Y-J. Zhang, H. Li, H-C. Wu, J. Shen, L. Wang, M-W. Yu, P-H. Lee, I.B. Weinstein, and R.M. Santella. Silencing of hint1, a novel tumor suppressor gene, by promoter hypermethylation in hepatocellular carcinoma. Cancer Letters, 275(2):277– 284, 2009. 250. S. Zhu, D. Wang, K. Yu, T. Li, and Y. Gong. Feature selection for gene expression using model-based entropy. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(1):25–36, 2010. 13 Sommario Molti problemi di Pattern Recognition statistica sono stati affrontati nella letteratura recente attraverso la rappresentazione “bag of words”, una rappresentazione particolarmente appropriata quando negli oggetti del problema si riescono ad individuare dei semplici elementi “costituenti”. Mediante la rappresentazione bag of words, gli oggetti vengono caratterizzati da un vettore in cui ogni elemento conta il numero di occorrenze dei costituenti nell’oggetto. Nonostante il grande successo ottenuto in diversi campi della ricerca scientifica, tecniche e modelli basati su questa rappresentazione non sono ancora stati sfruttati appieno in Bioinformatica, a causa delle sfide metodologiche e applicative poste da questa specifica disciplina. Ciononostante, in questo contesto la rappresentazione bag of words sembra essere particolarmente appropriata: da un lato, numerosi problemi bioinformatici sono inerentemente posti attraverso meccanismi di conteggio; dall’altro, in molti scenari biologici la struttura degli oggetti che li caratterizzano è assente o sconosciuta, e uno dei maggiori svantaggi della rappresentazione bag of words (che non modella tale struttura) viene a cadere. Questa tesi si inserisce nel contesto appena presentato, e promuove l’utilizzo della rappresentazione bag of words per caratterizzare oggetti e problemi in Bioinformatica e Biologia Computazionale. In questa tesi vengono investigate tutte le problematiche relative alla creazione di rappresentazioni e modelli bag of words per specifici problemi, e vengono proposte possibili soluzioni e approcci. In dettaglio, sono stati individuati ed analizzati in questa tesi tre specifici problemi bioinformatici: l’analisi dell’espressione genica, il modeling dell’infezione HIV, e l’identificazione di omologia remota fra proteine. Per ogni scenario sono state analizzate le motivazioni, i vantaggi, e le sfide poste dall’utilizzo di rappresentazioni e modelli bag of words, e sono state proposte diverse soluzioni. I meriti degli approcci proposti sono stati dimostrati attraverso estese validazioni sperimentali, sia sfruttando benchmark ampiamente utilizzati in letteratura, sia utilizzando dati derivanti dall’interazione diretta con laboratori e gruppi di ricerca clinici/biologici. La conclusione raggiunta indica che gli approcci basati sulla rappresentazione bag of words possono avere un impatto determinante nelle comunità della Bioinformatica e Biologia Computazionale.