Open Challenges for Data Stream Mining Research
Transcription
Open Challenges for Data Stream Mining Research
Open Challenges for Data Stream Mining Research Georg Krempl Indre Žliobaite Dariusz Brzeziński University Magdeburg, Germany Aalto University and HIIT, Finland Poznan U. of Technology, Poland [email protected] [email protected] [email protected] Eyke Hüllermeier Mark Last Vincent Lemaire University of Paderborn, Germany Ben-Gurion U. of the Negev, Israel Orange Labs, France [email protected] [email protected] [email protected] Tino Noack Ammar Shaker Sonja Sievi TU Cottbus, Germany University of Paderborn, Germany Astrium Space Transportation, Germany [email protected] [email protected] [email protected] Myra Spiliopoulou Jerzy Stefanowski University Magdeburg, Germany Poznan U. of Technology, Poland [email protected] [email protected] ABSTRACT Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining. 1. INTRODUCTION The volumes of automatically generated data are constantly increasing. According to the Digital Universe Study [18], over 2.8ZB of data were created and processed in 2012, with a projected increase of 15 times by 2020. This growth in the production of digital data results from our surrounding environment being equipped with more and more sensors. People carrying smart phones produce data, database transactions are being counted and stored, streams of data are extracted from virtual environments in the form of logs or user generated content. A significant part of such data is volatile, which means it needs to be analyzed in real time as it arrives. Data stream mining is a research field that studies methods and algorithms for extracting knowledge from volatile streaming data [14; 5; 1]. Although data streams, online learning, big data, and adaptation to concept drift have become important research topics during the last decade, truly autonomous, self-maintaining, adaptive data mining systems are rarely reported. This paper identifies real-world challenges for data stream research that are important but yet unsolved. Our objective is to present to the community a position paper that could inspire and guide future research in data streams. This article builds upon discussions at the International Workshop on Real-World Challenges for Data Stream Mining (RealStream)1 in September 2013, in Prague, Czech Republic. Several related position papers are available. Dietterich [10] presents a discussion focused on predictive modeling techniques, that are applicable to streaming and non-streaming data. Fan and Bifet [12] concentrate on challenges presented by large volumes of data. Zliobaite et al. [48] focus on concept drift and adaptation of systems during online operation. Gaber et al. [13] discuss ubiquitous data mining with attention to collaborative data stream mining. In this paper, we focus on research challenges for streaming data inspired and required by real-world applications. In contrast to existing position papers, we raise issues connected not only with large volumes of data and concept drift, but also such practical problems as privacy constraints, availability of information, and dealing with legacy systems. The scope of this paper is not restricted to algorithmic challenges, it aims at covering the full cycle of knowledge discovery from data (CRISP [40]), from understanding the context of the task, to data preparation, modeling, evaluation, and deployment. We discuss eight challenges: making models simpler, protecting privacy and confidentiality, dealing with legacy systems, stream preprocessing, timing and availability of information, relational stream mining, analyzing event data, and evaluation of stream mining algorithms. Figure 1 illustrates the positioning of these challenges in the CRISP cycle. Some of these apply to traditional (non-streaming) data mining as well, but they are critical in streaming environments. Along with further discussion of these challenges, we present our position where the forthcoming focus of research and development efforts should be directed to address these challenges. In the remainder of the article, section 2 gives a brief introduction to data stream mining, sections 3–7 discuss each identified challenge, and section 8 highlights action points for future research. 1 SIGKDD Explorations http://sites.google.com/site/realstream2013 Volume 16, Issue 1 Page 1 $ " # Figure 1: CRISP cycle with data stream research challenges. 2. DATA STREAM MINING Mining big data streams faces three principal challenges: volume, velocity, and volatility. Volume and velocity require a high volume of data to be processed in limited time. Starting from the first arriving instance, the amount of available data constantly increases from zero to potentially infinity. This requires incremental approaches that incorporate information as it becomes available, and online processing if not all data can be kept [15]. Volatility, on the other hand, corresponds to a dynamic environment with ever-changing patterns. Here, old data is of limited use, even if it could be saved and processed again later. This is due to change, that can affect the induced data mining models in multiple ways: change of the target variable, change in the available feature information, and drift. Changes of the target variable occur for example in credit scoring, when the definition of the classification target “default” versus “non-default” changes due to business or regulatory requirements. Changes in the available feature information arise when new features become available, e.g. due to a new sensor or instrument. Similarly, existing features might need to be excluded due to regulatory requirements, or a feature might change in its scale, if data from a more precise instrument becomes available. Finally, drift is a phenomenon that occurs when the distributions of features x and target variables y change in time. The challenge posed by drift has been subject to extensive research, thus we provide here solely a brief categorization and refer to recent surveys like [17]. In supervised learning, drift can affect the posterior P (y|x), the conditional feature P (x|y), the feature P (x) and the class prior P (y) distribution. The distinction based on which distribution is assumed to be affected, and which is assumed to be static, serves to assess the suitability of an approach for a particular task. It is worth noting, that the problem of changing distributions is also present in unsupervised learning from data streams. A further categorization of drift can be made by: • smoothness of concept transition: Transitions between concepts can be sudden or gradual. The former is sometimes also denoted in literature as shift or abrupt drift. SIGKDD Explorations • systematic or unsystematic: In the former case, there are patterns in the way the distributions change that can be exploited to predict change and perform faster model adaptation. Examples are subpopulations that can be identified and show distinct, trackable evolutionary patterns. In the latter case, no such patterns exist and drift occurs seemingly at random. An example for the latter is fickle concept drift. • real or virtual: While the former requires model adaptation, the latter corresponds to observing outliers or noise, which should not be incorporated into a model. ! • singular or recurring contexts: In the former case, a model becomes obsolete once and for all when its context is replaced by a novel context. In the latter case, a model’s context might reoccur at a later moment in time, for example due to a business cycle or seasonality, therefore, obsolete models might still regain value. Stream mining approaches in general address the challenges posed by volume, velocity and volatility of data. However, in real-world applications these three challenges often coincide with other, to date insufficiently considered ones. The next sections discuss eight identified challenges for data stream mining, providing illustrations with real world application examples, and formulating suggestions for forthcoming research. 3. PROTECTING PRIVACY AND CONFIDENTIALITY Data streams present new challenges and opportunities with respect to protecting privacy and confidentiality in data mining. Privacy preserving data mining has been studied for over a decade (see. e.g. [3]). The main objective is to develop such data mining techniques that would not uncover information or patterns which compromise confidentiality and privacy obligations. Modeling can be done on original or anonymized data, but when the model is released, it should not contain information that may violate privacy or confidentiality. This is typically achieved by controlled distortion of sensitive data by modifying the values or adding noise. Ensuring privacy and confidentiality is important for gaining trust of the users and the society in autonomous, stream data mining systems. While in offline data mining a human analyst working with the data can do a sanity check before releasing the model, in data stream mining privacy preservation needs to be done online. Several existing works relate to privacy preservation in publishing streaming data (e.g. [46]), but no systematic research in relation to broader data stream challenges exists. We identify two main challenges for privacy preservation in mining data streams. The first challenge is incompleteness of information. Data arrives in portions and the model is updated online. Therefore, the model is never final and it is difficult to judge privacy preservation before seeing all the data. For example, suppose GPS traces of individuals are being collected for modeling traffic situation. Suppose person A at current time travels from the campus to the airport. The privacy of a person will be compromised, if there are no similar trips by other persons in the very near future. However, near future trips are unknown at the current time, when the model needs to be updated. On the other hand, data stream mining algorithms may have some inherent privacy preservation properties due to the fact that they do not need to see all the modeling data at once, and can be incrementally updated with portions of data. Investigating privacy preservation properties of existing data stream algorithms makes another interesting direction for future research. Volume 16, Issue 1 Page 2 The second important challenge for privacy preservation is concept drift. As data may evolve over time, fixed privacy preservation rules may no longer hold. For example, suppose winter comes, snow falls, and much less people commute by bike. By knowing that a person comes to work by bike and having a set of GPS traces, it may not be possible to identify this person uniquely in summer, when there are many cyclists, but possible in winter. Hence, an important direction for future research is to develop adaptive privacy preservation mechanisms, that would diagnose such a situation and adapt themselves to preserve privacy in the new circumstances. 4. STREAMED DATA MANAGEMENT Most of the data stream research concentrates on developing predictive models that address a simplified scenario, in which data is already pre-processed, completely and immediately available for free. However, successful business implementations depend strongly on the alignment of the used machine learning algorithms with both, the business objectives, and the available data. This section discusses often omitted challenges connected with streaming data. actions before feeding the newest data to the predictive models. The problem of preprocessing for data streams is challenging due to the challenging nature of the data (continuously arriving and evolving). An analyst cannot know for sure, what kind of data to expect in the future, and cannot deterministically enumerate possible actions. Therefore, not only models, but also the procedure itself needs to be fully automated. This research problem can be approached from several angles. One way is to look at existing predictive models for data streams, and try to integrate them with selected data preprocessing methods (e.g. feature selection, outlier definition and removal). Another way is to systematically characterize the existing offline data preprocessing approaches, try to find a mapping between those approaches and problem settings in data streams, and extend preprocessing approaches for data streams in such a way as traditional predictive models have been extended for data stream settings. In either case, developing individual methods and methodology for preprocessing of data streams would bridge an important gap in the practical applications of data stream mining. 4.1 Streamed Preprocessing 4.2 Timing and Availability of Information Data preprocessing is an important step in all real world data analysis applications, since data comes from complex environments, may be noisy, redundant, contain outliers and missing values. Many standard procedures for preprocessing offline data are available and well established, see e.g. [33]; however, the data stream setting introduces new challenges that have not received sufficient research attention yet. While in traditional offline analysis data preprocessing is a once-off procedure, usually done by a human expert prior to modeling, in the streaming scenario manual processing is not feasible, as new data continuously arrives. Streaming data needs fully automated preprocessing methods, that can optimize the parameters and operate autonomously. Moreover, preprocessing models need to be able to update themselves automatically along with evolving data, in a similar way as predictive models for streaming data do. Furthermore, all updates of preprocessing procedures need to be synchronized with the subsequent predictive models, otherwise after an update in preprocessing the data representation may change and, as a result, the previously used predictive model may become useless. Except for some studies, mainly focusing on feature construction over data streams, e.g. [49; 4], no systematic methodology for data stream preprocessing is currently available. As an illustrative example for challenges related to data preprocessing, consider predicting traffic jams based on mobile sensing data. People using navigation services on mobile devices can opt to send anonymized data to the service provider. Service providers, such as Google, Yandex or Nokia, provide estimations and predictions of traffic jams based on this data. First, the data of each user is mapped to the road network, the speed of each user on each road segment of the trip is computed, data from multiple users is aggregated, and finally the current speed of the traffic is estimated. There are a lot of data preprocessing challenges associated with this task. First, noisiness of GPS data might vary depending on location and load of the telecommunication network. There may be outliers, for instance, if somebody stopped in the middle of a segment to wait for a passenger, or a car broke. The number of pedestrians using mobile navigation may vary, and require adaptive instance selection. Moreover, road networks may change over time, leading to changes in average speeds, in the number of cars and even car types (e.g. heavy trucks might be banned, new optimal routes emerge). All these issues require automated preprocessing Most algorithms developed for evolving data streams make simplifying assumptions on the timing and availability of information. In particular, they assume that information is complete, immediately available, and received passively and for free. These assumptions often do not hold in real-world applications, e.g., patient monitoring, robot vision, or marketing [43]. This section is dedicated to the discussion of these assumptions and the challenges resulting from their absence. For some of these challenges, corresponding situations in offline, static data mining have already been addressed in literature. We will briefly point out where a mapping of such known solutions to the online, evolving stream setting is easily feasible, for example by applying windowing techniques. However, we will focus on problems for which no such simple mapping exists and which are therefore open challenges in stream mining. SIGKDD Explorations 4.2.1 Handling Incomplete Information Completeness of information assumes that the true values of all variables, that is of features and of the target, are revealed eventually to the mining algorithm. The problem of missing values, which corresponds to incompleteness of features, has been discussed extensively for the offline, static settings. A recent survey is given in [45]. However, only few works address data streams, and in particular evolving data streams. Thus several open challenges remain, some are pointed out in the review by [29]: how to address the problem that the frequency in which missing values occur is unpredictable, but largely affects the quality of imputations? How to (automatically) select the best imputation technique? How to proceed in the trade-off between speed and statistical accuracy? Another problem is that of missing values of the target variable. It has been studied extensively in the static setting as semi-supervised learning (SSL, see [11]). A requirement for applying SSL techniques to streams is the availability of at least some labeled data from the most recent distribution. While first attempts to this problem have been made, e.g. the online manifold regularization approach in [19] and the ensembles-based approach suggested by [11], improvements in speed and the provision of performance guarantees remain open challenges. A special case of incomplete information is “censored data” in Event History Analysis (EHA), which is described in section 5.2. A related problem discussed below is active learning (AL, see [38]). Volume 16, Issue 1 Page 3 4.2.2 Dealing with Skewed Distributions Class imbalance, where the class prior probability of the minority class is small compared to that of the majority class, is a frequent problem in real-world applications like fraud detection or credit scoring. This problem has been well studied in the offline setting (see e.g. [22] for a recent book on that subject), and has also been studied to some extent in the online, stream-based setting (see [23] for a recent survey). However, among the few existing stream-based approaches, most do not pay attention to drift of the minority class, and as [23] pointed out, a more rigorous evaluation of these algorithms on real-world data needs yet to be done. 4.2.3 Handling Delayed Information Latency means information becomes available with significant delay. For example, in the case of so-called verification latency, the value of the preceding instance’s target variable is not available before the subsequent instance has to be predicted. On evolving data streams, this is more than a mere problem of streaming data integration between feature and target streams, as due to concept drift patterns show temporal locality [2]. It means that feedback on the current prediction is not available to improve the subsequent predictions, but only eventually will become available for much later predictions. Thus, there is no recent sample of labeled data at all that would correspond to the most-recent unlabeled data, and semisupervised learning approaches are not directly applicable. A related problem in static, offline data mining is that addressed by unsupervised transductive transfer learning (or unsupervised domain adaptation): given labeled data from a source domain, a predictive model is sought for a related target domain in which no labeled data is available. In principle, ideas from transfer learning could be used to address latency in evolving data streams, for example by employing them in a chunk-based approach, as suggested in [43]. However, adapting them for use in evolving data streams has not been tried yet and constitutes a non-trivial, open task, as adaptation in streams must be fast and fully automated and thus cannot rely on iterated careful tuning by human experts. Furthermore, consecutive chunks constitute several domains, thus the transitions between several subsequent chunks might provide exploitable patterns of systematic drift. This idea has been introduced in [27], and a few so-called drift-mining algorithms that identify and exploit such patterns have been proposed since then. However, the existing approaches cover only a very limited set of possible drift patterns and scenarios. • uncertainty regarding convergence: in contrast to learning in static contexts, due to drift there is no guarantee that with additional labels the difference between model and reality narrows down. This leaves the formulation of suitable stop criteria a challenging open issue. • necessity of perpetual validation: even if there has been convergence due to some temporary stability, the learned hypotheses can get invalidated at any time by subsequent drift. This can affect any part of the feature space and is not necessarily detectable from unlabeled data. Thus, without perpetual validation the mining algorithm might lock itself to a wrong hypothesis without ever noticing. • temporal budget allocation: the necessity of perpetual validation raises the question of optimally allocating the labeling budget over time. • performance bounds: in the case of drifting posteriors, no theoretical work exists that provides bounds for errors and label requests. However, deriving such bounds will also require assuming some type of systematic drift. The task of active feature acquisition, where one has to actively select among costly features, constitutes another open challenge on evolving data streams: in contrast to the static, offline setting, the value of a feature is likely to change with its drifting distribution. 5. MINING ENTITIES AND EVENTS Conventional stream mining algorithms learn over a single stream of arriving entities. In subsection 5.1, we introduce the paradigm of entity stream mining, where the entities constituting the stream are linked to instances (structured pieces of information) from further streams. Model learning in this paradigm involves the incorporation of the streaming information into the stream of entities; learning tasks include cluster evolution, migration of entities from one state to another, classifier adaptation as entities re-appear with another label than before. Then, in subsection 5.2, we investigate the special case where entities are associated with the occurrence of events. Model learning then implies identifying the moment of occurrence of an event on an entity. This scenario might be seen as a special case of entity stream mining, since an event can be seen as a degenerate instance consisting of a single value (the event’s occurrence). 5.1 Entity Stream Mining 4.2.4 Active Selection from Costly Information The challenge of intelligently selecting among costly pieces of information is the subject of active learning research. Active streambased selective sampling [38] describes a scenario, in which instances arrive one-by-one. While the instances’ feature vectors are provided for free, obtaining their true target values is costly, and the definitive decision whether or not to request this target value must be taken before proceeding to the next instance. This corresponds to a data stream, but not necessarily to an evolving one. As a result, only a small subset of stream-based selective sampling algorithms is suited for non-stationary environments. To make things worse, many contributions do not state explicitly whether they were designed for drift, neither do they provide experimental evaluations on such evolving data streams, thus leaving the reader the arduous task to assess their suitability for evolving streams. A first, recent attempt to provide an overview on the existing active learning strategies for evolving data streams is given in [43]. The challenges for active learning posed by evolving data streams are: SIGKDD Explorations Let T be a stream of entities, e.g. customers of a company or patients of a hospital. We observe entities over time, e.g. on a company’s website or at a hospital admission vicinity: an entity appears and re-appears at discrete time points, new entities show up. At a time point t, an entity e ∈ T is linked with different pieces of information - the purchases and ratings performed by a customer, the anamnesis, the medical tests and the diagnosis recorded for the patient. Each of these information pieces ij (t) is a structured record or an unstructured text from a stream Tj , linked to e via the foreign key relation. Thus, the entities in T are in 1-to-1 or 1-to-n relation with entities from further streams T1 , . . . , Tm (stream of purchases, stream of ratings, stream of complaints etc). The schema describing the streams T, T1 , . . . , Tm can be perceived as a conventional relational schema, except that it describes streams instead of static sets. In this relational setting, the entity stream mining task corresponds to learning a model ζT over T , thereby incorporating information from the adjoint streams T1 , . . . , Tm that ”feed” the entities in T . Volume 16, Issue 1 Page 4 Albeit the members of each stream are entities, we use the term ”entity” only for stream T – the target of learning, while we denote the entities in the other streams as ”instances”. In the unsupervised setting, entity stream clustering encompasses learning and adapting clusters over T , taking account the other streams that arrive at different speeds. In the supervised setting, entity stream classification involves learning and adapting a classifier, notwithstanding the fact that an entity’s label may change from one time point to the next, as new instances referencing it arrive. entities, thus the challenges pertinent to stream mining also apply here. One of these challenges, and one much discussed in the context of big data, is volatility. In relational stream mining, volatility refers to the entity itself, not only to the stream of instances that reference the entities. Finally, an entity is ultimately big data by itself, since it is described by multiple streams. Hence, next to the problem of dealing with new forms of learning and new aspects of drift, the subject of efficient learning and adaption in the Big Data context becomes paramount. 5.1.1 Challenges of Aggregation 5.2 Analyzing Event Data The first challenge of entity stream mining task concerns information summarization: how to aggregate into each entity e at each time point t the information available on it from the other streams? What information should be stored for each entity? How to deal with differences in the speeds of the individual streams? How to learn over the streams efficiently? Answering these questions in a seamless way would allow us to deploy conventional stream mining methods for entity stream mining after aggregation. The information referencing a relational entity cannot be held perpetually for learning, hence aggregation of the arriving streams is necessary. Information aggregation over time-stamped data is traditionally practiced in document stream mining, where the objective is to derive and adapt content summaries on learned topics. Content summarization on entities, which are referenced in the document stream, is studied by Kotov et al., who maintain for each entity the number of times it is mentioned in the news [26]. In such studies, summarization is a task by itself. Aggregation of information for subsequent learning is a bit more challenging, because summarization implies information loss - notably information about the evolution of an entity. Hassani and Seidl monitor health parameters of patients, modeling the stream of recordings on a patient as a sequence of events [21]: the learning task is then to predict forthcoming values. Aggregation with selective forgetting of past information is proposed in [25; 42] in the classification context: the former method [25] slides a window over the stream, while the latter [42] forgets entities that have not appeared for a while, and summarizes the information in frequent itemsets, which are then used as new features for learning. Events are an example for data that occurs often yet is rarely analyzed in the stream setting. In static environments, events are usually studied through event history analysis (EHA), a statistical method for modeling and analyzing the temporal distribution of events related to specific objects in the course of their lifetime [9]. More specifically, EHA is interested in the duration before the occurrence of an event or, in the recurrent case (where the same event can occur repeatedly), the duration between two events. The notion of an event is completely generic and may indicate, for example, the failure of an electrical device. The method is perhaps even better known as survival analysis, a term that originates from applications in medicine, in which an event is the death of a patient and survival time is the time period between the beginning of the study and the occurrence of this event. EHA can also be considered as a special case of entity stream mining described in section 5.1, because the basic statistical entities in EHA are monitored objects (or subjects), typically described in terms of feature vectors x ∈ Rn , together with their survival time s. Then, the goal is to model the dependence of s on x. A corresponding model provides hints at possible cause-effect relationships (e.g., what properties tend to increase a patient’s survival time) and, moreover, can be used for predictive purposes (e.g., what is the expected survival time of a patient). Although one might be tempted to approach this modeling task as a standard regression problem with input (regressor) x and output (response) s, it is important to notice that the survival time s is normally not observed for all objects. Indeed, the problem of censoring plays an important role in EHA and occurs in different facets. In particular, it may happen that some of the objects survived till the end of the study at time tend (also called the cut-off point). They are censored or, more specifically, right censored, since tevent has not been observed for them; instead, it is only known that tevent > tend . In snapshot monitoring [28], the data stream may be sampled multiple times, resulting in a new cut-off point for each snapshot. Unlike standard regression analysis, EHA is specifically tailored for analyzing event data of that kind. It is built upon the hazard function as a basic mathematical tool. 5.1.2 Challenges of Learning Even if information aggregation over the streams T1 , . . . , Tm is performed intelligently, entity stream mining still calls for more than conventional stream mining methods. The reason is that entities of stream T re-appear in the stream and evolve. In particular, in the unsupervised setting, an entity may be linked to conceptually different instances at each time point, e.g. reflecting a customer’s change in preferences. In the supervised setting, an entity may change its label; for example, a customer’s affinity to risk may change in response to market changes or to changes in family status. This corresponds to entity drift, i.e. a new type of drift beyond the conventional concept drift pertaining to model ζT . Hence, how should entity drift be traced, and how should the interplay between entity drift and model drift be captured? In the unsupervised setting, Oliveira and Gama learn and monitor clusters as states of evolution [32], while [41] extend that work to learn Markov chains that mark the entities’ evolution. As pointed out in [32], these states are not necessarily predefined – they must be subject of learning. In [43], we report on further solutions to the entity evolution problem and to the problem of learning with forgetting over multiple streams and over the entities referenced by them. Conventional concept drift also occurs when learning a model over SIGKDD Explorations 5.2.1 Survival function and hazard rate Suppose the time of occurrence of the next event (since the start or the last event) for an object x is modeled as a real-valued random variable T with probability density function f (· | x). The hazard function or hazard rate h(· | x) models the propensity of the occurrence of an event, that is, the marginal probability of an event to occur at time t, given that no event has occurred so far: h(t | x) = f (t | x) f (t | x) = , S(t | x) 1 − F (t | x) where S(· | x) is the survival function and F (· | x) the cumulative distribution of f (· | x). Thus, t F (t | x) = P(T ≤ t) = f (u | x) du Volume 16, Issue 1 0 Page 5 is the probability of an event to occur before time t. Correspondingly, S(t | x) = 1 − F (t | x) is the probability that the event did not occur until time t (the survival probability). It can hence be used to model the probability of the right-censoring of the time for an event to occur. A simple example is the Cox proportional hazard model [9], in which the hazard rate is constant over time; thus, it does depend on the feature vector x = (x1 , . . . , xn ) but not on time t. More specifically, the hazard rate is modeled as a log-linear function of the features xi : h(t | x) = λ(x) = exp x β The model is proportional in the sense that increasing xi by one unit increases the hazard rate λ(x) by a factor of αi = exp(βi ). For this model, one easily derives the survival function S(t | x) = 1 − exp(−λ(x) · t) and an expected survival time of 1/λ(x). 5.2.2 EHA on data streams Although the temporal nature of event data naturally fits the data stream model and, moreover, event data is naturally produced by many data sources, EHA has been considered in the data stream scenario only very recently. In [39], the authors propose a method for analyzing earthquake and Twitter data, namely an extension of the above Cox model based on a sliding window approach. The authors of [28] modify standard classification algorithms, such as decision trees, so that they can be trained on a snapshot stream of both censored and non-censored data. Like in the case of clustering [35], where one distinguishes between clustering observations and clustering data sources, two different settings can be envisioned for EHA on data streams: 1. In the first setting, events are generated by multiple data sources (representing monitored objects), and the features pertain to these sources; thus, each data source is characterized by a feature vector x and produces a stream of (recurrent) events. For example, data sources could be users in a computer network, and an event occurs whenever a user sends an email. 2. In the second setting, events are produced by a single data source, but now the events themselves are characterized by features. For example, events might be emails sent by an email server, and each email is represented by a certain set of properties. Statistical event models on data streams can be used in much the same way as in the case of static data. For example, they can serve predictive purposes, i.e., to answer questions such as “How much time will elapse before the next email arrives?” or “What is the probability to receive more than 100 emails within the next hour?”. What is specifically interesting, however, and indeed distinguishes the data stream setting from the static case, is the fact that the model may change over time. This is a subtle aspect, because the hazard model h(t | x) itself may already be time-dependent; here, however, t is not the absolute time but the duration time, i.e., the time elapsed since the last event. A change of the model is comparable to concept drift in classification, and means that the way in which the hazard rate depends on time t and on the features xi changes over time. For example, consider the event “increase of a stock rate” and suppose that βi = log(2) for the binary feature xi = energy sector in the above Cox model (which, as already mentioned, does not depend on t). Thus, this feature doubles the hazard rate and hence halves the expected duration between two events. Needless to say, however, this influence may change over time, depending on how well the energy sector is doing. SIGKDD Explorations Dealing with model changes of that kind is clearly an important challenge for event analysis on data streams. Although the problem is to some extent addressed by the works mentioned above, there is certainly scope for further improvement, and for using these approaches to derive predictive models from censored data. Besides, there are many other directions for future work. For example, since the detection of events is a main prerequisite for analyzing them, the combination of EHA with methods for event detection [36] is an important challenge. Indeed, this problem is often far from trivial, and in many cases, events (such as frauds, for example) can only be detected with a certain time delay; dealing with delayed events is therefore another important topic, which was also discussed in section 4.2. 6. EVALUATION OF DATA STREAM ALGORITHMS All of the aforementioned challenges are milestones on the road to better algorithms for real-world data stream mining systems. To verify if these challenges are met, practitioners need tools capable of evaluating newly proposed solutions. Although in the field of static classification such tools exist, they are insufficient in data stream environments due to such problems as: concept drift, limited processing time, verification latency, multiple stream structures, evolving class skew, censored data, and changing misclassification costs. In fact, the myriad of additional complexities posed by data streams makes algorithm evaluation a highly multi-criterial task, in which optimal trade-offs may change over time. Recent developments in applied machine learning [6] emphasize the importance of understanding the data one is working with and using evaluation metrics which reflect its difficulties. As mentioned before, data streams set new requirements compared to traditional data mining and researchers are beginning to acknowledge the shortcomings of existing evaluation metrics. For example, Gama et al. [16] proposed a way of calculating classification accuracy using only the most recent stream examples, therefore allowing for time-oriented evaluation and aiding concept drift detection. Methods which test the classifier’s robustness to drifts and noise on a practical, experimental level are also starting to arise [34; 47]. However, all these evaluation techniques focus on single criteria such as prediction accuracy or robustness to drifts, even though data streams make evaluation a constant trade-off between several criteria [7]. Moreover, in data stream environments there is a need for more advanced tools for visualizing changes in algorithm predictions with time. The problem of creating complex evaluation methods for stream mining algorithms lies mainly in the size and evolving nature of data streams. It is much more difficult to estimate and visualize, for example, prediction accuracy if evaluation must be done online, using limited resources, and the classification task changes with time. In fact, the algorithm’s ability to adapt is another aspect which needs to be evaluated, although information needed to perform such evaluation is not always available. Concept drifts are known in advance mainly when using synthetic or benchmark data, while in more practical scenarios occurrences and types of concepts are not directly known and only the label of each arriving instance is known. Moreover, in many cases the task is more complicated, as labeling information is not instantly available. Other difficulties in evaluation include processing complex relational streams and coping with class imbalance when class distributions evolve with time. Finally, not only do we need measures for evaluating single aspects of stream mining algorithms, but also ways of combining several of these aspects into global evaluation models, which would take into Volume 16, Issue 1 Page 6 account expert knowledge and user preferences. Clearly, evaluation of data stream algorithms is a fertile ground for novel theoretical and algorithmic solutions. In terms of prediction measures, data stream mining still requires evaluation tools that would be immune to class imbalance and robust to noise. In our opinion, solutions to this problem should involve not only metrics based on relative performance to baseline (chance) classifiers, but also graphical measures similar to PR-curves or cost curves. Furthermore, there is a need for integrating information about concept drifts in the evaluation process. As mentioned earlier, possible ways of considering concept drifts will depend on the information that is available. If true concepts are known, algorithms could be evaluated based on: how often they detect drift, how early they detect it, how they react to it, and how quickly they recover from it. Moreover, in this scenario, evaluation of an algorithm should be dependent on whether it takes place during drift or during times of concept stability. A possible way of tackling this problem would be the proposal of graphical methods, similar to ROC analysis, which would work online and visualize concept drift measures alongside prediction measures. Additionally, these graphical measures could take into account the state of the stream, for example, its speed, number of missing values, or class distribution. Similar methods could be proposed for scenarios where concepts are not known in advance, however, in these cases measures should be based on drift detectors or label-independent stream statistics. Above all, due to the number of aspects which need to be measured, we believe that the evaluation of data stream algorithms requires a multi-criterial view. This could be done by using inspirations from multiple criteria decision analysis, where trade-offs between criteria are achieved using user-feedback. In particular, a user could showcase his/her criteria preferences (for example, between memory consumption, accuracy, reactivity, self-tuning, and adaptability) by deciding between alternative algorithms for a given data stream. It is worth noticing that such a multi-criterial view on evaluation is difficult to encapsulate in a single number, as it is usually done in traditional offline learning. This might suggest that researchers in this area should turn towards semi-qualitative and semi-quantitative evaluation, for which systematic methodologies should be developed. Finally, a separate research direction involves rethinking the way we test data stream mining algorithms. The traditional train, crossvalidate, test workflow in classification is not applicable for sequential data, which makes, for instance, parameter tuning much more difficult. Similarly, ground truth verification in unsupervised learning is practically impossible in data stream environments. With these problems in mind, it is worth stating that there is still a shortage of real and synthetic benchmark datasets. Such a situation might be a result of non-uniform standards for testing algorithms on streaming data. As community, we should decide on such matters as: What characteristics should benchmark datasets have? Should they have prediction tasks attached? Should we move towards online evaluation tools rather than datasets? These questions should be answered in order to solve evaluation issues in controlled environments before we create measures for real-world scenarios. 7. FROM ALGORITHMS TO DECISION SUPPORT SYSTEMS While a lot of algorithmic methods for data streams are already available, their deployment in real applications with real streaming data presents a new dimension of challenges. This section points out two such challenges: making models simpler and dealing with legacy systems. SIGKDD Explorations 7.1 Making models simpler, more reactive, and more specialized In this subsection, we discuss aspects like the simplicity of a model, its proper combination of offline and online components, and its customization to the requirements of the application domain. As an application example, consider the French Orange Portal2 , which registers millions of visits daily. Most of these visitors are only known through anonymous cookie IDs. For all of these visitors, the portal has the ambition to provide specific and relevant contents as well as printing ads for targeted audiences. Using information about visits on the portal the questions are: what part of the portal does each cookie visit, and when and which contents did it consult, what advertisement was sent, when (if) was it clicked. All this information generates hundreds of gigabytes of data each week. A user profiling system needs to have a back end part to preprocess the information required at the input of a front end part, which will compute appetency to advertising (for example) using stream mining techniques (in this case a supervised classifier). Since the ads to print change regularly, based on marketing campaigns, the extensive parameter tuning is infeasible as one has to react quickly to change. Currently, these tasks are either solved using bandit methods from game theory [8], which impairs adaptation to drift, or done offline in big data systems, resulting in slow reactivity. 7.1.1 Minimizing parameter dependence Adaptive predictive systems are intrinsically parametrized. In most of the cases, setting these parameters, or tuning them is a difficult task, which in turn negatively affects the usability of these systems. Therefore, it is strongly desired for the system to have as few user adjustable parameters as possible. Unfortunately, the state of the art does not produce methods with trustworthy or easily adjustable parameters. Moreover, many predictive modeling methods use a lot of parameters, rendering them particularly impractical for data stream applications, where models are allowed to evolve over time, and input parameters often need to evolve as well. The process of predictive modeling encompasses fitting of parameters on a training dataset and subsequently selecting the best model, either by heuristics or principled methods. Recently, model selection methods have been proposed that do not require internal crossvalidation, but rather use the Bayesian machinery to design regularizers with data dependent priors [20]. However, they are not yet applicable in data streams, as their computational time complexity is too high and they require all examples to be kept in memory. 7.1.2 Combining offline and online models Online and offline learning are mostly considered as mutually exclusive, but it is their combination that might enhance the value of data the most. Online learning, which processes instances oneby-one and builds models incrementally, has the virtue of being fast, both in the processing of data and in the adaptation of models. Offline (or batch) learning has the advantage of allowing the use of more sophisticated mining techniques, which might be more time-consuming or require a human expert. While the first allows the processing of “fast data” that requires real-time processing and adaptivity, the second allows processing of “big data” that requires longer processing time and larger abstraction. Their combination can take place in many steps of the mining process, such as the data preparation and the preprocessing steps. For example, offline learning on big data could extract fundamental and sustainable trends from data using batch processing and massive parallelism. Online learning could then take real-time decisions 2 www.orange.fr Volume 16, Issue 1 Page 7 from online events to optimize an immediate pay-off. In the online advertisement application mentioned above, the user-click prediction is done within a context, defined for example by the currently viewed page and the profile of the cookie. The decision which banner to display is done online, but the context can be preprocessed offline. By deriving meta-information such as “the profile is a young male, the page is from the sport cluster”, the offline component can ease the online decision task. mining can make a decisive contribution to enhance and facilitate the required monitoring tasks. Recently, we are planning to use the ISS Columbus module as a technology demonstrator for integrating data stream processing and mining into the existing monitoring processes [31]. Figure 2 exemplifies the failure management system (FMS) of the ISS Columbus module. While it is impossible to simply redesign the FMS from scratch, we can outline the following challenges. 7.1.3 Solving the right problem Domain knowledge may help to solve many issues raised in this paper, by systematically exploiting particularities of application domains. However, this is seldom considered, as typical data stream methods are created to deal with a large variety of domains. For instance, in some domains the learning algorithm receives only partial feedback upon its prediction, i.e. a single bit of right-or-wrong, rather than the true label. In the user-click prediction example, if a user does not click on a banner, we do not know which one would have been correct, but solely that the displayed one was wrong. This is related to the issues on timing and availability of information discussed in section 4.2. However, building predictive models that systematically incorporate domain knowledge or domain specific information requires to choose the right optimization criteria. As mentioned in section 6, the data stream setting requires optimizing multiple criteria simultaneously, as optimizing only predictive performance is not sufficient. We need to develop learning algorithms, which minimize an objective function including intrinsically and simultaneously: memory consumption, predictive performance, reactivity, self monitoring and tuning, and (explainable) auto-adaptivity. Data streams research is lacking methodologies for forming and optimizing such criteria. Therefore, models should be simple so that they do not depend on a set of carefully tuned parameters. Additionally, they should combine offline and online techniques to address challenges of big and fast data, and they should solve the right problem, which might consist in solving a multi-criteria optimization task. Finally, they have to be able to learn from a small amount of data and with low variance [37], to react quickly to drift. 7.2 Dealing with Legacy Systems In many application environments, such as financial services or health care systems, business critical applications are in operation for decades. Since these applications produce massive amounts of data, it becomes very promising to process these amounts of data by real-time stream mining approaches. However, it is often impossible to change existing infrastructures in order to introduce fully fledged stream mining systems. Rather than changing existing infrastructures, approaches are required that integrate stream mining techniques into legacy systems. In general, problems concerning legacy systems are domain-specific and encompass both technical and procedural issues. In this section, we analyze challenges posed by a specific real-world application with legacy issues — the ISS Columbus spacecraft module. 1. ISS Columbus module 5. Assembly, integration, and test facility 2. Ground control centre 4. Mission archiv 3. Engineering support centre Figure 2: ISS Columbus FMS 7.2.2 Complexity Even though spacecraft monitoring is very challenging by itself, it becomes increasingly difficult and complex due to the integration of data stream mining into such legacy systems. However, it was assumed to enhance and facilitate current monitoring processes. Thus, appropriate mechanism are required to integrate data stream mining into the current processes to decrease complexity. 7.2.3 Interlocking As depicted in Figure 2, the ISS Columbus module is connected to ground instances. Real-time monitoring must be applied aboard where computational resources are restricted (e.g. processor speed and memory or power consumption). Near real-time monitoring or long-term analysis must be applied on-ground where the downlink suffers from latencies because of a long transmission distance, is subject to bandwidth limitations, and continuously interrupted due to loss of signal. Consequently, new data stream mining mechanisms are necessary which ensure a smooth interlocking functionality of aboard and ground instances. 7.2.4 Reliability and Balance 7.2.1 ISS Columbus Spacecrafts are very complex systems, exposed to very different physical environments (e.g. space), and associated to ground stations. These systems are under constant and remote monitoring by means of telemetry and commands. The ISS Columbus module has been in operation for more than 5 years. For some time, it is pointed out that the monitoring process is not as efficient as previously expected [30]. However, we assume that data stream SIGKDD Explorations The reliability of spacecrafts is indispensable for astronauts’ health and mission success. Accordingly, spacecrafts pass very long and expensive planning and testing phases. Hence, potential data stream mining algorithms must ensure reliability and the integration of such algorithms into legacy systems must not cause critical side effects. Furthermore, data stream mining is an automatic process which neglects interactions with human experts, while spacecraft monitoring is a semi-automatic process and human experts (e.g. Volume 16, Issue 1 Page 8 the flight control team) are responsible for decisions and consequent actions. This problem poses the following question: How to integrate data stream mining into legacy systems when automation needs to be increased but the human expert needs to be maintained in the loop? Abstract discussions on this topic are provided by expert systems [44] and the MAPE-K reference model [24]. Expert systems aim to combine human expertise with artificial expertise and the MAPE-K reference model aims to provide an autonomic control loop. A balance must be struck which considers both aforementioned aspects appropriately. Overall, the Columbus study has shown that extending legacy systems with real time data stream mining technologies is feasible and it is an important area for further stream-mining research. 8. CONCLUDING REMARKS In this paper, we discussed research challenges for data streams, originating from real-world applications. We analyzed issues concerning privacy, availability of information, relational and event streams, preprocessing, model complexity, evaluation, and legacy systems. The discussed issues were illustrated by practical applications including GPS systems, Twitter analysis, earthquake predictions, customer profiling, and spacecraft monitoring. The study of real-world problems highlighted shortcomings of existing methodologies and showcased previously unaddressed research issues. Consequently, we call the data stream mining community to consider the following action points for data stream research: • developing methods for ensuring privacy with incomplete information as data arrives, while taking into account the evolving nature of data; • considering the availability of information by developing models that handle incomplete, delayed and/or costly feedback; • taking advantage of relations between streaming entities; funded by the German Research Foundation, projects SP 572/11-1 (IMPRINT) and HU 1284/5-1, the Academy of Finland grant 118653 (ALGODAN), and the Polish National Science Center grants DEC-2011/03/N/ST6/00360 and DEC-2013/11/B/ST6/00963. 9. REFERENCES [1] C. Aggarwal, editor. Data Streams: Models and Algorithms. Springer, 2007. [2] C. Aggarwal and D. Turaga. Mining data streams: Systems and algorithms. In Machine Learning and Knowledge Discovery for Engineering Systems Health Management, pages 4–32. Chapman and Hall, 2012. [3] R. Agrawal and R. Srikant. Privacy-preserving data mining. SIGMOD Rec., 29(2):439–450, 2000. [4] C. Anagnostopoulos, N. Adams, and D. Hand. Deciding what to observe next: Adaptive variable selection for regression in multivariate data streams. In Proc. of the 2008 ACM Symp. on Applied Computing, SAC, pages 961–965, 2008. [5] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS, pages 1–16, 2002. [6] C. Brodley, U. Rebbapragada, K. Small, and B. Wallace. Challenges and opportunities in applied machine learning. AI Magazine, 33(1):11–24, 2012. [7] D. Brzezinski and J. Stefanowski. Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Trans. on Neural Networks and Learning Systems., 25:81–94, 2014. • developing event detection methods and predictive models for censored data; [8] D. Chakrabarti, R. Kumar, F. Radlinski, and E. Upfal. Mortal multi-armed bandits. In Proc. of the 22nd Conf. on Neural Information Processing Systems, NIPS, pages 273–280, 2008. • developing a systematic methodology for streamed preprocessing; [9] D. Cox and D. Oakes. Analysis of Survival Data. Chapman & Hall, London, 1984. • creating simpler models through multi-objective optimization criteria, which consider not only accuracy, but also computational resources, diagnostics, reactivity, interpretability; [10] T. Dietterich. Machine-learning research. AI Magazine, 18(4):97–136, 1997. • establishing a multi-criteria view towards evaluation, dealing with absence of the ground truth about how data changes; [11] G. Ditzler and R. Polikar. Semi-supervised learning in nonstationary environments. In Proc. of the 2011 Int. Joint Conf. on Neural Networks, IJCNN, pages 2741 – 2748, 2011. • developing online monitoring systems, ensuring reliability of any updates, and balancing the distribution of resources. [12] W. Fan and A. Bifet. Mining big data: current status, and forecast to the future. SIGKDD Explorations, 14(2):1–5, 2012. As our study shows, there are challenges in every step of the CRISP data mining process. To date, modeling over data streams has been viewed and approached as an extension of traditional methods. However, our discussion and application examples show that in many cases it would be beneficial to step aside from building upon existing offline approaches, and start blank considering what is required in the stream setting. [13] M. Gaber, J. Gama, S. Krishnaswamy, J. Gomes, and F. Stahl. Data stream mining in ubiquitous environments: state-of-theart and current directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(2):116 – 138, 2014. Acknowledgments [15] J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 2010. We would like to thank the participants of the RealStream2013 workshop at ECMLPKDD2013 in Prague, and in particular Bernhard Pfahringer and George Forman, for suggestions and discussions on the challenges in stream mining. Part of this work was SIGKDD Explorations [14] M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD Rec., 34(2):18–26, 2005. [16] J. Gama, R. Sebastiao, and P. Rodrigues. On evaluating stream learning algorithms. Machine Learning, 90(3):317– 346, 2013. Volume 16, Issue 1 Page 9 [17] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept-drift adaptation. ACM Computing Surveys, 46(4), 2014. [18] J. Gantz and D. Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east, December 2012. [19] A. Goldberg, M. Li, and X. Zhu. Online manifold regularization: A new learning setting and empirical study. In Proc. of the European Conf. on Machine Learning and Principles of Knowledge Discovery in Databases, ECMLPKDD, pages 393–407, 2008. [20] I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the bayesian/frequentist divide. Journal of Machine Learning Research, 11:61–87, 2010. [21] M. Hassani and T. Seidl. Towards a mobile health context prediction: Sequential pattern mining in multiple streams. In Proc. of , IEEE Int. Conf. on Mobile Data Management, MDM, pages 55–57, 2011. [22] H. He and Y. Ma, editors. Imbalanced Learning: Foundations, Algorithms, and Applications. IEEE, 2013. [23] T. Hoens, R. Polikar, and N. Chawla. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, 1(1):89–101, 2012. [24] IBM. An architectural blueprint for autonomic computing. Technical report, IBM, 2003. [25] E. Ikonomovska, K. Driessens, S. Dzeroski, and J. Gama. Adaptive windowing for online learning from multiple interrelated data streams. In Proc. of the 11th IEEE Int. Conf. on Data Mining Workshops, ICDMW, pages 697–704, 2011. [26] A. Kotov, C. Zhai, and R. Sproat. Mining named entities with temporally correlated bursts from multilingual web news streams. In Proc. of the 4th ACM Int. Conf. on Web Search and Data Mining, WSDM, pages 237–246, 2011. [27] G. Krempl. The algorithm APT to classify in concurrence of latency and drift. In Proc. of the 10th Int. Conf. on Advances in Intelligent Data Analysis, IDA, pages 222–233, 2011. [28] M. Last and H. Halpert. Survival analysis meets data stream mining. In Proc. of the 1st Worksh. on Real-World Challenges for Data Stream Mining, RealStream, pages 26–29, 2013. [29] F. Nelwamondo and T. Marwala. Key issues on computational intelligence techniques for missing data imputation - a review. In Proc. of World Multi Conf. on Systemics, Cybernetics and Informatics, volume 4, pages 35–40, 2008. [30] E. Noack, W. Belau, R. Wohlgemuth, R. Müller, S. Palumberi, P. Parodi, and F. Burzagli. Efficiency of the columbus failure management system. In Proc. of the AIAA 40th Int. Conf. on Environmental Systems, 2010. [31] E. Noack, A. Luedtke, I. Schmitt, T. Noack, E. Schaumlöffel, E. Hauke, J. Stamminger, and E. Frisk. The columbus module as a technology demonstrator for innovative failure management. In German Air and Space Travel Congress, 2012. [32] M. Oliveira and J. Gama. A framework to monitor clusters evolution applied to economy and finance problems. Intelligent Data Analysis, 16(1):93–111, 2012. SIGKDD Explorations [33] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., 1999. [34] T. Raeder and N. Chawla. Model monitor (m2 ): Evaluating, comparing, and monitoring models. Journal of Machine Learning Research, 10:1387–1390, 2009. [35] P. Rodrigues and J. Gama. Distributed clustering of ubiquitous data streams. WIREs Data Mining and Knowledge Discovery, pages 38–54, 2013. [36] T. Sakaki, M. Okazaki, and Y. Matsuo. Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Trans. on Knowledge and Data Engineering, 25(4):919–931, 2013. [37] C. Salperwyck and V. Lemaire. Learning with few examples: An empirical study on leading classifiers. In Proc. of the 2011 Int. Joint Conf. on Neural Networks, IJCNN, pages 1010– 1019, 2011. [38] B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan and Claypool Publishers, 2012. [39] A. Shaker and E. Hüllermeier. Survival analysis on data streams: Analyzing temporal events in dynamically changing environments. Int. Journal of Applied Mathematics and Computer Science, 24(1):199–212, 2014. [40] C. Shearer. The CRISP-DM model: the new blueprint for data mining. J Data Warehousing, 2000. [41] Z. Siddiqui, M. Oliveira, J. Gama, and M. Spiliopoulou. Where are we going? predicting the evolution of individuals. In Proc. of the 11th Int. Conf. on Advances in Intelligent Data Analysis, IDA, pages 357–368, 2012. [42] Z. Siddiqui and M. Spiliopoulou. Classification rule mining for a stream of perennial objects. In Proc. of the 5th Int. Conf. on Rule-based Reasoning, Programming, and Applications, RuleML, pages 281–296, 2011. [43] M. Spiliopoulou and G. Krempl. Tutorial ”mining multiple threads of streaming data”. In Proc. of the Pacific-Asia Conf. on Knowledge Discovery and Data Mining, PAKDD, 2013. [44] D. Waterman. A Guide to Expert Systems. Addison-Wesley, 1986. [45] W. Young, G. Weckman, and W. Holland. A survey of methodologies for the treatment of missing values within datasets: limitations and benefits. Theoretical Issues in Ergonomics Science, 12, January 2011. [46] B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia. Continuous privacy preserving publishing of data streams. In Proc. of the 12th Int. Conf. on Extending Database Technology, EDBT, pages 648–659, 2009. [47] I. Zliobaite. Controlled permutations for testing adaptive learning models. Knowledge and Information Systems, In Press, 2014. [48] I. Zliobaite, A. Bifet, M. Gaber, B. Gabrys, J. Gama, L. Minku, and K. Musial. Next challenges for adaptive learning systems. SIGKDD Explorations, 14(1):48–55, 2012. [49] I. Zliobaite and B. Gabrys. Adaptive preprocessing for streaming data. IEEE Trans. on Knowledge and Data Engineering, 26(2):309–321, 2014. Volume 16, Issue 1 Page 10 Twitter Analytics: A Big Data Management Perspective Oshini Goonetilleke† , Timos Sellis† , Xiuzhen Zhang† , Saket Sathe§ Twitter Analytics: Bigand Data Management Perspective School of ComputerA Science IT, RMIT University, Melbourne, Australia † § IBM Melbourne Research Laboratory, Australia {oshini.goonetilleke, timos.sellis, xiuzhen.zhang}@rmit.edu.au, [email protected] Oshini Goonetilleke† , Timos Sellis† , Xiuzhen Zhang† , Saket Sathe§ † School of Computer Science and IT, RMIT University, Melbourne, Australia § IBM Melbourne Research Laboratory, Australia {oshini.goonetilleke, timos.sellis, xiuzhen.zhang @rmit.edu.au, [email protected] ABSTRACT The }systems that perform analysis in the context of With the inception of the Twitter microblogging platform in 2006, a myriad of research efforts have emerged studying different aspects of the Twittersphere. Each study exploits its own tools and mechanisms to capture, store, query and ABSTRACT analyze Twitter data. Inevitably, platforms have been deWith the of ad-hoc the Twitter microblogging platform veloped to inception replace this exploration with a more strucin 2006, a myriad of research efforts have emerged studying tured and methodological form of analysis. Another body different aspects of the Each study exploits of literature focuses on Twittersphere. developing languages for querying its own tools and mechanisms to capture, query Tweets. This paper addresses issues aroundstore, the big data and naanalyze Twitterand data. Inevitably, havedata been deture of Twitter emphasizes theplatforms need for new manveloped replace ad-hoc frameworks exploration with more strucagementtoand querythis language that aaddress limitured and methodological analysis. Another body tations of existing systems.form We of review existing approaches of literature focuses on developing languages for querying that were developed to facilitate twitter analytics followed Tweets. This paper addressesissues issuesand around the big data naby a discussion on research technical challenges ture of Twitterintegrated and emphasizes the need for new data manin developing solutions. agement and query language frameworks that address limitations of existing systems. We review existing approaches 1. INTRODUCTION that were developed to facilitate twitter analytics followed The growth of dataissues generated from social media by a massive discussion on research and technical challenges sources have resulted in solutions. a growing interest on efficient and in developing integrated effective means of collecting, analyzing and querying large volumes of social data. The existing platforms exploit sev1. INTRODUCTION eral characteristics of big data, including large volumes of The massive of data generated from socialand media data, velocity growth due to the streaming nature of data, vasources resulted in a growing efficient and riety duehave to the integration of data interest from theon web and other effective means collecting, analyzing and querying large sources. Hence,ofsocial network data presents an excellent volumes for of social data. Thedata. existing platforms exploit sevtestbed research on big eral characteristics big data, including large volumes of In particular, onlineof social networking and microblogging data, velocity due has to the streaming nature of data, anduser vaplatform Twitter seen exponential growth in its riety due toitsthe integration of data from web200 andmillion other base since inception in 2006 with nowthe over sources. Hence,users social network500 data presents an (Twitterexcellent monthly active producing million tweets testbed the for research big data. sphere, postings on made to Twitter) daily1 . A wide reIn particular, online networking since and microblogging search community hassocial been established then with the platform Twitter has seen exponential growth in user hope of understanding interactions on Twitter. Foritsexambasestudies since its inception in 2006 in with nowdomains over 200 million ple, have been conducted many exploring monthly active users producing 500 million tweets (Twitterdifferent perspectives of understanding human behavior. 1 . A wide resphere,research the postings to Twitter) dailyincluding Prior focusesmade on a variety of topics opinsearch community has event been established since with the ion mining [12, 14, 38], detection [46, 65, then 76], spread of hope of understanding interactions on Twitter. Foranalysis exampandemics [26,58,68], celebrity engagement [74] and ple,political studies discourse have been[28, conducted many domains exploring of 40, 70]. in These types of efforts have different perspectivestoofunderstand understanding human behavior. enabled researchers interactions on Twitter Prior focuses on a varietyeducation, of topics including relatedresearch to the fields of journalism, marketing,opindision mining [12, 14, 38], event detection [46, 65, 76], spread of aster relief etc. pandemics [26,58,68], celebrity engagement [74] and analysis 1 ofhttp://tnw.to/s0n9u political discourse [28, 40, 70]. These types of efforts have enabled researchers to understand interactions on Twitter related to the fields of journalism, education, marketing, disaster relief etc. 1 http://tnw.to/s0n9u SIGKDD Explorations these interactions typically involve the following major components: focused crawling, data management and data analytics. Here, data management comprises of information extraction, pre-processing, data modeling and query processThe components. systems that Figure perform1 analysis the diagram context of ing shows a in block of these such interactions involve the following major compoa system andtypically depicts interactions between various nents: Until focused crawling, data and data ananents. now, there has beenmanagement significant amount of prior lytics. Here, dataimproving management of information research around eachcomprises of the components shownexin traction, andthere queryhave processFigure 1, pre-processing, but to the best data of ourmodeling knowledge, been ing frameworks components.that Figure 1 shows a block diagram of such no propose a unified approach to Twitter a system and depicts compodata management thatinteractions seamlessly between integratevarious all these comnents. Until now, there has been significant amount ponents. Following these observations, in this paperofweprior exresearch each that of the components shownfor in tensivelyaround survey improving the techniques have been proposed Figure 1,each but of to the thecomponents best of our knowledge, there1,have realising shown in Figure and been then no frameworks that propose a unified approach to Twitter motivate the need and challenges of a unified framework for data management that seamlessly integrate all these commanaging Twitter data. ponents. Following these observations, in this paper we extensively survey the techniques that have been proposed for realising each of the components shown in Figure 1, and then motivate the need and challenges of a unified framework for managing Twitter data. Figure 1: An abstraction of a Twitter data management platform In our survey of existing literature we observed ways in which researchers have tried to develop general platforms Figure 1: aAn abstraction of a Twitter data management to provide repeatable foundation for Twitter data analytplatform ics. We review the tweet analytics space primarily focusing on the following key elements: In our survey of existing literature we observed ways in • Data Collection. Researchers have several options for which researchers have tried to develop general platforms collecting a suitable data set from Twitter. In Section 2 to provide a repeatable foundation for Twitter data analytwe briefly describe mechanisms and tools that focus priics. We review the tweet analytics space primarily focusing marily on facilitating the initial data acquisition phase. on the following key elements: These tools systematically capture the data using any of • Data Collection. Researchers have several options for collecting a suitable data set from Twitter. In Section 2 we briefly describe mechanisms and tools that focus primarily on facilitating the initial data acquisition phase. These tools systematically capture the data using any of Volume 16, Issue 1 Page 11 the Twitter’s publicly accessible APIs. • Data management frameworks. In addition to providing a module for crawling tweets, these frameworks provide support for pre-processing, information extraction and/or visualization capabilities. Prepossessing deals the Twitter’s publicly accessible APIs. with preparing tweets for data analysis. Information ex• traction Data management frameworks. addition proaims to derive more insightInfrom the to tweets, viding module for crawling these frameworks which isa not directly reported tweets, by the Twitter API, e.g. provide support pre-processing, information extraca sentiment for afor given tweet. Several frameworks protion and/or visualization capabilities. Prepossessing deals vide functionality to present the results in many output with tweets for data analysis. Information exformspreparing of visualizations while others are built exclusively traction aims to derive more insight from the tweets, to search over a large tweet collection. In Section 3 we which not directly reported by frameworks. the Twitter API, e.g. review isexisting data management a sentiment for a given tweet. Several frameworks pro• Languages for querying tweets. A growing body vide functionality to present the results in many output of literature proposes declarative query languages as a forms of visualizations while others are built exclusively mechanism of extracting structured information from tweets. to search over a large tweet collection. In Section 3 we Languages present end users with a set of primitives benreview existing data management frameworks. eficial in exploring the Twittersphere along different di• Languages querying tweets. Adeclarative growing body mensions. Infor Section 4 we investigate lanof literature proposes declarative query a guages and similar systems developed for languages querying aasvamechanism of extracting riety of tweet properties.structured information from tweets. Languages present end users with a set of primitives benWe eficial have in identified thetheessential ingredients a unified exploring Twittersphere alongfor different diTwitter data management solution, with the intention mensions. In Section 4 we investigate declarative that lanan analyst will similar easily be able to extend its for guages and systems developed for capabilities querying a vaspecific of properties. research. Such a solution will allow the riety types of tweet data analyst to focus on the use cases of the analytics task We conveniently have identified thethe essential ingredients for by a unified by using functionality provided an inTwitter management solution, with the intention that tegrated data framework. In Section 5 we present our position an analyst will easily be able to extend its capabilities for and emphasize the need for integrated solutions that address specific types of research. SuchSection a solution will allow the limitations of existing systems. 6 outlines research data to focus associated on the usewith casesthe of development the analyticsoftask issuesanalyst and challenges inby conveniently using the functionality provided tegrated platforms. Finally we conclude in Sectionby7. an integrated framework. In Section 5 we present our position and emphasize the need for integrated solutions that address 2. OPTIONS FOR DATA COLLECTION limitations of existing systems. Section 6 outlines research Researchers have several optionswith when an API issues and challenges associated thechoosing development of for indata collection, i.e. Finally the Search, Streaming and the7.REST tegrated platforms. we conclude in Section API. Each API has varying capabilities with respect to the type and the amount of information that can be retrieved. 2. OPTIONS FOR DATA COLLECTION The Search API is dedicated for running searches against an Researchers have several when choosing an with API the for index of recent tweets. It options takes keywords as queries data collection, i.e. the Search, Streaming the REST possibility of multiple queries combined as aand comma sepaAPI. API has varying capabilities with respect to the rated Each list. A request to the search API returns a collection type and the amount of information that can be retrieved. of relevant tweets matching a user query. The APIAPI is dedicated runningtosearches against an The Search Streaming providesfor a stream continuously capindex of recent tweets. It takes keywords as queries with the ture the public tweets where parameters are provided to possibility of multiple as keywords, a comma sepafilter the results of the queries stream combined by hashtags, twitrated list. A request to the search API returns collection ter user ids, usernames or geographic regions. aThe REST of relevant matching query.of the most recent API can betweets used to retrievea auser fraction The Streaming stream continuously tweets publishedAPI by aprovides Twitter auser. All to three APIs limitcapthe ture the public tweets where parameters are provided are to number of requests within a time window and rate-limits filter the the stream by hashtags, posed at results the userof and the application level.keywords, Responsetwitobter userfrom ids, Twitter usernames regions. Theformat. REST tained APIorisgeographic generally in the JSON 2retrieve a fraction of the most recent API can be used to are available in many programming Third party libraries tweets published by a Twitter user. API. All three limitprothe languages for accessing the Twitter TheseAPIs libraries number of requests time window and rate-limitsand are vide wrappers and within providea methods for authentication posed at the user and the application level. Response obother functions to conveniently access the API. tained from Twitter API generally in the JSONcoverage format. Publicly available APIs doisnot guarantee complete 2 are available in many programming Third party libraries of the data for a given query as the feeds are not designed languages for accessing theexample, Twitter API. These libraries profor enterprise access. For the streaming API only vide wrappers and provide for authentication and provides a random sample methods of 1% (known as the Spritzer other functions to conveniently access the API. 2 https://dev.twitter.com/docs/twitter-libraries Publicly available APIs do not guarantee complete coverage of the data for a given query as the feeds are not designed for enterprise access. For example, the streaming API only provides a random sample of 1% (known as the Spritzer 2 https://dev.twitter.com/docs/twitter-libraries SIGKDD Explorations stream) of the public Twitter stream in real-time. Applications where this rate limitation is too restrictive rely on third party resellers like GNIP, DataSift or Topsy 3 , who provides access to the entire collection of the tweets known as the Twitter FireHose. At a cost, resellers can provide unlimstream) of the public Twitter streamdata, in real-time. ited access to archives of historical real-timeApplicastreamtions where this rate limitation is too restrictive rely on third ing data or both. It is mostly the corporate 3businesses who , whoconsumer provides party like GNIP,toDataSift or Topsy opt forresellers such alternatives gain insights into their access to the entire collection of the tweets known as the and competitor patterns. Twitter FireHose. At a cost,sufficient resellersfor cananprovide In order to obtain a dataset analysisunlimtask, ited access to archives of historical data, real-time it is necessary to efficiently query the respective APIstreammething or the both. It is mostly the corporate businesses who ods,data within bounds of imposed rate limits. The requests opt for such alternatives to gain insights into their consumer to the API may have to run continuously spanning across and competitor patterns.Creating the users social graph for several days or weeks. In order to obtain a dataset sufficient for an analysis a community of interest requires additional modules task, that it is necessary to efficiently query thecrawls respective crawl user accounts iteratively. Large with API moremethcomods, thewas bounds imposedwith ratethe limits. The requests pletewithin coverage madeofpossible use of whitelisted to the API may have to run continuously spanning accounts [21, 45] and using the computation power of across cloud several days[55]. or weeks. the users socialwhitelisted graph for computing Due to Creating Twitters current policy, a community of interest requires that accounts are discontinued and areadditional no longer modules an option as crawl user accounts crawls with more commeans of large dataiteratively. collection.Large Distributed systems have plete coverage was with the userunning, of whitelisted been developed [16,made 45] topossible make continuously large accounts [21, 45] and using the computation power cloud scale crawls feasible. There are other solutions that of provide computingfeatures [55]. Due Twitters policy, whitelisted extended to to process the current incoming Twitter data as accounts discontinued and are no longer an option as discussed are next. means of large data collection. Distributed systems have been developed [16, 45] to make continuously running, large 3. MANAGEMENT scale DATA crawls feasible. There are otherFRAMEWORKS solutions that provide extended features to process the incoming Twitter data as 3.1 Focused discussed next. Crawlers The focus in studies like TwitterEcho [16] and Byun et al. [19] is data collection, where the primary contributions 3. DATA MANAGEMENT FRAMEWORKS are driven by crawling strategies for effective retrieval and better coverage. TwitterEcho describes an open source dis3.1 Focused Crawlers tributed crawler for Twitter. Data can be collected from The focus in studies like TwitterEcho anda Byun et a focused community of interest and it [16] adapts centralal. [19] is data collection, where the primary contributions ized distributed architecture in which multiple thin clients are deployed driven byto crawling effective retrieval and are create astrategies scalable for system. TwitterEcho debetter TwitterEcho open source disvises a coverage. user expansion strategydescribes by whichanthe user’s follower tributed crawler iteratively for Twitter. Data be collected lists are crawled using thecan REST API. Thefrom sysa focused community of interest andfor it controlling adapts a centraltem also includes modules responsible the exized distributed in whichfeature multiple thin clients pansion strategy.architecture The user selection identifies user are deployed to monitored create a scalable TwitterEcho accounts to be by the system. system with modules defor vises a user expansion strategy by which the user’sThe follower user profile analysis and language identification. user lists are crawled usingtothe REST Thecomsysselection feature iteratively is customized crawl the API. focused tem alsoofincludes modules responsible exmunity Portuguese tweets but can for be controlling adapted to the target pansion strategy. The user identifies user other communities. Byun et selection al. [19] infeature their work propose a accounts be monitored by the system with rule-basedtodata collection tool for Twitter with modules the focusfor of user profilesentiment analysis of and language identification. The user analysing Twitter messages. It is a java-based 4focused comselection feature is customized to crawl the open source tool developed using the Drools rule engine. munity of Portuguese tweetsofbut be adapted target They stress the importance an can automated data to collector other communities. Byun et al. data [19] in their work messages. propose a that also filters out unnecessary such as spam rule-based data collection tool for Twitter with the focus of analysing sentiment of Twitter It is a java-based 3.2 Pre-processing andmessages. Information Extracopen source tion tool developed using the Drools4 rule engine. They stress an automated data collector Apart from the dataimportance collection, of several frameworks implement that also filters out unnecessary data such as spam methods to perform extensive pre-processing andmessages. information extraction of the tweets. Pre-processing tasks of Trend3.2 Pre-processing and Information ExtracMiner [61] take into account the challenges posed by the tion noisy genre of tweets. Tokenization, stemming and POS Apart from data collection, several frameworks implement 3 http://gnip.com/, http://datasift.com/, http: methods to perform extensive pre-processing and informa//about.topsy.com/ tion extraction of the tweets. Pre-processing tasks of Trend4 http://drools.jboss.org/ Miner [61] take into account the challenges posed by the noisy genre of tweets. Tokenization, stemming and POS 3 http://gnip.com/, http://datasift.com/, //about.topsy.com/ 4 http://drools.jboss.org/ Volume 16, Issue 1 http: Page 12 tagging are some of the text processing tasks that better prepare tweets for the analysis task. The platform provides separate built-in modules to extract information such as location, language, sentiment and named entities that are deemed very useful in data analytics. The creation of a tagging sometools of the textthe processing tasks better pipeline are of these allows data analyst to that extend and prepare tweets for the analysis task. The platform proreuse each component with relative ease. vides separate to extract information such TwitIE [17] is built-in another modules open-source information extraction as location, sentiment and named entities that NLP pipelinelanguage, customized for microblog text. For the purare deemed very useful in data analytics. The creation a pose of information extraction (IE), the general purposeofIE pipeline of these tools allows data analyst to extendsuch and pipeline ANNIE is used and the it consists of components reuse each component with relative as sentence splitter, POS tagger and ease. gazetteer lists (for locaTwitIE [17] is another open-source information extraction tion prediction). Each step of the pipeline addresses drawNLP pipeline customized for microblog text. For purbacks in traditional NLP systems by addressing the the inherent pose of information extraction (IE), the general purpose IE challenges in microblog text. As a result, individual compipeline of ANNIE is are usedcustomized. and it consists of components such ponents ANNIE Language identification, as sentence splitter, POS tagger and gazetteer (forentity locatokenisation, normalization, POS tagging and lists named tion prediction). Each step of the pipeline addresses drawrecognition is performed with each module reporting accubackson in tweets. traditional NLP systems by addressing the inherent racy challenges in presents microblog text. As a result, comBaldwin [11] a system designed for individual event detection ponents of ANNIE are customized. Language identification, on Twitter with functionality for pre-processing. JSON retokenisation, POSAPI tagging named sults returnednormalization, by the Streaming are and parsed and entity piped recognition is performed with each module reporting accuthrough language filtering and lexical normalisation comporacy on tweets. nents. Messages that do not have location information are Baldwin [11] using presents a system designed for event geo-located, probabilistic models since it’s detection a critical on Twitter with functionality for pre-processing. JSON exreissue in identifying where an event occurs. Information sults returned by the Streaming API from are parsed andsources piped traction modules require knowledge external through language filtering and lexical normalisation compoand are generally more expensive tasks than language pronents. Messages that do not have location information cessing. Platforms that support real-time analysis [11, are 76] geo-located, using probabilistic models since it’s a critical require processing tasks to be conducted on-the-fly where issue in identifying where analgorithms event occurs. Information exthe speed of the underlying is a crucial considertraction modules require knowledge from external sources ation. and are generally more expensive tasks than language processing. Platforms that support real-time analysis [11, 76] 3.3 Generic Platforms require processing tasks to be conducted on-the-fly where There are several proposals in which researchers have tried the speed of the underlying algorithms is a crucial considerto develop generic platforms to provide a repeatable founation. dation for Twitter data analytics. Twitter Zombie [15] is a platform to unify the data gathering and analysis methods 3.3 Generic Platforms by presenting a candidate architecture and methodological There are several proposals in which have tried approach for examining specific parts researchers of the Twittersphere. to develop generic platforms to provide a repeatable founIt outlines architecture for standard capture, transformadation foranalysis Twitterofdata analytics. Twitter Zombie [15] is a tion and Twitter interactions using the Twitter’s platform to unify the data gathering and analysis methods Search API. This tool is designed to gather data from Twitby presenting a candidate architecture and methodological ter by executing a series of independent search jobs on a approach examining specific parts of and the their Twittersphere. continual for basis and the collected tweets metadata It outlines architecture for standard capture, transformais kept in a relational DBMS. One of the interesting features tion and analysis ofisTwitter interactions using the Twitter’s of TwitterZombie its ability to capture hierarchical relaSearch API. Thisdata toolreturned is designed to gatherAdata from transTwittionships in the by Twitter. network ter executing a series of independentonsearch jobs on a latorbymodule performs post-processing the tweets and continual basis and the collected tweets separately and their metadata stores hashtags, mentions and retweets, from the is kepttext. in a relational DBMS. One of the interesting features tweet Raw tweets are transformed into a representaof TwitterZombie ability to capture hierarchical relation of interactionsistoitscreate networks of retweets, mentions tionships the data returned by Twitter. A network transand usersinmentioning hashtags. This feature captured by lator module performs post-processing theattention tweets and TwitterZombie, which other studies pay on little to, stores hashtags, mentions and retweets, from the is helpful in answering different types ofseparately research questions tweet text. Raw transformed intoina the representawith relative ease.tweets Socialare graphs are created form of tion of interactions tonetwork create networks mentions a retweet or mention and theyofdoretweets, not crawl for the and hashtags. Thisrelationships. feature captured by user users graphmentioning with traditional following It also TwitterZombie, which other studies pay little attention to, draws discussion on how multi-byte tweets in languages like is helpful answering different of researchtransliteraquestions Arabic or in Chinese can be storedtypes by performing with tion. relative ease. Social graphs are created in the form of a retweet or mention network they do not crawl of forsupthe More recently, TwitHoard [69]and suggests a framework user graph with traditional It also porting processors for data following analytics relationships. on Twitter with emdraws discussion on how multi-byte tweets in languages like Arabic or Chinese can be stored by performing transliteration. More recently, TwitHoard [69] suggests a framework of supporting processors for data analytics on Twitter with em- SIGKDD Explorations phasis on selection of a proper data set for the definition of a campaign. The platform consists of three layers; campaign crawling layer, integrated modeling layer and the data analysis layer. In the campaign crawling layer, a configuration module follows an iterative approach to ensure the camphasis on selection a proper set for(keywords). the definition of paign converges to of a proper setdata of filters Cola campaign. The platform consists of three layers; campaign lected tweets, metadata and the community data (relationcrawling layer, integrated modeling layer the data analships among Twitter users) are stored in and a graph database. ysis the campaign crawling layer, a configuration Thislayer. study In should be highlighted for its distinction to allow module follows an iterative approach to ensure themodel camfor a flexible querying mechanism on top of a data paignon converges to The a proper of filters (keywords). Colbuilt raw data. modelset is generated in the integrated lected tweets, and thea community dataof(relationmodeling layermetadata and comprises representation associaships between among Twitter users) are stored in in a graph tions terms (e.g. hashtags) used tweets database. and their This studyinshould highlighted forisits distinctionastoitallow evolution time. be Their approach interesting capfor a the flexible mechanism top of a In data tures oftenquerying overlooked temporalon dimension. themodel third built analysis on raw data. model is generated integrated data layer,The a query language is usedintothe design a tarmodeling and comprises representation get view oflayer the campaign data athat corresponds of to associaa set of tions between terms (e.g. hashtags) used in tweets their tweets that contain for example, the answer to anand opinion evolution in time. Their approach is interesting as it capmining question. tures often overlooked temporal dimension. In the third Whilethe including components for capture and storage of tweets, data analysis layer, a query language is used to design a additional tools have been developed to search through tarthe get view of the campaign data that to presents a set of collected tweets. The architecture of corresponds CoalMine [72] tweets contain example, thedemonstrated answer to anon opinion a socialthat network datafor mining system Twitmining question. ter, designed to process large amounts of streaming social While including components capturean and storage of tweets, data. The ad-hoc query toolfor provides end user with the additional tools have been developed to search through the ability to access one or more data files through a Googlecollected The Appropriate architecture of CoalMine [72] presents like searchtweets. interface. support is provided for a a network data mining system demonstrated on Twitsetsocial of boolean and logical operators for ease of querying on ter, of designed to process amounts streaming social top a standard Apache large Lucene index. of The data collection data. The ad-hoc query is tool provides an user withconthe and storage component responsible for end establishing ability to access one or more data files through a Googlenections to the REST API and to store the JSON objects like search Appropriate returned ininterface. compressed formats. support is provided for a set of boolean and logical operators for ease to of querying on In building support platforms it is necessary make provitop of a standard Apache Lucene index. The data collection sions for practical considerations such as processing big data. and storage component is responsible for establishing TrendMiner [61] facilitates real-time analysis of tweets conand nections the REST API and to and store the JSON objects takes intotoconsideration scalability efficiency of processreturned compressed formats. ing large in volumes of data. TrendMiner makes an effort to In building platforms is necessarytools to make proviunify some support of the existing textit processing for Online sions practical considerations such as processing big data. Socialfor Networking (OSN) data, with emphasis on adapting TrendMiner [61] facilitates real-time analysisbatches of tweets and to real-life scenarios that include processing of miltakes intodata. consideration scalability efficiency processlions of They envision the and system to be of developed ing large batch-mode volumes of data. TrendMiner makes an effort to for both and online processing. TwitIE [17] as unify someinofthe the existingsection text processing for Online discussed previous is anothertools open-source inSocial Networking (OSN) data, with emphasisfor onmicroblog adapting formation extraction NLP pipeline customized to real-life scenarios that include processing batches of miltext. lions of data. They envision the system to be developed for both batch-mode and online Platforms processing. TwitIE [17] as 3.4 Application-specific discussed in the previous section is another open-source inApart from the above mentioned general purpose platforms, formation extraction NLP pipeline customized for microblog there are many frameworks targeted at conducting specific text. types of analysis with Twitter data. Emergency Situation Awareness (ESA) [76] is a platform developed to detect, as3.4 Application-specific Platforms sess, summarise and report messages of interest published Apart from the mentioned general platforms, on Twitter for above crisis coordination tasks.purpose The objective of there are many targeted specific their work is to frameworks convert large streamsatofconducting social media data typesuseful of analysis with Twitterinformation data. Emergency Situation into situation awareness in real-time. The Awareness (ESA) [76] isofa modules platformto developed to detect,conasESA platform consists detect incidents, sess, summarise and report messages of interest published dense and summarise messages, classify messages of high on Twitter forand crisis coordination The objective of value, identify track issues and tasks. finally to conduct forentheir work isoftohistorical convert large streams of socialare media data sic analysis events. The modules enriched into situation awareness information in real-time. The by a useful suite of visualisation interfaces. Baldwin et al. [11] proESA another platformsupport consistsplatform of modules to detect incidents, conpose focused on detecting events dense and summarise messages, on Twitter. The Twitter stream isclassify queried messages with a setof of high keyvalue, specified identify and track issues to conduct forenwords by the user withand the finally objective of filtering the sic analysis of historical events. The modules are enriched by a suite of visualisation interfaces. Baldwin et al. [11] propose another support platform focused on detecting events on Twitter. The Twitter stream is queried with a set of keywords specified by the user with the objective of filtering the Volume 16, Issue 1 Page 13 stream on a topic of interest. The results are piped through text processing components and the geo-located tweets are visualised on a map for better interaction. Clearly, platforms of this nature that deal with incident exploration need to make provisions for real-time analysis of the incoming Twitstream on aand topic of interest. The visualizations results are piped through ter stream produce suitable of detected text processing components and the geo-located tweets are incidents. visualised on a map for better interaction. Clearly, platforms of natureModel that deal with incidentMechanisms exploration need to 3.5this Data and Storage make provisions for real-time analysis of the incoming TwitData models are not discussed in detail in most studies, ter stream and produce suitable visualizations of detected as a simple data model is sufficient to conduct basic form incidents. of analysis. When standard tweets are collected, flat files [11, 72] is the preferred choice. Several studies that cap3.5 Data Model and Storage Mechanisms ture the social relationships [15, 19] of the Twittersphere, Data models are not discussed in detail most studies, employs the relational data model but doinnot necessarily as a simple data model inis asufficient to conductAs basic form store the relationships graph database. a conseof analysis. standard tweets are collected, flat files quence, manyWhen analyses that can be performed conveniently [11, is theare preferred choice. bySeveral studies that Only capon a72]graph not captured these platforms. ture the social relationships [15, 19] of the Twittersphere, TwitHoard [69] in their paper models co-occurrence of terms employs thewith relational data model butproperties. do not necessarily as a graph temporally evolving Twitter store the[15] relationships in a graph database. As a conseZombie and TwitHoard [69] should be highlighted for quence, many analyses including that can be conveniently capturing interactions theperformed retweets and term ason a graphapart are from not captured by these platforms. social Only sociations the traditional follower/friend TwitHoard [69] in their paper models co-occurrence of terms relationships. TrendMiner [61] draws explicit discussion on as a graph with temporally evolving properties. Twitter making provisions for processing millions of data and takes Zombie [15] TwitHoard [69] should be highlighted for advantage of and Apache Hadoop MapReduce framework to percapturing interactions including the retweets and term asform distributed processing of the tweets stored as key-value sociations apart from the traditional pairs. CoalMine [72] also has Apachefollower/friend Hadoop at thesocial core relationships. TrendMiner [61] drawsresponsible explicit discussion on of their batch processing component for efficient making provisions processing millions of data and takes processing of large for amount of data. advantage of Apache Hadoop MapReduce framework to perform of the tweetsInterfaces stored as key-value 3.6 distributed Supportprocessing for Visualization pairs. CoalMine [72] also has Apache Hadoop at the core There are many platforms designed with integrated tools of their batch processing component responsible for efficient predominantly for visualization, to analyse data in spatial, processing of large amount of data. temporal and topical perspectives. One tool is tweetTracker [44], which is designed to aid monitoring of tweets for hu3.6 Support for Visualization Interfaces manitarian and disaster relief. TweetXplorer [54] also proThere are many platforms designed withTwitter integrated vides useful visualization tools to explore data.tools For predominantly for visualization, to analyse data in spatial, a particular campaign, visualizations in tweetXplorer help temporal and topical perspectives. One tool is tweetTracker analysts to view the data along different dimensions; most [44], which is designed to aid monitoring of tweets forusers huinteresting days in the campaign (when), important manitarian and disaster relief. TweetXplorer [54] alsoinproand their tweets (who/what) and important locations the vides useful visualization to explore[48], Twitter data. For dataset (where). Systemstools like TwitInfo Twitcident [8] a particular visualizations and Torettorcampaign, [65] also provide a suiteinoftweetXplorer visualisationhelp caanalysts the tweets data along different dimensions; most pabilitiestotoview explore in different dimensions relating interesting days in the campaign (when), important users to specific applications like fighting fire and detecting earthand theirWeb-mashups tweets (who/what) and important in the quakes. like Trendsmap [5] andlocations Twitalyzer [6] dataset Systems like TwitInfobusiness [48], Twitcident provide (where). a web interface and enterprise solutions [8] to and also visualisation cagain Torettor real-time [65] trend andprovide insightsa ofsuite userofgroups. pabilities to explorean tweets in different dimensions relating Table 1 illustrates overview of related approaches and to specific like fighting fire and detecting features of applications different platforms. Pre-processing in Tableearth1 inquakes. like Trendsmap [5] and [6] dicates ifWeb-mashups any form of language processing tasksTwitalyzer such as POS provide a web interface and enterprise business solutions to tagging or normalization are conducted. Information extracgain real-time trend and insights of user groups. tion refers to the types of post processing performed to infer Table 1 illustrates an overview of relatedorapproaches and additional information, such as sentiment named entities features of different platforms. Pre-processing in that Tableis 1carin(NEs). Multiple ticks () correspond to a task dicates if any form of language processing tasks such as POS ried out extensively. In addition to collecting tweets, some tagging also or normalization are conducted. Information extracstudies capture the user’s social graph while others protion to the post processing performedretweets, to infer pose refers the need to types regardof interactions of hashtags, additional such as sentiment named entities mentions asinformation, separate properties. Backend or data models sup(NEs). by Multiple ticks () correspond task that is carported the platform shape the typestoofaanalysis that can ried out extensively. to collecting tweets, be conveniently doneIn onaddition each framework. From the some sumstudiesinalso capture user’s social graph while others promary Table 1, wethe can observe that each study on data pose the need to regard interactions of hashtags, retweets, mentions as separate properties. Backend data models supported by the platform shape the types of analysis that can be conveniently done on each framework. From the summary in Table 1, we can observe that each study on data SIGKDD Explorations management frameworks concentrate on a set of challenges more than others. We aim to recognize the key ingredients of an integrated framework that takes into account shortcomings of existing systems. management frameworks concentrate on a set of challenges more than others. We aim to recognize the key QUERYING ingredients of anTWEETS integrated 4. LANGUAGES FOR framework takes into account shortcomings of systems existing The goal ofthat proposing declarative languages and systems. for querying tweets is to put forward a set of primitives or an interface for analysts to conveniently query specific interactions on Twitter exploring the user, time, space and 4. LANGUAGES FOR QUERYING TWEETS topical dimensions. High level languages for querying tweets The goal of proposing declarative languages systems extend capabilities of existing languages such and as SQL and for querying tweets to put forward setTwitter of primitives SPARQL. Queries areiseither executed on athe stream or real-time an interface fora analysts to conveniently query specific in or on stored collection of tweets. interactions on Twitter exploring the user, time, space and topical dimensions. High level languages for querying tweets 4.1 Generic Languages extend capabilities of existing languages such as SQL and TweeQL [49] provides a streaming SQL-like interface to the SPARQL. Queries are either executed on the Twitter stream Twitter API and provides a set of user defined functions in real-time or on a stored collection of tweets. (UDFs) to manipulate data. The objective is to introduce a query language to extract structure and useful informa4.1 Generic Languages tion that is embedded in unstructured Twitter data. The TweeQL provides streamingand SQL-like interface to the language [49] exploits botharelational streaming semantics. Twitter API and provides a set of user defined functions UDFs allow for operations such as location identification, (UDFs)processing, to manipulate data. prediction, The objective is to entity introduce string sentiment named exa query language extract structure and ofuseful informatraction and eventtodetection. In the spirit streaming setion thatitisprovides embedded unstructured Twitteraggregations data. The mantics, SQLinconstructs to perform language exploits both relational and streaming semantics. over the incoming stream on a user-specified time window. UDFs allow for operations such as location identification, The result of a given query can be stored in a relational string processing, sentiment prediction, named entity exfashion for subsequent querying. traction andrepresenting event detection. In the network spirit of in streaming seModels for any social RDF have mantics, it provides SQL constructs to perform aggregations been proposed by Martin and Gutierrez [50] allowing queries over the incoming stream on a user-specified window. in SPARQL. The work explores the feasibilitytime of adoption The result of by a given query cantheir be stored in aanrelational of this model demonstrating idea with illustrafashion for subsequent tive prototype but doesquerying. not focus on a single social network Models for representing anyTwarQL social network in RDF have like Twitter in particular. [53] extracts content been byencodes Martin it and [50]using allowing queries from proposed tweets and in Gutierrez RDF format shared and in SPARQL. The work explores feasibility of adoption well known vocabularies (FOAF,the MOAT, SIOC) enabling of this model by demonstrating theirfacility idea with an illustraquerying in SPARQL. The extraction processes plain tive prototype but does not focus on a single social network tweets and expands its description by adding sentiment anlike Twitter in particular. [53] extracts notations, DBPedia entities, TwarQL hashtag definitions andcontent URLs. from tweets and encodes it in RDFdifferent format using shared and The annotation of tweets using vocabularies enwell known vocabularies (FOAF, MOAT, SIOC) ables querying and analysis in different dimensionsenabling such as querying SPARQL. The and extraction location, in users, sentiment relatedfacility namedprocesses entities. plain The tweets and expands its description by adding sentiment aninfrastructure of TwarQL enables subscription to a stream notations, DBPedia hashtag URLs. that matches a given entities, query and returnsdefinitions streamingand annotated The data annotation in real-time.of tweets using different vocabularies enables querying and analysis inare different dimensions such as Temporal and topical features of paramount importance location, users,microblogging sentiment and related entities. in an evolving stream likenamed Twitter. In the The laninfrastructure of TwarQL enables subscription to be a stream guages above, time and topic of a tweet (topic can reprethat matches given query and streaming annotated sented simplyaby a hashtag) arereturns considered meta-data of the data real-time. tweetinand is not treated any different from other metadata Temporal topical features are of paramount importance reported. and Topics are regarded as part of the tweet content in an evolving stream Twitter. In theAPI. lanor what drivesmicroblogging the data filtering tasklike from the Twitter guages above, time and topic of a tweet (topic can repreThere have been efforts to exploit features that gobe well besented a hashtag) aretime considered meta-data of the yond asimply simpleby filter based on and topic. Plachouras tweetStavrakas and is not[59] treated other modelling metadata and stressany thedifferent need forfrom temporal reported. aretoregarded as part of the tweet content of terms inTopics Twitter effectively capture changing trends. or what refers drives to theany data filtering task phrase from the A term word or short of Twitter interest API. in a There have been efforts to or exploit features go recogniwell betweet, including hashtags output of anthat entity yond a simple Their filter proposed based on query time and topic. can Plachouras tion process. operators express and Stavrakas the need for temporal modelling complex queries[59] forstress associations between terms over varyof terms in Twitter to effectively capture changing trends. A term refers to any word or short phrase of interest in a tweet, including hashtags or output of an entity recognition process. Their proposed query operators can express complex queries for associations between terms over vary- Volume 16, Issue 1 Page 14 Table 1: Overview of related approaches in data management frameworks. Prepossessing Examples of Social and/or other Data Store extracted information interactions captured? TwitterEcho [16] Language Yes Not given Byun et al. [19] Location Yes Relational Table 1: Overview of related approaches in data management Twitter Zombie [15] Yesframeworks. Relational Prepossessing Examples of Social and/or Data TwitHoard [69] Yes other GraphStore DB extracted information interactions CoalMine [72] Nocaptured? Files TwitterEcho [16] Language Yes Not given TrendMiner [61] Location, Sentiment, NEs No Key-value pairs Byun et al. [19] Location Yes Relational TwitIE [17] Language, Location, NEs No Not given Twitter Yes Relational ESA [76]Zombie [15] Location, NEs No Not given TwitHoard [69] Yes Graph DB Baldwin et al. [11] Language, Location No Flat files CoalMine [72] No Files TrendMiner [61] Location, Sentiment, NEs No Key-value pairs [17] to discover context of collected Language, NEs No Not given ing time TwitIE granularities, data.Location, graph database to demonstrate the feasibility of the model. ESA [76] Location, NEs No given Operators also allow retrieving a subset of tweets satisfying languages that operate on the twitter Not stream like TweeQL Baldwin et al. [11] No the output inFlat files TweeQL these complex conditions on term associations. ThisLanguage, enables Location and TwarQL generates real-time; the end user to select a good set of terms (hashtags) that [49] allows the resulting tweets to be collected in batches drive the data collection which has a direct impact on the then stores them in a relational database, while TwarQL [53] ing timeofgranularities, to discover context of collected data. graph to information demonstrateextraction the feasibility of the model. quality the results generated from the analysis. at the database end of the phase, annotated Operators also allow retrieving a subsetofoftweets tweetsoften satisfying languages operate on the twitter stream like TweeQL tweets are that encoded in RDF. Spatial features are another property overthese conditions on term associations. This enables and TwarQL generates the output in real-time; TweeQL lookedcomplex in complex analysis. Previously discussed work uses the user attribute to select aasgood set of terms that [49] allows the Languages resulting tweets be collected in batches 4.3 Query fortoSocial Networks the end location a mechanism to (hashtags) filter tweets in drive the data collection which has a direct impact on the then stores them in a relational database, while TwarQL [53] space. To complete our discussion we briefly outline two To the best of our knowledge, there is no existing work quality theuse results generated from thetoanalysis. at the end of the information extraction phase, annotated studies of that geo-spatial properties perform complex focusing on high level languages operating on the Twittweetssocial are encoded RDF. it is important to note proSpatial another propertyDoytsher of tweetsetoften overanalysisfeatures using theare location attribute. al. [31] inter’s graph. inHowever looked in acomplex analysis. discussed work uses troduced model and queryPreviously language suited for integrated posals for declarative query languages tailored for querying 4.3 Query Languages Social the attribute a mechanism filter tweetsnetin datalocation connecting a socialasnetwork of users to with a spatial social networks in general [10,for 30, 32, 50, 51,Networks 64]. One of the space. complete discussion we briefly two work to To identify placesour visited frequently. Edges outline named lifeTo the supported best of ourareknowledge, there is no existing queries path queries satisfying a set of work constudies properties to perform complex patternsthat are use usedgeo-spatial to associate the social and spatial netfocusing high level operating on the Twitditions ononthe path, andlanguages the languages in general take adanalysis the location attribute. Doytsher et al. [31] for inworks. using Different time granularities can be expressed ter’s social graph. properties However it is important to Semantics note provantage of inherent of social networks. troduced a model andrepresented query language suited for integrated each visited location by the life-pattern edge. posals declarative tailored for querying of the for languages are query based languages on Datalog [51], SQL [32, 64] data social network ofemploys users with a spatial synnetEven connecting though thea implementation a partially social networks in general [10, 30, 32, 50, 51, 64]. One of the or RDF/SPARQL [50]. Implementations are conducted on work identifyitplaces visited frequently. Edges named thetictodataset, will be interesting to investigate how lifethe queries supported are path satisfying a set ofsocial conbibliographical networks [32],queries Facebook and evolving patterns are networks used to associate social and spatial socio-spatial and the the life-pattern edges that netare ditions thelike path, and the languages general take adcontent on sites Yahoo! Travel [10] andinare not tested on works. Differentthe time granularities be expressed for used to associate spatial and socialcan networks can be repvantage of inherenttaking properties of social networks. Semantics Twitter networks Twitter specific affordances into each visited by thewith life-pattern edge. resented in a location real socialrepresented network dataset location inforof the languages are based on Datalog [51], SQL [32, 64] consideration. Even though implementation employs a partially synmation, such the as Twitter. GeoScope [18] finds information or RDF/SPARQL [50]. Implementations are conducted on thetic it will be interesting to investigate how the trends dataset, by detecting significant correlations among trending bibliographical networks [32], Facebook and evolving 4.4 Information Retrieval - Tweet Searchsocial socio-spatial the life-pattern edges location-topicnetworks pairs in aand sliding window. This givesthat riseare to content sites like Yahoo! Travel [10] and are not tested on Another class of systems presents textual queries to effiused to associateofthe spatial and caninformabe repthe importance capturing the social notionnetworks of spatial Twitter networks taking Twitter specific affordances into ciently search over a corpus of tweets. The challenges in this resented in ainreal social networkindataset with location infortion trends social networks analysis tasks. Real-time consideration. area are similar to that of information retrieval in addition mation, as Twitter. GeoScope [18] in finds information detectionsuch of crisis events from a location space, exhibits with having to deal with peculiarities of tweets. The short trends by detecting significant correlations among trending the possible value of Geoscope. In one of the experiments 4.4 Retrieval - Tweet length Information of tweets in particular creates added Search complexity to location-topic a sliding This gives rise to Twitter is usedpairs as aincase study window. to demonstrate its usefulAnother class of systems presents textual queries relevant to effitext-based search tasks as it is difficult to identify the of capturing the to notion of spatial informaness,importance where a hashtag is chosen represent the topic and ciently matching search over a corpus of[13,35]. tweets. Expanding The challenges in conthis tweets a user query tweet tion trendswhich in social networks in analysis tasks. Real-time city from the tweet originates chosen to capture the area are similar to that oftoinformation retrieval in The addition tent is suggested as a way enhance the meaning. goal detection location. of crisis events from a location in space, exhibits with having to deal with peculiarities of tweets. need The in short of such systems is to express a user’s information the the possible value of Geoscope. In one of the experiments lengthof of tweets text in particular creates added complexity to form a simple query, much like in search engines, and Twitter is used as a case study to demonstrate its useful4.2 Data Model for the Languages text-based search tasks as it is difficult to identify relevant return a tweet list in real-time with effective strategies for ness, where a hashtag is chosen to represent the topic and tweets matching a user measurements query [13,35]. Expanding conranking and relevance [33, 34, 71]. tweet Indexing Relational, RDF the andtweet Graphs are the chosen most common choices city from which originates to capture the tent is suggested as a way to enhance the meaning. The goal mechanisms are discussed in [23] as they directly impact efof data representation. There is a close affiliation in these location. of suchretrieval systems of is to express a user’s need track in the5 ficient tweets. The TRECinformation micro-blogging data models observing that, for instance, a graph can easily form of a simple text query, much like searchreal-time engines, and is dedicated to calling participants to in conduct adcorrespond to a set of RDF triples or vice versa. In fact, 4.2 Data Model for theand Languages return a tweet in areal-time withcollection. effective strategies for hoc search taskslist over given tweet Publications some studies like Plachouras Stavrakas [59] have put ranking and relevance measurements [33, 34, 71]. Indexing Relational, RDF and Graphs are the most common choices of TREC [56], documents the findings of all systems in the forward their data model as a labeled multi digraph and have mechanisms are discussed [23] as tweets they directly impact efof data arepresentation. Therefor is its a close affiliation in None these task of ranking the most in relevant matching a prechosen relational database implementation. 5 ficient of tweets. data models observing for Twitter instance,social a graph can easily definedretrieval set of user queries.The TREC micro-blogging track of these query systems that, models network with is dedicated to calling participants to conduct real-time adcorrespond a set relationships of RDF triples or vice versa. In fact, following or to retweet among users. Doytsher et Table 2 illustrates an overview of related approaches in syshoc tasks over a given tweet collection. Publications some like Plachouras and query Stavrakas [59] have al. [31]studies implement their algebraic operators with put the temssearch for querying tweets. Data models and dimensions inof TREC [56], documents the findings of all systems in the forward their dataand model as a labeled multi digraph and have use of both graph a relational database as the underlying 5 task of ranking the most relevant tweets matching a prechosen a relational for its compare implementation. http://trec.nist.gov/ data storage. They database experimentally relationalNone and defined set of user queries. of these query systems models Twitter social network with following or retweet relationships among users. Doytsher et Table 2 illustrates an overview of related approaches in sysal. [31] implement their algebraic query operators with the tems for querying tweets. Data models and dimensions inuse of both graph and a relational database as the underlying 5 http://trec.nist.gov/ data storage. They experimentally compare relational and SIGKDD Explorations Volume 16, Issue 1 Page 15 Table 2: Overview of approaches in systems for querying tweets. Data Model Explored dimensions Relational RDF Graph Text Time Space Social Network Real-Time TweeQL [49] Yes TwarQL [53] Yes Table 2: Overview of approaches in systems for querying tweets. Plachouras et al. [60] No Explored dimensions Doytsher et al. [31]∗ Data Model No Relational RDF Graph Text Time Space GeoScope et al. [18]∗ Social Network Real-Time Yes TweeQL [49] Yes ∗ Languages on social networks No TwarQL [53] Yes Tweet search systems Yes Plachouras et al. [60] No Doytsher et al. [31]∗ No ∗ GeoScope et al. [18] Yes vestigated in each system are depicted in the Table 2. Sysmation extraction considering the inherent peculiarities of ∗ on social networks tweets and not all No tems Languages that have made provision for the real-time streaming frameworks we discussed provided this Tweet Pre- processing components Yes nature of thesearch tweetssystems are indicated in the Real-time column. functionality. for example, norMultiple ticks () correspond to a dimension explored in demalization and tokenization should be implemented, with tail. Note that the systems marked with an asterisk (*) are the option for the end users to customize the modules to suit vestigated in eachspecifically system aretargeting depicted tweets, in the Table 2. their Sysmation extraction considering inherentmodules peculiarities of not implemented though their requirements. Informationthe extraction such as tems that have made provision forbethe real-time tweets and not all frameworks we discussedand provided application is meaningful and can extended to streaming the Twitlocation prediction, sentiment classification named this ennature of the are indicated the Real-time column. functionality. Precomponentsanalysis for example, nortersphere. Wetweets observe potential forin developing languages for tity recognition areprocessing useful in conducting on tweets Multiple ticks () correspond to a dimension explored in demalization and tokenization should be implemented, with querying tweets that include querying by dimensions that and attempt to derive more information from plain tweet tail. Note that thebysystems marked with an asterisk are the option for the end usersIdeally, to customize modules to suit are not captured existing systems, especially the(*) social text and their metadata. a userthe should be able to not implemented specifically targeting tweets, though their their requirements. Information modules such as graph. integrate any combination of the extraction components into their own application is meaningful and can be extended to the Twitlocation prediction, sentiment classification and named enapplications. tersphere. We observe potential for developing languages for tity recognition are useful in conducting analysis on tweets Data model:toMuch of the literature presented 3.5 querying tweets that FOR include INTEGRATED querying by dimensions that and attempt derive more information from (Section plain tweet 5. THE NEED SOLUand 4.2) does not emphasize or draw explicit discussions are not captured by existing systems, especially the social text and their metadata. Ideally, a user should be able to on the data in use. ofThe data into model greatly TIONS graph. integrate anymodel combination the logical components their own influences the types of analysis that can be done with relaThere is a need to assimilate individual efforts with the goal applications. tive ease on collected data. A physical representation of the of providing a unified framework that can be used by reData model: of the literature (Section 3.5 model involvingMuch suitable indexing andpresented storage mechanisms searchers practitioners many disciplines. Inte5. THEandNEED FOR across INTEGRATED SOLUand 4.2)volumes does notof emphasize draw explicit discussions of large data is an or important consideration for grated solutions should ideally handle the entire workflow on the data model in use. The logical data model greatly TIONS efficient retrieval. We notice that current research pays litof the data analysis life cycle from collecting the tweets to influences theto types of analysis that interactions, can be done with relaThere is a need to assimilate with we the have goal tle attention queries on Twitter the social presenting the results to theindividual user. Theefforts literature tive ease collected A data. A physical of the of providing a unifiedsections framework that efforts can bethat usedsupport by regraph in on particular. graph view of representation the Twittersphere is reviewed in previous outlines model involving suitableand indexing and storage searchers and practitioners across disciplines. Inteconsistently overlooked we recognize great mechanisms potential in different parts of the workflow. In many this section, we present of of data is an important consideration for grated solutions ideally handle the entire workflow thislarge area.volumes The graph construction on Twitter is not limour position withshould the aim of outlining significant compoefficient retrieval. We notice that current research pays litof the ofdata analysis life cycle from collecting the tweets to ited to considering the users as nodes and links as follownents an integrated solution addressing the limitations of tle attention to queries on Twitter thesuch social presenting the results to the user. The literature we have ing relationships; embracing useful interactions, characteristics as existing systems. graph in mention particular. Ahashtag graph view of the Twittersphere is reviewed in previous sections outlines efforts that support retweet, and (co-occurrence) networks in According to a review of literature conducted on the miconsistently overlooked andopportunity we recognize great potential in different parts of the workflow. In this section, we present the data model will create to conduct complex croblogging platform [25], the majority of published work this area.on The construction on of Twitter not envilimour positionconcentrates with the aim significant compoanalysis thesegraph structural properties tweets.is We on Twitter on of theoutlining user domain and the mesited to considering the users as nodes and links as follownentsdomain. of an integrated the limitations of sion a new data model that proactively captures structural sage The usersolution domain addressing explores properties of Twiting relationships; embracing useful characteristics such as existing relationships taking into consideration efficient retrieval of ter userssystems. in the microblogging environment while the mesretweet, mention and hashtag (co-occurrence) networks in relevant data to perform the queries. According to a review of literature conducted on the misage domain deals with properties exhibited by the tweets the data model will create opportunity to conduct complex croblogging platform [25], the majority work themselves. In comparison to the extentofofpublished work done on Query Languages described in Section deanalysis language: on these structural properties of tweets. We4,envion Twitter concentrates on the user domain and the mesthe microblogging platform, only a few investigates the defines both simple operators to be appliedcaptures on tweets and adsion a new data model that proactively structural sage domain. The user domain explores properties of Twitvelopment of data management frameworks and query lanvanced operators that extract complex patterns, which can relationships taking into consideration efficient retrieval of ter users in the microblogging environment while the mesguages that describe and facilitate processing of online sobe manipulated in different types of applications. Some lanrelevant data to perform the queries. sage domain deals with properties exhibited by the tweets cial networking data. In consequence, there is opportunity guages provide support for continuous queries on the stream themselves. In comparison extent of work done on Query language: in offer Section 4, defor improvement in this areatoforthe future research addressing or queries on a storedLanguages collection, described while others flexibility the microblogging platform, only a few the definesboth. both The simple operators to be view applied on tweets adthe challenges in data management. Weinvestigates elicit the following for advent of a graph makes crucialand contrivelopment of data management frameworks andfor query lanvanced that extract complex allowing patterns,uswhich can high-level components and envisage a platform Twitter butions operators in analyzing the Twittersphere to query guages that describe facilitate processing of online sobe manipulated different types offorms. applications. lanthat encompasses suchand capabilities: twitter data in in novel and varying It willSome be intercial networking data. In consequence, there is opportunity guagesto provide support fortypical continuous queries on the stream esting investigate how functionality [73] provided Focused crawler: Responsible retrieval andaddressing collection for improvement in this area for for future research or on a stored collection, while otherstooffer flexibility by queries graph query languages can be adapted Twitter netof Twitter data crawling the publicly accessible Twitthe challenges in by data management. We elicit the following for both. The adventquery of a graph view makes crucial contriworks. In developing languages, one could investigate ter APIs. A focused crawler should allow the user to dehigh-level components and envisage a platform for Twitter butions in analyzing the real-time Twittersphere allowing us to query the distinction between and batch mode processfine campaign with filters, monitor output and that aencompasses such suitable capabilities: twitter data in novel and varying forms. It will be intering. Visualizing the data retrieved as a result of a query iteratively crawl Twitter for large volumes of data until its esting to investigate how typical functionality [73] provided in a suitable manner is also an important concern. A set of Focused crawler: Responsible for retrieval and collection coverage of relevant tweets is satisfactory. by graph query languages be useful adapted to Twitter netpre-defined output formats can will be in order to provide of Twitter data by crawling the publicly accessible TwitPre-processor: As highlighted in Section 3.2,user thistostage works. In developing query languages, investigate an informative visualization over a mapone forcould a query that reter APIs. A focused crawler should allow the deusually compriseswith of modules pre-processing and inforthe distinction between real-time and batch mode processfine a campaign suitablefor filters, monitor output and ing. Visualizing the data retrieved as a result of a query iteratively crawl Twitter for large volumes of data until its in a suitable manner is also an important concern. A set of coverage of relevant tweets is satisfactory. pre-defined output formats will be useful in order to provide Pre-processor: As highlighted in Section 3.2, this stage an informative visualization over a map for a query that reusually comprises of modules for pre-processing and infor- SIGKDD Explorations Volume 16, Issue 1 Page 16 turns an array of locations. Another interesting avenue to explore is the introduction of a ranking mechanism on the query result. Ranking criteria may involve relevance, timeliness or network attributes like the reputation of users in the case of a social graph. Ranking functions are a stanturns an array of in locations. Another interesting avenue to dard requirement the field of information retrieval [23,39] explore is the introduction of a ranking mechanism on and studies like SociQL [30] report the use of visibility and the query result. may involve relevance, reputations metricsRanking to rank criteria results generated from a social timeliness or network attributes theview reputation of users graph. A query language with a like graph of the Twitterin the case a social graph. Ranking functionsand are ranking a stansphere alongofwith capabilities for visualizations dardcertainly requirement in the of information retrieval [23,39] will benefit thefield upcoming data analysis efforts of and studies like SociQL [30] report the use of visibility and Twitter. reputations metrics generated from Here, we focused on to therank key results ingredients required foraasocial fully graph. A query language with a improvements graph view of the Twitterdeveloped solution and discussed we can make sphere alongliterature. with capabilities for visualizations and ranking on existing In the next section, we identify chalwill certainly benefitissues the upcoming lenges and research involved. data analysis efforts of Twitter. Here, we focused on the key ingredients required for a fully 6. CHALLENGES ANDimprovements RESEARCH developed solution and discussed we ISSUES can make on In the next section, wewe identify chalTo existing completeliterature. our discussion, in this section summarize lenges and research issues key research issues in datainvolved. management and present technical challenges that need to be addressed in the context of building a data analytics platform for Twitter. 6. CHALLENGES AND RESEARCH ISSUES To our discussion, in this section we summarize 6.1complete Data Collection Challenges key research issues in data management and present techOnce a suitable Twitter API has been identified, we can nical challenges that need to be addressed in the context of define a campaign with a set of parameters. The focused building a data analytics platform for Twitter. crawler can be programmed to retrieve all tweets matching the query of the campaign. If a social graph is necessary, 6.1 Data Collection Challenges separate modules would be responsible to create this netOnce iteratively. a suitable Exhaustively Twitter API crawling has beenallidentified, we can work the relationships define a campaign with a set of parameters. The focused between Twitter users is prohibitive given the restrictions crawler canTwitter be programmed to itretrieve all tweets set by the API. Hence is required for thematching focused the query the campaign. If a social is necessary, crawler to of prioritize the relationships to graph crawl based on the separateand modules wouldofbespecific responsible to accounts. create thisInnetimpact importance Twitter the work iteratively. Exhaustively the relationships case that the platform handlescrawling multipleallcampaigns in parbetween Twitter users to is optimize prohibitive thetorestrictions allel, there is a need thegiven access the API. set by the Twitter API. Hence itofisarequired the focused Typically, the implementation crawler for should aim to crawler prioritize theofrelationships to considering crawl based the on the minimizetothe number API requests, reimpact and importance of specific Twitter accounts. the strictions, while fetching data for many campaigns in In paralcase that the platform handles multiple parlel. Hence building an effective crawling campaigns strategy is in a chalallel, there is a need to optimize the access to the API. lenging task, in order to optimize the use of API requests Typically, the implementation of a crawler should aim to available. minimize thecoverage number of ofthe API requests,is considering the reAppropriate campaign another significant strictions, while fetching data for many campaigns in paralconcern and denotes whether all the relevant information has lel. Hence building an effective crawling strategy is chalbeen collected. When specifying the parameters to adefine lenging task, ina order to optimize useknowledge of API requests the campaign, user needs a very the good on the available. relevant keywords. Depending on the specified keywords, Appropriate the campaign another significant a collection coverage may missofrelevant tweetsis in addition to the concern and denotes whether all theby relevant has tweets removed due to restrictions APIs. information Plachouras and been collected. When parameters Stavrakas work [60] is anspecifying initial stepthe in this directiontoasdefine it inthe campaign, a user of needs a very good knowledge on the vestigates this notion coverage and proposes mechanisms relevant keywords. Depending on the specified keywords, to automatically adapt the campaign to evolving hashtags. a collection may miss relevant tweets in addition to the tweets removed due to restrictions by APIs. Plachouras and 6.2 Pre-processing Challenges Stavrakas work [60] is an initial step in this direction as it inMany problems associated with summarization, topic devestigates this notion of coverage and proposes mechanisms tection and part-of-speech (POS) tagging, in the case of to automatically adapt the campaign to evolving hashtags. well-formed documents, e.g. news articles, have been extensively studied in the literature. Traditional named entity 6.2 Pre-processing Challenges recognizers (NERs) heavily depend on local linguistic feaMany [62] problems associated with summarization, topic and detures of well-formed documents like capitalization tection and part-of-speech (POS) tagging, the case of POS tagging of previous words. None of the in characteristics well-formed documents, news articles, been exhold for tweets with shorte.g. utterances of tweetshave limited to 140 tensively studied in the literature. Traditional named entity recognizers (NERs) heavily depend on local linguistic features [62] of well-formed documents like capitalization and POS tagging of previous words. None of the characteristics hold for tweets with short utterances of tweets limited to 140 SIGKDD Explorations characters, which make use of informal language, undoubtedly making a simple task of POS tagging more challenging. Besides the length limit, heavy and inconsistent usage of abbreviations, capitalizations and uncommon grammar constructions pose additional challenges to text processing. characters, which of make usebeing of informal undoubtWith the volume tweets orders language, of magnitude more edly making a simple task of POS tagging more challengthan news articles, most of the conventional methods cannot ing.directly Besidesapplied the length limit, inconsistent be to the noisyheavy genreand of Twitter data.usage Any of abbreviations, capitalizations uncommon grammar effort that uses Twitter data needsand to make use of appropriconstructions pose additional challenges to text ate twitter-specific strategies to pre-process text processing. addressing With the volume of tweets being ordersproperties of magnitude more the challenges associated with intrinsic of tweets. than newsinformation articles, most of the conventional cannot Similarly, extraction from tweetsmethods is not straightbe directly applied to the of Twitter data. Any forward as it is difficult tonoisy derivegenre context and topics from a effort that is uses Twitter data needs to make useThere of appropritweet that a scattered part of a conversation. is sepate twitter-specific strategies to pre-process text addressing arate literature on identifying entities (references to organithe challenges with intrinsic properties of [20,36], tweets. zations, places,associated products, persons) [47,63], languages Similarly, is not source straightsentiment information [57] present extraction in the tweetfrom texttweets for a richer of forward as it is difficult istoanother derive context and topics from a information. Location vital property representtweet that isfeatures a scattered part a conversation. There ing spatial either of of the tweet or of the user.is sepThe arate literature identifying (references to iforganilocation of eachon tweet may beentities optionally recorded using zations, places, products, [47,63], languages a GPS-enabled device. Apersons) user can also specify his[20,36], or her sentiment present in the tweet textand for is a richer of location as[57] a part of the user profile often source reported information. Location is another vital property representin varying granularities. The drawback is that only a small ing spatial features of the tweet or of the [24]. user. Since The portion of about 1% either of the tweets are geo-located location of eachalways tweet may be optionally recorded if using analysis almost requires the location property, when a GPS-enabled device. A userown canmechanisms also specifytohisinfer or her absent, studies conduct their lolocationofas a part the user is often reported cation the user, of a tweet, or profile both. and There are two major in varying granularities. drawback is thatanalysis only a small approaches for location The prediction: content with portion of about 1% of the tweets [24]. Since probabilistic language models [24,are 27,geo-located 37] or inference from analysis almost requires social and otheralways relations [22, 29,the 67].location property, when absent, studies conduct their own mechanisms to infer location Data of the Management user, a tweet, or Challenges both. There are two major 6.3 approaches for location prediction: content analysis with In Section 3.5 and Section 4.2 we outlined several alternaprobabilistic language models [24, 27, 37] or inference from tive approaches in literature for a data model to characsocial and other relations [22, 29, 67]. terise the Twittersphere. The relational and RDF models are frequently chosen while graph-based models are acknowl6.3 Data Management Challenges edged, however not realized concretely at the implementaIn Section andinSection several alternation phase.3.5 Ways which 4.2 we we canoutlined apply graph data mantive approaches in literature for a diverse data model to characagement in Twitter are extremely and interesting; terise thetypes Twittersphere. and RDF models different of networksThe canrelational be constructed apart from are frequently chosen while graph-based models are acknowlthe traditional social graph as outlined in Section 5. With edged, however not realized at the implementathe advent of a graph view to concretely model the Twittersphere, gives tion phase. Ways in which graph manrise to a range of queries thatwe cancan be apply performed ondata the strucagement intweets Twitter are extremely diverse and range interesting; ture of the essentially capturing a wider of use different typesused of networks be analytics constructed apart from case scenarios in typicalcan data tasks. the traditional as outlined in Section 5. With As discussed in social Sectiongraph 4.3, there are already languages simthe advent of a graph view to model the Twittersphere, gives ilar to SQL adapted to social networks. Many of the techrise to ainrange of queries can case be performed the strucniques literature are that for the of genericonsocial netture of the tweets essentially capturing a wider range of use works under a number of specific assumptions. For example casesocial scenarios used satisfy in typical data analytics the networks properties such astasks. the power law As discussed in Section 4.3,small therediameters are already languages simdistribution, sparsity and [52]. We envision ilar to SQL adapted to social networks. Many of the techqueries that take another step further and executes on Twitniques in literature are forlanguages the case FQL of generic social ter graphs. Simple query [1] and YQLnet[7] works a number of specific assumptions. For example provideunder features to explore properties of Facebook and Yathe social networks satisfy suchpart(usually as the power law hoo APIs but are limited toproperties querying only a sindistribution, sparsity and small diameters [52]. We gle user’s connections) of the large social graph. As envision pointed queries thatreport take another furtherand and Web executes Twitout in the on the step Databases 2.0 on Panel at ter graphs. Simple query languages FQL [1] trust, and YQL [7] VLDB 2007 [9], understanding and analyzing authorprovide features to explore of Facebook and netYaity, authenticity, and other properties quality measures in social hoo APIs but are limited to querying only part(usually a sinworks pose major research challenges. While there are ingle user’s connections) of themeasures large social graph. As vestigations on these quality in Twitter, it’spointed about out thewe report ondeclarative the Databases and Web 2.0 Panel at timeinthat enable querying of networks using VLDB 2007 [9], understanding and analyzing trust, authorsuch measurements. ity, authenticity, and other quality measures in social networks pose major research challenges. While there are investigations on these quality measures in Twitter, it’s about time that we enable declarative querying of networks using such measurements. Volume 16, Issue 1 Page 17 One of the predominant challenges is the management of large graphs that inevitably results from modeling users, tweets and their properties as graphs. With the large volume of data involved in any practical task, a data model should be information rich, yet a concise representation that enOne the predominant challenges is theonmanagement of ables of expression of useful queries. Queries graphs should large graphs that inevitably results from modeling users, be optimized for large networks and should ideally run intweets and their as graphs. large volume dependent of theproperties size of the graph. With Therethe arealready apof data involved in any practical a data model proaches that investigate efficienttask, algorithms on veryshould large be information rich,Efficient yet a concise representation engraphs [41–43, 66]. encoding and indexingthat mechaables expression of place usefultaking queries. graphs should nisms should be in intoQueries accountonvariations of inbe optimized large proposed networks for andtweets should ideally run indexing systemsfor already [23] and indexing dependent of the size of the arealready apof graphs [75] in general. We graph. need to There consider maintaining proaches that investigate efficient onefficient very large indexes for tweets, keywords, users,algorithms hashtags for acgraphs [41–43, Efficient encoding and situations indexing mechacess of data in 66]. advance queries. In certain it may nisms should beto in store place the taking intoraw account inbe impractical entire tweetvariations and the of comdexing systems for tweetsto[23] andcompress indexing plete user graphalready and it proposed may be desirable either of graphs [75] in general. We It need to considertomaintaining or drop portions of the data. is important investigate indexes for tweets,ofkeywords, users, for efficient acwhich properties tweets and thehashtags graph should be comcess of data in advance queries. In certain situations it may pressed. be impractical to challenges, store the entire tweet and the comBesides the above tweetsraw impose general research plete graph it mayChallenges be desirable to either compress issuesuser related to and big data. should be addressed or dropsame portions data. It big is important to investigate in the spiritofasthe any other data analytics task. In which of tweets the graph should be being comthe faceproperties of challenges posed and by large volumes of data pressed. collected, the NoSQL paradigm should be considered as an Besides above tweets impose generalsolutions research obvious the choice of challenges, dealing with them. Developed issues related to big for data. Challenges should beand addressed should be extensible upcoming requirements should in the same spirit When as anycollecting other bigdata, dataaanalytics In indeed scale well. user needtask. to conthe face of challenges by large volumes of data being sider scalable crawlingposed as there are large volumes of tweets collected, the NoSQL paradigm be considered as an received, processed and indexed should per second. With respect obvious choice of dealing with them. Developed solutions to implementation, it is necessary to investigate paradigms should be well, extensible for upcoming requirements and that scale like MapReduce which is optimized forshould offline indeed scale well. data When collecting on data, a user need to conanalytics on large partitioned hundreds of machines. sider scalable as thereofarethe large volumes of tweets Depending oncrawling the complexity queries supported, it received, indexed per second. With respect might be processed difficult toand express graph algorithms intuitively in to implementation, it is necessary to investigate paradigms MapReduce graph models, consequently databases such as that scale is optimized for offline Titan [4], well, DEXlike [3],MapReduce and Neo4j which [2] should be compared for analytics on large data partitioned on hundreds of machines. graph implementations. Depending on the complexity of the queries supported, it might be difficult to express graph algorithms intuitively in 7. CONCLUSION MapReduce graph models, consequently databases such as In this paper we addressed issues[2] around thebebig data nature Titan [4], DEX [3], and Neo4j should compared for of Twitter analytics and the need for new data management graph implementations. and query language frameworks. By conducting a careful and extensive review of the existing literature we observed 7. ways CONCLUSION in which researchers have tried to develop general platIn this to paper we addressed issues around thefor bigTwitter data nature forms provide a repeatable foundation data of Twitter We analytics andthe thetweet need for new data management analytics. reviewed analytics space by explorand mechanisms query language frameworks. conducting careful ing primarily for data By collection, data amanageand reviewfor of querying the existing observed mentextensive and languages andliterature analyzingwe tweets. We ways in which researchers have tried to develop plathave identified the essential ingredients requiredgeneral for a unified forms to provide a repeatable foundation for Twitter data framework that address the limitations of existing systems. analytics. reviewed the tweet analytics space bysome explorThe paperWe outlines research issues and identifies of ing mechanisms primarily with for data data the challenges associated the collection, development of managesuch inment andplatforms. languages for querying and analyzing tweets. We tegrated have identified the essential ingredients required for a unified framework that address the limitations of existing systems. 8. REFERENCES The paper outlines research issues and identifies some of the challenges associated with the development of such in[1] FaceBook Query Language(FQL) overview. tegrated platforms. https://developers.facebook.com/docs/ technical-guides/fql. 8. REFERENCES [2] Neo4j: The world’s leading graph database. http:// www.neo4j.org/. [1] FaceBook Query Language(FQL) overview. https://developers.facebook.com/docs/ technical-guides/fql. [2] Neo4j: The world’s leading graph database. http:// www.neo4j.org/. SIGKDD Explorations [3] Sparksee: Scalable high-performance graph database. http://www.sparsity-technologies.com/. [4] Titan: distributed graph database. http: //thinkaurelius.github.io/titan. [3] Sparksee: Scalable high-performance graph database. [5] TrendsMap, Realtime local twitter trends. http:// http://www.sparsity-technologies.com/. trendsmap.com/. [4] Titan: distributed graph database. http: [6] Twitalyzer: Serious analytics for social business. http: //thinkaurelius.github.io/titan. //twitalyzer.com. [5] TrendsMap, Realtime local twitter trends. http:// [7] Yahoo! Query Language guide on YDN. https:// trendsmap.com/. developer.yahoo.com/yql/. [6] Twitalyzer: Serious analytics for social business. http: [8] F. Abel, C. Hauff, and G. Houben. Twitcident: fight//twitalyzer.com. ing fire with information from social web streams. In WWW, pages 305–308, 2012. [7] Yahoo! Query Language guide on YDN. https:// developer.yahoo.com/yql/. [9] S. Amer-Yahia, V. Markl, A. Halevy, A. Doan, G. Abel, Alonso, Kossmann, G. Weikum. Databases [8] F. C. D. Hauff, and G. and Houben. Twitcident: fightand Webwith 2.0 panel at VLDB 2007. In SIGMOD Record, ing fire information from social web streams. In volume pages 49–52,2012. Mar. 2008. WWW, 37, pages 305–308, [10] V. Lakshmanan;, and Cong So[9] S. AmerYahia;, Amer-Yahia,L. V. Markl, A. Halevy, A. Yu. Doan, cialScope Enabling information discoveryDatabases on social G. Alonso,: D. Kossmann, and G. Weikum. content In CIDR, 2009.2007. In SIGMOD Record, and Websites. 2.0 panel at VLDB volume 37, pages 49–52, Mar. 2008. [11] T. Baldwin, P. Cook, and B. Han. A support platform event detection using social intelligence. In Demon[10] for S. AmerYahia;, L. V. Lakshmanan;, and Cong Yu. Sostrations 13th Conference of the European cialScope at: the Enabling information discovery on Chapsocial ter of the Association Computational Linguistics, content sites. In CIDR, for 2009. pages 69–72, 2012. [11] T. Baldwin, P. Cook, and B. Han. A support platform [12] for L. Barbosa and J. Feng. sentiment detection on event detection usingRobust social intelligence. In DemonTwitter biased noisy data. pages 36–44,ChapAug. strationsfrom at the 13th and Conference of the European 2010. ter of the Association for Computational Linguistics, pages 69–72, 2012. [13] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, E. H. Chi. interactive topic-based browsing [12] and L. Barbosa and Eddi: J. Feng. Robust sentiment detection on of social from status streams. In 23nd annual sympoTwitter biased and noisy data. pagesACM 36–44, Aug. sium 2010. on User interface software and technology - UIST, pages 303–312, Oct. 2010. [13] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, [14] and A. Bifet E. Eddi: Frank.interactive Sentimenttopic-based knowledge discovery E. H.and Chi. browsing in streaming data.InDiscovery Science. of twitter social status streams. 23nd annual ACMSpringer sympoBerlin pagessoftware 1–15, Oct. sium onHeidelberg, User interface and 2010. technology - UIST, pages 303–312, Oct. 2010. [15] A. Black, C. Mascaro, M. Gallagher, and S. P. GogTwitter Zombie: for capturing, so[14] gins. A. Bifet and E. Frank. Architecture Sentiment knowledge discovery cially transforming Twittersphere. in twitter streaming and data.analyzing Discoverythe Science. Springer In International on Oct. Supporting Berlin Heidelberg,conference pages 1–15, 2010. group work, pages 229–238, 2012. [15] A. Black, C. Mascaro, M. Gallagher, and S. P. Gog[16] gins. M. Boanjak E. Oliveira. TwitterEcho - A disTwitter and Zombie: Architecture for capturing, sotributed focused crawler support the openTwittersphere. research with cially transforming and to analyzing twitter data. In International companion on In International conference onconference Supporting group work, World Wide Web, pages 1233–1239, 2012. pages 229–238, 2012. [17] K. Bontcheva andE.L.Oliveira. Derczynski. TwitIE: an [16] M. Boanjak and TwitterEcho - Aopendissource extraction pipeline microblog tributedinformation focused crawler to support open for research with text. Indata. International Conference on Recent Advances twitter In International conference companion on in Natural Processing, 2013. World WideLanguage Web, pages 1233–1239, 2012. [18] C. Budak, T. Georgiou, and D. E. Abbadi. [17] K. Bontcheva and L. Derczynski. TwitIE: GeoScope: an openOnline detection of extraction geo-correlated information trends source information pipeline for microblog in social networks. PVLDB, 7(4):229–240, 2013. text. In International Conference on Recent Advances in Natural Language Processing, 2013. [18] C. Budak, T. Georgiou, and D. E. Abbadi. GeoScope: Online detection of geo-correlated information trends in social networks. PVLDB, 7(4):229–240, 2013. Volume 16, Issue 1 Page 18 [19] C. Byun, H. Lee, Y. Kim, and K. K. Kim. Twitter data collecting tool with rule-based filtering and analysis module. International Journal of Web Information Systems, 9(3):184–203, 2013. [34] S. Frénot and S. Grumbach. An in-browser microblog ranking engine. In International conference on Advances in Conceptual Modeling, volume 7518, pages 78– 88, 2012. [20] S. Carter, W. Lee, Weerkamp, andand M. K. Tsagkias. Microblog [19] C. Byun, H. Y. Kim, K. Kim. Twitter language identification: the limitations of data collecting tool with Overcoming rule-based filtering and analyshort, unedited and idiomatic text. of Language Resources sis module. International Journal Web Information and Evaluation, 47(1):195–215, Systems, 9(3):184–203, 2013. June 2012. [35] G. Frénot Golovchinsky M. Efron.An Making sense of Twitter [34] S. and S.and Grumbach. in-browser microblog search. CHI, 2010. rankingInengine. In International conference on Advances in Conceptual Modeling, volume 7518, pages 78– [36] M. Graham, S. A. . Hale, and D. Gaffney. Where in the 88, 2012. world are you ? Geolocation and language identification in Twitter. In ICWSM, 2012.of Twitter [35] G. Golovchinsky and M. pages Efron.518–521, Making sense search. In CHI, 2010. [37] B. Hecht, L. Hong, B. Suh, and E. Chi. Tweets from Justin Bieber’s the and dynamics of the Where locationinfield [36] M. Graham, S. heart: A. . Hale, D. Gaffney. the in userareprofiles. In Conference Human Factors in world you ? Geolocation and on language identification Computing pages 237–246, 2011. in Twitter. Systems, In ICWSM, pages 518–521, 2012. [21] S. M.Carter, Cha, H.W. Haddadi, F. Benevenuto, and K.Microblog P. Gum[20] Weerkamp, and M. Tsagkias. madi. Measuring user influence in twitter: The million language identification: Overcoming the limitations of follower fallacy. and In ICWSM, 2010. short, unedited idiomaticpages text. 10–17, Language Resources and Evaluation, 47(1):195–215, June 2012. [22] S. Chandra, L. Khan, and F. B. Muhaya. Estimating social interactions–a [21] twitter M. Cha,user H. location Haddadi,using F. Benevenuto, and K. P.content Gumbased approach. In IEEE Conference on Privacy, Semadi. Measuring user influence in twitter: The million curity, and In Trust, pagespages 838–843, Oct. 2011. followerRisk fallacy. ICWSM, 10–17, 2010. [23] C. Chandra, Chen, F. L. Li, Khan, C. Ooi, and TI : An efficient [22] S. and F. S.B.Wu. Muhaya. Estimating indexing mechanism for real-time search. In SIGMOD, twitter user location using social interactions–a content pages based 649–660, approach.2011. In IEEE Conference on Privacy, Security, Risk and Trust, pages 838–843, Oct. 2011. [24] Z. Cheng, J. Caverlee, K. Lee, and C. Science. A forS.geo-locating microblog [23] content-driven C. Chen, F. Li,framework C. Ooi, and Wu. TI : An efficient users. ACM Transactions on Intelligent Systems and indexing mechanism for real-time search. In SIGMOD, Technology, 2012. pages 649–660, 2011. [25] M. Cheng, Cheong J. andCaverlee, S. Ray. A of recent [24] Z. K.literature Lee, andreview C. Science. A microblogging report, Clayton content-driven developments. framework forTechnical geo-locating microblog School of Information Technology, Monash University, users. ACM Transactions on Intelligent Systems and 2011. Technology, 2012. [26] Chew, Cynthia, G. Eysenbach. in the [25] M. Cheong and and S. Ray. A literaturePandemics review of recent age of twitter:developments. content analysis of tweets during the microblogging Technical report, Clayton 2009 H1N1 outbreak. PloS one, 5(11), 2010.University, School of Information Technology, Monash 2011. [27] B. O. Connor, N. A. Smith, and E. P. Xing. A latent model forand geographic lexical variation. [26] variable Chew, Cynthia, G. Eysenbach. PandemicsIninConthe ference on Empirical Methods in Natural Language age of twitter: content analysis of tweets duringProthe cessing, pages 1277–1287, 2009 H1N1 outbreak. PloS2010. one, 5(11), 2010. [28] Conover, Michael, J. Ratkiewicz, [27] B. O. Connor, N. A. Smith, and E. P. M. Xing.Francisco, A latent B. Gonçalves, and A. Flammini. variable model F. forMenczer, geographic lexical variation. Political In Conpolarization on Twitter. In ICWSM, 2011. ference on Empirical Methods in Natural Language Processing, pages 1277–1287, 2010. [29] J. David. Thats what friends are for inferring location online social mediaJ.platforms based M. on social rela[28] in Conover, Michael, Ratkiewicz, Francisco, tionships. In ICWSM, 2013.and A. Flammini. Political B. Gonçalves, F. Menczer, polarization on Twitter. In ICWSM, 2011. [30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and Guana.Thats SociQL: query are language for the social [29] V. J. David. whatAfriends for inferring location Web. In social E. Kranakis, editor, Advances Network in online media platforms based on in social relaAnalysis its Applications, chapter 17, pages 381– tionships.and In ICWSM, 2013. 406. 2013. [30] Diego Serrano, Eleni Stroulia, Denilson Barbosa and [31] V. Y. Doytsher and B. Galon. Querying geo-social by Guana. SociQL: A query language for thedata social bridging networkseditor, and social networks. In 2nd Web. In spatial E. Kranakis, Advances in Network ACM SIGSPATIAL International Workshop on LocaAnalysis and its Applications, chapter 17, pages 381– tion 406. Based 2013. Social Networks, pages 39–46, 2010. [32] A. Dries, S. Nijssen, and L. De Raedt.geo-social A query language [31] Y. Doytsher and B. Galon. Querying data by for analyzing networks. In CIKM, pages 485–494,In2009. bridging spatial networks and social networks. 2nd ACM SIGSPATIAL International Workshop on Loca[33] tion M. Efron. retrieval inpages a microblogging BasedHashtag Social Networks, 39–46, 2010.environment. pages 787–788, 2010. [32] A. Dries, S. Nijssen, and L. De Raedt. A query language for analyzing networks. In CIKM, pages 485–494, 2009. [33] M. Efron. Hashtag retrieval in a microblogging environment. pages 787–788, 2010. SIGKDD Explorations [38] Jansen, M. Zhang, K. Sobel, Chowdury. [37] B. J. Hecht, L. Hong, B. Suh, and E. and Chi.A.Tweets from Twitter power: heart: Tweets electronic word of mouth. Justin Bieber’s theas dynamics of the location field Journal of the American SocietyonforHuman Information in user profiles. In Conference FactorsSciin ence and Technology, 60(11):2169–2188, Nov. 2009. Computing Systems, pages 237–246, 2011. [39] L. Hidayah, T.K. Elsayed, H. Chowdury. Ramadan. [38] J. B. Jiang, J. Jansen, M. Zhang, Sobel, and A. BEST KAUST at TREC-2011 : Building Twitterofpower: Tweets as electronic word of effective mouth. search TREC, Society 2011. for Information SciJournalinofTwitter. the American ence and Technology, 60(11):2169–2188, Nov. 2009. [40] P. Jürgens, A. Jungherr, and H. Schoen. Small worlds a difference: new gatekeepers of [39] with J. Jiang, L. Hidayah, T. Elsayed, and and the H. filtering Ramadan. political Twitter. In: International Web BEST ofinformation KAUST at on TREC-2011 Building effective Science pages 1–5, June 2011. search inConference-WebSci, Twitter. TREC, 2011. [41] Kang, D.A. H. Jungherr, Chau, andand C. Faloutsos. and [40] U. P. Jürgens, H. Schoen.Managing Small worlds mining large graphsnew : Systems and implementations. with a difference: gatekeepers and the filtering In of SIGMOD, volume 1, on pages 589–592, 2012. political information Twitter. In International Web Science Conference-WebSci, pages 1–5, June 2011. [42] U. Kang and C. Faloutsos. Big graph mining : andChau, discoveries. SIGKDD Managing Explorations, [41] Algorithms U. Kang, D. H. and C. Faloutsos. and 14(2):29–36, 2013. : Systems and implementations. In mining large graphs SIGMOD, volume 1, pages 589–592, 2012. [43] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos. An and efficient platform for large graphs.: [42] Gbase: U. Kang C. analysis Faloutsos. Big graph mining VLDB Journal, June 2012.Explorations, Algorithms and21(5):637–650, discoveries. SIGKDD 14(2):29–36, 2013. [44] S. Kumar, G. Barbier, M. Abbasi, and H. Liu. Tweethumanitarian and disaster [43] Tracker: U. Kang, An H. analysis Tong, J. tool Sun,for C.-Y. Lin, and C. Faloutsos. relief. ICWSM, 661–662, 2011. Gbase:InAn efficientpages analysis platform for large graphs. VLDB Journal, 21(5):637–650, June 2012. [45] H. Kwak, C. Lee, H. Park, and S. Moon. What is twita socialG. network or aM. news media? pages [44] ter, S. Kumar, Barbier, Abbasi, andInH.WWW, Liu. Tweet591–600, 2010. Tracker: An analysis tool for humanitarian and disaster relief. In ICWSM, pages 661–662, 2011. [46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen. approach for Park, event and detection by mining spatio[45] A H.novel Kwak, C. Lee, H. S. Moon. What is twittemporal information microblogs. International ter, a social network or on a news media? In WWW, pages Conference on Advances in Social Networks Analysis 591–600, 2010. and Mining, pages 254–259, July 2011. [46] C.-H. Lee, H.-C. Yang, T.-F. Chien, and W.-S. Wen. [47] A C. novel Li, J. approach Weng, Q.for He,event Y. Yao, and A.by Datta. TwiNER: detection mining spationamed entity recognition targeted twitter stream. In temporal information on in microblogs. In International SIGIR, pages 2012. Conference on721–730, Advances in Social Networks Analysis and Mining, pages 254–259, July 2011. [48] A. Marcus, M. Bernstein, and O. Badar. Tweets as data: TweeQL and In SIG[47] C. Li, demonstration J. Weng, Q. He,ofY. Yao, and A.Twitinfo. Datta. TwiNER: MOD, pages named entity 1259–1261, recognition2011. in targeted twitter stream. In SIGIR, pages 721–730, 2012. [49] A. Marcus, M. Bernstein, and O. Badar. Processing and in tweets. SIGMOD Record, 40(4), [48] visualizing A. Marcus,the M.data Bernstein, and O. Badar. Tweets as 2012. data: demonstration of TweeQL and Twitinfo. In SIGMOD, pages 1259–1261, 2011. [49] A. Marcus, M. Bernstein, and O. Badar. Processing and visualizing the data in tweets. SIGMOD Record, 40(4), 2012. Volume 16, Issue 1 Page 19 [50] M. S. Martı́n and C. Gutierrez. Representing, querying and transforming social networks with RDF/SPARQL. In European Semantic Web Conference, pages 293–307, 2009. [63] A. Ritter, S. Clark, and O. Etzioni. Named entity recognition in tweets : an experimental study. In Conference on Empirical Methods in Natural Language Processing, pages 1524–1534, 2011. [51] P. T. Mauro Martı́n, Claudio Gutierrez. SNQL [50] M. S. W. Martı́n andSan C. Gutierrez. Representing, querying :and A social networksocial querynetworks and transformation language. transforming with RDF/SPARQL. In European 5th Alberto Mendelzon Workshop on Semantic Web International Conference, pages 293–307, Foundations of Data Management, 2011. 2009. [63] A. S. Clark, and O. SoQL: Etzioni.ANamed entity recog[64] R. Ritter, Ronen and O. Shmueli. language for querynition in tweets : an experimental In Conference ing and creating data in social study. networks. In ICDE, on Empirical Methods Natural Language Processing, pages 1595–1602, Mar.in 2009. 1524–1534, 2011.shakes twitter users : Real-time [65] pages T. Sakaki. Earthquake event detection by social sensors. In WWW, pages 851– [64] R. Ronen 860, 2010.and O. Shmueli. SoQL: A language for querying and creating data in social networks. In ICDE, pages 1595–1602, 2009.GPS : A graph processing [66] S. Salihoglu and J.Mar. Widom. [65] T. Sakaki. shakes twitter on users : Real-time system. In Earthquake International Conference Scientific and event detection by social sensors. Inpages WWW, pages 851– Statistical Database Management, 1–31, 2013. 860, 2010. [67] A. Schulz, A. Hadjakos, and H. Paulheim. A multi[66] S. Salihoglu and J. Widom. GPS : A graph processing indicator approach for geolocalization of tweets. In system. Inpages International Conference on Scientific and ICWSM, 573–582, 2013. Statistical Database Management, pages 1–31, 2013. [68] A. Signorini, A. M. Segre, and P. M. Polgreen. The [67] A. A. Hadjakos, and H. Paulheim. A multiuse Schulz, of Twitter to track levels of disease activity and indicator approach of tweets. In public concern in the for U.S.geolocalization during the influenza A H1N1 ICWSM, pages pandemic. PloS 573–582, one, 6(5),2013. Jan. 2011. [52] M.T. Mcglohon and Faloutsos. Statistical properties [51] P. W. Mauro SanC.Martı́n, Claudio Gutierrez. SNQL of In C.and C. transformation Aggarwal, editor, Social : Asocial social networks. network query language. Network Data Analytics, 2, pagesWorkshop 17–42. 2011. In 5th Alberto Mendelzonchapter International on Foundations of Data Management, 2011. [53] P. Mendes, A. Passant, and P. Kapanipathi. Twarql: into the wisdom of the crowd. In Proceedings of [52] tapping M. Mcglohon and C. Faloutsos. Statistical properties the 6th International on Semantic Systems, of social networks. InConference C. C. Aggarwal, editor, Social pages 3–5, 2010. Network Data Analytics, chapter 2, pages 17–42. 2011. [54] Morstatter, S. Kumar, and R. Maciejew[53] F. P. Mendes, A. Passant, andH.P.Liu, Kapanipathi. Twarql: ski. Understanding Twitter datacrowd. with TweetXplorer. tapping into the wisdom of the In Proceedings In of SIGKDD, pages 1482–1485, 2013.on Semantic Systems, the 6th International Conference pages 3–5, 2010. [55] P. Noordhuis, M. Heijkoop, and A. Lazovik. Mining in the cloud: A caseH.study. IEEE 3rd Inter[54] Twitter F. Morstatter, S. Kumar, Liu, In and R. Maciejewnational ConferenceTwitter on Cloud pages 107– ski. Understanding dataComputing, with TweetXplorer. In 114, July 2010. SIGKDD, pages 1482–1485, 2013. [56] C. Macdonald, Lin, I. Soboroff. [55] I. P. Ounis, Noordhuis, M. Heijkoop,J.and A. and Lazovik. Mining Overview the TREC-2011 Microblog Track. 20th Twitter inofthe cloud: A case study. In IEEE 3rdInInterText REtrieval Conference (TREC), 2011. pages 107– national Conference on Cloud Computing, 114, July 2010. [57] A. Pak and P. Paroubek. Twitter as a corpus for senanalysis and opinionJ.mining. In International [56] timent I. Ounis, C. Macdonald, Lin, and I. Soboroff. Conference Resources and Evaluation, Overview of on the Language TREC-2011 Microblog Track. In 20th pages 1320–1326, 2010. Text REtrieval Conference (TREC), 2011. [58] Paul, M.and J, and M. Dredze.Twitter In ICWSM, pages 265–272. [57] A. Pak P. Paroubek. as a corpus for sentiment analysis and opinion mining. In International [59] Conference V. Plachouras Y. Stavrakas. Querying term assoon and Language Resources and Evaluation, ciations and their 2010. temporal evolution in social data. In pages 1320–1326, International VLDB Workshop on Online Social Sys2012. [58] tems, Paul, M. J, and M. Dredze. In ICWSM, pages 265–272. [60] V. Plachouras, Y. Stavrakas, and Querying A. Andreou. Assess[59] Plachouras and Y. Stavrakas. term assoing the coverage data collection campaigns Twitciations and theiroftemporal evolution in socialon data. In ter: A case study. On the Move to Meaningful InInternational VLDBInWorkshop on Online Social Systernet 2012. Systems: OTM 2013 Workshops, pages 598–607. tems, 2013. [60] V. Plachouras, Y. Stavrakas, and A. Andreou. Assess[61] ing D. the Preotiuc-Pietro, S. collection Samangooei, and T. coverage of data campaigns on Cohn. TwitTrendminer : An architecture for real analysisInof ter: A case study. In On the Move to time Meaningful social text. In Workshop on Real-Time Analysis ternet media Systems: OTM 2013 Workshops, pages 598–607. and 2013.Mining of Social Streams, pages 4–7, 2012. [62] L. Ratinov and D. Roth. challenges [61] D. Preotiuc-Pietro, S. Design Samangooei, andand T.misconCohn. ceptions in named entity recognition. Conference Trendminer : An architecture for realIntime analysis on of Computational Natural Language social media text. In Workshop on Learning Real-Time(CoNLL), Analysis number June, and Mining of pages Social147–155, Streams,2009. pages 4–7, 2012. [62] L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In Conference on Computational Natural Language Learning (CoNLL), number June, pages 147–155, 2009. SIGKDD Explorations [68] A. Signorini, and A. M. and P.A M. Polgreen. The [69] Y. Stavrakas V. Segre, Plachouras. platform for supuse of Twitter to track of challenges disease activity and porting data analytics onlevels twitter and objecpublicIntl. concern in the U.S. during theExtraction influenza A&H1N1 tives. Workshop on Knowledge Conpandemic. from PloS Social one, 6(5), Jan.(Ict 2011. solidation Media, 270239), 2013. [69] Y. Stavrakas and V. Plachouras. A platform for sup[70] Tumasjan, Andranik, T. O. Sprenger, P. G. Sandner, porting data analytics on twitter challenges objecand I. M. Welpe. Predicting Elections withand Twitter: tives. Intl. Workshop on Knowledge Extraction & ConWhat 140 Characters Reveal about Political Sentiment. solidation from Social Media, (Ict 270239), 2013. In ICWSM, pages 178–185, 2010. [70] Andranik, T.J.O. Sprenger, P. G. Sandner, [71] Tumasjan, J. Weng, E.-p. Lim, and Jiang. TwitterRank : Findand I. M. Welpe. Predicting ing topic-sensitive influential Elections twitterers.with In Twitter: WSDM, What Characters pages 140 261–270, 2010. Reveal about Political Sentiment. In ICWSM, pages 178–185, 2010. [72] J. S. White, J. N. Matthews, and J. L. Stacy. Coalmine: [71] an J. Weng, E.-p. in Lim, and J.aJiang. : Findexperience building systemTwitterRank for social media aning topic-sensitive influential In WSDM, alytics. In I. V. Ternovskiy andtwitterers. P. Chin, editors, Propages 261–270, 2010. ceedings of SPIE, volume 8408, 2012. [72] J. N. Matthews, J. L. Stacy. Coalmine: [73] P. S. T.White, Wood. J. Query languagesand for graph databases. SIGan experience building a Apr. system for social media anMOD Record, in 41(1):50–60, 2012. alytics. In I. V. Ternovskiy and P. Chin, editors, Pro[74] S. Wu, J.ofM. Hofman, W.8408, A. Mason, ceedings SPIE, volume 2012. and D. J. Watts. Who says what to whom on twitter. In WWW, pages [73] 705–714, P. T. Wood. Query Mar. 2011.languages for graph databases. SIGMOD Record, 41(1):50–60, Apr. 2012. [75] X. Yan, P. S. Yu, and J. Han. Graph indexing : A [74] S. Wu, J.structure-based M. Hofman, W.approach. A. Mason, D. J. Watts. frequent In and SIGMOD, pages Who says2004. what to whom on twitter. In WWW, pages 335–346, 705–714, Mar. 2011. [76] J. Yin, S. Karimi, B. Robinson, and M. Cameron. ESA: [75] X. Yan, P. situation S. Yu, and J. Han. via Graph indexing : In A emergency awareness microbloggers. frequentpages structure-based In SIGMOD, pages CIKM, 2701–2703, approach. 2012. 335–346, 2004. [76] J. Yin, S. Karimi, B. Robinson, and M. Cameron. ESA: emergency situation awareness via microbloggers. In CIKM, pages 2701–2703, 2012. Volume 16, Issue 1 Page 20 What is Tumblr: A Statistical Overview and Comparison Yi Chang† , Lei Tang§ , Yoshiyuki Inagaki† , Yan Liu‡ † Yahoo Labs, Sunnyvale, CA 94089 @WalmartLabs, San Bruno, CA 94066 ‡ University of Southern California, Los Angeles, CA 90089 § [email protected], [email protected], [email protected], [email protected] ABSTRACT Tumblr, as one of the most popular microblogging platforms, has gained momentum recently. It is reported to have 166.4 millions of users and 73.4 billions of posts by January 2014. While many articles about Tumblr have been published in major press, there is not much scholar work so far. In this paper, we provide some pioneer analysis on Tumblr from a variety of aspects. We study the social network structure among Tumblr users, analyze its user generated content, and describe reblogging patterns to analyze its user behavior. We aim to provide a comprehensive statistical overview of Tumblr and compare it with other popular social services, including blogosphere, Twitter and Facebook, in answering a couple of key questions: What is Tumblr? How is Tumblr different from other social media networks? In short, we find Tumblr has more rich content than other microblogging platforms, and it contains hybrid characteristics of social networking, traditional blogosphere, and social media. This work serves as an early snapshot of Tumblr that later work can leverage. 1. INTRODUCTION Tumblr, as one of the most prevalent microblogging sites, has become phenomenal in recent years, and it is acquired by Yahoo! in 2013. By mid-January 2014, Tumblr has 166.4 millions of users and 73.4 billions of posts1 . It is reported to be the most popular social site among young generation, as half of Tumblr’s visitor are under 25 years old2 . Tumblr is ranked as the 16th most popular sites in United States, which is the 2nd most dominant blogging site, the 2nd largest microblogging service, and the 5th most prevalent social site3 . In contrast to the momentum Tumblr gained in recent press, little academic research has been conducted over this burgeoning social service. Naturally questions arise: What is Tumblr? What is the difference between Tumblr and other blogging or social media sites? Traditional blogging sites, such as Blogspot4 and Live Journal5 , have high quality content but little social interactions. Nardi et al. [17] investigated blogging as a form of personal communication and expression, and showed that the vast majority of blog posts are written by ordinary people with a small audience. On the conhttp://www.tumblr.com/about http://www.webcitation.org/64UXrbl8H 3 http://www.alexa.com/topsites/countries/US 4 http://blogspot.com 5 http://livejournal.com 1 2 trary, popular social networking sites like Facebook6 , have richer social interactions, but lower quality content comparing with blogosphere. Since most social interactions are either unpublished or less meaningful for the majority of public audience, it is natural for Facebook users to form different communities or social circles. Microblogging services, in between of traditional blogging and online social networking services, have intermediate quality content and intermediate social interactions. Twitter7 , which is the largest microblogging site, has the limitation of 140 characters in each post, and the Twitter following relationship is not reciprocal: a Twitter user does not need to follow back if the user is followed by another. As a result, Twitter is considered as a new social media [11], and short messages can be broadcasted to a Twitter user’s followers in real time. Tumblr is also posed as a microblogging platform. Tumblr users can follow another user without following back, which forms a nonreciprocal social network; a Tumblr post can be re-broadcasted by a user to its own followers via reblogging. But unlike Twitter, Tumblr has no length limitation for each post, and Tumblr also supports multimedia post, such as images, audios or videos. With these differences in mind, are the social network, user generated content, or user behavior on Tumblr dramatically different from other social media sites? In this paper, we provide a statistical overview over Tumblr from assorted aspects. We study the social network structure among Tumblr users and compare its network properties with other commonly used ones. Meanwhile, we study content generated in Tumblr and examine the content generation patterns. One step further, we also analyze how a blog post is being reblogged and propagated through a network, both topologically and temporally. Our study shows that Tumblr provides hybrid microblogging services: it contains dual characteristics of both social media and traditional blogging. Meanwhile, surprising patterns surface. We describe these intriguing findings and provide insights, which hopefully can be leveraged by other researchers to understand more about this new form of social media. 2. 6 7 SIGKDD Explorations TUMBLR AT FIRST SIGHT Tumblr is ranked the second largest microblogging service, right after Twitter, with over 166.4 million users and 73.4 billion posts by January 2014. Tumblr is easy to register, and one can sign up for Tumblr service with a valid email address within 30 seconds. Once sign in Tumblr, a user can follow other users. Different from Facebook, the connections in Tumblr do not require mutual confirmation. Hence the social network in Tumblr is unidirectional. http://facebook.com http://twitter.com Volume 16, Issue 1 Page 21 Both Twitter and Tumblr are considered as microblogging platforms. Comparing with Twitter, Tumblr exposes several differences: Photo: 78.11% Text: 14.13% Quote: 2.27% Audio: 2.01% Video: 1.35% Chat: 0.85% Answer: 0.82% Link: 0.46% • There is no length limitation for each post; • Tumblr supports multimedia posts, such as images, audios and videos; • Similar to hashtags in Twitter, bloggers can also tag their blog post, which is commonplace in traditional blogging. But tags in Tumblr are seperate from blog content, while in Twitter the hashtag can appear anywhere within a tweet. • Tumblr recently (Jan. 2014) allowed users to mention and link to specific users inside posts. This @user mechanism needs more time to be adopted by the community; Figure 2: Distribution of Posts (Better viewed in color) • Tumblr does not differentiate verified account. Figure 1: Post Types in Tumblr Specifically, Tumblr defines 8 types of posts: photo, text, quote, audio, video, chat, link and answer. As shown in Figure 1, one has the flexibility to start a post in any type except answer. Text, photo, audio, video and link allow one to post, share and comment any multimedia content. Quote and chat, which are not available in most other social networking platforms, let Tumblr users share quote or chat history from ichat or msn. Answer occurs only when one tries to interact with other users: when one user posts a question, in particular, writes a post with text box ending with a question mark, the user can enable the option for others to answer the question, which will be disabled automatically after 7 days. A post can also be reblogged by another user to broadcast to his own followers. The reblogged post will quote the original post by default and allow the reblogger to add additional comments. Figure 2 demonstrates the distribution of Tumblr post types, based on 586.4 million posts we collected. As seen in the figure, even though all kinds of content are supported, photo and text dominate the distribution, accounting for more than 92% of the posts. Therefore, we will concentrate on these two types of posts for our content analysis later. Since Tumblr has a strong presence of photos, it is natural to compare it to other photo or image based social networks like Flickr8 and Pinterest9 . Flickr is mainly an image hosting website, and Flicker users can add contact, comment or like others’ photos. Yet, different from Tumblr, one cannot reblog another’s photo in Flickr. Pinterest is designed for curators, allowing one to share photos or videos of her taste with the public. Pinterest links a pin to the commercial website where the product presented in the pin can be purchased, which accounts for a stronger e-commerce behavior. Therefore, the target audience of Tumblr and Pinterest are quite different: the majority of users in Tumblr are under age 25, while Pinterest is heavily used by women within age from 25 to 44 [16]. We directly sample a sub-graph snapshot of social network from Tumblr on August 2013, which contains 62.8 million nodes and 8 9 http://flickr.com http://pinterest.com SIGKDD Explorations 3.1 billion edges. Though this graph is not yet up-to-date, we believe that many network properties should be well preserved given the scale of this graph. Meanwhile, we sample about 586.4 million of Tumblr posts from August 10 to September 6, 2013. Unfortunately, Tumblr does not require users to fill in basic profile information, such as gender or location. Therefore, it is impossible for us to conduct user profile analysis as done in other works. In order to handle such large volume of data, most statistical patterns are computed through a MapReduce cluster, with some algorithms being tricky. We will skip the involved implementation details but concentrate solely on the derived patterns. Most statistical patterns can be presented in three different forms: probability density function (PDF), cumulative distribution function (CDF) or complementary cumulative distribution function (CCDF), describing P r(X = x), P r(X ≤ x) and P r(X ≥ x) respectively, where X is a random variable and x is certain value. Due to the space limit, it is impossible to include all of them. Hence, we decide which form(s) to include depending on presentation and comparison convenience with other relevant papers. That is, if CCDF is reported in a relevant paper, we try to also report CCDF here so that rigorous comparison is possible. Next, we study properties of Tumblr through different lenses, in particular, as a social network, a content generation website, and an information propagation platform, respectively. 3. TUMBLR AS SOCIAL NETWORK We begin our analysis of Tumblr by examining its social network topology structure. Numerous social networks have been analyzed in the past, such as traditional blogosphere [21], Twitter [10; 11], Facebook [22], and instant messenger communication network [13]. Here we run an array of standard network analysis to compare with other networks, with results summarized in Table 110 . Degree Distribution. Since Tumblr does not require mutual confirmation when one follows another user, we represent the followerfollowee network in Tumblr as a directed graph: in-degree of a user represents how many followers the user has attracted, while outdegree indicates how many other users one user has been following. Our sampled sub-graph contains 62.8 million nodes and 3.1 billion Even though we wish to include results over other popular social media networks like Pinterest, Sina Weibo and Instagram, analysis over those websites not available or just small-scale case studies that are difficult to generalize to a comprehensive scale for a fair comparison. Actually in the Table, we observe quite a discrepancy between numbers reported over a small twitter data set and another comprehensive snapshot. 10 Volume 16, Issue 1 Page 22 Table 1: Comparison of Tumblr with other popular social networks. The numbers of Blogosphere, Twitter-small, Twitter-huge, Facebook, and MSN are obtained from [21; 10; 11; 22; 13], respectively. In the table, – implies the corresponding statistic is not available or not applicable; GCC denotes the giant connected component; the symbols in parenthesis m, d, e, r respectively represent mean, median, the 90% effective diameter, and diameter (the maximum shortest path in the network). Metric Tumblr Blogosphere Twitter-small Twitter-huge Facebook MSN #nodes 62.8M 143,736 87,897 41.7M 721M 180M #links 3.1B 707,761 829,467 1.47B 68.7B 1.3B in-degree distr ∝ k−2.19 ∝ k−2.38 ∝ k−2.4 ∝ k−2.276 – – �= power-law – – – �= power-law ∝ k0.8 e−0.03k degree distr in r-graph direction directed directed directed directed undirected undirected reciprocity 29.03% 3% 58% 22.1% – – degree correlation 0.106 – – >0 0.226 – avg distance 4.7(m), 5(d) 9.3(m) – 4.1(m), 4(d) 4.7(m), 5(d) 6.6(m), 6(d) diameter 5.4(e), ≥ 29(r) 12(r) 6(r) 4.8(e), ≥ 18(r) < 5(e) 7.8(e), ≥ 29(r) GCC coverage 99.61% 75.08% 93.03% – 99.91% 99.90% edges. Within this social graph, 41.40% of nodes have 0 in-degree, and the maximum in-degree of a node is 4.06 million. By contrast, 12.74% of nodes have 0 out-degree, the maximum out-degree of a node is 155.5k. Top popular Tumblr users include equipo11 , instagram12 , and woodendreams13 . This indicates the media characteristic of Tumblr: the most popular user has more than 4 million audience, while more than 40% of users are purely audience since they don’t have any followers. Figure 3(a) demonstrates the distribution of in-degrees in the blue curve and that of out-degrees in the red curve, where y-axis refers to the cumulated density distribution function (CCDF): the probability that accounts have at least k in-degrees or out-degrees, i.e., P (K >= k). It is observed that Tumblr users’ in-degree follows a power-law distribution with exponent −2.19, which is quite similar from the power law exponent of Twitter at −2.28 [11] or that of traditional blogs at −2.38 [21]. This also confirms with earlier empirical observation that most social network have a power-law exponent between −2 and −3 [6]. In regard to out-degree distribution, we notice the red curve has a big drop when out-degree is around 5000, since there was a limit that ordinary Tumblr users can follow at most 5000 other users. Tumblr users’ out-degree does not follow a power-law distribution, which is similar to blogosphere of traditional blogging [21]. If we explore user’s in-degree and out-degree together, we could generate normalized 3-D histogram in Figure 3(b). As both indegree and out-degree follow the heavy-tail distribution, we only zoom in those user who have less than 210 in-degrees and outdegrees. Apparently, there is a positive correlation between indegree and out-degree because of the dominance of diagonal bars. In aggregation, a user with low in-degree tends to have low outdegree as well, even though some nodes, especially those top popular ones, have very imbalanced in-degree and out-degree. Reciprocity. Since Tumblr is a directed network, we would like to examine the reciprocity of the graph. We derive the backbone of the Tumblr network by keeping those reciprocal connections only, i.e., user a follows b and vice versa. Let r-graph denote the corresponding reciprocal graph. We found 29.03% of Tumblr user pairs have reciprocity relationship, which is higher than 22.1% of reciprocity on Twitter [11] and 3% of reciprocity on Blogosphere [21], indicating a stronger interaction between users in the network. Figure 3(c) shows the distribution of degrees in the r-graph. There is a turning http://equipo.tumblr.com http://instagram.tumblr.com 13 http://woodendreams.tumblr.com 11 point due to the Tumblr limit of 5000 followees for ordinary users. The reciprocity relationship on Tumblr does not follow the power law distribution, since the curve mostly is convex, similar to the pattern reported over Facebook[22]. Meanwhile, it has been observed that one’s degree is correlated with the degree of his friends. This is also called degree correlation or degree assortativity [18; 19]. Over the derived r-graph, we obtain a correlation of 0.106 between terminal nodes of reciprocate connections, reconfirming the positive degree assortativity as reported in Twitter [11]. Nevertheless, compared with the strong social network Facebook, Tumblr’s degree assortativity is weaker (0.106 vs. 0.226). Degree of Separation. Small world phenomenon is almost universal among social networks. With this huge Tumblr network, we are able to validate the well-known “six degrees of separation” as well. Figure 4 displays the distribution of the shortest paths in the network. To approximate the distribution, we randomly sample 60,000 nodes as seed and calculate for each node the shortest paths to other nodes. It is observed that the distribution of paths length reaches its mode with the highest probability at 4 hops, and has a median of 5 hops. On average, the distance between two connected nodes is 4.7. Even though the longest shortest path in the approximation has 29 hops, 90% of shortest paths are within 5.4 hops. All these numbers are close to those reported on Facebook and Twitter, yet significantly smaller than that obtained over blogosphere and instant messenger network [13]. Component Size. The previous result shows that those users who are connected have a small average distance. It relies on the assumption that most users are connected to each other, which we shall confirm immediately. Because the Tumblr graph is directed, we compute out all weakly-connected components by ignoring the direction of edges. It turns out the giant connected component (GCC) encompasses 99.61% of nodes in the graph. Over the derived r-graph, 97.55% are residing in the corresponding GCC. This finding suggests the whole graph is almost just one connected component, and almost all users can reach others through just few hops. To give a palpable understanding, we summarize commonly used network statistics in Table 1. Those numbers from other popular social networks (blogosphere, Twitter, Facebook, and MSN) are also included for comparison. From this compact view, it is obvious traditional blogs yield a significantly different network structure. Tumblr, even though originally proposed for blogging, yields a network structure that is more similar to Twitter and Facebook. 12 SIGKDD Explorations Volume 16, Issue 1 Page 23 0 Percentage of Users −2 10 CCDF 0 10 0.2 In−Degree Out−Degree −4 10 0.15 −2 10 0.1 0.05 CCDF 10 0 0 −6 10 −8 10 0 2 10 10 4 6 10 10 In−Degree or Out−Degree 1 2 3 4 5 X In−Degree = 2 8 10 (a) in/out degree distribution 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 −4 10 −6 10 −8 Out−Degree = 2Y 10 (b) in/out degree correlation 0 10 1 10 2 3 4 10 10 10 In−Degree (same to Out−Degree) 5 10 (c) degree distribution in r-graph Figure 3: Degree Distribution of Tumblr Network 0 10 # Posts Mean Post Length Median Post Length Max Post Length −2 10 −4 PDF 10 −6 10 Text Post Dataset 21.5 M 426.7 Bytes 87 Bytes 446.0 K Bytes Photo Caption Dataset 26.3 M 64.3 Bytes 29 Bytes 485.5 K Bytes Table 2: Statistics of User Generated Contents −8 10 −10 10 0 5 10 15 20 Shortest Path Length 25 30 5 10 15 20 Shortest Path Length 25 30 1 0.8 CDF 0.6 0.4 0.2 0 0 Figure 4: Shortest Path Distribution 4. TUMBLR AS BLOGOSPHERE FOR CONTENT GENERATION As Tumblr is initially proposed for the purpose of blogging, here we analyze its user generated contents. As described earlier, photo and text posts account for more than 92% of total posts. Hence, we concentrate only on these two types of posts. One text post may contain URL, quote or raw message. In this study, we are mainly interested in the authentic contents generated by users. Hence, we extract raw messages as the content information of each text post, by removing quotes and URLs. Similarly, photo posts contains 3 categories of information: photo URL, quote photo caption, raw photo caption. While the photo URL might contain lots of additional meta information, it would require tremendous effort to analyze all images in Tumblr. Hence, we focus on raw photo captions as the content of each photo post. We end up with two datasets of content: one is text post, and the other is photo caption. What’s the effect of no length limit for post? Both Tumblr and Twitter are considered microblogging platforms, yet there is one SIGKDD Explorations key difference: Tumblr has no length limit while Twitter enforces the strict limitation of 140 bytes for each tweet. How does this key difference affect user post behavior? It has been reported that the average length of posts on Twitter is 67.9 bytes and the median is 60 bytes14 . Corresponding statistics of Tumblr are shown in Table 2. For the text post dataset, the average length is 426.7 bytes and the median is 87 bytes, which both, as expected, are longer than that of Twitter. Keep in mind Tumblr’s numbers are obtained after removing all quotes, photos and URLs, which further discounts the discrepancy between Tumblr and Twitter. The big gap between mean and median is due to a small percentage of extremely long posts. For instance, the longest text post is 446K bytes in our sampled dataset. As for photo captions, naturally we expect it to be much shorter than text posts. The average length is around 64.3 bytes, but the median is only 29 bytes. Although photo posts are dominant in Tumblr, the number of text posts and photo captions in Table 2 are comparable, because majority of photo posts don’t contain any raw photo captions. A further related question: is the 140-byte limit sensible? We plot post length distribution of the text post dataset, and zoom into less than 280 bytes in Figure 5. About 24.48% of posts are beyond 140 bytes, which indicates that at least around one quarter of posts will have to be rewritten in a more compact version if the limit was enforced in Tumblr. Blending all numbers above together, we can see at least two types of posts: one is more like posting a reference (URL or photo) with added information or short comments, the other is authentic user generated content like in traditional blogging. In other words, Tumblr is a mix of both types of posts, and its no-length-limit policy encourages its users to post longer high-quality content directly. What are people talking about? Because there is no length limit on Tumblr, the blog post tends to be more meaningful, which alhttp://www.quora.com/Twitter-1/What-is-the-average-length-ofa-tweet 14 Volume 16, Issue 1 Page 24 Topic Pets 1 0.8 CCDF Scenery 0.6 Pop Music Photography 0.4 0.2 0 0 50 100 150 200 Post Length (Bytes) 250 Sports 300 Medical Figure 5: Post Length Distribution Topic Pop Music Sports Internet Pets Medical Finance Topical Keywords music song listen iframe band album lyrics video guitar game play team win video cookie ball football top sims fun beat league internet computer laptop google search online site facebook drop website app mobile iphone big dog cat animal pet animals bear tiny small deal puppy anxiety pain hospital mental panic cancer depression brain stress medical money pay store loan online interest buying bank apply card credit Table 3: Topical Keywords from Text Post Dataset lows us to run topic analysis over the two datasets to have an overview of the content. We run LDA [4] with 100 topics on both datasets, and showcase several topics and their corresponding keywords on Tables 3 and 4, which also show the high quality of textual content on Tumblr clearly. Medical, Pets, Pop Music, Sports are shared interests across 2 different datasets, although representative topical keywords might be different even for the same topic. Finance, Internet only attracts enough attentions from text posts, while only significant amount of photo posts show interest to Photography, Scenery topics. We want to emphasize that most of these keywords are semantically meaningful and representative of the topics. Who are the major contributors of contents? There are two potential hypotheses. 1) One supposes those socially popular users post more. This is derived from the result that those popular users are followed by many users, therefore blogging is one way to attract more audience as followers. Meanwhile, it might be true that blogging is an incentive for celebrities to interact or reward their followers. 2) The other assumes that long-term users (in terms of registration time) post more, since they are accustomed to this service, and they are more likely to have their own focused communities or social circles. These peer interactions encourage them to generate more authentic content to share with others. Do socially popular users or long-term users generate more contents? In order to answer this question, we choose a fixed time window of two weeks in August 2013 and examine how frequent each user blogs on Tumblr. We sort all users based on their indegree (or duration time since registration) and then partition them into 10 equi-width bins. For each bin, we calculate the average blogging frequency. For easy comparison, we consider the maximal value of all bins as 1, and normalize the relative ratio for other bins. The results are displayed in Figure 6, where x-axis from left to right indicates increasing in-degree (or decreasing duration time). SIGKDD Explorations Topical Keywords cat dog cute upload kitty batch puppy pet animal kitten adorable summer beach sun sky sunset sea nature ocean island clouds lake pool beautiful music song rock band album listen lyrics punk guitar dj pop sound hip photo instagram pic picture check daily shoot tbt photography team world ball win football club round false soccer league baseball body pain skin brain depression hospital teeth drugs problems sick cancer blood Table 4: Topical Keywords from Photo Caption Dataset For brevity, we just show the result for text post dataset as similar patterns were observed over photo captions. The patterns are strong in both figures. Those users who have higher in-degree tend to post more, in terms of both mean and median. One caveat is that what we observe and report here is merely correlation, and it does not derive causality. Here we draw a conservative conclusion that the social popularity is highly positively correlated with user blog frequency. A similar positive correlation is also observed in Twitter[11]. In contrast, the pattern in terms of user registration time is beyond our imagination until we draw the figure. Surprisingly, those users who either register earliest or register latest tend to post less frequently. Those who are in between are inclined to post more frequently. Obviously, our initial hypothesis about the incentive for new users to blog more is invalid. There could be different explanations in hindsight. Rather than guessing the underlying explanation, we decide to leave this phenomenon as an open question to future researchers. As for reference, we also look at average post-length of users, because it has been adopted as a simple metric to approximate quality of blog posts [1]. The corresponding correlations are plot in Figure 7. In terms of post length, the tail users in social networks are the winner. Meanwhile, long-term or recently-joined users tend to post longer blogs. Apparently, this pattern is exactly opposite to post frequency. That is, the more frequent one blogs, the shorter the blog post is. And less frequent bloggers tend to have longer posts. That is totally valid considering each individual has limited time and resources. We even changed the post length to the maximum for each individual user rather than average, but the pattern remains still. In summary, without the post length limitation, Tumblr users are inclined to write longer blogs, and thus leading to higher-quality user generated content, which can be leveraged for topic analysis. The social celebrities (those with large number of followers) are the main contributors of contents, which is similar to Twitter [24]. Surprisingly, long-term users and recently-registered users tend to blog less frequently. The post-length in general has a negative correlation with post frequency. The more frequently one posts, the shorter those posts tend to be. 5. TUMBLR FOR INFORMATION PROPAGATION Tumblr offers one feature which is missing in traditional blog services: reblog. Once a user posts a blog, other users in Tumblr can reblog to comment or broadcast to their own followers. This en- Volume 16, Issue 1 Page 25 1.2 1 Normalized Post Length 1 Normalized Post Frequency 1.2 Mean of Post Frequency Median of Post Frequency 0.8 0.6 0.4 0.2 0 0.6 0.4 0 In−Degree from Low to High along x−Axis 1.2 Mean of Post Frequency Median of Post Frequency 1 Normalized Post Length 1 Normalized Post Frequency 0.8 0.2 1.2 0.8 0.6 0.4 0.2 0 Mean of Post Length Median of Post Length In−Degree from Low to High along x−Axis Mean of Post Length Median of Post Length 0.8 0.6 0.4 0.2 0 Registration Time from Early to Late along x−Axis Registration Time from Early to Late along x−Axis Figure 6: Correlation of Post Frequency with User In-degree or Duration Time since Registration Figure 7: Correlation of Post Length with User In-degree or Duration Time since Registration ables information to be propagated through the network. In this section, we examine the reblogging patterns in Tumblr. We examine all blog posts uploaded within the first 2 weeks, and count reblog events in the subsequent 2 weeks right after the blog is posted, so that there would be no bias because of the time window selection in our blog data. Who are reblogging? Firstly, we would like to understand which users tend to reblog more? Those people who reblog frequently serves as the information transmitter. Similar to the previous section, we examine the correlation of reblogging behavior with users’ in-degree. As shown in the Figure 8, social celebrities, who are the major source of contents, reblog a lot more compared with other users. This reblogging is propagated further through their huge number of followers. Hence, they serve as both content contributor and information transmitter. On the other hand, users who registered earlier reblog more as well. The socially popular and long-term users are the backbone of Tumblr network to make it a vibrant community for information propagation and sharing. Reblog size distribution. Once a blog is posted, it can be reblogged by others. Those reblogs can be reblogged even further, which leads to a tree structure, which is called reblog cascade, with the first author being the root node. The reblog cascade size indicates the number of reblog actions that have been involved in the cascade. Figure 9 plots the distribution of reblog cascade sizes. Not surprisingly, it follows a power-law distribution, with majority of reblog cascade involving few reblog events. Yet, within a time window of two weeks, the maximum cascade could reach 116.6K. In order to have a detailed understanding of reblog cascades, we zoom into the short head and plot the CCDF up to reblog cascade size equivalent to 20 in Figure 9. It is observed that only about 19.32% of reblog cascades have size greater than 10. By contrast, only 1% of retweet cascades have size larger than 10 [11]. The reblog cascades in Tumblr tend to be larger than retweet cascades in Twitter. Reblog depth distribution. As shown in previous sections, almost any pair of users are connected through few hops. How many hops does one blog to propagate to another user in reality? Hence, we look at the reblog cascade depth, the maximum number of nodes to pass in order to reach one leaf node from the root node in the reblog cascade structure. Note that reblog depth and size are different. A cascade of depth 2 can involve hundreds of nodes if every other node in the cascade reblogs the same root node. Figure 10 plots the distribution of number of hops: again, the reblog cascade depth distribution follows a power law as well according to the PDF; when zooming into the CCDF, we observe that only 9.21% of reblog cascades have depth larger than 6. That is, majority of cascades can reach just few hops, which is consistent with the findings reported over Twitter [3]. Actually, 53.31% of cascades in Tumblr have depth 2. Nevertheless, the maximum depth among all cascades can reach 241 based on two week data. This looks un- SIGKDD Explorations Volume 16, Issue 1 Page 26 0 10 Mean of Reblog Frequency Median of Reblog Frequency 0.8 −4 10 −6 10 0.6 −8 10 0.4 0.2 0 2 4 6 10 10 Reblog Cascade Size 10 0.8 In−Degree from Low to High along x−Axis Mean of Reblog Frequency Median of Reblog Frequency 1 0.6 0.4 0.2 0 0 0.8 5 10 15 Reblog Cascade Size 20 25 Figure 9: Distribution of Reblog Cascade Size 0.6 0.4 is posted, the less likely it would be reblogged. 75.03% of first reblog arrive within the first hour since a blog is posted, and 95.84% of first reblog appears within one day. Comparatively, It has been reported that “half of retweeting occurs within an hour and 75% under a day” [11] on Twitter. In short, Tumblr reblog has a strong bias toward recency, and information propagation on Tumblr is fast. 0.2 0 0 10 1 1.2 Normalized Reblog Frequency −2 10 PDF 1 CCDF Normalized Reblog Frequency 1.2 Registration Time from Early to Late along x−Axis Figure 8: Correlation of Reblog Frequency with User In-degree or Duration Time since Registration likely at first glimpse, considering any two users are just few hops away. Indeed, this is because users can add comment while reblogging, and thus one user is likely to involve in one reblog cascade multiple times. We notice that some Tumblr users adopt reblog as one way for conversation or chat. Reblog Structure Distribution. Since most reblog cascades are few hops, here we show the cascade tree structure distribution up to size 5 in Figure 11. The structures are sorted based on their coverage. Apparently, a substantial percentage of cascades (36.05%) are of size 2, i.e., a post being reblogged merely once. Generally speaking, a reblog cascade of a flat structure tends to have a higher probability than a reblog cascade of the same size but with a deep structure. For instance, a reblog cascade of size 3 have two variants, of which the flat one covers 9.42% cascade while the deep one drops to 5.85%. The same patten applies to reblog cascades of size 4 and 5. In other words, it is easier to spread a message widely rather than deeply in general. This implies that it might be acceptable to consider only the cascade effect under few hops and focus those nodes with larger audience when one tries to maximize influence or information propagation. Temporal patten of reblog. We have investigated the information propagation spatially in terms of network topology, now we study how fast for one blog to be reblogged? Figure 12 displays the distribution of time gap between a post and its first reblog. There is a strong bias toward recency. The larger the time gap since a blog SIGKDD Explorations 6. RELATED WORK There are rich literatures on both existing and emerging online social network services. Statistical patterns across different types of social networks are reported, including traditional blogosphere [21], user-generated content platforms like Flickr, Youtube and LiveJournal [15], Twitter [10; 11], instant messenger network [13], Facebook [22], and Pinterest [7; 20]. Majority of them observe shared patterns such as long tail distribution for user degrees (power law or power law with exponential cut-off), small (90% quantile effective) diameter, positive degree association, homophily effect in terms of user profiles (age or location), but not with respect to gender. Indeed, people are more likely to talk to the opposite sex [13]. The recent study of Pinterest observed that ladies tend to be more active and engaged than men [20], and women and men have different interests [5]. We have compared Tumblr’s patterns with other social networks in Table 1 and observed that most of those trend hold in Tumblr except for some number difference. Lampe et al. [12] did a set of survey studies on Facebook users, and shown that people use Facebook to maintain existing offline connections. Java et al. [10] presented one of the earliest research paper for Twitter, and found that users leverage Twitter to talk their daily activities and to seek or share information. In addition, Schwartz [7] is one of the early studies on Pinterest, and from a statistical point of view that female users repin more but with fewer followers than male users. While Hochman and Raz [8] published an early paper using Instagram data, and indicated differences in local color usage, cultural production rate, for the analysis of location-based visual information flows. Existing studies on user influence are based on social networks or Volume 16, Issue 1 Page 27 36.05% 9.42% 5.85% 3.58% 1.69% 1.44% 2.78% 1.20% 1.15% 0.58% 0.51% 0.42% 0.33% 0.31% 0.24% 0.21% Figure 11: Cascade Structure Distribution up to Size 5. The percentage at the top is the coverage of cascade structure. 0 1 10 0.8 −2 10 CDF PDF 0.6 −4 10 0.4 −6 10 0.2 −8 10 0 1 10 2 0 3 10 10 Reblog Cascade Depth 10 1 CCDF 10m 1h 1d 1w Lag Time of First Reblog Figure 12: Distribution of Time Lag between a Blog and its first Reblog 0.8 7. 0.6 0.4 0.2 0 0 1m 5 10 15 Reblog Cascade Depth 20 25 Figure 10: Distribution of Reblog Cascade Depth content analysis. McGlohon et al. [14] found topology features can help us distinguish blogs, the temporal activity of blogs is very non-uniform and bursty, but it is self-similar. Bakshy et al. [3] investigated the attributes and relative influence based on Twitter follower graph, and concluded that word-of-mouth diffusion can only be harnessed reliably by targeting large numbers of potential influencers, thereby capturing average effects. Hopcroft et al. [9] studied the Twitter user influence based on two-way reciprocal relationship prediction. Weng et al. [23] extended PageRank algorithm to measure the influence of Twitter users, and took both the topical similarity between users and link structure into account. Kwak et al. [11] study the topological and geographical properties on the entire Twittersphere and they observe some notable properties of Twitter, such as a non-power-law follower distribution, a short effective diameter, and low reciprocity, marking a deviation from known characteristics of human social networks. However, due to data access limitation, majority of the existing scholar papers are based on either Twitter data or traditional blogging data. This work closes the gap by providing the first overview of Tumblr so that others can leverage as a stepstone to investigate more over this evolving social service or compare with other related services. SIGKDD Explorations CONCLUSIONS AND FUTURE WORK In this paper, we provide a statistical overview of Tumblr in terms of social network structure, content generation and information propagation. We show that Tumblr serves as a social network, a blogosphere and social media simultaneously. It provides high quality content with rich multimedia information, which offers unique characteristics to attract youngsters. Meanwhile, we also summarize and offer as rigorous comparison as possible with other social services based on numbers reported in other papers. Below we highlight some key findings: • With multimedia support in Tumblr, photos and text account for majority of blog posts, while audios and videos are still rare. • Tumblr, though initially proposed for blogging, yields a significantly different network structure from traditional blogosphere. Tumblr’s network is much denser and better connected. Close to 29.03% of connections on Tumblr are reciprocate, while blogosphere has only 3%. The average distance between two users in Tumblr is 4.7, which is roughly half of that in blogosphere. The giant connected component covers 99.61% of nodes as compared to 75% in blogosphere. • Tumblr network is highly similar to Twitter and Facebook, with power-law distribution for in-degree distribution, nonpower law out-degree distribution, positive degree associativity for reciprocate connections, small distance between connected nodes, and a dominant giant connected component. • Without post length limitation, Tumblr users tend to post longer. Approximately 1/4 of text posts have authentic contents beyond 140 bytes, implying a substantial portion of high quality blog posts for other tasks like topic Volume 16, Issue 1 Page 28 • Those social celebrities tend to be more active. They post analysis and text mining. and reblog more frequently, serving as both content generators and information transmitters. Moreover, frequent bloggers like to write short, while infrequent bloggers spend more effort in writing longer posts. • In terms of duration since registration, those long-term users and recently registered users post less frequently. Yet, longterm users reblog more. • Majority of reblog cascades are tiny in terms of both size and depth, though extreme ones are not uncommon. It is relatively easier to propagate a message wide but shallow rather than deep, suggesting the priority for influence maximization or information propagation. • Compared with Twitter, Tumblr is more vibrant and faster in terms of reblog and interactions. Tumblr reblog has a strong bias toward recency. Approximately 3/4 of the first reblogs occur within the first hour and 95.84% appear within one day. This snapshot research is by no means to be complete. There are several directions to extend this work. First, some patterns described here are correlations. They do not illustrate the underlying mechanism. It is imperative to differentiate correlation and causality [2] so that we can better understand the user behavior. Secondly, it is observed that Tumblr is very popular among young users, as half of Tumblr’s visitor base being under 25 years old. Why is it so? We need to combine content analysis, social network analysis, together with user profiles to figure out. In addition, since more than 70% of Tumblr posts are images, it is necessary to go beyond photo captions, and analyze image content together with other meta information. 8. REFERENCES [1] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the influential bloggers in a community. In Proceedings of WSDM, 2008. [2] A. Anagnostopoulos, R. Kumar, and M. Mahdian. Influence and correlation in social networks. In Proceedings of KDD, 2008. [3] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Everyone’s an influencer: quantifying influence on twitter. In Proceedings of WSDM, 2011. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan:. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. [5] S. Chang, V. Kumar, E. Gilbert, and L. Terveen. Specialization, homophily, and gender in a social curation site: Findings from pinterest. In Proceedings of The 17th ACM Conference on Computer Supported Cooperative Work and Social Computing, CSCW’14, 2014. [6] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. arXiv, 706, 2007. [7] E. Gilbert, S. Bakhshi, S. Chang, and L. Terveen. ’i need to try this!’: A statistical overview of pinterest. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 2013. SIGKDD Explorations [8] N. Hochman and R. Schwartz. Visualizing instagram: Tracing cultural visual rhythms. In Proceedings of the Workshop on Social Media Visualization (SocMedVis) in conjunction with ICWSM, 2012. [9] J. E. Hopcroft, T. Lou, and J. Tang. Who will follow you back?: reciprocal relationship prediction. In Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), pages 1137–1146, 2011. [10] A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In WebKDD/SNA-KDD ’07, pages 56–65, New York, NY, USA, 2007. ACM. [11] H. Kwak, C. Lee, H. Park, and S. B. Moon. What is twitter, a social network or a news media. In Proceedings of 19th International World Wide Web Conference (WWW), 2010. [12] C. Lampe, N. Ellison, and C. Steinfield. A familiar face(book): Profile elements as signals in an online social network. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 2007. [13] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant-messaging network. In WWW ’08: Proceeding of the 17th international conference on World Wide Web, pages 915–924, New York, NY, USA, 2008. ACM. [14] M. McGlohon, J. Leskovec, C. Faloutsos, M. Hurst, and N. S. Glance. Finding patterns in blog shapes and blog evolution. In Proceedings of ICWSM, 2007. [15] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 29–42, New York, NY, USA, 2007. ACM. [16] S. Mittal, N. Gupta, P. Dewan, and P. Kumaraguru. The pinbang theory: Discovering the pinterest world. arXiv preprint arXiv:1307.4952, 2013. [17] B. Nardi, D. J. Schiano, S. Gumbrecht, and L. Swartz. Why we blog. Commun. ACM, 47(12):41–46, 2004. [18] M. E. J. Newman. Assortative mixing in networks. Physical review letters, 89(20): 208701, 2002. [19] M. E. J. Newman. Mixing patterns in networks. Physical Review E, 67(2): 026126, 2003. [20] R. Ottoni, J. P. Pesce, D. Las Casas, G. Franciscani, P. Kumaruguru, and V. Almeida. Ladies first: Analyzing gender roles and behaviors in pinterest. Proceedings of ICWSM, 2013. [21] X. Shi, B. Tseng, , and L. A. Adamic. Looking at the blogosphere topology through different lenses. In Proceedings of ICWSM, 2007. [22] J. Ugander, B. Karrer, L. Backstrom, and C. Marlow. The anatomy of the facebook social graph. arXiv preprint arXiv:1111.4503, 2011. [23] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of WSDM, pages 1137–1146, 2010. [24] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who says what to whom on twitter. In Proceedings of WWW 2011. Volume 16, Issue 1 Page 29 Change Detection in Streaming Data in the Era of Big Change Detection Streaming in the Era of Big Data:inModels and Data Issues Data: Models and Issues Dang-Hoan Tran Vietnam Maritime University Dang-Hoan Tran Vietnam Maritime University 484 Lach Tray, Ngo Quyen 484Haiphong, Lach Tray, Vietnam Ngo Quyen Haiphong, Vietnam Mohamed Medhat Gaber School of Computing Mohamed MedhatScience Gaber School of Computing Science and Digital Media, Robert andGordon Digital Media, Robert University Gordon Riverside EastUniversity Garthdee Road dangRiverside East Garthdee Road Aberdeen [email protected] Aberdeen AB10 7GJ, UK [email protected] AB10 7GJ, UK [email protected] [email protected] ABSTRACT ABSTRACT Big Data is identified by its three Vs, namely velocity, volume, Big Data is identified by data its three Vs, processing namely velocity, volume, and variety. The area of stream has long dealt and The two areaVsofvelocity data stream processing hasa long dealt with variety. the former and volume. Over decade of with the former two Vs velocity and volume. Over a decade of intensive research, the community has provided many important intensive research, theincommunity has provided research discoveries the area. The third V ofmany Big important Data has research in the area.and Thethethird of Big Data data has been the discoveries result of social media largeV unstructured been the result of socialtechniques media andhave the large unstructured it generates. Streaming also been proposeddata reit generates. Streaming techniques have also been proposed recently addressing this emerging need. However, a hidden factor cently addressing this emerging need. However, a hidden factor can represent an important fourth V, that is variability or change. can important fourthand V, that is variability or change. Our represent world is an changing rapidly, accounting to variability is Our world is changing rapidly, and accounting to variability is a crucial success factor. This paper provides a survey of change adetection crucial success factor. This paper provides data. a survey change techniques as applied to streaming Theofreview is detection techniques as applied to streaming data. The review is timely with the rise of Big Data technologies, and the need to timely with the rise of Big Data technologies, and the need to have this important aspect highlighted and its techniques categohave this important aspect highlighted and its techniques categorized and detailed. rized and detailed. 1. 1. INTRODUCTION INTRODUCTION Today’s world is changing very fast. The changes occur in every Today’s world is changing very fast. The changes occur in every aspects aspects of of life. life. Therefore, Therefore, the the ability ability to to detect, detect, adapt, adapt, and and react react to the change play an important role in all aspects of to the change play an important role in all aspects of life. life. The The physical physical world world is is often often represented represented in in some some model model or or some some inforinformation mation system. system. The The changes changes in in the the physical physical world world are are reflected reflected in in terms of the changes in data or model built from data. terms of the changes in data or model built from data. Therefore, Therefore, the the nature nature of of data data is is changing. changing. The advance of technology The advance of technology results results in in the the data data deluge. deluge. The The data data volume volume is is increasing increasing with with an an estimated estimated rate rate of of 50% 50% per per year year [39]. [39]. Data Data flood flood makes makes traditional traditional methods methods including including traditional traditional disdistributed framework and parallel models inappropriate tributed framework and parallel models inappropriate for for proprocessing, cessing, analyzing, analyzing, storing, storing, and and understanding understanding these these massive massive data data sets. sets. Data Data deluge deluge needs needs aa new new generation generation of of computing computing tools tools that that th Jim paradigm in in scientific scientific computing computing [25]. [25]. ReReJim Gray Gray calls calls the the 4 4th paradigm cently, there have been some emerging computing paradigms that meet the requirements of Big Data as follows. Parallel batch processing model only deals with the stationary massive data [17]. However, evolving data continuously arrives with high speed. In fact, online data stream processing is the main approach to dealing with the problem of three characteristics of Big Data including big volume, big velocity, and big variety. Streaming data processing is a model of Big Data processing. Streaming data is temporal data in nature. In addition to the temporal nature, streaming data may include spatial characteristics. For example, geographic information systems can produce spatial-temporal data stream. Streaming data processing and mining have been deploying in SIGKDD Explorations Explorations SIGKDD Kai-Uwe Sattler Ilmenau University Kai-Uwe Sattlerof Ilmenau University of Technology POTechnology Box 100565 PO Box Germany 100565 Ilmenau, Ilmenau, Germany [email protected] [email protected] real-world systems such as InforSphere Streams (IBM)1 , Rapid1 , Rapid5 . In real-world systems such2 ,asStreamBase InforSphere3 ,Streams MOA 4(IBM) , AnduIN miner Streams Plugin 2 3 4 5 . In , StreamBase MOA a, AnduIN miner Streams Plugin order to deal with the high-speed data ,streams, hybrid model order to deal with the high-speed data streams, a hybrid model that combines the advantages of both parallel batch processing that combines the advantages of both model parallelis batch processing model and streaming data processing proposed. Some 7 , and Grok model data processing is proposed. Some8 projectsand for streaming such hybrid model include model S4 66 , Storm 7 , and Grok 8 , Storm projects for such hybrid model include S4 . .One of these challenges facing data stream processing and minOne these challenges data stream ing isofthe changing naturefacing of streaming data.processing Therefore,and the minabiling is the changing nature of streaming data. Therefore, the ability to identify trends, patterns, and changes in the underlying proity to identify trends, patterns, and changes in the underlying processes generating data contributes to the success of processing cesses generating data contributes to the success of processing and mining massive high-speed data streams. and mining massive high-speed data streams. A model of continuous distributed monitoring has been recently A model of continuous distributed monitoring has been recently proposed to deal with streaming data coming from multiple proposed to deal with streaming data coming from multiple sources. This model has many observers where each observer sources. This model has many observers where each observer monitors a single data stream. The goal of continuous distributed monitors a single data stream. The goal of continuous distributed monitoring is to perform some tasks that need to aggregate the inmonitoring is to perform some tasks that need to aggregate the incoming data from the observers. The continuous distributed moncoming data from the observers. The continuous distributed monitoring is applied to monitor networks such as sensor networks, itoring is applied to monitor networks such as sensor networks, social networks, networks of ISP [11]. social networks, networks of ISP [11]. Change Change detection detection is is the the process process of of identifying identifying differences differences in in the the state state of of an an object object or or phenomenon phenomenon by by observing observing it it at at different different times times or or different different locations locations in in space. space. In In the the streaming streaming context, context, change detection is the process of segmenting change detection is the process of segmenting aa data data stream stream into into different different segments segments by by identifying identifying the the points points where where the the stream stream dynamics dynamics change change [53]. [53]. A A change change detection detection method method consists consists of of the the following following tasks: tasks: change change detection detection and and localization localization of of change. change. Change Change detection detection identifies identifies whether whether aa change change occurs, occurs, and and reresponds sponds to to the the presence presence of of such such change. change. Besides Besides change change detecdetection, tion, localization localization of of changes changes determines determines the the location location of of change. change. The The problem problem of of locating locating the the change change has has been been studied studied in in statistics statistics in in the the problems problems of of change change point point detection. detection. This This paper paper presents presents the the background background issues issues and and notation notation relevant relevant to the problem of change detection in data streams. 2. CHANGE DETECTION IN STREAMING DATA 1 http://www-01.ibm.com/software/data/infosphere/ streams/ 2 http://www-ai.cs.uni-dortmund.de/auto?self= $eit184kc 3 http://www.streambase.com/ 4 http://moa.cs.waikato.ac.nz/ 5 http://www.tu-ilmenau.de/dbis/research/anduin/ 6 http://incubator.apache.org/s4/ 7 https://github.com/nathanmarz/storm/wiki/Tutorial 8 8 https://www.numenta.com/grok_info.html Volume 16, 16, Issue Issue 1 1 Volume Page 30 30 Page Streaming computational model is considered one of the widelyused models for processing and analyzing massive data. Streaming data processing helps the decision-making process in realtime. A data stream is defined as isfollows. Streaming computational model considered one of the widelyused models for processing and analyzing massive data. StreamEFINITION 1. Ahelps data the stream is an infinite sequence eleingDdata processing decision-making process inofrealments time. A data stream is defined as follows. (1) S = (X1 , T1 ) , .., X j , T j , ... stream is an infinite sequence of eleD EFINITION 1. A data ments Each element is a pair X j , T j where X j is a d-dimensional vec xd) (X arriving at the time stamp T j . Time-stamp tor X j = (x1 , x2 , ..., (1) S= 1 , T1 ) , .., X j , T j , ... is defined over discretedomain with a total order. There are two d-dimensional vecEach is a pairexplicit X j , T jtime-stamp where X j is agenerated types element of time-stamps: when data (x1 , x2 ,time-stamp ..., xd ) arriving at the time stampdata T j . Time-stamp tor X j =implicit arrive; is assigned by some stream prois defined over discrete domain with a total order. There are two cessing system. types of time-stamps: explicit time-stamp is generated when data Streaming data time-stamp includes theis fundamental characteristics as profolarrive; implicit assigned by some data stream lows. data arrives continuously. Second, streaming data cessingFirst, system. evolves overtime. Third, streaming data is noisy, corrupted. Streaming datainterfering includes the fundamental characteristics as folForth, timely is important. From the characteristics lows. First, data continuously. Second, streaming data of streaming data arrives and data stream model, data stream processevolves overtime. Third, streaming data is noisy, corrupted. ing and mining pose the following challenges. First, as streaming Forth, timely interfering is important. From thedata characteristics data arrives rapidly, the techniques of streaming process and of streaming data the stream data stream analysis must data keepand up with data model, rate to prevent fromprocessthe loss ing and mining pose the following as streaming of important information as well aschallenges. avoid dataFirst, redundancy. Secdata rapidly, techniques process and ond, arrives as the speed of the streaming dataof is streaming very high, data the data volume analysis must keep up with the data rate to prevent from the loss overcomes the processing capacity of the existing systems. Third, of information welltime, as avoid data redundancy. Sectheimportant value of data decreasesasover the recent streaming data is ond, as the speed of streaming data is very high, the data volume sufficient for many applications. Therefore, one can only capture overcomes the processing capacity of the existing systems. Third, and process the data as soon as it is generated. the value of data decreases over time, the recent streaming data is sufficient for many Detection: applications. Therefore, one canand onlyNotacapture 2.1 Change Definitions and process tionthe data as soon as it is generated. This presents concepts and classificationand of changes 2.1 section Change Detection: Definitions Notaand change tiondetection methods. To develop a change detection method, we should understand what a change is. This section presents concepts and classification of changes andDchange detection methods. To develop detection EFINITION 2. Change is defined as thea change difference in the method, we should understand what a change is. state of an object or phenomenon over time and/or space [52; 1].D EFINITION 2. Change is defined as the difference in the state an object or phenomenon timeofand/or space [52;a In theofview of system, change is theover process transition from 1]. state of a system to another. In other words, a change can be defined the of difference betweenisan state and a laterfrom state.a In theas view system, change theearlier process of transition An distinction between change difference stateimportant of a system to another. In other words,and a change can is be that deafined change refers to a transition state state of an and object or astate. pheas the difference betweeninantheearlier a later nomenon overtime while the difference means dissimilarity in An important distinction between change andthe difference is that the characteristics two objects. A change a change refers to of a transition in the state ofcan an reflect object the or ashortpheterm trendovertime or long-term For example, stock analyst may nomenon whiletrend. the difference meansa the dissimilarity in be in theofshort-term change of thecan stock price.the shorttheinterested characteristics two objects. A change reflect Change detection is defined asFor the example, process ofa stock identifying differterm trend or long-term trend. analyst may ences in the state an object or phenomenon by observing it at be interested in theofshort-term change of the stock price. different times [54]. the above a change is detected Change detection is In defined as thedefinition, process of identifying differon thein basis differences of an at different times without ences the of state of an object orobject phenomenon by observing it at considering the[54]. differences of an definition, object in locations In different times In the above a changeinisspace. detected many world applications, changes occur both in terms of on thereal basis of differences of an objectcan at different times without both time and For example, multiple spatial-temporal data considering thespace. differences of an object in locations in space. In many real world applications, changeslongitude, can occurtime) both are in terms of streams representing triple (latitude, created both timeinformation and space. For example, spatial-temporal in traffic systems usingmultiple GPS [23]. Hence, changedata destreamscan representing (latitude, longitude, time) are created tection be definedtriple as follows. in traffic information systems using GPS [23]. Hence, change de3. Change detection is the process of identifyD EFINITION tection can be defined as follows. ing differences in the state of an object or phenomenon by ob3. Change detection is thelocations process in of space. identifyD EFINITION serving it at different times and/or different ing differences in the state of an object or phenomenon by obA distinction betweentimes concept driftdifferent detection and change detecserving it at different and/or locations in space. tion is that concept drift detection focuses on the labeled data A distinction and and change detecwhile change between detectionconcept can dealdrift withdetection both labeled unlabeled tion is that concept drift detection focuses on the labeled data while change detection can deal with both labeled and unlabeled SIGKDD Explorations SIGKDD Explorations data. Change analysis both detects and explains the change. Hido et al. [26] proposed a method for change analysis by using supervised learning. data. Change analysis both detects explains change. Hido 4. Change point and detection is the identifying time D EFINITION et al. [26] proposed a method for change analysis by using superpoints at which properties of time series data change[32] vised learning. Depending on specific application, change detection can be D EFINITION 4. Change point detection is identifying time called in different terms such as burst detection, outlier detection, points at which properties of time series data change[32] or anomaly detection. Burst detection a special kind of change DependingBurst on specific application, detection sum can exbe detection. is a period on streamchange with aggregated called in adifferent terms such as burst detection, detection, ceeding threshold [31]. Outlier detection is aoutlier special kind of or anomaly detection. Burst detection detection can a special kindasofa change change detection. Anomaly be seen special detection. Burstdetection is a period on streamdata. with aggregated sum extype of change in streaming ceeding threshold [31].problem Outlierofdetection is a special of To find aasolution to the change detection, wekind should change detection. can be special consider the aspectsAnomaly of changedetection of the system in seen whichaswea want to type of As change detection detect. shown in [52],in thestreaming followingdata. aspects of change, which To find solution to the problem of change detection, wechange, should must beaconsidered, include subject of change, type of consider aspects of change of the system of in which want to cause of the change, effect of change, response change,wetemporal detect. [52], the of change, which issues, As andshown spatialinissues. In following particular,aspects to design an algorithm mustdetecting be considered, subject of change, typemajor of change, for changesinclude in sensor streaming data, the quescause of change, of include: change, What response of change, temporal tions we need to effect answer is the system in which issues, and spatial In particular, tothe design an algorithm the changes need toissues. be detected? What are principles used to for detecting changesWhat in sensor data, quesmodel the problem? is datastreaming type? What arethe themajor constraints tions we need to answer include: What is the system in which of the problem? What is the physical subject of change? What is the changes to be detected? What aretothe principles used to the meaningneed of change to the user? How respond and react to model the problem? is data What are the constraints this change? How to What visualize thistype? change? of the problem? What is the physical subject of change? What is A change detection method can fall into one of two types: batch the meaning of change to the user? How to respond and react to change detection and sequential change detection. Given a sethis change? How to visualize this change? quence of N observations x1 , .., xN , where N is invariant, the task A change detection method can fall into one of two types: batch of a batch change detection method is deciding whether a change change detection and sequential change detection. Given a seoccurs at some point in the sequence by using all N available obquence of N observations x1 , .., xN , where N is invariant, the task servations. When the arriving speed of data is too high, batch of a batch change detection method is deciding whether a change change detection is suitable. In other words, change detection occurs at some point in the sequence by using all N available obmethod using two adjacent windows model will be used. Howservations. When the arriving speed of data is too high, batch ever, the drawback of batch change detection method is that its change detection is suitable. In other words, change detection running time is very large when detecting changes in a large method using two adjacent windows model will be used. Howamount of data. In contrast, sequential change detection probever, the drawback of batchthe change detection method is that its lem is based on the observations so far. If no change is detected, running time is very large when detecting changes in a large the next of observation is processed. Whenever a change is detected, amount data. In contrast, the sequential change detection probthe change detector is reset. lem is based on the observations so far. If no change is detected, Change methods can be classified the following apthe next detection observation is processed. Wheneverinto a change is detected, proaches: the changethreshold-based detector is reset.change detection method; state-based change change method. Change detection detection method; methods trend-based can be classified intodetection the following apA change threshold-based detection algorithm should meet method; three main requireproaches: change detection state-based ments accuracy, promptness, online. The algorithm change[37]: detection method; trend-basedand change detection method. should detect as many as possible actual change genA change detection algorithm should meet threepoints main and requireerate few asaccuracy, possible false alarms. The should detect mentsas[37]: promptness, and algorithm online. The algorithm change point as The algorithm shouldand be geneffishould detect as early manyas as possible. possible actual change points cient sufficient for a realfalse timealarms. environment. erate as few as possible The algorithm should detect Change detection in data stream allows us to identify change point as early as possible. The algorithm shouldthebetimeeffievolving trends,for and time-evolving patterns. Research issues on cient sufficient a real time environment. mining in in data streams include and representaChangechanges detection data stream allowsmodeling us to identify the timetion of changes, mining method, and interactive evolving trends, change-adaptive and time-evolving patterns. Research issues on exploration of changes Change detection plays important mining changes in data [19]. streams include modeling andanrepresentarole in changes, the field change-adaptive of data stream analysis. Since change in model tion of mining method, and interactive may conveyofinteresting time-dependent information knowlexploration changes [19]. Change detection plays anand important role inthe thechange field of stream analysis. change in model edge, of data the data stream can beSince used for understanding maynature convey time-dependent information andresearch knowlthe of interesting several applications. Basically, interesting edge, the change of thechanges data stream can streams be used for problems on mining in data canunderstanding be classified the several applications. Basically, interesting research intonature three of categories: modeling and representation of changes, problems on mining changes in exploration data streams be classified mining methods, and interactive of can changes. Change into threealgorithm categories: representationinof changes, detection canmodeling be used asand a sub-procedure many other mining methods, and interactiveinexploration ofwith changes. Change data stream mining algorithms order to deal the changing detection algorithm as a sub-procedure many other data in data streamscan [28;be4].used A definition of changeindetection for data streamdata mining algorithms in order to deal with the changing streaming is given as follows data in data streams [28; 4]. A definition of change detection for Change detection is the process of segmentD EFINITION streaming data is5. given as follows D EFINITION 5. Change detection is the process of segment- Volume 16, Issue 1 Volume 16, Issue 1 Page 31 Page 31 the changes of the generated model Incoming Data stream the changes of Data Stream the generated model Processing/Mining Incoming Data stream Data Stream the changes of Processing/Mining data generating process Model Model the changes of Figure 1: general process diagram for detecting changes in data stream dataAgenerating ing a data into different identifying thestream points Figure 1: Astream general diagram forsegments detectingbychanges in data where the stream dynamics changes [53]. • • As streams overtime in nature, there is growing eming data a data streamevolve into different segments by identifying the points phasis on detecting changeschanges not only[53]. in the underlying data diswhere the stream dynamics tribution, but also in the models generated by data stream process As data is growing emand datastreams stream evolve mining.overtime As can in benature, seen inthere Figure 1, a change phasis on detecting changes notoronly the underlying discan occur in the data stream, theinstreaming model.data Theretribution, in types the models stream process fore, therebut arealso two of thegenerated problemsbyofdata change detection: and datadetection stream mining. Asgenerating can be seen in Figure 1, a change change in the data process and change deteccan occur in the data stream, or the streaming model. Theretion in the model generated by a data stream processing, or minfore, there are two types ofofthe problems of change detection: ing. The fundamental issues detecting changes in data streams change in the data process and change detecincludesdetection characterizing andgenerating quantifying of changes and detecttion in the model generated by a data stream processing, or mining changes. A change detection method in streaming data needs ing. The fundamental issues of detecting changes in data streams a trade-off among space-efficiency, detection performance, and includes characterizing and quantifying of changes and detecttime-efficiency. ing changes. A change detection method in streaming data needs a trade-off among Detection space-efficiency, detection in performance, and 2.2 Change Methods Streaming time-efficiency. Data Over last 50 years, change detection has been widely studied 2.2 theChange Detection Methods in Streaming and applied in both academic research and industry. For examData ple, it has been studied for a long time in the following fields: Over the last 50 years, change detection has been widely studied statistics, signal processing, and control theory. In recent years and applied in both academic research and industry. For exammany detection have been for streample, it change has been studiedmethods for a long time in proposed the following fields: ing data. The approaches to detecting changes in data streamyears can statistics, signal processing, and control theory. In recent be classified follows. methods have been proposed for streammany changeasdetection ing •data. Thestream approaches changes in data Data model:toAdetecting data stream can fall intostream one of can the be classified as follows. following models: time series model, cash register model, model [41]. Onstream the basis of the • and Dataturnstile stream model: A data can fall intodata onestream of the model, there are change algorithms developed for following models: time detection series model, cash register model, the corresponding data stream model. Krishnamurthy et al and turnstile model [41]. On the basis of the data stream presented a sketch-based change detection for the model, there are change detection algorithmsmethod developed for most general streaming model model. Turnstile model [35]. et al the corresponding data stream Krishnamurthy presented a sketch-based change detection method for the • Data characteristics: Change detection methods can be most general streaming model Turnstile model [35]. classified on the basis of the data characteristics of streamsuch as data dimensionality, datamethods label, and • ing Datadata characteristics: Change detection candata be type. A data coming from thecharacteristics data stream can be uniclassified on item the basis of the data of streamvariate multi-dimensional. It would data be great if we ing dataorsuch as data dimensionality, label, andcould data develop a general algorithm able detect changes type. A data item coming from thetodata stream can in be both uniunivariate and multidimensional data streams. devariate or multi-dimensional. It would be great Change if we could tection in streaming multivariate data have developalgorithms a general algorithm able to detect changes in been both presented 34; 36]. Data streams be classified univariate [14; and multidimensional datacan streams. Changeinto decategorial data stream and numerical data stream. Webeen can tection algorithms in streaming multivariate data have presentedthe [14; 34; 36]. Data streams canfor becategorial classified data into develop change detection algorithm categorial data stream and numerical stream. We can stream or numerical data stream. In realdata world applications, develop theitem change detection algorithm for categorial each data in data stream may include multipledata atstream orofnumerical data stream. In real world tributes both numerical and categorial data. applications, In such situeach data item data stream may include multiple atations, these datainstreams can be projected by each attribute tributes numerical and categorial In such or groupofofboth attributes. Change detectiondata. methods cansitube ations, these streams can be projected by streams each attribute applied to thedata corresponding projected data afteror group of streams attributes. detection methods be wards. Data areChange classified into labeled data can stream applied to the corresponding and unlabeled data streams. Aprojected labeled data data streams stream isafterone wards. Data streams are classified into labeled data stream and unlabeled data streams. A labeled data stream is one SIGKDD Explorations SIGKDD Explorations • • • • whose individual example is associated with a given class label, otherwise, it is unlabeled data stream. A change detection algorithm that identifies changes in the labeled data stream is supervised change detection [34; whileclass one whose individual example is associated with5], a given detecting changesit in the unlabeled stream is called label, otherwise, is unlabeled data data stream. A change deunsupervised change algorithm The advantection algorithm that detection identifies changes in [7]. the labeled data tage of is thesupervised supervisedchange approach is that the stream detection [34;detection 5], whileaccuone racy is high. However, theunlabeled ground truth mustisbecalled gendetecting changes in the datadata stream erated. Thus achange unsupervised change detection is unsupervised detection algorithm [7]. approach The advantage of thetosupervised approach detection preferred the supervised one isinthat casethethe ground accutruth racy is unavailable. high. However, the ground truth data must be gendata erated. Thus a unsupervised change detection approach is Completeness statistical information: Onground the basis of preferred to theofsupervised one in case the truth the datacompleteness is unavailable.of statistical information, a change detection algorithm can fall into one of three following catCompleteness of statistical On thearebasis of egories. Parametric change information: detection schemes based the knowing completeness of statistical information, a change deon the full prior information before and after tection algorithm can in fall one of three following catchange. For example, theinto distributional change detection egories. change detection schemes are based methods,Parametric the data distributions before and after change are on knowing theAfull prior introduced information beforeto and after known [41; 42]. recently method detecting change. example, the distributional changemethod detection changes For in order stockin streams is a parametric in methods, data distributions andorders after change which thethe distribution of streambefore of stock confideare to known [41; 42]. A recently introduced method of to detecting the Poisson distribution [37]. The advantage parametchanges in order stock streams is a parametric ric change detection approaches is that they can method produceina which distribution stream of stock orders to higher the accurate result of than semi-parametric andconfide nonparathe Poisson distribution [37]. The advantage of parametmetric methods. However, in many real-time applications, ric change detection approaches is that they can produce data may not confine to any standard distribution, thusa higher accurate result than semi-parametric and nonparaparametric approaches are inapplicable. Semi-parametric metric methods. However, in many real-time methods are based on the assumption that theapplications, distribution data may not confine thus of observations belongstoto any somestandard class of distribution, distribution funcparametric approaches are inapplicable. Semi-parametric tion, and parameters of the distribution function change methods are based on the assumption that the distribution in disorder moments. Recently, Kuncheva [36] has proof observations belongs to some class of distribution funcposed a semi-parametric method using a semi-parametric tion, and parameters of the distribution function change log-likelihood for testing a change. Nonparametric methin disorder moments. Recently, Kuncheva [36] has proods make no distribution assumptions on the data. Nonposed a semi-parametric method using a semi-parametric parametric methods for detecting changes in the underlylog-likelihood for testing a change. Nonparametric mething data distribution includes Wilcoxon, kernel method, ods make no distribution assumptions on the data. NonKullback-Leiber distance, and Kolmogorov-Smirnov test. parametric methods for detecting changes in the underlyNonparametric methods can be classified into two cateing data distribution includes Wilcoxon, kernel method, gories: nonparametric methods using window [33]; nonKullback-Leiber distance, and Kolmogorov-Smirnov test. parametric methods without using window into [27].two We catehave Nonparametric methods can be classified paid particular attentionmethods to the nonparametric degories: nonparametric using window change [33]; nontection methods usingwithout windowusing because in many parametric methods window [27].real-world We have applications, theattention distributions both null hypothesis paid particular to theofnonparametric change and dealternative hypothesis are unknown ininadvance. Furthertection methods using window because many real-world more, we arethe only interested of in both recent data. A common applications, distributions null hypothesis and approach identifyingaretheunknown change is comparing two alternativetohypothesis in to advance. Furthersamples order to interested find out theindifference between them, more, weinare only recent data. A common which is called two-sample change detection, or windowapproach to identifying the change is to comparing two based change detection. stream is infinite, slidsamples in order to find As out data the difference betweenathem, ing window is often used to detect Window based which is called two-sample changechanges. detection, or windowchange detection incurs the [37]. based change detection. Ashigh datadelay stream is Window-based infinite, a slidchange detection scheme the dissimilarity meaing window is often used is to based detecton changes. Window based sure between twoincurs distributions synopses extracted from change detection the highordelay [37]. Window-based the reference window andisthe current window. change detection scheme based on the dissimilarity measure between two distributions or synopses extracted from Velocity of data change: Aggarwal proposes a framework the reference window and the current window. that can deal with the changes in both spatial velocity profile and temporal velocityAggarwal profile [1;proposes 2]. In this approach, Velocity of data change: a framework thatchanges can dealin with changes in both spatial the datathedensity occurring at eachvelocity locationproare file and temporal velocityvelocity profile [1; 2]. In in thissome approach, estimated by estimating density userthe changes in data densityAn occurring at advantage each location are defined temporal window. important of this estimated estimating velocity density This in some userapproach isbythat it visualizes the changes. visualizadefined temporal helps window. important the advantage this tion of changes userAn understand changesofintuapproach itively. is that it visualizes the changes. This visualization of changes helps user understand the changes intuSpeed itively. of response: If a change detection method needs to react to the detected changes as fast as possible, the Speed of response: If a change detection method needs to react to the detected changes as fast as possible, the Volume 16, Issue 1 Volume 16, Issue 1 Page 32 Page 32 quickest quickest detection detection of of change change should should be be proposed. proposed. Quickest Quickest change change detection detection can can help help aa system system make make aa timely timely alarm. alarm. Timely Timely alarm alarm warning warning is is benefit benefit for for economical. economical. In In some some cases, cases, it it may may save save the the human human life life such such as as in in fire-fighting fire-fighting system. system. Change Change detection detection methods methods using using two two overlapping overlapping windows windows can can quickly quickly react react to to the the changes changes in in streaming streaming data data while while methods methods using using adjacent adjacent windows windows model model may may incur incur the the high high delay. delay. As As change change can can be be abrupt abrupt change change or or gradual gradual change, change, there there exists exists the the abrupt abrupt change change detection detection algorithm algorithm and gradual change detection algorithm [46; 40]. and gradual change detection algorithm [46; 40]. • Decision making methodology: Based on the decision • Decision making methodology: Based on the decision making methodology, a change detection method can fall making methodology, a change detection method can fall into one of the following categories: rank-based method into one of the following categories: rank-based method [33], density-based method [55], information-theoretic [33], density-based method [55], information-theoretic method [15]. A change detection problem can be also clasmethod [15]. A change detection problem can be also classified into batch change detection and sequential change sified into Based batch on change detection change detection. detection delayand that sequential a change detector detection. Based on detection delay that a can change detector suffers from, a change detection methods fall into one suffers from, a change can fall into of two following types: detection real-time methods change detection, and one retof two following real-time change retrospective changetypes: detection. Based on thedetection, spatial orand temporospective change detection. Based on the spatial or temporal characteristics of data, change detection algorithm can ral characteristics of data, algorithmtemcan fall into one of three kinds:change spatialdetection change detection; fall one ofdetection; three kinds: spatial change detection; temporalinto change or spatio-temporal change detecporal[6]. change detection; or spatio-temporal change detection tion [6]. • Application: On the basis of applications that generate data • streams, Application: the basis of applications data data On streams can be classified as that into generate transactional streams, data streams classified as into data transactional data stream, sensor can databestream, network stream, data stream, sensor data stream, data network data stream, stock order data stream, astronomy stream, video data stock order stream, data stream,there video stream, etc. data Based on theastronomy specific applications, aredata the stream, etc. Based methods on the specific thereapplicaare the change detection for theapplications, corresponding change detection methods for methods the corresponding tions such as change detection for sensor applicastreamtionsdata such[56], as change streaming changedetection detectionmethods methodsfor forsensor transactional ing data [56], methodsvan for Leeuwen transactional streaming datachange [45; 57;detection 8]. For example, and streaming [45; 57; 8]. For example, van Leeuwen Siebes [57]data have presented a change detection methodand for Siebes [57] have presented change method for transactional streaming dataabased on detection the principle of Mintransactional streaming data based on the principle of Minimum Description Length. imum Description Length. • Stream processing methodology: Based on methodology processing data stream, a data stream canmethodology be classified • for Stream processing methodology: Based on into online data stream anda off-line datacan stream [38]. In for processing data stream, data stream be classified some work, data an online called a live[38]. stream into online streamdata andstream off-lineis data stream In while off-line data stream is called some an work, an online data stream is archived called a data live stream [18]. datadata stream needs to be processed online bewhileOnline an off-line stream is called archived data stream cause of its high data streams include [18]. Online data speed. stream Such needsonline to be processed online bestreams of network measurements, cause ofofitsstock highticker, speed.streams Such online data streams include and streams of sensor data,etc.ofOff-line is a sestreams of stock ticker, streams networkstream measurements, quence of updates to warehouses or backup devices. and streams of sensor data,etc. Off-line stream is a The sequeries over the off-line streams orcan be processed offquence of updates to warehouses backup devices. The line. However, as off-line it is insufficient off-line queries over the streams time can to be process processed offstreams, techniques summarizing are necessary. line. However, as it isforinsufficient timedata to process off-line In off-linetechniques change detection method, the entire set is streams, for summarizing data are data necessary. available the analysis process to detect the change. The In off-linefor change detection method, the entire data set is online method theprocess changetoincrementally basedThe on available for thedetects analysis detect the change. the recently incoming item. An important distinction online method detects data the change incrementally based on between off-line method and online is that distinction the online the recently incoming data item. An one important method constrained byand the online detection reaction time betweenisoff-line method oneand is that the online due to the of real-time applications whiletime the method is requirement constrained by the detection and reaction off-line is free from the detection time, and reaction time. due to the requirement of real-time applications while the Methods for detecting changes can be useful for streamoff-line is free from the detection time, and reaction time. ing data warehouses where both live streams of data and Methods for detecting changes can be useful for streamarchived data streams are available [24; 29]. In this work, ing data warehouses where both live streams of data and we focus on developing the methods for detecting changes archived data streams are available [24; 29]. In this work, in online data streams, in particular, sensor data streams. we focus on developing the methods for detecting changes in work onlineondata streams, inchange particular, sensorproposed data streams. The first model-based detection by [21; 22] is FOCUS. The central idea behind FOCUS is that the models The first work on model-based change detection proposed by [21; 22] is FOCUS. The central idea behind FOCUS is that the models SIGKDD Explorations SIGKDD Explorations can can be be divided divided into into structural structural and and measurement measurement components. components. To To detect detect deviation deviation between between two two models, models, they they compare compare specific specific parts parts of of these these corresponding corresponding models. models. The The models models obtained obtained by by data data mining mining algorithms algorithms includes includes frequent frequent item item sets, sets, decision decision trees, trees, and and clusters. clusters. The The change change in in model model may may convey convey interesting interesting informainformation tion or or knowledge knowledge of of an an event event or or phenomenon. phenomenon. Model Model change change is is defined defined in in terms terms of of the the difference difference between between two two set set of of parameters parameters of of two two models models and and the the quantitative quantitative characteristics characteristics of of two two modmodels. els. As As such, such, model model change change detection detection is is finding finding the the difference difference between two set of parameters of two models and the quantitabetween two set of parameters of two models and the quantitative characteristics of these two models. We should distinguish tive characteristics of these two models. We should distinguish between detection of changes in data distribution by using modbetween detection of changes in data distribution by using models and detection of changes in model built from streaming data. els and detection of changes in model built from streaming data. While model change detection aims to identify the difference beWhile model change detection aims to identify the difference between two models, change detection in the underlying data distritween two models, change detection in the underlying data distribution by using models is inferring the changes in two data sets bution by using models is inferring the changes in two data sets from the difference between two models constructed from two from the between two modelsdata constructed from data sets. difference The changes in the underlying distribution cantwo indata The changes inchanges the underlying data distribution can the induce sets. the corresponding in the model produced from duce the corresponding changes in the model produced from the data generating process. data generating process. As models can be generated by statistics method or data mining As models can be generatedinby statistics or datainto mining methods, change detection models canmethod be classified data methods, change modelsTwo cankinds be classified into mining model anddetection statisticalinmodel. of models wedata are mining model and statistical model. Two kinds of models we are interested in detecting changes are predictive model and explanainterested in Predictive detecting changes areused predictive model explanatory model. model is to predict theand changes in tory model.Detecting Predictive model inis the used to predict changesfor in the future. changes pattern can bethe beneficial the future. Detecting in the pattern can be beneficial for many applications. Inchanges explanatory model, a change that occurred many In explained. explanatoryThere model, that occurred is bothapplications. detected and area change some approaches to is both detection: detected and explained. There are some approaches to change one-model approach, two-model approach, or change detection: one-model approach, two-model approach, or multiple-model approach. multiple-model A model-basedapproach. change detection algorithm consists of two A model-based change algorithm consistsdetection. of two phases as follows: modeldetection construction and change phases as follows: construction andmining change detection. First, a model is builtmodel by using some stream method such First, a model is clustering, built by using somepattern. stream mining such as decision tree, frequent Second,method a difference as decision tree, clustering, frequent pattern.based Second, difference measure between two models is computed the acharacterismeasure between computed the characteristics of the model,two thismodels step isisalso called based the quantification of tics of difference. the model, Therefore, this step isone alsofundamental called the quantification model issue here is of to quantify the changes between two and to issue determine model difference. Therefore, one models fundamental here criteis to ria for making decision whether and whenand a change in the model quantify the changes between two models to determine criteoccurs. Recently, some whether change detection streaming ria for making decision and when methods a change in the model data by clustering have been proposed [10;methods 3]. Basedinon the data occurs. Recently, some change detection streaming stream model, may have the[10; corresponding data bymining clustering havewe been proposed 3]. Based onproblems the data of detecting changes model follows. Ikonomovska et al. [30] stream mining model,inwe mayashave the corresponding problems have presented an algorithm learningIkonomovska regression trees of detecting changes in model for as follows. et al.from [30] streaming data in presence conceptregression drifts. Their change have presented an the algorithm foroflearning trees from detection based on sequential statistical that monstreamingmethod data inisthe presence of concept drifts.tests Their change itoring themethod changes of theon local error, atstatistical each node of that tree,monand detection is based sequential tests inform of theerror, localatchanges. itoring the the learning changes process of the local each node of tree, and Detecting streamofcluster model has been received ininform thechanges learningof process the local changes. creasing Zhou et al.cluster [59] have presented method for Detectingattention. changes of stream model has beena received intracking evolution of et clusters over sliding windows by using creasing the attention. Zhou al. [59] have presented a method for temporal cluster features and the over exponential histogram, tracking the evolution of clusters sliding windows bywhich using called exponential histogram cluster features.histogram, Chen and Liu [9] temporal cluster features andofthe exponential which have a framework detecting the changes in clustercalledpresented exponential histogram for of cluster features. Chen and Liu [9] ing from datachanges streamsinby using havestructures presentedconstructed a framework for categorial detecting the clusterhierarchial entropy trees to capture the entropy characteristics of ing structures constructed from categorial data streams by using clusters, andentropy then detecting in clustering structures based hierarchial trees to changes capture the entropy characteristics of on these and entropy clusters, thencharacteristics. detecting changes in clustering structures based Based onentropy the data stream mining model, we may have the coron these characteristics. responding problems of detecting changes in model as follows Based on the data stream mining model, we may have the cor[14]. Recently Ng and Dash [44] have introduced an algorithm responding problems of detecting changes in model as follows for mining frequent patterns from evolving data streams. Their [14]. Recently Ng and Dash [44] have introduced an algorithm algorithm is capable of updating the frequent patterns based on for mining frequent patterns from evolving data streams. Their the algorithms for detecting changes in the underlying data disalgorithm is capable of updating the frequent patterns based on tributions. Two windows are used for change detection: the refthe algorithms changes in the data diserence windowfor anddetecting the current window. At underlying the initial stage, the tributions. Two windows are used for change detection: the reference window and the current window. At the initial stage, the Volume 16, Issue 1 Volume 16, Issue 1 Page 33 Page 33 reference is initialized with the first batch of transactions from reference is initialized the first batch data stream. The currentwith window moves onof thetransactions data streamfrom and captures the The next current batch ofwindow transactions. frequent item sets data stream. moves Two on the data stream and are constructed two windows by using the captures the nextfrom batch ofcorresponding transactions. Two frequent item sets Apriori algorithm. A statistical test is performed on by twousing absolute are constructed from two corresponding windows the support values thatAare computed Apriorion from referApriori algorithm. statistical testby is the performed twothe absolute ence window window.byBased on the from statistical test, support valuesand thatcurrent are computed the Apriori the referthe significant or insignificant. the deviation encedeviation window can and be current window. Based on theIfstatistical test, is a change in theordata stream is reported. Chang thesignificant deviation then can be significant insignificant. If the deviation and Lee [8] have method for monitoring the Chang recent is significant then presented a change ina the data stream is reported. change item setsa from dataforstream by using and Leeof[8]frequent have presented method monitoring the sliding recent window. change of frequent item sets from data stream by using sliding window. 2.3 Design Methodology 2.3 Design Methodology There are two design methodologies for developing the change detection streaming data. first methodology There are algorithms two design in methodologies for The developing the change is to adaptalgorithms the existing detection methods streaming detection in change streaming data. The first for methodology data. However, many traditional change detection methods canis to adapt the existing change detection methods for streaming not extendedmany for streaming because of the methods high compudata.beHowever, traditionaldata change detection cantational complexity as some kernel-based not be extended for such streaming data because of change the highdetection compumethods, and density-based change detection methods. The sectational complexity such as some kernel-based change detection ond methodology is to develop new change detection methods for methods, and density-based change detection methods. The secstreaming data. is to develop new change detection methods for ond methodology There are two streaming data.common approaches to the problem of change detection in streaming dataapproaches distributions: distance-based change deThere are two common to the problem of change detectors and predictive model-based change detectors. change In the fortection in streaming data distributions: distance-based demer, two windows are used to extractchange two data segments the tectors and predictive model-based detectors. Infrom the fordata The change is to quantified by data usingsegments some dissimilarmer, stream. two windows are used extract two from the ity If thechange dissimilarity measure is greater a given datameasure. stream. The is quantified by using somethan dissimilarthreshold then a change is detected. Similar to distance-based ity measure. If the dissimilarity measure is greater than a given change detectors, two windows are used for detecting changes. threshold then a change is detected. Similar to distance-based Instead of comparing the dissimilarity measure between two winchange detectors, two windows are used for detecting changes. dows a given threshold, a change is detected by using the Insteadwith of comparing the dissimilarity measure between two winprediction error of the model built from the current window and dows with a given threshold, a change is detected by using the the predictive model from thethe reference prediction error of theconstructed model built from current window. window and the predictive model constructed from the reference window. erance. The scalability refers to the ability to extend the erance. Thenetwork scalability refers to the ability to extend the size of the without significantly reducing the performance the framework. As faults may occurthe dueperto size of theof network without significantly reducing the transmission and the As effects of noisy channels formance of the error framework. faults may occur duebeto tween local sensors and fusion center,ofa noisy distributed change the transmission error and the effects channels bedetection method should be able to tolerate these faults in tween local sensors and fusion center, a distributed change order to assure theshould function thetosystem. detection method be of able tolerate these faults in order to assure the function of the system. • Distributed change detection using the local approach is direlevant to the problemusing of multiple • rectly Distributed change detection the localhypotheses approach istestdiing and data fusion because of each local change detector rectly relevant to the problem multiple hypotheses testneeds to data perform a hypothesis test local to determine ing and fusion because each change whether detector aneeds change occurs. Therefore, the deto perform a hypothesisbesides test toconsidering determine whether tection performance of local change algorithms a change occurs. Therefore, besides detection considering the deincluding probabilityof oflocal detection anddetection probability of false tection performance change algorithms alarm at the node level,ofthe detection performance disincluding probability detection and probability of ofafalse tributed at the fusion center alarm at change the nodedetection level, themethod detection performance of amust disbe takenchange into account. tributed detection method at the fusion center must be taken into account. Distributed detection and data fusion have been widely studied for many decades. However, recently, Distributed detection and dataonly fusion have distributed been widelydetection studied in data has receivedonly attention. forstreaming many decades. However, recently, distributed detection in streaming data has received attention. 3.1 3.1 Distributed Detection: One-shot versus Continuous Distributed Detection: One-shot versus Distributed detection of changes can be classified into two types Continuous 3. 3. of models asdetection follows. of changes can be classified into two types Distributed of models as follows. • One-shot distributed detection of changes: Figure 2 shows models of one-shot distributed change detection. One• two One-shot distributed detection of changes: Figure 2 shows shot change of detection method means a change detector detwo models one-shot distributed change detection. Onetects and reacts to themethod detected change once detector a changedeis shot change detection means a change detected. distributed have retects and One-shot reacts to the detectedchange changedetection once a change is ceived great deal of attention forchange a long time. One-shot detected. One-shot distributed detection havedisretributed change include two models: distributed ceived great dealdetection of attention for a long time. One-shot disdetection with decision with decision fusion as shown in tributed change detection include two models: distributed Figure 2(a); distributed without decision fusion detection with decision detection with decision fusion as shown in as illustrated in Figure 2(b). What are the differences beFigure 2(a); distributed detection without decision fusion tween one-shot and What continuous as illustrated in detection Figure 2(b). are thedetection. differences be- can be achieved only when could amount develop of thestreaming change detecKnowledge discovery from we massive data tion frameworks that monitor streaming data created by multiple can be achieved only when we could develop the change detecsources such as sensor networks, WWWdata [13].created The objectives of tion frameworks that monitor streaming by multiple designing a distributed detection areobjectives maximizing sources such as sensor change networks, WWWscheme [13]. The of the lifetime of the network, maximizing the detection capability, designing a distributed change detection scheme are maximizing and minimizing the communication cost [58]. the lifetime of the network, maximizing the detection capability, There are two approaches to the problem of change detection in and minimizing the communication cost [58]. streaming data that is created from multiple sources. In the cenThere are two approaches to the problem of change detection in tralized approach: all remote sites send raw data to the coordistreaming data that is created from multiple sources. In the cennator. The coordinator aggregates all the raw streaming data that tralized approach: all remote sites send raw data to the coordiis received from the remote sites. Detection of changes is pernator. The coordinator aggregates all the raw streaming data that formed on the aggregated streaming data. In most cases, comis received from the remote sites. Detection of changes is permunication consumes the largest amount of energy. The lifetime formed on the aggregated streaming data. In most cases, comof sensors therefore drastically reduces when they communicate munication consumes the largest amount of energy. The lifetime raw measurements to a centralized server for analysis. Centralof sensors therefore drastically reduces when they communicate ized approaches suffer from the following problems: communiraw measurements to a centralized server for analysis. Centralcation constraint, power consumption, robustness, and privacy. ized approaches suffer from the following problems: communiDistributed detection of changes in streaming data addresses the cation constraint, power consumption, robustness, and privacy. challenges that come from the problem of change detection, data Distributed detectionand of changes in streaming data addresses the stream processing, the problem of distributed computing. challenges that come from thethe problem of change detection, data The challenges coming from distributed computing environstream ment areprocessing, as follows and the problem of distributed computing. The challenges coming from the distributed computing environment as follows • are Distributed change detection in streaming data is a problem of distributed computing in nature. Therefore, a distributed • Distributed change detection in streaming data isthe a problem framework for detecting changes should meet properof distributed computing in nature. Therefore, a distributed ties of distributed computing such scalability, and fault tolframework for detecting changes should meet the properties of distributed computing such scalability, and fault tol- As of the properties of distributed Computing computational systems 3.2oneLocality in Distributed is locality [43], a distributed algorithm for detecting changes in As one of the properties of distributed computational systems streaming data should meet the locality. A local algorithm is deis locality [43], a distributed algorithm for detecting changes in fined as one whose resource consumption is independent of the streaming data should meet the locality. A local algorithm is desystem size. The scalability of distributed stream mining algofined onebewhose resource consumption is independent the rithmsascan achieved by using the local change detectionof algosystem rithms size. The scalability of distributed stream mining algorithms can be achieved by using the local change detection algoLocal algorithms can fall into one of two categories [16]:Exact rithms local algorithms are defined as ones that produce the same results Local algorithmsalgorithm; can fall into one of twolocal categories [16]:Exact as a centralized Approximate algorithms are allocal algorithms are defined as ones that produce the same results gorithms that produce approximations of the results that centralas a centralized Approximate local algorithms alized algorithms algorithm; would produce. Two attractive properties are of logorithms that produce approximations of the results that centralcal algorithms are scalability and fault tolerance. A distributed ized algorithms would produce. Two attractive properties of local algorithms are scalability and fault tolerance. A distributed DISTRIBUTED CHANGE DETECTION IN STREAMINGCHANGE DATA DETECTION DISTRIBUTED Knowledge discovery from massive amount of streaming data IN STREAMING DATA SIGKDD Explorations SIGKDD Explorations one-shot detection and continuous detection. • tween Continuous distributed detection of changes: In this chapter, we propose two continuous distributed detection mod• Continuous distributed detection of changes: In this chapels as shown in Figure 3. An important distinction between ter, we propose two continuous distributed detection modcontinuous of changes and one-shot els as showndistributed in Figure 3.detection An important distinction between distributed detection of changes is that the inputs to the continuous distributed detection of changes and one-shot one-shot distributed change detection are batches of data distributed detection of changes is that the inputs to the while the inputs to the continuous distributed detection of one-shot distributed change detection are batches of data changes are the data streams in which data items continuwhile the inputs to the continuous distributed detection of ously arrive. changes are the data streams in which data items continuously detection arrive. model without fusion is a truly distributed Distributed detection model in which The decision-making process occurs at Distributed detection model without fusion is a truly distributed each sensor. detection model in which The decision-making process occurs at each 3.2 sensor. Locality in Distributed Computing Volume 16, Issue 1 Volume 16, Issue 1 Page 34 Page 34 Phenomenon Phenomenon x1 Phenomenon x2 x2 x1 x2 x1 x2 Phenomenon x1 u1 u2 Local decision u1 Local decision u2 Local decision Global decision u1 u2 (a) Distributed detection without decision u1 u2 fusion Local decision Global decision (b) Distributed detection with decision fusion (a) Distributed detection without decision (b) Distributed detection with decision fusion Figure 2: One-shot distributed change detection models fusion Figure 2: One-shot distributed change detection models Phenomenon Phenomenon Data stream 1 Phenomenon Data stream 1 Data stream 1 Data stream 2 Phenomenon Data stream 1 u1 Local decision Data stream 2 Data stream 2 u1 u1 Data stream 2 u2 u2 Local decision Local decision u1 Local decision Global decision u2 Local decision u2 Local decision Global decision (a) Distributed continuous Local detection without deci- (b) Distributed continuous detection with decision decision Local decision sion fusion fusion (a) Distributed continuous detection without deci- (b) Distributed continuous detection with decision sion fusion fusion Figure 3: Continuous distributed change detection models Figure 3: Continuous distributed change detection models SIGKDD Explorations SIGKDD Explorations Volume 16, Issue 1 Volume 16, Issue 1 Page 35 Page 35 framework for mining streaming data should be robust to netframework for mining streaming work partitions, and node failures. data should be robust to networkadvantage partitions,ofand node failures. is the ability to preserve priThe local approaches The advantage of local approaches the ability vacy [20]. A drawback of the localisapproach to to thepreserve problempriof distributed change detection is local the synchronization vacy [20]. A drawback of the approach to theproblem. problemFor of example, local change approach can meet the principle ofFor lodistributedthe change detection is the synchronization problem. calized algorithms wireless sensorcan networks in which dataofproexample, the local in change approach meet the principle locessing is performed at node-level much asinpossible in order calized algorithms in wireless sensorasnetworks which data proto reduceisthe amount of informationastomuch be sent the network. cessing performed at node-level as in possible in order to reduce the amount of information to be sent in the network. 3.3 3.3 Distributed Detection of Changes in DistributedData Detection of Changes in Streaming Data Over theStreaming last decades, the problem of decentralized detection has received much attention. There are directionsdetection of research Over the last decades, the problem of two decentralized has on decentralized detection.There The first focusesofon aggrereceived much attention. are approach two directions research gating measurements from The multiple sensors tofocuses test a single hyon decentralized detection. first approach on aggrepothesis. The second focuses on dealing with to multiple gating measurements from multiple sensors test a dependent single hytesting/estimation tasks fromon multiple [51]. Distributed pothesis. The second focuses dealingsensors with multiple dependent change detection usually involves a setsensors of sensors receive testing/estimation tasks from multiple [51]. that Distributed observations from usually the environment then obserchange detection involves and a set of transmit sensors those that receive vations back to fusion center in order tothen reach the final consensus observations from the environment and transmit those obserof detection. Decentralized datathefusion are therevations back to fusion centerdetection in order toand reach final consensus fore two closely related tasks that arise in data the context of sensor of detection. Decentralized detection and fusion are therenetworks [48; 47]. Two traditional approaches to the decentralfore two closely related tasks that arise in the context of sensor ized change detection aretraditional data fusion,approaches and decision In data networks [48; 47]. Two to fusion. the decentralfusion, each node detects change and sends quantized version of ized change detection are data fusion, and decision fusion. In data its observation to detects a fusionchange centerand responsible for making decifusion, each node sends quantized version of sion on the detected changes, and responsible further relaying information. its observation to a fusion center for making deciIn contrast, decision fusion, and eachfurther node performs change sion on the in detected changes, relaying local information. detection byinusing somefusion, local change algorithm andlocal updates its In contrast, decision each node performs change decision on the received information and broadcasts again detectionbased by using some local change algorithm and updates its its new decision. Thisreceived processinformation repeats until decision based on the andconsensus broadcastsamong again the nodesdecision. are reached. to datauntil fusion, decision among fusion its new ThisCompared process repeats consensus can reduceare thereached. communication costtobecause sensors need only to the nodes Compared data fusion, decision fusion transmit the local decisions represented by small data structures. can reduce the communication cost because sensors need only to Although there is great deal represented of work on distributed detection and transmit the local decisions by small data structures. data fusion, most of work focuses on the one-time change detecAlthough there is great deal of work on distributed detection and tion One-time is defined as a querychange that needs to data solutions. fusion, most of workquery focuses on the one-time detecproceed data once in order to provide the answer [12]. Likewise, tion solutions. One-time query is defined as a query that needs to one-time change method is athe change detection that reproceed data oncedetection in order to provide answer [12]. Likewise, quires to proceed data once in response to the change occurred. In one-time change detection method is a change detection that rereal-world applications, we need the approaches capable of conquires to proceed data once in response to the change occurred. In tinuously thewe changes of approaches the events occurring the real-worldmonitoring applications, need the capable ofinconenvironment. Recently, work on continuous detection and montinuously monitoring the changes of the events occurring in the itoring of changes has been receiving attentionand such as environment. Recently, work started on continuous detection mon[49; 13; 50]. Das et al. [13] have presented a scalable distributed itoring of changes has been started receiving attention such as framework for detecting changes in astronomy data streams us[49; 13; 50]. Das et al. [13] have presented a scalable distributed ing local, asynchronous eigen monitoring algorithms. Palpanas framework for detecting changes in astronomy data streams uset al. [49] proposed a distributed framework for outlier detection ing local, asynchronous eigen monitoring algorithms. Palpanas in real-time data streams. In their framework, each sensor estiet al. [49] proposed a distributed framework for outlier detection mates and maintains a model for its underlying distribution by in real-time data streams. In their framework, each sensor estiusing kernel density estimators. However, they did not show how mates and maintains a model for its underlying distribution by to reach the global detection decision. using kernel density estimators. However, they did not show how to reach the global detection decision. 4. CONCLUDING REMARKS We in this paper that variability, or simply change, is cru4. argued CONCLUDING REMARKS cial in a world full of affecting factors that alter the behavior of We argued in this paper that variability, or simply change, is cruthe data, and consequently the underlying model. The ability to cial in a world full of affecting factors that alter the behavior of detect such changes in centralized as well as distributed system the data, and consequently the underlying model. The ability to plays an important role in identifying validity of data models. detect such changes in centralized as well as distributed system The paper presented the state-of-the-art in this area of paramount plays an important role in identifying validity of data models. importance. Techniques, in some cases, are tightly coupled with The paper presented state-of-the-art area of paramount application domains.the However, most of in thethis techniques reviewed importance. Techniques, in some cases, are tightly in this paper are generic and could be adapted to coupled differentwith doapplication domains. However, most of the techniques reviewed mains of applications. in this paper are generic and could be adapted to different doWith Big Data technologies reaching a mature stage, the future mains of applications. With Big Data technologies reaching a mature stage, the future SIGKDD Explorations SIGKDD Explorations work in change detection is expected to exploit such scalable data work in change is expected exploitand such scalable data processing toolsdetection in efficiently detect, to localize classify occurprocessing tools efficiently detect, localize classifymodels occurring changes. Forinexample, distributed changeand detection ring make changes. example, distributed changetodetection models can use For of the MapReduce framework accelerate their respective processes. can make use of the MapReduce framework to accelerate their respective processes. 5. 5. REFERENCES REFERENCES [1] C. Aggarwal. A framework for diagnosing changes in evolving data streams. In Proceedings of the changes 2003 ACM [1] C. Aggarwal. A framework for diagnosing in SIGMOD international on Management of ACM data, evolving data streams. conference In Proceedings of the 2003 pages 575–586. ACM New York, NY, USA, 2003. of data, SIGMOD international conference on Management pages 575–586. ACM New York, NY, USA, 2003. [2] C. Aggarwal. On change diagnosis in evolving data streams. IEEE Transactions on Knowledge Data Engineering, [2] C. Aggarwal. On change diagnosis inand evolving data streams. pages 2005. IEEE 587–600, Transactions on Knowledge and Data Engineering, pages 587–600, 2005. [3] C. Aggarwal. A segment-based framework for modeling andAggarwal. mining data Knowledge and information sys[3] C. A streams. segment-based framework for modeling tems, 30(1):1–29, 2012. Knowledge and information sysand mining data streams. tems, 30(1):1–29, 2012. [4] C. Aggarwal and P. Yu. A survey of synopsis construction in data streams. Data streams: models and algorithms, page [4] C. Aggarwal and P. Yu. A survey of synopsis construction 169, 2007. in data streams. Data streams: models and algorithms, page 169, 2007. [5] A. Bondu and M. Boullé. A supervised approach for change detection data Insupervised Neural Networks (IJCNN), The [5] A. Bondu in and M.streams. Boullé. A approach for change 2011 International Joint In Conference on, pages 519–526. detection in data streams. Neural Networks (IJCNN), The IEEE, 2011. 2011 International Joint Conference on, pages 519–526. IEEE, 2011. [6] S. Boriah, V. Kumar, M. Steinbach, C. Potter, and S. Klooster. cover change detection: C. a case study.and In [6] S. Boriah, Land V. Kumar, M. Steinbach, Potter, Proceeding the 14th SIGKDD international conferS. Klooster.ofLand coverACM change detection: a case study. In ence on Knowledge discovery and datainternational mining, pages 857– Proceeding of the 14th ACM SIGKDD confer865. 2008. discovery and data mining, pages 857– ence ACM, on Knowledge 865. ACM, 2008. [7] G. Cabanes and Y. Bennani. Change detection in data streams through learning.detection In Neural [7] G. Cabanes and unsupervised Y. Bennani. Change in Netdata works 2012 International Conference streams(IJCNN), through The unsupervised learning. Joint In Neural Neton, pages 1–6. IEEE, works (IJCNN), The 2012. 2012 International Joint Conference on, pages 1–6. IEEE, 2012. [8] J. Chang and W. Lee. estwin: adaptively monitoring the recent change itemsets over online data streams. [8] J. Chang andofW.frequent Lee. estwin: adaptively monitoring the reIn Proceedings of the twelfth international conference on cent change of frequent itemsets over online data streams. Information andofknowledge 536–539. In Proceedings the twelfthmanagement, internationalpages conference on ACM, 2003. Information and knowledge management, pages 536–539. ACM, 2003. [9] K. Chen and L. Liu. HE-Tree: a framework for detecting changes in clustering structure for categorical data streams. [9] K. Chen and L. Liu. HE-Tree: a framework for detecting The VLDB Journal, pages 1–20. changes in clustering structure for categorical data streams. The VLDB Journal, pages 1–20. [10] T. CHEN, C. YUAN, A. SHEIKH, and C. NEUBAUER. Segment-based change detection method in multivariate [10] T. CHEN, C. YUAN, A. SHEIKH, and C. NEUBAUER. data stream, Apr. 9 2009. WO Patent WO/2009/045,312. Segment-based change detection method in multivariate dataCormode. stream, Apr. 2009. WOdistributed Patent WO/2009/045,312. [11] G. The 9continuous monitoring model. SIGMOD Record, 42(1):5, 2013. [11] G. Cormode. The continuous distributed monitoring model. SIGMOD Record, 2013. Efficient strategies for [12] G. Cormode and 42(1):5, M. Garofalakis. continuous distributed tracking tasks. IEEE Data Engineer[12] G. and M. Garofalakis. Efficient strategies for ing Cormode Bulletin, 28(1):33–39, 2005. continuous distributed tracking tasks. IEEE Data EngineeringDas, Bulletin, 28(1):33–39, 2005. [13] K. K. Bhaduri, S. Arora, W. Griffin, K. Borne, C. Giannella, and H. Kargupta. Scalable Distributed Change De[13] K. Das,from K. Bhaduri, S. Arora, Griffin, K. Borne, Gitection Astronomy Data W. Streams using Local, C. Asynannella, H. Kargupta. Scalable Distributed Change Dechronousand Eigen Monitoring Algorithms. In SIAM Internatection from Astronomy Data Streams using2009. Local, Asyntional Conference on Data Mining, Nevada, chronous Eigen Monitoring Algorithms. In SIAM International Conference on Data Mining, Nevada, 2009. Volume 16, Issue 1 Volume 16, Issue 1 Page 36 Page 36 [14] T. Dasu, S. Krishnan, D. Lin, S. Venkatasubramanian, and [14] T. S. Krishnan, D. Lin, Venkatasubramanian, and K. Dasu, Yi. Change (Detection) You S. Can Believe in: Finding Distributional Shifts in Data Streams. Proceedings of theDis8th K. Yi. Change (Detection) You CanInBelieve in: Finding International Symposium on Intelligent Data Analysis: tributional Shifts in Data Streams. In Proceedings of the Ad8th vances in Intelligent Dataon Analysis VIII,Data pageAnalysis: 34. Springer, International Symposium Intelligent Ad2009. vances in Intelligent Data Analysis VIII, page 34. Springer, 2009. [15] T. Dasu, S. Krishnan, S. Venkatasubramanian, and K. Yi. AnDasu, information-theoretic to detecting and changes in [15] T. S. Krishnan, S. approach Venkatasubramanian, K. Yi. multi-dimensional data streams. In 38th Symposium on the An information-theoretic approach to detecting changes in Interface of Statistics, Computing Science, and Applicamulti-dimensional data streams. In 38th Symposium on the tions. Citeseer, 2005. Computing Science, and ApplicaInterface of Statistics, tions. Citeseer, 2005. [16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta. Distributed data in peer-to-peer [16] S. Datta, K. Bhaduri, C. mining Giannella, R. Wolff, andnetworks. H. KarIEEE Computing, pages in 18–26, 2006. networks. gupta.Internet Distributed data mining peer-to-peer IEEE Internet Computing, pages 18–26, 2006. [17] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on S. large clusters. Mapreduce: Communications of the [17] J. Dean and Ghemawat. Simplified dataACM, pro51(1):107–113, cessing on large2008. clusters. Communications of the ACM, 51(1):107–113, 2008. [18] N. Dindar, P. M. Fischer, M. Soner, and N. Tatbul. Efficiently correlating complexM. events over archived [18] N. Dindar, P. M. Fischer, Soner, andlive N. and Tatbul. Effidata streams. In ACM DEBS Conference, 2011.and archived ciently correlating complex events over live data streams. In ACM DEBS Conference, 2011. [19] G. Dong, J. Han, L. Lakshmanan, J. Pei, H. Wang, and P. Yu. Online mining of changes from dataH.streams: Re[19] G. Dong, J. Han, L. Lakshmanan, J. Pei, Wang, and search and preliminary Citeseer. P. Yu. problems Online mining of changesresults. from data streams: Research problems J.and preliminary results. Citeseer. [20] A. R. Ganguly, Gama, O. A. Omitaomu, M. M. Gaber, [29] W. Huang, E. Omiecinski, L. Mark, and M. Nguyen. His[29] tory W. Huang, Omiecinski, L. Mark, and in M.streams. Nguyen. Data HisguidedE.low-cost change detection Warehousing and Knowledge Discovery, pages 75–86, tory guided low-cost change detection in streams. Data 2009. Warehousing and Knowledge Discovery, pages 75–86, 2009. [30] E. Ikonomovska, J. Gama, R. Sebastião, and D. Gjorgjevik. trees from dataR. streams withand driftD.detection. In [30] Regression E. Ikonomovska, J. Gama, Sebastião, Gjorgjevik. Discovery pages RegressionScience, trees from data121–135. streams Springer, with drift2009. detection. In Discovery Science, pages 121–135. Springer, 2009. [31] M. Karnstedt, D. Klan, C. Pölitz, K.-U. Sattler, and Adaptive burst detection a stream engine.and In [31] C. M. Franke. Karnstedt, D. Klan, C. Pölitz,in K.-U. Sattler, Proceedings of the 2009 ACM symposium on Applied ComC. Franke. Adaptive burst detection in a stream engine. In puting, pagesof1511–1515. ACM, 2009. on Applied ComProceedings the 2009 ACM symposium puting, pages 1511–1515. ACM, 2009. [32] Y. Kawahara and M. Sugiyama. Change-point detection in density-ratio estimation. In Pro[32] time-series Y. Kawaharadata andbyM.direct Sugiyama. Change-point detection in ceedings of data 2009bySIAM time-series directInternational density-ratioConference estimation.on In Data ProMining pages 389–400, 2009. ceedings(SDM2009), of 2009 SIAM International Conference on Data Mining (SDM2009), pages 389–400, 2009. [33] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in streams. In Proceedings the Thirtieth international [33] data D. Kifer, S. Ben-David, and J.of Gehrke. Detecting change in conference onIn Very large dataofbases-Volume page 191. data streams. Proceedings the Thirtieth 30, international VLDB Endowment, 2004.data bases-Volume 30, page 191. conference on Very large VLDB Endowment, 2004. [34] A. Kim, C. Marzban, D. Percival, and W. Stuetzle. Using to evaluate detectors a multivariate [34] labeled A. Kim,data C. Marzban, D.change Percival, and W.inStuetzle. Using streaming Signal detectors Processing, labeled dataenvironment. to evaluate change in a89(12):2529– multivariate 2536, 2009.environment. Signal Processing, 89(12):2529– streaming 2536, 2009. [35] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch- measuring in data characteristics.AInframework Proceedings [21] V. Ganti, J. changes Gehrke, and R. Ramakrishnan. for of the eighteenth ACM SIGMOD-SIGACT-SIGART sympomeasuring changes in data characteristics. In Proceedings sium Principles of SIGMOD-SIGACT-SIGART database systems, pages 126–137. of theon eighteenth ACM sympoACM, 1999. sium on Principles of database systems, pages 126–137. ACM, 1999. [22] V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Loh. A using likelihood andmultivariate Data Engi[36] data L. Kuncheva. Changedetectors. detectionKnowledge in streaming neering, IEEE Transactions on, Knowledge (99):1–1, 2011. data using likelihood detectors. and Data EngiIEEE (99):1–1, 2011. and K. Ra[37] neering, X. Liu, X. Wu,Transactions H. Wang, R.on, Zhang, J. Bailey, andR. R.Ganguly, R. Vatsavai. Knowledge from data, [20] A. J. Gama, O. A.discovery Omitaomu, M.sensor M. Gaber, volume 7. CRC, 2008. and R. R. Vatsavai. Knowledge discovery from sensor data, volume 2008. [21] V. Ganti,7.J.CRC, Gehrke, and R. Ramakrishnan. A framework for framework measuring in dataand characteristics. [22] V. Ganti, J.forGehrke, R. differences Ramakrishnan, W. Loh. A Journal of Computer and System Sciences, 64(3):542–578, framework for measuring differences in data characteristics. 2002. Journal of Computer and System Sciences, 64(3):542–578, 2002. [23] S. Geisler, C. Quix, and S. Schiffer. A data stream-based evaluation forS.traffic information systems. In [23] S. Geisler, framework C. Quix, and Schiffer. A data stream-based Proceedings of the ACM SIGSPATIAL International Workevaluation framework for traffic information systems. In shop on GeoStreaming, pages 11–18. ACM, 2010. Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming, 11–18. ACM, [24] L. Golab, T. Johnson, pages J. S. Seidel, and V.2010. Shkapenyuk. Stream warehousing with datadepot. In Proceedings of the [24] L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk. 35th SIGMOD international conference on Management of Stream warehousing with datadepot. In Proceedings of the data, pages 847–854. ACM, 2009. 35th SIGMOD international conference on Management of data,J.pages ACM, 2009. [25] A. Hey, 847–854. S. Tansley, and K. M. Tolle. The fourth paradigm: data-intensive scientific discovery. Microsoft [25] A. J. Hey, S. Tansley, and K. M. Tolle. The fourth Research Redmond, WA, 2009. paradigm: data-intensive scientific discovery. Microsoft Research Redmond, WA, 2009. [26] S. Hido, T. Idé, H. Kashima, H. Kubo, and H. Matsuzawa. Unsupervised change analysis using supervised learning. [26] S. Hido, T. Idé, H. Kashima, H. Kubo, and H. Matsuzawa. In Proceedings of the 12th Pacific-Asia conference on AdUnsupervised change analysis using supervised learning. vances in knowledge discovery and data mining, pages In Proceedings of the 12th 2008. Pacific-Asia conference on Ad148–159. Springer-Verlag, vances in knowledge discovery and data mining, pages 148–159. 2008. changes in unlabeled data [27] S. Ho and Springer-Verlag, H. Wechsler. Detecting streams using martingale. In Proceedings of the 20th in[27] S. Ho and H.joint Wechsler. Detecting changesintelligence, in unlabeledpages data ternational conference on Artifical streams using martingale. In Proceedings of the 20th 1912–1917. Morgan Kaufmann Publishers Inc., 2007. international joint conference on Artifical intelligence, pages 1912–1917. Kaufmann Publishers Inc., 2007. [28] W. Huang, E.Morgan Omiecinski, and L. Mark. Evolution in Data Streams. 2003. [28] W. Huang, E. Omiecinski, and L. Mark. Evolution in Data Streams. 2003. SIGKDD Explorations SIGKDD Explorations change detection: Methods, evaluation, and applica[35] based B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketchtions. In Proceedings of the 3rd ACM SIGCOMM Conferbased change detection: Methods, evaluation, and applicaence Measurement, ACM New tions.on InInternet Proceedings of the 3rdpages ACM234–247. SIGCOMM ConferYork, NY, USA, 2003. ence on Internet Measurement, pages 234–247. ACM New NY, USA, 2003. detection in streaming multivariate [36] York, L. Kuncheva. Change distribution in stock [37] mamohanarao. X. Liu, X. Wu, Mining H. Wang, R. Zhang,change J. Bailey, and K.order Rastreams. Prof. of ICDE, pages 105–108, 2010. mamohanarao. Mining distribution change in stock order Prof. pages 105–108, 2010. [38] streams. G. Manku andof R.ICDE, Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th international [38] G. Manku and R. Motwani. Approximate frequency counts conference on Very Large Data of Bases, pages 346–357. over data streams. In Proceedings the 28th international VLDB Endowment, 2002. conference on Very Large Data Bases, pages 346–357. Endowment, 2002. [39] VLDB J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Byers. Big data: The next frontier for [39] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, innovation, competition and productivity. McKinsey Global C. Roxburgh, and A. Byers. Big data: The next frontier for Institute, May, 2011. innovation, competition and productivity. McKinsey Global May, [40] Institute, A. Maslov, M. 2011. Pechenizkiy, T. Kärkkäinen, and M. Tähtinen. Quantile index for gradual and abrupt change detec[40] A. Maslov, M. Pechenizkiy, T. Kärkkäinen, and M. Tähtition from cfb boiler sensor data in online settings. In Pronen. Quantile index for gradual and abrupt change detecceedings of the Sixth International Workshop on Knowledge tion from cfb boiler sensor data in online settings. In ProDiscovery from Sensor Data, pages 25–33. ACM, 2012. ceedings of the Sixth International Workshop on Knowledge from Sensor pagesAlgorithms 25–33. ACM, [41] Discovery S. Muthukrishnan. DataData, streams: and 2012. applications. Now Publishers Inc, 2005. [41] S. Muthukrishnan. Data streams: Algorithms and applicaNow Publishers Inc,den 2005. [42] tions. S. Muthukrishnan, E. van Berg, and Y. Wu. Sequential change detection on data streams. ICDM Workshops, 2007. [42] S. Muthukrishnan, E. van den Berg, and Y. Wu. Sequential data streams. ICDM 2007. [43] change M. Naordetection and L. on Stockmeyer. What can Workshops, be computed locally? pages 184–193, 1993. [43] M. Naor and L. Stockmeyer. What can be computed lo[44] cally? W. Ng pages and M.184–193, Dash. A 1993. change detector for mining frequent patterns over evolving data streams. In Systems, Man and [44] Cybernetics, W. Ng and M.2008. Dash.SMC A change for mining frequent 2008. detector IEEE International Conferpatterns data IEEE, streams. In Systems, Man and ence on, over pagesevolving 2407–2412. 2008. Cybernetics, 2008. SMC 2008. IEEE International Conference on, pages 2407–2412. IEEE, 2008. Volume 16, Issue 1 Volume 16, Issue 1 Page 37 Page 37 [45] W. Ng and M. Dash. A test paradigm for detecting changes in transactional data streams. In Database Systemschanges for Ad[45] W. Ng and M. Dash. A test paradigm for detecting vanced Applications, pages 204–219. Springer, 2008. in transactional data streams. In Database Systems for Advanced Applications, pages 204–219. Springer, 2008. [46] D. Nikovski and A. Jain. Fast adaptive algorithms for abrupt change detection. learning, 2010. [46] D. Nikovski and A.Machine Jain. Fast adaptive79(3):283–306, algorithms for abrupt change detection. Machine learning, 79(3):283–306, 2010. [47] R. Niu and P. K. Varshney. Performance analysis of distributed detection a randomPerformance sensor field. analysis Signal Process[47] R. Niu and P. K. in Varshney. of dising, IEEE Transactions on, 56(1):339–349, 2008. Processtributed detection in a random sensor field. Signal ing, IEEE Transactions on, 56(1):339–349, 2008. [48] R. Niu, P. K. Varshney, and Q. Cheng. Distributed detectionNiu, in a P. large sensor network. [48] R. K. wireless Varshney, and Q. Cheng.Information DistributedFusion, detec7(4):380–394, 2006. sensor network. Information Fusion, tion in a large wireless 7(4):380–394, [49] T. Palpanas, 2006. D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. deviation detection in sensor net[49] T. Palpanas, Distributed D. Papadopoulos, V. Kalogeraki, and works. ACM SIGMOD Record, 32(4):77–82, D. Gunopulos. Distributed deviation detection2003. in sensor networks.Pham, ACM SIGMOD Record, 2003. [50] D.-S. S. Venkatesh, M. 32(4):77–82, Lazarescu, and S. Budhaditya. Pham, Anomaly detection inM.large-scale net[50] D.-S. S. Venkatesh, Lazarescu,data andstream S. Budhaworks. Data Mining and Knowledge Discovery, 28(1):145– ditya. Anomaly detection in large-scale data stream net189, 2014. works. Data Mining and Knowledge Discovery, 28(1):145– 189,Rajagopal, 2014. [51] R. X. Nguyen, S. C. Ergen, and P. Varaiya. Distributed online simultaneous detection multi[51] R. Rajagopal, X. Nguyen, S. C.fault Ergen, and P.forVaraiya. ple sensors. In Information Processing in Sensor Networks, Distributed online simultaneous fault detection for multi2008. IPSN’08. International Conference on, pages 133– ple sensors. In Information Processing in Sensor Networks, 144. IEEE, 2008. 2008. IPSN’08. International Conference on, pages 133– 144.Roddick, IEEE, 2008. [52] J. L. Al-Jadir, L. Bertossi, M. Dumas, H. Gregersen, Hornsby, L.J. Bertossi, Lufter, F.M.Mandreoli, [52] J. Roddick, L.K. Al-Jadir, Dumas, T. Mannisto, E. Mayol, et al. Evolution and F. change in data H. Gregersen, K. Hornsby, J. Lufter, Mandreoli, management:issues andetdirections. ACM T. Mannisto, E. Mayol, al. Evolution andSigmod changeRecord, in data 29(1):21–25, 2000. and directions. ACM Sigmod Record, management:issues 29(1):21–25, 2000. [53] G. Ross, D. Tasoulis, and N. Adams. Online annotation and regimeand switching dataOnline streams. In Proceed[53] prediction G. Ross, D.for Tasoulis, N. Adams. annotation and ings of the for 2009 ACMswitching symposium on streams. Applied In Computing, prediction regime data Proceedpages ACM, 2009. ings of1501–1505. the 2009 ACM symposium on Applied Computing, pages 1501–1505. ACM, 2009. [54] A. Singh. Review Article Digital change detection techusing remotely-sensed data. International Journal of [54] niques A. Singh. Review Article Digital change detection techRemote Sensing, 10(6):989–1003, niques using remotely-sensed data.1989. International Journal of Remote Sensing, 10(6):989–1003, 1989. [55] X. Song, M. Wu, C. Jermaine, and S. Ranka. Statistical data. In Statistical Proceed[55] change X. Song,detection M. Wu,for C. multi-dimensional Jermaine, and S. Ranka. ings of detection the 13th ACM SIGKDD international change for multi-dimensional data. Inconference Proceedon discovery and datainternational mining, pagesconference 667–676. ingsKnowledge of the 13th ACM SIGKDD ACM, 2007. discovery and data mining, pages 667–676. on Knowledge ACM, 2007. [56] D.-H. Tran and K.-U. Sattler. On detection of changes in data streams. In Proceedings of the 9thof International [56] sensor D.-H. Tran and K.-U. Sattler. On detection changes in Conference on Advances in Mobile of Computing and Multisensor data streams. In Proceedings the 9th International media, pageson50–57. ACM, Conference Advances in 2011. Mobile Computing and Multimedia, pages 50–57. ACM, 2011. [57] M. van Leeuwen and A. Siebes. Streamkrimp: Detecting data streams. Machine and Knowledge [57] change M. van in Leeuwen and A. Siebes.Learning Streamkrimp: Detecting Discovery in Databases, pages 672–687, 2008. change in data streams. Machine Learning and Knowledge Discovery in Databases, pages 672–687, 2008. [58] V. Veeravalli and P. Varshney. Distributed inference in wiresensor networks. Philosophical Transactions the [58] less V. Veeravalli and P. Varshney. Distributed inference inofwireRoyal Societynetworks. A: Mathematical, Physical and Engineering less sensor Philosophical Transactions of the Sciences, 370(1958):100–117, 2012. Royal Society A: Mathematical, Physical and Engineering Sciences, 370(1958):100–117, 2012. [59] A. Zhou, F. Cao, W. Qian, and C. Jin. Tracking clusters overand sliding windows. Knowledge [59] in A. evolving Zhou, F.data Cao,streams W. Qian, C. Jin. Tracking clusters and Information Systems, 15(2):181–214, 2008. in evolving data streams over sliding windows. Knowledge and Information Systems, 15(2):181–214, 2008. SIGKDD Explorations Volume 16, Issue 1 Page 38 SIGKDD Explorations Volume 16, Issue 1 Page 38 Contextual Crowd Intelligence Beng Chin Ooi† , Kian-Lee Tan† , Quoc Trung Tran† , James W. L. Yip§ , Gang Chen# , Zheng Jye Ling§ , Thi Nguyen† , Anthony K. H. Tung† , Meihui Zhang† † National University of Singapore § National University Health System # Zhejiang University † {ooibc, tankl, tqtrung, thi, atung, zmeihui}@comp.nus.edu.sg § {james yip, zheng jye ling}@nuhs.edu.sg # [email protected] ABSTRACT Most data analytics applications are industry/domain specific, e.g., predicting patients at high risk of being admitted to intensive care unit in the healthcare sector or predicting malicious SMSs in the telecommunication sector. Existing solutions are based on “best practices”, i.e., the systems’ decisions are knowledge-driven and/or data-driven. However, there are rules and exceptional cases that can only be precisely formulated and identified by subject-matter experts (SMEs) who have accumulated many years of experience. This paper envisions a more intelligent database management system (DBMS) that captures such knowledge to effectively address the industry/domain specific applications. At the core, the system is a hybrid human-machine database engine where the machine interacts with the SMEs as part of a feedback loop to gather, infer, ascertain and enhance the database knowledge and processing. We discuss the challenges towards building such a system through examples in healthcare predictive analysis – a popular area for big data analytics. 1. INTRODUCTION Most data analytics applications are industry or domain specific. For example, many prediction tasks in healthcare require prior medical knowledge, such as, identifying patients at high risk of being admitted to the intensive care unit, or predicting the probability of the patients being readmitted into the hospital within 30 days after discharge. Another example from the telecommunication sector is the identification of malicious SMSs requiring inputs from security experts. Building competent tools to effectively address these problems are important, as industrial organizations face increasing pressures to improve outcomes while reducing costs [3]. Existing solutions to industry or domain specific tasks are based on “best practices”. These solutions are knowledge-driven (i.e., utilizing general guidelines such existing clinical guidelines or literature from medical journals) and/or data-driven (i.e., deriving rules from observational data) [31]. Let us consider the task of identifying the risk factors related to heart failure. The knowledge-driven solution uses risk factors identified from existing clinical knowledge or literature, such as, age, hypertension and diabetes status. However, it may miss out other unknown risk factors specific to the population of interest. The reason is that the guidelines are generic and based on existing knowledge, which results in models that may not adequately represent the underlying complex disease processes in the population with a comprehensive list of risk factors [31]. The SIGKDD Explorations data-driven solution employs machine learning algorithms to derive risk factors solely from observational data. An alternative approach combines the knowledge-driven and data-driven approaches in the data analytics applications [31]. However, there are exceptional situations where it is not easy to capture or formalize, and where neither general guidelines are available nor rules can be derived from data (e.g., in rare conditions). Instead, it is only through many years of experience can subject-matter experts (SMEs) formulate and identify these situations. The challenge then is to be able to capture and utilize such knowledge to effectively support industry/domain specific applications, e.g., improving the accuracy of the prediction tasks. This paper proposes building the next generation of intelligent database management systems (DBMSs) that exploit contextual crowd intelligence. The crowd intelligence here refers to the knowledge and experience of subject-matter experts (SMEs). Although such knowledge is an important component in transforming data into information, it is currently not captured by a structured system. The participants in an intelligent crowd are domain experts rather than “unknown” lay-persons in existing systems that use crowdsourcing as part of database query processing (e.g., CrowdDB [13], Deco [24], Qurk [23], CDAS [12; 22]) and information extraction or knowledge acquisition (e.g., HIGGINS [21] and CASTLE [28]). For applications where data confidentiality and privacy are important (e.g., healthcare analytics), the intelligent crowd may consist of only experts from within the organization, since the tasks cannot be outsourced to external parties. Given that the crowd is known apriori, there is an assurance of user accountability, which translates to an assurance in the quality of the answers. A recent system, called Data Tamer [30], also proposed to leverage on expert crowdsourcing system to enhance machine computation but in the context of data curation. Our proposition differs from Data Tamer in several aspects. First, the target applications of our work (i.e., data analytics) are different from those in Data Tamer (i.e., data curation). Thus, each system needs to address a unique, different set of challenges. Second, the domain experts in our context are also users/reviewers of the system. Thus, the experts are likely to take ownership and hence are motivated to improve the accuracy of the analytics and the usability of the applications. This would reduce the need to localize/customize the system since the experts/users are continuously interacting with the system; these experts define the “best practices” for the system. For example, doctors in a particular department may use a different convention or notation from another department, e.g., when doctors write “PID” in the orthopedic department, the acronym refers to the “Prolapsed Intervertebral Disc” only and not the “Pelvic Inflammatory Disease”. Clearly, such knowledge can only Volume 16, Issue 1 Page 39 be provided by internal domain experts. In contrast, experts in Data Tamer are not the users of the system and hence there is a need to customize/localize the system for different use-cases. In order to entrench the crowd intelligence into the DBMS, the system needs to keep SMEs as part of the feedback loop. The system can then further utilize feedback provided from the SMEs to infer, ascertain and enhance its processing, thus continuously improving the effectiveness of the system. For example, when predicting the risk of unplanned patient readmissions, the system asks the doctors to label patients who the system has low confidence in predicting their readmissions, and the rules/hypotheses that the doctors used to do the labeling. One example of such an expert rule is that an elderly patient who lives alone and have had several severe diseases is likely to be readmitted into the hospital frequently. The system would then verify or adjust these rules/hypotheses and revert back to the doctors with evidence to support or reject their rules/hypotheses. Such interactions are beneficial to both the system and the doctors. Eventually, the application system evolves over time. SMEs become part of this evolving process by sharing their domain knowledge and rich experience, thereby contributing to the improvement and development of the system. Hence, the experts are more willing and comfortable to use the system to alleviate the burden of their duties. This work is part of our CIIDAA project on building large scale, Comprehensive IT Infrastructure for Data-intensive Applications and Analysis [2]. Our collaborators are clinicians in the National University Health System (NUHS) [5]. The project aims to harness the power of cloud computing to solve big data problems in the real world, with healthcare predictive analytics being a popular area for big data analytics [26]. Organization. The remainder of this paper is organized as follows. Section 2 presents motivating examples in healthcare predictive analytics. Section 3 discusses the architecture of an intelligent DBMS that aims to embed contextual crowd intelligence. Section 4 elaborates on research problems that we need to address in order to build an intelligent DBMS. Section 5 presents our preliminary results on the problem of predicting the risk of unplanned patient readmissions. Section 6 presents the related work. Finally, Section 7 concludes our work. 2. MOTIVATING EXAMPLES Let us consider a hospital that has an integrated view of the medical care records of patients as shown in Table 1. The table contains two types of information: • Structured information, including the case identifier, patient’s name, age, gender, race, the number of days that the patient stayed at the hospital during a particular visit (LengthO f Stay), and the number of days before the patient was readmitted into the hospital after discharge (Readmission) ; and • Unstructured information, i.e., free-text from a doctor’s note that contains additional and useful information of a patient healthcare profile such as his past medical history, social factors, previous medications, complaints of patients based on a doctor’s investigations, major lab results, issues and progress, etc. The tuples in this table are extracted from real cases of patients admitted to the National University Hospital (NUH) in Singapore. Healthcare professionals often have queries relating to predicting the severity of patients’ condition, such as, identifying patients at SIGKDD Explorations high risk of being admitted to intensive care unit, or predicting the probability of the patients being readmitted into the hospital soon after discharge. There are also queries that monitor real-time data of patients in critical conditions for unusual conditions, such as, whether patients are at high risk of collapsing. With correct predictions, doctors can intervene early to alleviate the deterioration of patient’s health outcome. This can potentially reduce the burden of limited healthcare resources in the primary and acute care facilities. For instance, if a patient is at high-risk for unplanned post discharge readmission, he can potentially benefit from close followed-up after discharge, e.g., the hospital sends a case manager or nurse to examine him once every three days. In addition, important queries related to public health surveillance can be answered in a timely fashion. For example, it is critical to provide real-time, early information to alert decision-makers of emerging threats that need to be addressed in a particular population. The ultimate goal of these predictive queries is to predict, pre-empt and prevent for better healthcare outcome. 3. AN INTELLIGENT DBMS FOR BIG DATA ANALYTICS In this section, we discuss the challenges of addressing big data analytics and present an overview of a hybrid human-machine system for these tasks. 3.1 Challenges of Big Data Analytics Essentially, many tasks of big data analytics can be viewed as conventional data mining problems, such as, classifying patients into different class labels (high or low risk of being admitted to intensive care units). There are, however, three important aspects that differentiate big data analytics from traditional machine learning problems. • First, many valuable features for the analytics tasks are stored in unstructured data, for example, doctor’s notes [25]. We cannot simply treat these notes as traditional “bag-of-words” documents. Instead, we need powerful tools to extract from these documents the right entities (such as, diseases, medications, laboratory tests) and domain-specific relationships (such as, the relationship between a disease and a laboratory test). The text in unstructured data has to be contextualized to each organization’s practice, e.g., doctors in a particular department may use a different convention or notation from another department. • Second, there is usually a lack of training samples with well-defined class labels. For instance, when predicting the risk of committing suicide for each patient, the total number patients known to have committed suicide (i.e., class 1) is very small. However, it does not mean that all the remaining patients did not commit suicide (i.e., class 0). Hence we need to infer the correct class labels for these patients. This problem also occurs in other domains such as home security and banking. For example, one important task that many national security agencies need to perform is identifying persons or groups of people who will likely commit a crime [4]. In this setting, the agency maintains a very small set of people who have committed crime. However, we cannot simply assume that the remaining people are not likely to commit crime. As before, we need to infer the correct class labels for these people. Another example is in telecommunication, where a service provider wants to predict whether an SMS is malicious. In this case, we do not Volume 16, Issue 1 Page 40 CaseID Name Age Gender Race LengthOfStay Readmission Case 1 Patient 1 71 Female Chinese 5 20 Case 2 Patient 2 60 Male Malaysian 10 20 Doctor’s note PMH: 1 IHD - on GTN 0.5mg prn 2 DM - on Metformin 750mg - HbA1c 7.5% 09/12 3 HL Stays with son · · · Social issues: Single, no child Used to live with friend in a shophouse Now at sheltered home since Sept 2011. No next-of-skin or visitor. ··· Table 1: Medical care table have any predefined class labels and might need to ask security experts to provide the class labels for some sample cases. • Lastly, data in different domains (e.g., healthcare, telecommunication, home security) is expected to grow dramatically in the years ahead [26]. For instance, patients in intensive care units are constantly being monitored, and their historical records have to be retained. This can easily result in hundreds of millions of (historical) records of patients. As another example, during a mass casualty disaster (e.g., SARS, H5N1), there is an overwhelming number of patients who have to be monitored and tracked, and information about each patient is huge by itself. Furthermore, streaming data arrive continuously, e.g., new data from the real-time data feed are constantly being inserted. Hence, the system in healthcare setting must provide the real-time predictions, e.g., predicting the survival of patients in the next 6 hours. The three above mentioned aspects call for a new generation of intelligent DBMSs that can provide effective solutions for big data analytics. Our proposition of exploiting contextual crowd intelligence is, we believe, a big step towards this goal. 3.2 Contextual Data Management The central theme of crowd intelligence is to get domain experts engaged as both the participants to fine tune the system and the end-users of the system. Figure 1 presents an intelligent system that exploits contextual crowd intelligence for big data analytics. The system first builds a knowledge base that will be subsequently used for the analytics tasks based on historical data, domain knowledge from SMEs (e.g., doctors), and other sources such as general clinical guidelines. Each source contributes to build some “weak classifiers”. The system needs to combine these classifiers to derive a final classifier that achieves a high level of accuracy for prediction purposes. The system also needs to go through several iterations of interaction with the experts to refine, for example, the final classifier. As such, the experts participate in the entire process in fine tuning the system and decide on the “best practices”. When real-time data or feed arrives, the system performs the prediction on-the-fly and alerts the experts immediately. Hence, the experts become the end-users of the system. We have developed the epiC system [1; 10; 19] to support large scale data processing, and are extending it to support healthcare analytics. Figure 2 shows the software stack of epiC. At the bottom, the storage layer supports different storage systems (e.g., SIGKDD Explorations Hadoop Distributed File System (HDFS) and a key-value storage system, ES2 [8]) for both unstructured and structured data. The next layer (which is the security layer) enables users to protect data privacy by encryption. The third layer (which is the distributed processing layer) provides a distributed processing infrastructure called E3 [9] that supports different parallel processing logics such as MapReduce [11], Directed Acyclic Graph (DAG) and SQL. The top layer (which is the analytics layer) exploits the contextual crowd intelligence for big data analytics. The details of this layer are shown in Figure 1. In Figure 2, KB is the knowledge base and iCrowd is the component that interacts with the domain experts. Different components of the analytics layer (e.g., scalable machine learning algorithms) can process their data with the most appropriate data processing model and their computations will be automatically executed in parallel by the lower layers. In the remaining of this paper, we focus only on the analytics layer. For more details of the other layers of the epiC system, please refer to [1; 10; 19]. 4. RESEARCH PROBLEMS In this section, we elaborate on the research problems that we need to address in order to build an intelligent system for big data analytics. 4.1 Asking Experts The Right Questions Given a large volume of data and a limited amount of time that domain experts can participate in building the systems, we need to ask the experts the right questions. In the context of healthcare analytics, we plan to ask the following domain knowledge from doctors. • Labelings. The system asks doctors to label tuples that the system has low confidence in performing the prediction task. There are two important issues here. First, doctors have different levels of confidence when answering different questions, i.e., doctors are reluctant to assess patient profiles that they do not have specialties. Second, since there is so much information about patients, selecting the relevant feature of each patient to present to the doctors in order not to overwhelm them is also a major issue. In essence, what we need is a diverse set of labeled patients that covers the whole data space as much as possible. One possible solution is to group similar patient profiles together and show these groups to doctors. The purpose is to let the doctors select the groups of patients that they are comfortable in providing the labels. In addition, for each Volume 16, Issue 1 Page 41 Classifer KB iCrowd Analytics Layer Answers Historical data Classifier from SMEs Classifier from data Classifier from general guidelines Classifier from other sources Questions MR DAG SQL Distributed Processing Layer E3 Derive Real-time data/feed SMEs Update Trusted Data Service Predictor Security Layer Knowledge Base Recommendations HDFS ES2 Storage Layer Feedback Figure 1: Contextual crowd intelligence for big data analytics. group, we present only the features which the patients in the group have similar values. In this way, we can avoid overwhelming the doctors with information. Note that, in some cases, we need to perform hierarchical clustering to reduce the number of patients shown to the doctors each time. Selecting the right clustering algorithms and developing effective visualization tools to present patient’s profiles are important here. • Rules/Hypotheses. The system collects expert rules/hypotheses that the doctors used to do the labeling. For example, to predict the risk of unplanned patient readmissions, the doctors suggested a hypothesis that social factors and the status of the diseases are important risk indicators for readmission. The system would then verify or adjust these hypotheses and revert back to the doctors with evidence to support or reject their hypotheses. Such interactions are beneficial to both the system and the doctors. • Inferred implicit knowledge. The system can also infer implicit and valuable knowledge based on the answers/reactions of the domain experts. For instance, if the doctors label two patients who belong to a given cluster differently, then the system can adjust the distance function used to compute the similarity between two patients, and thus infer which features are more important. Such knowledge is implicit as the doctors themselves may not be aware of. We can also ask the same kind of questions for the analytics tasks in other domains. For instance, to predict malicious SMSs, we need to select a small set of messages (by utilizing some clustering algorithms) and ask the experts to provide labels for these samples. We also collect rules and heuristics that the experts utilize to label the SMSs. 4.2 Extracting Domain Unstructured Data Entities From Feature selection is very important for any machine learning task and can greatly affect the algorithm’s quality. Processing doctor’s notes for extracting important features is an inevitably important step for healthcare analytics problems. There are several state-of-the-art Natural Language Processing (NLP) engines for processing clinical documents, such as, MedLEE [14] and cTAKES [27]. These engines process clinical notes, identifying types of clinical entities (e.g., medications, diseases, procedures, SIGKDD Explorations Figure 2: The software stack of epiC for big data analytics. lab tests) from various medical dictionaries (a.k.a. knowledge base), such as, the Unified Medical Language System (UMLS) [6]. We now discuss several problems raised due to the nature of the unstructured data and the incompleteness of the knowledge base, and subsequently discuss a hybrid human-machine approach to solve these problems. The discussion uses the following running example. We run cTAKES on the doctor’s note of patient 1 (in Table 1), and obtain the following clinical entities: (1) diseases: IHD (Ischemic Heart Disease) and DM; (2) medications: GTN and Metformin; and (3) laboratory test: HbA1c. Ambiguous mentions. In many cases, a mention in the free text may refer to different domain entities. For instance, in the running example, “DM” refers to two different diseases “Dystrophy Myotonic” and “Diabetes Mellitus”. We note that this problem is not uncommon as doctors tend to use abbreviations in their notes. For example, “CCF” refers to either “Congestive heart failure” or “Carotid-Cavernous Fistula” diseases; “PID” refers to either “Prolapsed Intervertebral Disc” or “Pelvic Inflammatory Disease”. There are also cases where only human but not the machine can understand the meaning of some mentions in the text. For example, assuming that we are extracting the social factor of patients in Table 1. It is rather easy to extract the social factor for patient 1, since the text contains the phraze “stays with son”. However, it is challenging, if not possible, for the machine to extract the social factor for patient 2. The reason is that the paragraph contains several different keywords relating to the social factor such as “single”, “no child”, “live with friend”, “sheltered home”, “next-of-kin”. Incomplete knowledge base. The knowledge base is incomplete for the following reasons. First, the terms used in the doctor’s notes could be specific within a country or a particular hospital, whereas the existing knowledge bases may only cover the universal ones. Thus, these terms do not exist in the dictionary. One example is the term “HL” in our running example, which refers to the “Hyperlipidemia” disease but is not captured in UMLS. Second, the relationships between entities covered in existing medical knowledge bases (like ULMS) are far from complete. In the running example, the fact that the medication Metformin is used to treat Diabetes Mellitus (DM) is also missing in UMLS. The relationships that exist between domain entities can be used to derive implicit and useful information. For instance, from the laboratory result of the lab test HbA1c, we can infer whether the DM condition is well-controlled (i.e., the relationship between a disease and a lab test). Volume 16, Issue 1 Page 42 A hybrid human-machine approach. To infer the correct entities from unstructured data, a hybrid human-machine solution should be employed. The system can leverage the information from the knowledge base (e.g., UMLS) together with the implicit information (signals) inherent in the unstructured data (e.g., doctor’s notes) to improve the accuracy of its inference process and enhance the knowledge base as well. The system will pose questions to the healthcare professionals for verification. Based on the answers from the experts, the system adjusts its inference results. The inference process gets more accurate and complete as the system runs more iterations. Meanwhile, the knowledge base becomes more comprehensive and customized to each organization’s practice. More specifically, in our running example: • Since “DM” is attached with the laboratory test “HbA1c” in the paragraph, the machine conjectures that “DM” would refer to the “Diabetes Mellitus” disease only. The reason is that HbA1c is a laboratory test that monitors the control of diabetes and HbA1c does not have any relationship with the other disease related to “DM” (i.e., “Dystrophy Myotonic”). • To correctly infer the disease “Hyperlipidemia” for “HL”, the machine infers a pattern of “num d” where num is a fraction annotation and d is a disease. (“1 IHD” and “2 DM” are two examples.) The machine then infers that “HL” may refer to a disease since the phrase “3 HL” follows the pattern. The machine then poses a question to a doctor: which disease “HL” represents for? In this case, the doctor confirms that “HL” represents for the “Hyperlipidemia” disease. Based on the answer, the machine adds the mapping between the mention “HL” and the disease “Hyperlipidemia” to the knowledge base. Hence, the knowledge base becomes more comprehensive and customized to NUH’s practice. • To identify the missing relationship between the medication Metformin and the disease DM, the machine infers a pattern of “d on med”, where d is a disease, med is a medication and med is used to treat d. (“IHD on GTN” is an example.) The machine conjectures that there should have a relationship between DM and Metformin, since the phraze “DM on Metformin” follows the pattern. The machine then verifies this inference with the doctors. The doctors confirm that they typically write the medications that are used to treat a disease right next to the disease, and connect these relationships by the preposition “on”. Clearly, such rule is very useful – the machine will then infer other missing relationships using this expert rule with fewer questions being posed to the doctors. • To derive the social factor for patient 2, the machine can first attempt to derive the information using a simple strategy such as analyzing the NLP structure of sentences containing patterns like “stay with”, “live with”. For complicated cases when the machine cannot find out the information, we need to tap on the knowledge of the experts. 4.3 Combining Multiple Weak Classifiers We can obtain different classifiers from multiple sources such as classifiers built based on the observational data, rules used by the doctors and general clinical guidelines. Each source of knowledge can be considered as a “weak classifier” and the task is to combine these classifiers to derive a final classifier that achieves a very high accuracy in prediction. SIGKDD Explorations C7 Round 3 Intermediate classifiers C5 Round 2 C1 C2 Classifier from doctor A Classifier from doctor B Round 1 Final classifier C6 C3 Classifier from data C4 Classifier from general guidelines Figure 3: An example of several rounds of learning for healthcare predictive analytics There are many ways to achieve the goal. Figure 3 shows an example of a process consisting of three rounds of learning for the task of predicting the severity of patients. In the first round, the system computes four classifiers: C1 and C2 are the classifiers derived from rules provided by SMEs (i.e., doctors); C3 is the classifier derived from historical data; and C4 is the classifier derived from clinical guidelines. It is essential to resolve disagreeing opinions from various sources. There are several ways to combine different classifiers, such as, using majority-voting for the outputs of different rules/classifiers or combining features being used in different input classifiers. It is likely that all the classifiers built after the first round do not agree with each other for the prediction tasks. Thus, in this example, the system performs two additional rounds of learning to improve the accuracy of the classifier. It is also possible that there is no way to reconcile the classifiers, i.e., there will be multiple different classifiers. In such situations, it may be necessary to “rank” the results of the different classifiers, and pick the answer that is ranked highest. How to do this is an open question. 4.4 Scalable Processing Big data analytics is characterized by the so-called 3V features: Volume - a huge amount of data, Velocity - a high data ingestion rate, and Variety - a mixed of structured, semi-structured and unstructured data. These requirements force us to rethink the whole software stack to address big data analytics efficiently and effectively, ranging from the storage layer that should manipulate both structured and unstructured data to application layer that should support scalable machine learning algorithms. To illustrate the points, let us reconsider the problem of predicting the malicious SMSs. The collection of SMSs is huge, e.g., in the order of hundreds of tera-bytes. As discussed in Section 4.1, we need to pick a set of SMSs for domain experts to label. Conventional clustering algorithms may not work well here as we need to handle such a large amount of data. The problem is even more challenging in our context, as we need to frequently get the domain experts involved in building the system. The delay from human beings’ reaction may be a large factor affecting the low latency of the system. The scalability of the problems is also in terms of high-dimensional data space. Our data set inherently contains a large number of features. For instance, there are different information about patients such as thousands of different diseases and lab tests. One solution to reduce the dimensions is to group these attributes semantically, e.g., grouping together different diseases that share a same “root”. For instance, the Hypertension disease, Hypotension disease and Ischaemic Heart disease can be grouped together under the category of Cardiovascular disease. Volume 16, Issue 1 Page 43 Clearly, to perform such tasks, we need to consult the domain experts as different hospitals/doctors may have different opinions/reasoning in performing this task. This is, again, an example of getting the domain experts involved in building the systems. # actual class 1 # actual class 0 #predicted class 1 1071 1321 #predicted class 0 4587 22070 (a) Using only structured features 4.5 Engaging Expert Users #predicted class 1 #predicted class 0 As the system needs to interact with SMEs frequently, it is important to engage the experts along the process of building and using the system. The system should provide several functionalities for this purpose: • Presenting feedback to the experts. For instance, the system can explain how well an expert performs compared to other colleagues. As another example, the system can reveal comments and annotations by other experts to see whether an expert would change her decision. It is also interesting to present new patterns of knowledge that an expert may lack and potentially educate her. 5. PRELIMINARY RESULTS We are studying the problem of predicting the probability of patients being readmitted into the hospital within 30 days after discharge. We refer to the task as readmission prediction for short. We use the clinical data drawn from the National University Hospital’s Computerized Clinical Data Repository (CCDR) and focus only on the elderly patients (i.e., patients with age older than 60) admitted to the hospital in 2012. The table used for the prediction task is the medical care table1 that has similar schema as the one presented in Table 1. There are in total 29049 elderly patients admitted to NUH in 2012, where 5658 patients readmitted within 30 days, i.e., the proportion of patients who were readmitted (i.e. class label 1) is 0.188. 5.1 Interacting with Domain Experts We have been getting the doctors involved in the following tasks. Hypothesis/Rules. Our clinician collaborators have suggested a hypothesis that the following features (indicators) might be important for the readmission prediction: • Social-economic factors, e.g., who are the care-givers and the patient’s economic status. • Lab findings. We should extract the lab findings that the doctors mentioned in their notes instead of using the labs recorded in the structured data in CCDR. The reason is that patients typically have hundreds of lab tests but only a small 1 To derive the medical care table, we joined information from various relations in CCDR, including: Discharge Summary, Patient Demographics, Visit and Encounter, Lab Results and Emergency Department. SIGKDD Explorations # actual class 0 4250 19141 (b) Using both structured and derived features Table 2: The accuracy of our classifier. • A user-friendly interface for the experts to provide their inputs such as rules, hypothesis, labels, etc. • The system should provide not only the final outcome (e.g., whether the patient is at high/low risk of being sent to ICU) but also the reasons that drive its decision. Therefore, keeping track of the provenance of the knowledge is important. For instance, when the system makes a decision that differs from experts’ opinions, the system should be able to trace back whether the mismatch is mainly due to the use of some general guidelines, or due to other experts’ opinions. # actual class 1 2679 2979 number of them is important and is captured in the doctor’s notes. As a result, selecting lab findings mentioned by doctors naturally reduces the dimensions of the data set. • Comorbidity influence, i.e., we should take into account the past medical history of the patient together with the disease status (whether the disease has been well-controlled). Participants in a crowd-sourcing system. We adopted a hybrid human-machine approach to extract the social factors and lab findings from doctor’s free-text notes. To extract the social factors, we use an NLP technique to analyze sentences containing phrases related to the social factor such as “live (with)”, “stay (with)”, “main care-giver” to pinpoint some keywords such as “daughter”, “family”, “spouse”, etc. The system then asks the doctors to handpick a set of predefined categories of social factors. For instance, living with family and taking care by professional helpers (e.g., maid, domestic helpers) are in a same group. As another example, living alone and living in a community nursing home are in a same group. The system also performs a postprocessing step to pull out cases that can be assigned more than one category of social factors. The system then asks the doctors to label these cases manually. (There are about 200 cases that need to be manually labeled.) To extract the lab findings, the system first uses a simple pattern matching technique to extract all possible lab tests mentioned in the note. For instance, if the note contains a pattern of the form “word num” where word is some word and num is a number, then word is a candidate lab test. A word is a correct lab test if it exists in the medical dictionary with the category of lab tests. For the “false” lab tests that are currently not present in the dictionary and appear frequently in the notes, the system asks the doctors to verify them. As a result, there are some actual lab tests that are missing in the dictionary such as “TW”, which is a local convention used inside NUH. Extracting medical concepts. We run the cTAKES NLP engine over the UMLS dictionary to extract the past medical history of a patient. We are in the process of developing algorithms to improve the accuracy of extraction (to resolve problems mentioned in Section 4.2). Thus, we use the number of diseases that the patient has as an indicator instead of the actual diseases. 5.2 Results After interacting with the doctors to extract relevant features, we obtained two sets of features for the prediction task: • Structured features: patients’ demographics (age, gender, race), the number of days that the patient stayed at the hospital, the number of previous hospitalizations, and the Volume 16, Issue 1 Page 44 number of prior emergency visits in the last six month before admission. • Derived features from free-texts (We refer to these features as derived features for short): social factors, lab findings, and past medical history (i.e., diseases). We used WEKA [15] to run a 10-fold cross-validation and the Bayesian Network classifier to construct a readmission classifier2 . Table 2 reports the accuracy of the prediction across all the 10 validation data. If only structured features are used to build the classifier (Table 2(a)), the resulting classifier can correctly predict 1071 cases that are readmitted (within 30 days). The precision and recall in this case are 0.448 and 0.189, respectively. Meanwhile, if both structured and derived features are used to build the classifier (Table 2(b)), the resulting classifier can correctly predict 2679 cases that are readmitted. The precision and recall are 0.387 and 0.473 respectively. Clearly, the recall has been improved significantly with the usage of the derived features from the free-text doctor’s notes. The result is also very promising when we compared it to the result handled manually by domain experts such as physicians, case managers, and nurses [7]. The recall reported in [7] is in the range [0.149, 0.306]. The conclusion in [7] is that care-providers were not able to accurately predict which patients were at highest risk of readmission. However, we believe that a hybrid machine-human solution would greatly alleviate the problem. We would like to emphasize that there are many rooms to further improve the accuracy of the prediction such as enhancing the feature extraction process, employing additional features, such as, disease status, specific diagnoses, medications, and using special classifiers for highly-imbalanced data set. 6. RELATED WORK Related works to our proposition can be broadly classified into the following three categories. Existing solutions for industry/domain specific applications. Existing solutions are currently built based on “best practices”. One direction is knowledge-driven approach that is based on general guidelines such as clinical guidelines, e.g., IBM Watson [3]. Another direction is data-driven approach that is based on “rules” extracted from the observational data, e.g., [16; 18; 20]. Recently, IBM proposes to combine the strengths of the two directions [31]. However, these solutions have not explored the exceptionally complicated rules/patterns that can only be provided by internal domain experts with years of working experience. Our research aims to fill this gap: we seek to engage the experts as users of the system, and tap on their expertise to enhance the database knowledge and processing. There are several benefits of employing internal domain experts. First, we do not need to customize/localize the system for different use-cases; they themselves define the “best practices” for the system. Second, in terms of the data used to build the knowledge base, our system mainly bases on observational data and knowledge provided by domain experts; whereas others (e.g., IBM Watson) need to process a much larger amount of inputs such as medical journals, white papers, medical policies and practices, information in the web, etc. Third, the system should become more “intelligent” over times when the expert users continuously enhance the system with their expert knowledge. 2 We also used other classifiers such as decision tree, rule-based classifier, SVM, etc and observe that the Bayesian Network classifier provides the best result. SIGKDD Explorations Crowdsourcing in database. There has been a lot of recent interest in the database community in using crowdsourcing as part of database query processing (e.g., CrowdDB [13], Deco [24], Qurk [23], CDAS [12; 22]). As discussed, the intelligent crowds in our context are domain experts (rather than lay-persons in the existing crowds) who are also users/reviewers of the system. Furthermore, exploiting intelligent crowd can be much more collaborative in nature. In typical crowdsourcing, the crowds are not aware of each other’s answers. But in our context, we can actually go through several iterations and see whether the experts will change their decisions when they are provided with comments and annotations by other experts. A recent system, called Data Tamer [30], also leveraged expert crowdsourcing system to enhance machine computation but in the context of data curation. As discussed in Section 1, the key difference between our proposition and Data Tamer lies in the fact that the domain experts in our context are also users/reviewers of the system. Thus, the experts are likely to take ownership and hence are motivated to improve the accuracy of the analytics and the usability of the applications. This would reduce the need to localize/customize the system. Also, each system needs to address a different set of challenges, since the targeted applications are different. Active learning. In the active learning model, the data come unlabeled but the goal is to ultimately learn a classifier (e.g., [17; 29; 32]). The idea is to query the labels of just a few points that are especially informative in order to obtain an accurate classifier. The labels are obtained from highly-trained experts (e.g., doctors). The scope of our proposition is much more general than active learning in the following points. First, we would like to exploit as much domain knowledge from experts as possible, not restricting to only the class labels as in active learning. For instance, rules and hypotheses provided by experts with many years of experience must be exploited in several cases. Second, active learning focuses on getting a better classifier so the query points presented to the crowd are usually those data points that are at the boundary of the separating plane. However, these are also the data points that the experts are usually not very clear about. As such, we need to be able to identify additional information that should be provided for the experts to be able to make an informed decision. Lastly, we need to handle a large amount of data whereas existing solutions on active learning usually deal with small data set. 7. CONCLUSION Each of us is a subject-matter expert (SME) of our profession, and we carry with us a vast amount of knowledge and insights not captured by a structured system. This might have explained the emergence of Knowledge Management systems. However, there are many rules and exceptional cases that can only be formulated by experts with many years of experience. Such rules, when properly coded, can help in facilitating contextual decision making. This paper envisions a more intelligent DBMS that captures such information or knowledge. At the core, the system is a hybrid human-machine database processing engine where the machine keeps the SMEs as part of the feedback loop to gather, infer, ascertain and enhance the database knowledge and processing. This paper discussed many open challenges that we need to tackle in order to build such a system. 8. ACKNOWLEDGEMENTS This work was supported by the National Research Foundation, Prime Minister’s Office, Singapore under Grant No. Volume 16, Issue 1 Page 45 NRF-CRP8-2011-08. We thank Associate Professor Gerald C.H. Koh and Dr. Chuen Seng Tan (Saw Swee Hock School of Public Health, National University Health System) for sharing with us domain knowledge in healthcare. 9. REFERENCES [1] http://www.comp.nus.edu.sg/∼epic. [2] The comprehensive it infrastructure for dataintensive applications and analysis project. http://www.comp.nus.edu.sg/∼ciidaa/. [3] Ibm big data for healthcare. http://www.ibm.com. [4] The minority report: Chicago’s new police computer predicts crimes, but is it racist? http://www.theverge.com/2014/2/19/5419854/the-minorityreport-this-computer-predicts-crime-but-is-it-racist. [5] National university health system. http://www.nuhs.edu.sg/. [6] Unified medical language http://www.nlm.nih.gov/research/umls/. system. [7] N. Allaudeen, J. L. Schnipper, E. J. Orav, R. M. Wachter, and A. R. Vidyarthi. Inability of providers to predict unplanned readmissions. J Gen Intern Med, 26(7):771776. [8] Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. C. Ooi, H. T. Vo, S. Wu, and Q. Xu. Es2: A cloud data storage system for supporting both oltp and olap. In ICDE, pages 291–302, 2011. [18] A. Hosseinzadeh, M. T. Izadi, A. Verma, D. Precup, and D. L. Buckeridge. Assessing the predictability of hospital readmission using machine learning. In IAAI, 2013. [19] D. Jiang, G. Chen, B. C. Ooi, K.-L. Tan, and S. Wu. epic: an extensible and scalable system for processing big data. In PVLDB, 2014. [20] P. S. Keenan, S.-L. T. Normand, Z. Lin, E. E. Drye, K. R. Bhat, J. S. Ross, J. D. Schuur, B. D. Stauffer, S. M. Bernheim, A. J. Epstein, Y. Wang, J. Herrin, J. Chen, J. J. Federer, J. A. Mattera, Y. Wang, and H. M. Krumholz. An administrative claims measure suitable for profiling hospital performance on the basis of 30-day all-cause readmission rates among patients with heart failure. Circ Cardiovasc Qual Outcomes, 1(1):29– 37, 2008. [21] K. S. Kumar, P. Triantafillou, and G. Weikum. Human computing games for knowledge acquisition. In CIKM, pages 2513–2516, 2013. [22] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: a crowdsourcing data analytics system. PVLDB, 5(10):1040–1051, 2012. [23] A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, pages 211–214, 2011. [24] A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: declarative crowdsourcing. In CIKM, pages 1203–1212, 2012. [9] G. Chen, K. Chen, D. Jiang, B. C. Ooi, L. Shi, H. T. Vo, and S. Wu. E3: an elastic execution engine for scalable data processing. JIP, 20(1):65–76, 2012. [25] S. Perera, A. Sheth, K. Thirunarayan, S. Nair, and N. Shah. Challenges in understanding clinical notes: Why nlp engines fall short and where background knowledge can help. In CIKM Workshop, 2013. [10] G. Chen, H. Jagadish, D. Jiang, D. Maier, B. Ooi, K. Tan, and W. Tan. Federation in cloud data management: Challenges and opportunities. TDKE, 2014. [26] W. Raghupathi and V. Raghupathi. Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2014. [11] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1), Jan. 2008. [12] J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybrid machine-crowdsourcing system for matching web tables. In ICDE, pages 976–987, 2014. [13] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD Conference, pages 61–72, 2011. [14] C. Friedman, P. O. Alderson, J. H. Austin, J. J. Cimino, and S. B. Johnson. A general natural-language text processor for clinical radiology. JAMIA, 1(2):161–174, 1994. [15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explorations, 11(1), 2009. [16] J. Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. [17] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical image classification. In ICML, pages 417–424, 2006. SIGKDD Explorations [27] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. K. Schuler, and C. G. Chute. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. JAMIA, 17(5):507–513, 2010. [28] T. K. Sean Goldberg, Daisy Zhe Wang. Castle: Crowdassisted system for textual labeling & extraction. HCOM, 2013. [29] B. Settles. Active learning literature survey. Technical report, University of Wisconsin–Madison, 2010. [30] M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013. [31] J. Sun, J. Hu, D. Luo, M. Markatou, F. Wang, S. Edabollahi, S. E. Steinhubl, Z. Daar, and W. F. Stewart. Combining knowledge and data driven insights for identifying risk factors using electronic health records. In AMIA, 2012. [32] J. Wiens and J. Guttag. Active learning applied to patientadaptive heartbeat classification. In NIPS, pages 2442–2450, 2010. Volume 16, Issue 1 Page 46 On Power Law Distributions in Large-scale Taxonomies Rohit Babbar, Cornelia Metzig, Ioannis Partalas, Eric Gaussier and Massih-Reza On Power Law Distributions in Large-scale Taxonomies Amini Université Grenoble Alpes, CNRS F-38000 Grenoble, France [email protected], [email protected], [email protected], Rohit Babbar, Cornelia Metzig, Ioannis Partalas, Eric Gaussier and Massih-Reza [email protected],Amini [email protected] Université Grenoble Alpes, CNRS F-38000 Grenoble, France ABSTRACT erarchy tree in large-scale taxonomies with the goal of [email protected], [email protected], [email protected], elling the process of their evolution. This is undertaken In many of the large-scale physical and social complex [email protected], [email protected] by a quantitative study of the evolution of large-scale taxtems phenomena fat-tailed distributions occur, for which different generating mechanisms have been proposed. In this paper, we study models of generating power law distribuABSTRACT tions in the evolution of large-scale taxonomies such as Open In many ofProject, the large-scale physical and social complex Directory which consist of websites assigned to sysone tems phenomena fat-tailed distributions for which difof tens of thousands of categories. Theoccur, categories in such ferent generating mechanisms haveorbeen In conthis taxonomies are arranged in tree DAGproposed. structured paper, we study of generating lawthem. distribufigurations havingmodels parent-child relationspower among We tions quantitatively in the evolutionanalyse of large-scale taxonomies such as first the formation process of Open such Directory Project, of websites assigned as to one taxonomies, whichwhich leads consist to power law distribution the of tens of thousands of categories. The ofcategories such stationary distributions. In the context designinginclassitaxonomies are arranged in treewhich or DAG structuredassign confiers for large-scale taxonomies, automatically figurations having to parent-child relations among them. how We unseen documents leaf-level categories, we highlight firstfat-tailed quantitatively formation can process of such the natureanalyse of thesethe distributions be leveraged taxonomies, which leads power law distribution the to analytically study the to space complexity of such as classistationary distributions. of designingonclassifiers. Empirical evaluationInofthe thecontext space complexity pubfiers available for large-scale taxonomies, which licly datasets demonstrates theautomatically applicability assign of our unseen documents to leaf-level categories, we highlight how approach. the fat-tailed nature of these distributions can be leveraged to analytically study the space complexity of such classi1. INTRODUCTION fiers. Empirical evaluation of the space complexity on pubWith the tremendous of datathe on applicability the web fromof varlicly available datasets growth demonstrates our ious sources such as social networks, online business serapproach. vices and news networks, structuring the data into conceptual taxonomies leads to better scalability, interpretability 1. INTRODUCTION and visualization. Yahoo! directory, the open directory With the(ODP) tremendous growth ofare data on the web from varproject and Wikipedia prominent examples of ious sources such as social networks, online business sersuch web-scale taxonomies. The Medical Subject Heading vices and news data into concephierarchy of thenetworks, Nationalstructuring Library of the Medicine is another tual taxonomies leads to better scalability, instance of a large-scale taxonomy in the interpretability domain of life and visualization. Yahoo! directory, the open directory sciences. These taxonomies consist of classes arranged in project (ODP)structure and Wikipedia are prominent examples of a hierarchical with parent-child relations among such web-scale taxonomies. Subject them and can be in the formThe of a Medical rooted tree or a Heading directed hierarchy of the National Librarywhich of Medicine another acyclic graph. ODP for instance, is in theisform of a instancetree, of alists large-scale taxonomy in the domain among of life rooted over 5 million websites distributed sciences. consist of classes arranged in close to 1 These milliontaxonomies categories and is maintained by close to a hierarchical with parent-child relations among 100,000 humanstructure editors. Wikipedia, on the other hand, repthem and can be in the form of a rooted or a directed resents a more complicated directed graphtree taxonomy strucacyclic graph. ODP fora instance, which is in of a ture consisting of over million categories. Inthe thisform context, rooted tree,hierarchical lists over 5 classification million websites distributed among large-scale deals with the task of close to 1 million categories is maintained by close automatically assigning labelsand to unseen documents fromto a 100,000 human editors. Wikipedia, on the other repset of target classes which are represented by thehand, leaf level resentsina the more complicated directed graph taxonomy strucnodes hierarchy. ture consisting over athe million categories. In this In this work, weofstudy distribution of data andcontext, the hilarge-scale hierarchical classification deals with the task of automatically assigning labels to unseen documents from a set of target classes which are represented by the leaf level nodes in the hierarchy. In this work, we study the distribution of data and the hi- SIGKDD Explorations onomy using models of preferential attachment, based on the famous model proposed by Yule [33] and showing that erarchy treethe in large-scale taxonomies with theexhibits goal of amodthroughout growth process, the taxonomy fatelling process ofWe their evolution. This isto undertaken tailed the distribution. apply this reasoning both cateby a quantitative study of the evolution of large-scale taxgory sizes and tree connectivity in a simple joint model. onomy using modelsvariable of preferential attachment, on Formally, a random X is defined to followbased a power the model proposed by Yuleconstant [33] anda,showing that law famous distribution if for some positive the complethroughout the growth process, the exhibits a fatmentary cumulative distribution is taxonomy given as follows: tailed distribution. We apply this reasoning to both cateP (X > x) ∝ in x−aa simple joint model. gory sizes and tree connectivity Formally, a random variable X is defined to follow a power Power law distributions, more generally fat-tailed dislaw distribution if for someorpositive constant a, the completributions that decaydistribution slower thanisGaussians, are found in a mentary cumulative given as follows: wide variety of physical and social complex systems, ranging −a (X > x) ∝ xof from city population,Pdistribution wealth to citations of scientific articles [23]. It is also found in network connectivPower lawthe distributions, more generally fat-tailed disity, where internet andorWikipedia are prominent examtributions slower Gaussians, are foundwebin a ples [27; 7].that Ourdecay analysis in than the context of large-scale wide variety leads of physical and social complex systems, taxonomies to a better understanding of suchranging largefrom data, city population, distribution of wealth to citations of scale and also leveraged in order to present a concrete scientific articles It is alsofor found in network connectivanalysis of space[23]. complexity hierarchical classification ity, where Due the internet and increasing Wikipediascale are prominent schemes. to the ever of trainingexamdata ples [27;terms 7]. Our analysis in the context of large-scale size in of the number of documents, feature setwebsize taxonomies to a classes, better understanding of such of largeand numberleads of target the space complexity the scale data, and also leveraged in order present a concrete trained classifiers plays a crucial role intothe applicability of analysis of space complexity hierarchical classification classification systems in manyfor applications of practical imschemes. Due to the ever increasing scale of training data portance. size in terms of the number of presented documents, set prosize The space complexity analysis in feature this paper and of target classes, the space complexity of the videsnumber an analytical comparison of the trained model for hitrained classifiers a crucialwhich role incan thebeapplicability of erarchical and flat plays classification, used to select classification systems many applications of practical imthe appropriate modelina-priori for the classification probportance. lem at hand, without actually having to train any modThe Exploiting space complexity analysis presented in this paper proels. the power law nature of taxonomies to study vides an analytical comparisonforofhierarchical the trained Support model forVechithe training time complexity erarchical andhas flat been classification, which can19]. be used select tor Machines performed in [32; The to authors the appropriate model a-priori for the classification probtherein justify the power law assumption only empirically, lem at our hand, without to we train any modunlike analysis in actually Section 3having wherein describe the els. Exploiting the of power law nature taxonomiesmore to study generative process large-scale web of taxonomies conthe training complexity for processes hierarchical Support Veccretely, in thetime context of similar studied in other tor Machines has the beenimportant performedinsights in [32; of 19].[32; The models. Despite 19],authors space therein justify law assumption only complexity has the not power been treated formally so far.empirically, unlike our analysis in paper Section 3 wherein we describe The remainder of this is as follows. Related workthe on generative process of large-scale web taxonomies more conreporting power law distributions and on large scale hierarcretely,classification in the context of similar in processes in other chical is presented Section studied 2. In Section 3, models. thegrowth important insights of [32; 19], space we recall Despite important models and quantitatively juscomplexity has not of been treated far. tify the formation power lawsformally as they so are found in hiThe remainder of this paper is as follows. Related work on reporting power law distributions and on large scale hierarchical classification is presented in Section 2. In Section 3, we recall important growth models and quantitatively justify the formation of power laws as they are found in hi- Volume 16, Issue 1 Page 47 erarchical large-scale web taxonomies by studying the evolution dynamics that generate them. More specifically, we present a process that jointly models the growth in the size of categories, as well as the growth of the hierarchical tree structure. We derive from this growth model why the class erarchical large-scale taxonomies by hierarchy studying the size distribution at a web given level of the alsoevoexlution dynamics generate them. More we specifically, we hibits power law that decay. Building on this, then appeal present a process jointly models thethe growth in the size to Heaps’ law in that Section 4, to explain distribution of of categories, as categories well as thewhich growth theexploited hierarchical tree features among is of then in Secstructure. derive the fromspace this growth model the class tion 5 for We analysing complexity forwhy hierarchical size distribution at a given level of isthe hierarchyvalidated also exclassification schemes. The analysis empirically hibits poweravailable law decay. Building on from this, we on publicly DMOZ datasets the then Largeappeal Scale 1 to Heaps’ lawText in Section 4, to explain the (LSHTC) distribution of and Hierarchical Classification Challenge 2 featuresdata among categories which Intellectual is then exploited in Secfrom World Property Orpatent (IPC) tion 5 for analysing the space complexity hierarchical ganization. Finally, Section 6 concludes thisfor work. classification schemes. The analysis is empirically validated on publicly available DMOZ datasets from the Large Scale 2. RELATED WORK Challenge (LSHTC)1 and Hierarchical Text Classification 2 Power law distributions reported in a wide varietyOrof World Intellectual Property patent data (IPC) fromare physical andFinally, social complex [22], such as in interganization. Section 6systems concludes this work. net topologies. For instance [11; 7] showed that internet topologies exhibit power laws with respect to the in-degree 2. of theRELATED nodes. AlsoWORK the size distribution of website catePower law distributions reportedofinwebsites, a wide exhibits variety of gories, measured in termsare of number a physical and social complex systems [22], such as inininterfat-tailed distribution, as empirically demonstrated [32; net topologies. instanceProject [11; 7] (ODP). showed Various that internet 19] for the OpenFor Directory modtopologies exhibit powerfor laws respectpower to thelaw in-degree els have been proposed thewith generation distriof the nodes. Also the that size may distribution catebutions, a phenomenon be seen ofas website fundamental gories, measured in terms number of websites,inexhibits a in complex systems as theofnormal distribution statistics fat-tailed distribution, as to empirically demonstrated in [32; [25]. However, in contrast the straight-forward derivation 19] for thedistribution Open Directory Variousmodels modof normal via theProject central (ODP). limit theorem, els have been proposed for the generation power law distriexplaining power law formation all rely on an approximabutions, a phenomenon may be as fundamental tion. Some explanationsthat are based onseen multiplicative noise in systems as thegroup normal distribution in statistics or complex on the renormalization formalism [28; 30; 16]. For [25].growth However, in contrast to the straight-forward derivation the process of large-scale taxonomies, models based of distribution via the limit theorem, models on normal preferential attachment arecentral most appropriate, which are explaining formation on on an the approximaused in thispower paper.law These modelsall arerely based seminal tion. Some explanations are based on multiplicative noise model by Yule [33], originally formulated for the taxonomy or on the renormalization group formalism 30; 16]. For of biological species, detailed in section 3. It[28; applies to systhe growth based tems where process elementsofoflarge-scale the systemtaxonomies, are groupedmodels into classes, on preferential appropriate, which and are and the systemattachment grows bothare in most the number of classes, used this number paper. These models are based on the seminal in theintotal of elements (which are here documents model by Yule In [33], formulated the taxonomy or websites). itsoriginally original form, Yule’sfor model serves as of biological for species, in section 3. Ittaxonomy, applies to irresysexplanation powerdetailed law formation in any tems where elements of hierarchy the systemamong are grouped into classes, spective of an eventual categories. Similar and the system grows bothtoinexplain the number and dynamics have been applied scalingofinclasses, the connecin the of total numberwhich of elements (which here documents tivity a network, grows in termsare of nodes and edges or websites). its original[2]. form, Yule’s modelgeneralizaserves as via preferentialInattachment Recent further explanation for power law formation taxonomy, tions apply the same growth processintoany trees [17; 14; irre29]. spective of an eventual among categories. In this paper, describe hierarchy the approximate power-lawSimilar in the dynamics have been applied to explain scaling in the child-to-parent category relations by the model by connecKlemm tivity a network, whichwe grows in terms nodes and edges et al. of [17]. Furthermore, combine thisofformation process viaa preferential attachment Recent in simple manner with the [2]. original Yulefurther model generalizain order to tions apply same law growth process sizes, to trees 29]. explain also the a power in category i.e. [17; we 14; provide In this paper, describe the approximate power-law in the a comprehensive explanation for the formation process of child-to-parent relations byDMOZ. the model by the Klemm large-scale web category taxonomies such as From secet al. we [17]. Furthermore, we combine this for formation process ond, infer a third scaling distribution the number of in a simple with theisoriginal model in order to features permanner category. This done viaYule the empirical Heaps’s explain a power law in i.e. we between provide law [10],also which describes thecategory scaling sizes, relationship a comprehensive explanation for the formation process of text length and the size of its vocabulary. large-scale taxonomies such as DMOZ. From the secSome of theweb earlier works on exploiting hierarchy among tarond, we infer a third scaling distribution for the number of 1 http://lshtc.iit.demokritos.gr/ features per category. This is done via the empirical Heaps’s 2 http://web2.wipo.int/ipcpub/ law [10], which describes the scaling relationship between text length and the size of its vocabulary. Some of the earlier works on exploiting hierarchy among tar- get classes for the purpose of text classification have been studied in [18; 6] and [8] wherein the number of target classes were limited to a few hundreds. However, the work by [19] is among the pioneering studies in hierarchical classification towards addressing web-scale directories such as Yahoo! diget classes for the of purpose of text target classification rectory consisting over 100,000 classes.have Thebeen austudied in [18;the 6] and [8] whereinwith the respect number to of accuracy target classes thors analyse performance and were limited a few hundreds. the work by [19] training timeto complexity for flat However, and hierarchical classificais among therecently, pioneering studies in hierarchical classification tion. More other techniques for large-scale hierartowards addressing web-scale such Prevention as Yahoo! dichical text classification have directories been proposed. of rectory consisting by of applying over 100,000 target classes. Theon auerror propagation Refined Experts trained a thors analyse withInrespect to accuracy and validation set the was performance proposed in [4]. this approach, bottomtraining time complexity for flat and hierarchical classificaup information propagation is performed by utilizing the tion. More recently, otherclassifiers techniques for large-scale hieraroutput of the lower level in order to improve claschical textatclassification have been proposed. Prevention of sification top level. The deep classification method proerror propagation applying Refinedpruning Experts to trained on a posed in [31] firstby applies hierarchy identify validation set was proposed in [4]. In thisPrediction approach, of bottommuch smaller subset of target classes. a test up information performedNaive by utilizing the instance is then propagation performed byisre-training Bayes clasoutput thesubset lower of level classifiers order to from improve sifier onofthe target classesinidentified the clasfirst sification atrecently, top level.Bayesian The deep classification method hierprostep. More modelling of large-scale posed inclassification [31] first applies hierarchy pruning a archical has been proposed in [15]toinidentify which himuch smaller subset of target classes. Predictionnodes of a test erarchical dependencies between the parent-child are instance performed by re-training modelledisbythen centring the prior of the childNaive nodeBayes at theclaspasifier on values the subset target classes identified from the first rameter of itsofparent. step. More recently, Bayesian modelling large-scale hierIn addition to prediction accuracy, otherofmetrics of perforarchical classification has been proposed in [15] in which himance such as prediction and training speed as well as space erarchical dependencies the parent-child nodes are complexity of the modelbetween have become increasingly impormodelled by iscentring the true priorinof the the context child node the patant. This especially of at challenges rameter of its posed byvalues problems in parent. the space of Big Data, wherein an optiIn addition prediction accuracy, other metrics of performal trade-offtoamong such metrics is desired. The significance mance such asspeed prediction traininghas speed well as space of prediction in suchand scenarios beenashighlighted in complexity of such the model have imporrecent studies as [3; 13; 24;become 5]. Theincreasingly prediction speed is tant. This is especially true in theof context of challenges directly related to space complexity the trained model, as posed problems in the Big Data, wherein it mayby not be possible tospace load of a large trained modelaninoptithe mal trade-off such metrics desired.its The significance main memoryamong due to sheer size. isDespite direct impact of in no such scenarios highlighted in onprediction predictionspeed speed, earlier workhas hasbeen focused on space recent studies such as [3; 13; 24; 5]. The prediction speed is complexity of hierarchical classifiers. directly related to space complexity the trained model, as Additionally, while the existence of of power law distributions it not be possible to load a large in the hasmay been used for analysis purposes in trained [32; 19] model no thorough main memory sheer Despite its direct impact justification is due giventoon the size. existence of such phenomenon. on no 3,earlier worktohas focused space Ourprediction analysis inspeed, Section attempts address thisonissue in complexity of hierarchical classifiers. a quantitative manner. Finally, power law semantics have Additionally, existence power law of distributions been used for while modelthe selection andof evaluation large-scale has been used for analysissystems purposes [32; 19] no thorough hierarchical classification [1].inUnlike problems studjustification is given on the existence of which such phenomenon. ied in classical machine learning sense deal with a Our analysis in Section 3, classes, attemptsthis to address this forms issue in limited number of target application a a quantitative manner. hidden Finally,information power law semantics have blue-print on extracting in big data. been used for model selection and evaluation of large-scale hierarchical classification systems [1]. Unlike problems stud3. POWER LAW IN LARGE-SCALE WEB ied in classical machine learning sense which deal with a TAXONOMIES limited number of target classes, this application forms a We begin by the complementary blue-print onintroducing extracting hidden information cumulative in big data.size distribution for category sizes. Let Ni denote the size of category i (in terms LAW of number documents), then the WEB proba3. POWER INofLARGE-SCALE bility that Ni > N is given by 1 3 To avoid confusion, we denote the power law exponents for in-degree distribution and feature size distribution γ and δ. 2 http://lshtc.iit.demokritos.gr/ http://web2.wipo.int/ipcpub/ SIGKDD Explorations TAXONOMIES P (Nithe >N ) ∝ N −β (1) We begin by introducing complementary cumulative size denote the size of catdistribution category sizes. Let Ni of where β > 0fordenotes the exponent the power law disegory i (in3 terms of number of documents), the probaEmpirically, it can be assessed then by plotting the tribution. given by its size (see Figure 1) The bility of that Ni > N issize rank a category’s against −β derivative of this distribution, category size probability P (Ni > N )the ∝N (1) 3 To avoid denote the power lawpower exponents for where β >confusion, 0 denoteswethe exponent of the law disin-degree 3distribution and feature size distribution γ and δ. tribution. Empirically, it can be assessed by plotting the rank of a category’s size against its size (see Figure 1) The derivative of this distribution, the category size probability Volume 16, Issue 1 Page 48 1000 100000 1 10 100 1000 10000 category size N 10 10 100 1000 10000 category size N Figure 10000 1: Category size vs rank for the γ =distribution 1.9 LSHTC2-DMOZ dataset. # of categories # of with categories dgi>dgwith dgi>dg 10000 100 1000 10 1 2 3 Level 4 5 Level 10 1000 1 100 100000 1000 100 100 10000 1 1000 10000 100 γ = 1.9 1000 10 100 1 1 10 100 1000 # of indegrees dg 10 Figure 2: Indegree vs rank distribution for the LSHTC21 DMOZ dataset. 1 10 100 1000 # ofofindegrees dg via models by We explain the formation these two laws Yule [33] and a related model by Klemm [17], detailed in Figure rankare distribution forinthe LSHTC2sections2:3.1Indegree and 3.2, vs which then related section 3.3. DMOZ dataset. Yule’s model We explain the formation of these two laws via models by Yule’s model describes a system that grows in two quantities, Yule [33] and a related model by Klemm [17], detailed in in elements and in classes in which the elements are assigned. sections 3.1 and 3.2, which are then related in section 3.3. It assumes that for a system having κ classes, the probability that a new element will be assigned to a certain class is 3.1 Yule’stomodel proportional its current size, Yule’s model describes a system that grows in two quantities, N in elements and in classes κ ithe elements are assigned. (2) p(i) in = which It assumes that for a system having κ iclasses, the probability i =1 N that a new element will be assigned to a certain class is 4 http://lshtc.iit.demokritos.gr/LSHTC2 datasets proportional to its current size, p(i) = κ Ni i =1 4 10000 Figure 3: Number of categories at each level in the hierarchy 10 of the LSHTC2-DMOZ database. 1 2 3 4 5 β = 1.1 Figure 1: Category size vs rank distribution for the 1 LSHTC2-DMOZ dataset. 3.1 100000 # categories # categories # of categories Ni>N with Ni>N # of with categories density p(Ni ), then also follows a power law with exponent −(β+1) (β + 1), i.e. p(Ni ) ∝ Ni . Two of our empirical findings are a power law for both the complementary cumulative category size distribution and the counter-cumulative in-degree distribution, shown in Figdensity p(Ni ), then also follows a power law with exponent ures 1 and 2, for LSHTC2-DMOZ dataset which is a subset −(β+1) (β + 1), i.e. . 394, 000 websites and 27, 785 i ) ∝4N i contains of ODP. Thep(N dataset Two of our The empirical findings are a power law for of both categories. number of categories at each level the the hicomplementary cumulative category size distribution and erarchy is shown in Figure 3. the counter-cumulative in-degree distribution, shown in Figures 1 and 2, for LSHTC2-DMOZ dataset which is a subset 000 websites and 27, 785 of ODP.100000 The dataset4 contains 394, β = 1.1 categories. The number of categories at each level of the hierarchy is10000 shown in Figure 3. (2) N i http://lshtc.iit.demokritos.gr/LSHTC2 datasets SIGKDD Explorations It further assumes that for every m elements that are added Figure 3: Number of categories each level in the hierarchy to the pre-existing classes in theatsystem, a new class of size 5 of LSHTC2-DMOZ database. . 1 isthe created The described system is constantly growing in terms of elements and classes, so strictly speaking, a stationary state It further assumes for every m elements that are added does not exist [20].that However, a stationary distribution, the to the pre-existing classes inhas thebeen system, a newusing classthe of size so-called Yule distribution, derived ap5 1 is created proach of the. master equation with similar approximations The[26; described constantly growing [23], in terms elby 23; 17].system Here,iswe follow Newman who ofconements so the strictly speaking, a stationary siders asand oneclasses, time-step duration between creation ofstate two does not exist [20]. From However, stationary distribution, the consecutive classes. this afollows that the average numso-called Yule distribution, beenmderived using apber of elements per class is has always + 1, and the the system proach the + master equation similar where approximations containsofκ(m 1) elements atwith a moment the numby [26; 23; 17]. Newman [23], who condenote the fraction of classes ber of classes is κ.Here, Let we pN,κfollow siders time-step the the duration creation havingasNone elements when total between number of classesofistwo κ. consecutive From thisinstances, follows that average numBetween twoclasses. successive time thethe probability for a ber elements per class always mgain + 1,aand system newthe element is givenofpre-existing class i ofissize Ni to contains κ(m + 1) elements at a moment where the nummN i /(κ(m + 1)). Since there are κ pN,κ classes of size N , denote thegain fraction classes ber of classesnumber is κ. Let pN,κ the expected such classes which a newofelement having N elements total by number of classes is κ. (and grow to size (Nwhen + 1))the is given : Between two successive time instances, the probability for a mNclass i of size Ni m to gain new element(3) is given pre-existing N paN,κ κ pN,κ = (m κ+p1) 1) 1)).+Since there are mNi /(κ(m +κ(m N,κ classes of size N , the expected number such classes which gain a new element The classes websites (andnumber grow toofsize (N +with 1)) N is given by are : thus fewer by the above quantity, but some which had (N −1) websites prior to the addition of mN a new class have nowmone more website. This (3) κ pN,κ = N pN,κ step depicting the+change of the(m state κ(m 1) + 1)of the system from κ classes to (κ + 1) classes is shown in Figure 4. Therefore, The number of classesofwith N websites thus fewer by the the expected number classes with N are documents when the above quantity, but some which had −1) websites prior to number of classes is (κ+1) is given by(N the following equation: the addition of a new class have now one more website. This m step the=change system from κ κ pN,κof + the state [(Nof−the 1)(p (κ depicting + 1)pN,(κ+1) (N −1),κ ) m + 1 in Figure 4. Therefore, (4) classes to (κ + 1) classes is shown −N pwhen N,κ ] the the expected number of classes with N documents number of classes is (κ+1) is given by the following equation: The first term in the right hand side of Equation 4 correm sponds to classes = with N documents the number of κ pN,κ + [(Nwhen − 1)(p (κ + 1)p N,(κ+1) (N −1),κ ) +1 classes is κ. The second termmcorresponds to the contribu(4) tion from classes of size (N − 1) which have grown size −N pto ] N, N,κ this is shown by the left arrow (pointing rightwards) in FigThe 4.first term the corresponds right hand side of Equation 4 correure The lastinterm to the decrease resulting sponds to classes with N documents when the number of 5 The initial be generalized to othertosmall sizes; for classes is κ. size Themay second term corresponds the contribuinstance al. (N consider entrant with tion fromTessone classes ofetsize − 1) which have classes grown to sizesize N, drawn from aby truncated power (pointing law [29] . rightwards) in Figthis is shown the left arrow ure 4. The last term corresponds to the decrease resulting 5 The initial size may be generalized to other small sizes; for instance Tessone et al. consider entrant classes with size drawn from a truncated power law [29] . Volume 16, Issue 1 Page 49 300 Variables Constants Indices im w Number elements Index forofthe class added to the system after which a new class is added ∈ [0, 1] Probability Table 1: Summary of notationthat usedattachment in Section 3of subcategories is preferential Indices from classes which have gained an element and have become the class ofi size (N + 1),Index this isforshown by the right arrow (pointing rightwards) in Figure 4. The equation for the class of size 1 Table is given by: 1: Summary of notation used in Section 3 m (κ + 1)p1,(κ+1) = κ p1,κ + 1 − p1,κ (5) m 1 have become from classes which have gained an element + and of + 1), κ this shown(and by the right arrow (pointing As size the (N number of isclasses therefore the number of rightwards) in Figure 4. The equation for the class of size 1 elements κ(m + 1)) in the system increases, the probability is given by:element is classified into a class of size N , given by that a new Equation 3, is assumed to remain constantmand independent (κ + 1)phypothesis, (5) p1,κ 1,(κ+1) = κ p1,κ + 1 − of κ. Under this the stationary for m + distribution 1 class sizes can be determined by solving Equation 4 and As theEquation number 5κasofthe classes (and therefore theis number using initial condition. This given byof elements κ(m + 1)) in the system increases, the probability + 1/m)B(N, + 1/m) pN =is(1 that a new element classified into a2class of size N , given(6) by Equation 3, is assumed to remain constant and independent where B(., .) this is the beta distribution. Equation 6 has been of κ. Under hypothesis, the stationary distribution for termed Yule distribution [26]. Written for a continuous class sizes can be determined by solving Equation 4 variand able , it has a5power tail:condition. This is given by usingNEquation as thelaw initial −2− 1 m )∝N + 1/m)B(N, 2 + 1/m) pN = (1p(N (6) From the equation the exponentEquation of the density where B(.,above .) is the beta distribution. 6 has funcbeen tion is between 2 and 3.[26]. ItsWritten cumulative distribution termed Yule distribution for a size continuous variN has ), asagiven Equation 1, has an exponent given P (NkN> able , it powerbylaw tail: by 1 p(N ) ∝ N −2− m β = (1 + (1/m)) (7) From the above equation the exponent of the density funcwhich is between 1 and 2. The higher the frequency 1/m tion is between 2 and Its cumulative size distribution at which new classes are3.introduced, the bigger β becomes, P (Nthe k > N ), as given by Equation 1, has an exponent given and lower the average class size. This exponent is stable by over time although the taxonomy is constantly growing. 3.2 β = (1 + (1/m)) (7) Preferential attachment models for netis between 1 and 2. The higher the frequency 1/m works and trees which at similar which new classes are introduced, the network bigger βgrowth becomes, A model has been formulated for by and the lower average class size. Thisthe exponent is stable Barabási and the Albert [2], which explains formation of a over time the taxonomy is constantly growing. power lawalthough distribution in connectivity degree of nodes. It assumes that the networks grow in terms of nodes and edges, 3.2 Preferential attachment models for netand that every newly added node to the system connects works and trees with a fixed number of edges to existing nodes. Attachment A again similarpreferential, model has been formulated for network growth by is i.e. the probability for a newly added Barabási and Albert [2], which explains the formation of a power law distribution in connectivity degree of nodes. It assumes that the networks grow in terms of nodes and edges, and that every newly added node to the system connects with a fixed number of edges to existing nodes. Attachment is again preferential, i.e. the probability for a newly added SIGKDD Explorations 250 number of classes Number of elements in class i Number of subclasses of class i Number of features of class i Total number of classes Total number of in-degrees (=subcategories) Fraction of classes having N elements Number elements in class i when theoftotal number of classes is κ Number of subclasses of class i Number of features of class i Total number of classes Number of elements added to the system afTotal number of class in-degrees (=subcategories) ter which a new is added Fraction of classes having N elementsof sub∈ [0, 1] Probability that attachment when the total number of classes is κ categories is preferential 200 number of classes Ni dgi di κ DG Variables pN,κ Ni dgi Constants di κ m DG pN,κ w 200 0 150 300 100 250 50 150 0 N-1 N N+1 class size 100 Figure 4: Illustration of Equation 4. Individual classes grow 50 move to the right over time, as indicated by constantly i.e., arrows. A stationary distribution means that the height of 0 each bar remains N-1 N N+1 0 constant. class size Figurei to 4: connect Illustration Equation 4. Individual classes grow node to a of certain existing node j is proportional constantly i.e.,ofmove to the right to its number existing edges of over nodetime, j. as indicated by arrows. A stationary distribution means thatcorresponds the height of A node in the Barabási-Albert (BA) model a each bar remains constant. class in Yule’s model, and a new edge to two newly assigned element. Every added edge counts both to the degree of an existing node j, as well as to the newly added node i. For nodereason i to connect to a certain node j is added proportional this the existing nodes existing j and the newly node i to its number of existing edges of node j. grow always by the same number of edges, implying m = 1 A in the Barabási-Albert (BA) model corresponds of a andnode consequently β = 2 in the BA-model, independently class in Yule’s and each a newnew edge to two newly assigned the number of model, edges that node creates. element. Every added edge both to the of an The seminal BA-model hascounts been extended in degree many ways. existing node j, as well as to the newly added node i. For For hierarchical taxonomies, we use a preferential attachthis j and the newly addedgrowth node i mentreason modelthe for existing trees by nodes [17]. The authors considered grow always edges, by the and sameexplain number of edges, implying m 1 via directed power law formation in = the and consequently = 2 directed in the BA-model, independently of in-degree, i.e. the β edges from children to parent in the number of edges that each node creates. a tree structure. In contrast to new the BA-model, newly added The seminal BA-model has been extended in in-degree many ways. nodes and existing nodes do not increase their by For same hierarchical taxonomies, we usestart a preferential attachthe amount, since new nodes with an in-degree ment for trees bycannot [17]. The authors considered of 0. model Leaf nodes thus attract attachment of growth nodes, via directed edges, and explain power law lead formation in the and preferential attachment alone cannot to a powerin-degree, i.e. random the edges directed from children to parent in law. A small term ensures that some nodes attach a tree structure. In contrast to the BA-model, newly added to existing ones independently of their degree, which is the nodes and existing nodesofdo by analogous to the start a not newincrease class intheir the in-degree Yule model. the amount, sincea new new node nodesattaches start with in-degree The same probability v that as aan child to the of 0. Leaf nodes attract attachment of nodes, existing node i of thus with cannot indegree dgi becomes and preferential attachment alone cannot lead to a powerdi − 1 1 law. A small random v(i) = wterm ensures + (1 −that w) some , nodes attach (8) DG of their degree, DG to existing ones independently which is the analogous the size startofofthe a new class in the Yule where DG to is the system measured in themodel. total The probability v that a new node attaches as a child tothat the number of in-degrees. w ∈ [0, 1] denotes the probability becomes existing node i of indegree(1dg−i w) the attachment is with preferential, the probability that it is random to any node, di −independently 1 1 of their numbers v(i)it=has w been done + (1 − w) , process [26; (8) of indegrees. As for the Yule DG DG 23; 14; 29], the stationary distribution is again derived via where DG isEquation the size 4. of the measured the total the master Thesystem exponent of the in asymptotic number of in-degrees. w ∈ [0, 1] denotes is the that power law in the in-degree distribution β probability = 1 + 1/w.This the attachment (1 − properties w) the probability that model is suitableistopreferential, explain scaling of the tree or it is random to any node, independently of their numbers network structure of large-scale web taxonomies, which have of indegrees. As itempirically, has been done for the Yule process [26; also been analysed for instance for subcategories 23; 14; 29], the is again derivedtrees via of Wikipedia [7].stationary It has alsodistribution been applied to directory the master Equation 4. The exponent of the asymptotic in [14]. power law in the in-degree distribution is β = 1 + 1/w.This model is suitable to hierarchical explain scaling properties of the tree or 3.3 Model for web taxonomies network structure of large-scale web taxonomies, which have We now apply these models to large-scale web taxonomies also been analysed empirically, for instance subcategories like DMOZ. Empirically, we uncovered two for scaling laws: (a) of Wikipedia [7]. It has also been applied to directory trees in [14]. 3.3 Model for hierarchical web taxonomies We now apply these models to large-scale web taxonomies like DMOZ. Empirically, we uncovered two scaling laws: (a) Volume 16, Issue 1 Page 50 one for the size distribution of leaf categories and (b) one for the indegree (child-to-parent link) distribution of categories (shown in Figure 2). These two scaling laws are linked in a non-trivial manner: a category may be very small or even not contain any websites, but nevertheless be highly conone for the size on distribution leaf categories and (b) jointly, one for nected. Since the other ofhand (a) and (b) arise the indegree here (child-to-parent link) distribution categories we propose a model generating the two of scaling laws (shown in Figure 2).manner. These two laws are linked in of a in a simple generic Wescaling suggest a combination non-trivial manner: a category may be very small the two processes detailed in subsections 3.1 and 3.2ortoeven denot contain any websites, nevertheless be highlyadded conscribe the growth process: but websites are continuously nected. Since on other hand (a) and (b)byarise jointly, to the system, andthe classified into categories human refwe propose a time, modelthe generating erees. At thehere same categoriesthe aretwo notscaling a merelaws set, in a form simple generic manner. We grows suggest a combination of but a tree structure, which itself in two quantithe processes in subsections 3.2 to deties:two in the numberdetailed nodes (categories) and3.1 in and the number of scribe the growth websites are continuously added in-degrees of nodesprocess: (child-to-parent links, i.e. subcategoryto the system, and Based classified intorules categories by human refto-category links). on the for voluntary referees erees. the same the categories a mere set, of the At DMOZ how time, to classify websites, are we not propose a simbut combined form a treedescription structure, which itselfAltogether, in two quantiple of the grows process. the ties: in the number nodes (categories) and in the number of database grows in three quantities: in-degrees of nodes (child-to-parent links, i.e. subcategoryto-category Based onNew the rules for voluntary referees (i) Growthlinks). in websites. websites are assigned into of thecategories DMOZ how to classify websites, a sim5). i, with probability p(i) we ∝ propose Ni (Figure ple combined description of theindependently process. Altogether, the This assignment happens of the hierdatabase grows quantities: archy levelinofthree category. However, only leaf categories may receive documents. (i) Growth in websites. New websites are assigned into categories i, with probability p(i) ∝ Ni (Figure 5). This assignment happens independently of the hierarchy level of category. However, only leaf categories may receive documents. Figure 5: (i): A website is assigned to existing categories with p(i) ∝ Ni . (ii) Growth in categories. With probability 1/m, the refFigureerees 5: (i): A website is assigned to existing assign a website into a newly createdcategories category, . with p(i) ∝ N i at any level of the hierarchy (Figure 6). This assumption would suffice to create a power law in category size distribution, but since a 1/m, tree-structure (ii) the Growth in categories. With probability the refamong categories exists, we also assume that event erees assign a website into a newly created the category, of is also attaching at category any levelcreation of the hierarchy (Figureat 6).particular places to the tree structure. The probability v(i) that a cateThis assumption suffice createparent a power law in gory is created aswould the child of ato certain category the category a tree-structure i can dependsize in distribution, addition on but the since in-degree di of that among categories exists,9). we also assume that the event category (see Equation of category creation is also attaching at particular places 2 v(i) that a cateto the tree structure. The probability gory is created as the child of a certain parent category 3 2 i can depend in addition on the in-degree di of that 0 0 category (see Equation 9). 0 0 0 0 2 Figure 6: (ii): Growth in categories is equivalent to growth 3 2 of the tree structure in terms of in-degrees. 0 0 0 0 0 0 Figure 6: (ii): Growth in categories is equivalent to growth of(iii) theGrowth tree structure in terms of in-degrees. in children categories. Finally, the hierarchy may also grow in terms of levels, since with a certain probability (1 − w), new children categories are assigned independently of the number of children, i.e. (iii) Growth in children categories. Finally, the hierarchy may also grow in terms of levels, since with a certain probability (1 − w), new children categories are assigned independently of the number of children, i.e. SIGKDD Explorations its in-degree di of the category i. (Figure 7). Like in [17], the attachment probability to a parent i is dgi − 1 i + (1 − w) . (9) DG DG its in-degree di of the category i. (Figure 7). Like in [17], the attachment probability to a parent i is v(i) = w 2 3 4 = w v(i) 2 0 0 1 2 0 dgi − 1 i + (1 − w) . DG DG (9) 0 0 4 3 2 Figure 7: (iii): Growth in children categories. 0 0 1 0 0 Equation 8,0 where i = 1, would suffice to explain power law in-degrees dgi and in category sizes Ni . Figure 7: (iii): Growth in children categories. To link the two processes more plausibly, it can be assumed that the second term in Equation 9 denoting assignment8,ofwhere new ‘first size 1, woulddepends suffice on to the explain Equation i =children’ categories, N i of parent power law in-degrees dg and in category sizes N . i i Ni To link the two processes be i = more , plausibly, it can(10) N in Equation 9 denoting assumed that the second term assignment new ‘first on the size since this isofcloser to thechildren’ rules bydepends which the referees of parent categories, N i create new categories, but is not essential for the explanation of the power laws.NIt reflects that the bigger i i =the probability , (10) a leaf category, the higher that referees N create a child category when assigning a new website since to it. this is closer to the rules by which the referees create new categories, but is not essential for the explanation the of the power laws. It reflects that the bigger To summarize, central idea of this joint model is to cona leaf category,forthe higher probabilitythe that referees sider two measures the size ofthe a category: number of create a category when a new website (which governs the assigning preferential attachment its websites Nichild to websites), it. of new and its in-degree, i.e. the number of its children dgi , which governs the preferential attachment of To summarize, this joint to connew categories.the Tocentral explainidea theofpower law model in the iscategory sider two measures(i) forand the (ii) sizeare of athe category: the number of sizes, assumptions requirements. For the governs the preferential attachment its websites i (which power law inNthe number of indegrees, assumptions (ii) and of new and its in-degree, i.e. the number of its (iii) are websites), the requirements. The empirically found exponents governs preferential attachment of children i , which β = 1.1 dg and γ = 1.9 yield the a frequency of new categories new categories. To explain law (1 in − the 1/m=0.1 and a frequency of the newpower indegrees w)category = 0.9. sizes, assumptions (i) and (ii) are the requirements. For the power in the number of indegrees, assumptions (ii) and 3.4 law Other interpretations (iii) are of theassuming requirements. The empirically exponents Instead in Equations 9 and 10 found that referees deβ = to 1.1open and aγ single = 1.9child yieldcategory, a frequency new realistic categories cide it isofmore to 1/m=0.1 andan a frequency of new indegrees (1 − w) 0.9.or assume that existing category is restructured, i.e.=one several child categories are created, and websites are moved 3.4 Other interpretations into these new categories such that the parent category conInstead of assuming Equations 9 and detains less websites orin even none at all. 10If that one referees of the new cide to open a single child category, it isofmore realisticcatto children categories inherits all websites the parent assume thatFigure an existing category is restructured, i.e. one or egory (see 8), the Yule model applies directly. If several child categories are created, and websites arecontains moved the websites are partitioned differently, the model into theseshrinking new categories such thatThis the parent effective of categories. is not category describedconby tains less model, websites orthe even none Equation at all. If 4one of the only new the Yule and master considers children categories. categories inherits all it websites of the parent growing However, has been shown [29; cat21] egory (see Figure 8), the Yule model applies If that models including shrinking categories also directly. lead to the the websites partitioned differently, the model compaticontains formation of are power laws. Further generalizations effective shrinking of categories. Thisnew is not described by ble with power law formation are that categories do not the Yule model, and the Equation 4 considers only necessarily start with one master document, and that the frequency growing categories. been shown [29; 21] of new categories doesHowever, not needittohas be constant. that models including shrinking categories also lead to the formation of power laws. Further generalizations compatible with power law formation are that new categories do not necessarily start with one document, and that the frequency of new categories does not need to be constant. Volume 16, Issue 1 Page 51 Figure 8: Model without and with shrinking categories. In the left figure, a child category inherits all the elements of its parent and takes its place in the size distribution. # of categories# with Ni>N of categories with Ni>N 100000 Figure 8: Model without and with shrinking Level 2 categories. In Level 3 elements of the left figure, a child category inherits all the Level 4 10000 its parent and takes its place in the size distribution. Level 5 1000 100000 1 10 10 100 1000 10000 100000 category size N Figure 9: Category size distribution for each level of the 1 LSHTC2-DMOZ dataset. 1 10 100 1000 10000 100000 category size N 3.5 Limitations However, 1 and do not exhibit power Figure 9: Figures Category size2distribution for perfect each level of law the decay for several reasons. Firstly, the dataset is limited. LSHTC2-DMOZ dataset. Secondly, the hypothesis that the assignment probability (Equation 2) depends uniquely on the size of a category 3.5 Limitations might be too strong for web directories, neglecting the change in importance of topics. big categories can exist However, Figures 1 and 2Indoreality, not exhibit perfect power law which receive only reasons. few new documents none at isall.limited. Dorodecay for several Firstly, theordataset govtsev and [9] have studied this problem by introSecondly, theMendes hypothesis that the assignment probability ducing an assignment that exponentially (Equation 2) dependsprobability uniquely on thedecays size of a category with age. For a low parameter neglecting they showthe that the might be too strong fordecay web directories, change stronger this decay, the steeper thebig power law; for in importance of topics. In reality, categories canstrong exist decay, receive no power lawfew forms. A last reason that refwhich only new documents or might none atbeall. Doroerees re-structure categories ways strongly deviating from govtsev and Mendes [9] haveinstudied this problem by introthe rules - (iii). ducing an(i) assignment probability that decays exponentially with age. For a low decay parameter they show that the 3.6 Statistics stronger this decay,per thehierarchy steeper the level power law; for strong The tree-structure of a database allows might also tobestudy the decay, no power law forms. A last reason that refsizes of class belonging to a in given level of thedeviating hierarchy.from As erees re-structure categories ways strongly shown in (i) Figure 3 the DMOZ database contains 5 levels of the rules - (iii). different size. If only classes on a given level l of the hier3.6 Statistics perwehierarchy level archy are considered, equally found a power law in category tree-structure size distribution in Figure Per-level power The of as a shown database allows 9. also to study the law decay has also been theofin-degree distribusizes of class belonging to found a givenfor level the hierarchy. As tion. This result3 may equallydatabase be explained by the model shown in Figure the DMOZ contains 5 levels of introduced above: Equations 9 respectively, are valid different size. If only classes 2onand a given level l of the hieralso if are instead of p(k)we oneequally considers the conditional probaarchy considered, κfound a power law in cateNi ,l i =1,l gory size distribution as shown in Figure 9.is Per-level power the probability bility p(l)p(i|l), where p(l) = κ =1 Ni i law decay has also been found for the in-degree distribuNi,l the of assignment to amay givenequally level, and p(i|l) = by κ tion. This result be explained the Nimodel ,l i =1,l introduced above: Equations and 9 respectively, are valid probability of being assigned2to a given class within that also if instead of p(k) one considers the conditional proba bility p(l)p(i|l), where p(l) = κ i =1,l κ i =1 Ni ,l Ni is the probability of assignment to a given level, and p(i|l) = κ Ni,l Ni ,l i =1,l the probability of being assigned to a given class within that SIGKDD Explorations 10000 100 δ = 1.9 1000 10 100 1 100 10 1000 10000 100000 category size in features d Figure 11: Number of features vs rank distribution. 1000 10000 100000 category size in features d level. The formation process may be seen as a Yule process 10 1000 1 100 δ = 1.9 1000 1 100 Level 2 Level 3 Level 4 Level 5 100 10000 # of categories di>d with di>d # ofwith categories 10000 κ within a level if i =1,l Ni ,l is used for the normalization Figure 11:2, Number features vs rank distribution. in Equation and thisofformation happens with probability p(l) that a website gets assigned into level l. Thereby, the rate at ml at which new classes are created need not be theThe same for everyprocess level, and exponent of level. formation may therefore be seen asthe a Yule process κ the power law iffit may vary level to level. Power law ,lfrom N is used for the normalization within a level i i =1,l decay for the2,per-level size distribution is a probabilstraightin Equation and thisclass formation happens with forward corollary of thegets described and ity p(l) that a website assignedformation into levelprocess, l. Thereby, will be used inl Section 5 to analyse theare space complexity of at which new classes created need not the rate at m hierarchical be the sameclassifiers. for every level, and therefore the exponent of the power law fit may vary from level to level. Power law decay for the per-level class size distribution is a straightforward corollary of BETWEEN the described formation process,SIZE and 4. RELATION CATEGORY will be used in Section 5 to analyse the space complexity of AND NUMBER OF FEATURES hierarchical classifiers. Having explained the formation of two scaling laws in the database, a third one has been found for the number of features di in each category, G(d) (see Figures 11 and 12). 4. isRELATION CATEGORY SIZE This a consequenceBETWEEN of both the category size distribution, NUMBER OF FEATURES shownAND (in Figure 1) in combination with another power law, termed lawthe [10]. This empirical states that Having Heaps’ explained formation of two law scaling laws in the number words R been in a document related to the database,of adistinct third one has found for isthe number of length a document as follows each category, G(d) (see Figures 11 and 12). featuresn dofi in This is a consequence of both the category size distribution, , R(n) = Knαwith shown (in Figure 1) in combination another power (11) law, termed Heaps’ law [10]. This empirical law states that the where α is typically between 0.4 and 0.6. numberthe of empirical distinct words R in a document is related to For the the LSHTC2-DMOZ dataset, Figure 10 shows that for the length n of a document as follows collection of words and the collection of websites, similar exponents are found. AnR(n) interpretation (11) = Knα , of this result is that the total number words in a category can be measured approximately by the α number of websites a category, alwhere the empirical is typically betweenin0.4 and 0.6. For though not all websites have the same10length. the LSHTC2-DMOZ dataset, Figure shows that for the Figure 10 shows that bigger categoriesofcontain alsosimilar more feacollection of words and the collection websites, extures, but increase weaker than of thethis increase ponents arethis found. An is interpretation result in is webthat sites. Thisnumber implieswords that less ‘feature-rich’ exthe total in avery category can be categories measured apist, which is also reflected in the decayinexponent δ = 1.9 proximately by the number of high websites a category, alof a power-law fit in Figure to the slower dethough not all websites have11, the(compared same length. cay of the category distribution shown in figure 1 where Figure 10 shows thatsize bigger categories contain also more feaβ = 1.1). Catenation size than distribution measured in tures, but this increaseofis the weaker the increase in webfeatures and Heaps’ lawless yields size distribution measites. This implies that veryagain ‘feature-rich’ categories exi.e. exponent multiplication of sured in websites: P (i) =in R(G(d i )),decay ist, which is also reflected the high δ = 1.9 the yields that δ11, · α(compared = 1.1 which confirms of a exponents power-law fit in Figure to the slower our deempirically found value β = 1.1. cay of the category size distribution shown in figure 1 where β = 1.1). Catenation of the size distribution measured in features and Heaps’ law yields again size distribution measured in websites: P (i) = R(G(di )), i.e. multiplication of the exponents yields that δ · α = 1.1 which confirms our empirically found value β = 1.1. Volume 16, Issue 1 Page 52 100000 1e+06 α = 0.59 1e+06 10000 100000 1000 10000 100 1000 α = 0.59 10000 100000 1e+06 1e+07 1e+08 nb of words 1000 nb of features nb of features nb of features nb of features 1e+06 100000 α = 0.53 1e+06 10000 100000 1000 α = 0.53 10000 100 1 10 100 1000 10000 100000 nb of docs in collection 1000 Figure 10: Heaps’ law: number of distinct words vs. number of words, and vs number of documents. 5. 100 1000 100 10000 100000 1e+06 1e+07 1e+08 SPACE COMPLEXITY OF LARGE-SCALE nb of words HIERARCHICAL CLASSIFICATION 1 10 100 1000 10000 100000 memory. We, therefore, compare the space complexity of nb of docs in collection hierarchical and flat methods which governs the size of the trained model in large scale classification. The goal of this Figure 10: Heaps’ law: number of distinct words Fat-tailed distributions in large-scale web taxonomies high- vs. number of words, and vs number of documents. analysis is to determine the conditions under which the size light the underlying structure and semantics which are useof the hierarchically trained linear model is lower than that ful to visualize important properties of the data especially in of flat model. 5. SPACE COMPLEXITY OF LARGE-SCALE memory. We, therefore, compare the space complexity of big data scenarios. In this section we focus on the applicaAs a prototypical classifier, wewhich use agoverns linear classifier the hierarchical and flat methods the size of of the tions HIERARCHICAL in the context of large-scale hierarchical classification, CLASSIFICATION x which bescale obtained using standard algorithms form wTmodel trained in can large classification. The goal of this wherein the fit of power law distribution to such taxonomies Fat-tailed distributions in large-scale web taxonomies highsuch as Support Vector the Machine or Logistic In analysis is to determine conditions under Regression. which the size can to structure concretelyand analyse the space lightbe theleveraged underlying semantics whichcomplexare usethis work, we apply trained one-vs-all L2-regularized L2-loss supof the hierarchically linear model is lower than that ity of visualize large-scale hierarchical classifiers in data the context of in a ful to important properties of the especially port classification as it has been shown to yield stateof flatvector model. generic classifier deployed hierarchical big datalinear scenarios. In this section in wetop-down focus on the applicaof-the-art performance in thewe context largeclassifier scale textofclasAs a prototypical classifier, use a of linear the cascade. tions in the context of large-scale hierarchical classification, T [12]. For flat classification one stores weight vecsification x which can be obtained using standard algorithms form w In the following we first present formally the task of wherein the fit ofsections power law distribution to such taxonomies ∀y and hence in Machine a K classorproblem d dimensional tors suchw asy ,Support Vector LogisticinRegression. In hierarchical classification and then we proceed to the space can be leveraged to concretely analyse the space complexfeature space, the space complexity for flat classification is: this work, we apply one-vs-all L2-regularized L2-loss supcomplexity analysis for large-scale systems. Finally, we of emity of large-scale hierarchical classifiers in the context a port vector classification as it has been shown to yield statepirically validate the derived bounds. generic linear classifier deployed in top-down hierarchical (12) Size F lat = d × K of-the-art performance in the context of large scale text clascascade. sification [12]. For stores weight vec5.1 Hierarchical which represents theflat sizeclassification of the matrixone consisting of K weight In the following sectionsClassification we first present formally the task of tors w y , ∀y and hence in a K class problem in d dimensional vectors, one for each class, spanning the entire input space. hierarchical classification and then weclassification, proceed to the In single-label multi-class hierarchical thespace trainfeature space, the space complexity for flat classification is: We need a more sophisticated analysis for computing the complexity for large-scale Finally, )}N In emthe ing set can analysis be represented by S =systems. {(x(i) , y (i) i=1 . we space complexity for Size hierarchical classification. In this case, pirically of validate the derived bounds. context text classification, x(i) ∈ X denotes the vector (12) F lat = d × K even though the total number of weight vectors is much more representation of document i in an input space X ⊆ Rd . since are computed forthe allmatrix the nodes in the tree not 5.1 Hierarchical Classification whichthese represents the size of consisting of Kand weight The hierarchy in the form of rooted tree is given by G = only for the leaves as in flat classification. Inspite of this, the vectors, one for each class, spanning the entire input space. (V,single-label E) where Vmulti-class ⊇ Y denotes the setclassification, of nodes ofthe G, trainand In hierarchical N size of hierarchical model can beanalysis much smaller as compared E denotes of edges with We need a more sophisticated for computing the . In the ing set canthe be set represented by Sparent-to-child = {(x(i) , y (i) )}orientation. i=1 to flatcomplexity model in the large scale classification. classification. InIntuitively, The leaves of the tree which usually the setthe of vector target space for hierarchical this case, X denotes context of text classification, x(i) ∈ form when the feature set size is high (top levels in the hierarchy), d classes is given by Y = {u ∈ V : v ∈ V, (u, v) ∈ E}. Assumeven though the total number of weight vectors is much more representation of document i in an input space X ⊆ R . (i) the number of computed classes is less, and onnodes the contrary, when the since these are for all the in the tree and not ∈ Y represents ing that there are K classes, the label y The hierarchy in the form of rooted tree is given by G = (i) number of classes is high (at the bottom), the feature set only for the leaves as in flat classification. Inspite of this, the The hierarchical the class associated with the instance (V, E) where V ⊇ Y denotes the setx of . nodes of G, and size is low. size of hierarchical model can be much smaller as compared relationship among implies a transition from genE denotes the set ofcategories edges with parent-to-child orientation. In to analytically compare relative sizesIntuitively, of hierarto order flat model in the large scale the classification. eralization one traverses any from The leaves to of specialization the tree whichasusually form the setpath of target chical and flat models in the context of large scale classifiwhen the feature set size is high (top levels in the hierarchy), root towards documents classes is giventhe by leaves. Y = {u This ∈ V : implies v ∈ V, that (u, v)the ∈ E}. Assumcation, we assume power law and behaviour with respect to the the (i) the number of classes is less, on the contrary, when which are assigned to a particular leaf also belong to the ing that there are K classes, the label y ∈ Y represents number of features, across levels in bottom), the hierarchy. More pre(i) that leaf node. number of classes is high (at the the feature set inner nodes on the path from the root to the class associated with the instance x . The hierarchical cisely, the categories at a level in the hierarchy are ordered size is if low. relationship among categories implies a transition from genwith respect to the number of features, we observe power In order to analytically compare the relative sizes ofa hierar5.2 Space Complexityas one traverses any path from eralization to specialization law behaviour. This hasinalso been verified empirically as ilchical and flat models the context of large scale classifiroot prediction towards the leaves. This implies that the documents The speed for large-scale classification is crucial lustrated inassume Figure power 12 for law various levels in therespect hierarchy, for cation, we behaviour with to the which are assigned a particular leaf also belong to the for its application in to many scenarios of practical importance. one of the datasets used in levels our experiments. MoreMore formally, number of features, across in the hierarchy. preinner on the in path the hierarchical root to that classifiers leaf node.are It hasnodes been shown [32;from 3] that the r-th in ranked category,are according the feature dl,r of at cisely, if the size categories a level the hierarchy ordered usually faster to train and test time as compared to flat to therespect number ofthe features, for level l, 1 ≤wel ≤ L − 1, aispower given with to number of features, observe classifiers. However, given the large physical memory of 5.2 Space Complexity by: law behaviour. This has also been verified empirically as ilmodern systems, whatforalso matters classification in practice isisthe size The prediction speed large-scale crucial −β lustrated in Figure 12 for various levels in the hierarchy, for l ≈ d r (13) d of the trained model with respect to the available physical l,r l,1 for its application in many scenarios of practical importance. one of the datasets used in our experiments. More formally, It has been shown in [32; 3] that hierarchical classifiers are the feature size dl,r of the r-th ranked category, according usually faster to train and test time as compared to flat to the number of features, for level l, 1 ≤ l ≤ L − 1, is given classifiers. However, given the large physical memory of by: modern systems, what also matters in practice is the size (13) dl,r ≈ dl,1 r−βl of the trained model with respect to the available physical SIGKDD Explorations Volume 16, Issue 1 Page 53 where dl,1 represents the feature size of the category ranked 1 at level l and β > 0 is the parameter of the power law. Using this ranking as above, let bl,r represent the number of children of the r-th ranked category at level l (bl,r is the branching factor for this category), and let Bl represents the where dl,1 represents the feature sizel.ofThen the category total number of categories at level the size ranked of the 1 at level l and β > 0 is the parameter of the power law. entire hierarchical classification model is given by: Using this ranking as above, let bl,r represent the number Bl Bl L−1 L−1 r-th of children of the ranked category at level l−β (bl,r is the = b d ≈ bl,rBdl,1 r l (14) Size Hier l,r l,r branching factor for this category), and let l represents the l=1 r=1 l=1 r=1 total number of categories at level l. Then the size of the Here l = 1 corresponds to the rootisnode, entirelevel hierarchical classification model given with by: B1 = 1. 10000 Size Hier = Bl L−1 bl,r dl,r ≈ Bl L−1 bl,r dl,1 r−βl (14) # of categories di>d # with of categories with di>d Level 2 l=1 r=1 l=1 r=1 Level 3 Here level l = 1 corresponds to the rootLevel node,4 with B1 = 1. 1000 10000 β Sizehier < bd1 (L − 1) (β − 1) K If β > (> 1), then Sizehier < Sizef lat K − b(L −the 1) size of the corresponding flat clasUsing our notation, 1000 10 100 1 100 10 1000 10000 # of features d 100000 Figure 12: Power-law variation for features in different levels 1 for LSHTC2-a the feature set size 100 dataset, Y-axis 1000 represents 10000 100000 plotted against rank of the categories on X-axis # of features d We now state a proposition that shows that, under some conFigure Power-law variation for features in different levels ditions 12: on the depth of the hierarchy, its number of leaves, for LSHTC2-a dataset, Y-axis represents the feature set its branching factors and power law parameters, the sizesize of plotted againstclassifier rank of is the categories a hierarchical below that ofonitsX-axis flat version. 1. For a hierarchy of that, categories depth L WeProposition now state a proposition that shows underof some conβl andits b= maxl,rof bl,rleaves, . Deand K leaves, β =ofmin ditions on the let depth the1≤l≤L hierarchy, number noting the space complexity of alaw hierarchical classification its branching factors and power parameters, the size of and the one ofthat its corresponding flat vermodel by Sizehier a hierarchical classifier is below of its flat version. sion by Sizef lat , one has: Proposition 1. For a hierarchy of categories of depth L K = then maxl,r bl,r . Deand KFor leaves, letif ββ = β > 1, > min1≤l≤L βl and (> b1), K − b(L 1) (15) noting the space complexity of a− hierarchical classification of its corresponding model by Sizehier and the one Size < Sizef lat flat verhier sion by Sizef lat , one has: 1−β b(L−1)(1−β) − 1 < K, then ForFor 0 <β β><1,1,ififβ > (1−β) K (> 1), b then − 1) (bK − b(L (16) − 1) (15) Size < Size hier f lat Size < Size hier f lat d1 and Bl − ≤ 1b(l−1) forβ 1 ≤ l ≤ L, one Proof. As dl,1 ≤b(L−1)(1−β) 1− < For 0 <Equation β < 1, if 14 and thenb: has, from the definitions of K, β and (1−β) b (b − 1) (16) (l−1) L−1 b Size < Size hier−β f lat Sizehier ≤ bd1 r (l−1) ≤ br=1 for 1 ≤ l ≤ L, one Proof. As dl,1 ≤ d1 and Bll=1 (l−1) has, from Equation 14and the definitions of β and b: b −β using ([32]): One can then bound r=1 r (l−1) L−1 b (l−1)(1−β) −β b(l−1) −β Sizehier b ≤ bd1 − β r r < for β = 0, 1 (17) 1 − l=1 β r=1 r=1 b(l−1) −β One can then bound r=1 r using ([32]): b(l−1) r=1 r−β < b(l−1)(1−β) − β 1−β SIGKDD Explorations (L−1)(1−β) −1 If β > 1, since b > 1, it implies that (bb(1−β) −1)(1−β) < 0. Using our notation, the size of the corresponding flat clasTherefore, Inequality 18 can be re-written as: sifier is: Size f lat = Kd1 , where K denotes the number of leaves. Thus: Level 2 Level 3 Level 4 100 leading to, for β = 0, 1: L−1 b(l−1)(1−β) − β Sizehier < bd1 1−β l=1 (L−1)(1−β) β −1 leading to, for β = 0,b1: − (L − 1) = bd1 (1 − β) (b(1−β) − 1)(1 − β) L−1 b(l−1)(1−β) − β (18) Sizehier < bd1 1−β l=1 where the last equality is based on the sum of the first terms (L−1)(1−β) of the geometric series β b (b(1−β) )l .− 1 − (L − 1) = bd1 (L−1)(1−β) − β)< 0. − 1)(1 −that β) b(1−β) (1 −1 If β > 1, since b >(b(1−β) 1, it implies −1)(1−β) (b (18) Therefore, Inequality 18 can be re-written as: where the last equality is based on the sum β of the first terms <(1−β) bd1 (L Size hier (b )l . − 1) (β − 1) of the geometric series for β = 0, 1 (17) sifier is: SizeCondition , where K denotes the number of which proves f lat = Kd115. leaves. Thus: The proof for Condition 16 is similar: assuming 0 < β < 1, it β ) is this time the second term in Equation 18 (−(L − 1) (1−β) K (> 1), then Sizehier < Sizef lat If β > which is negative, so that one obtains: K − b(L − 1) which proves Condition 15. b(L−1)(1−β) − 1 Sizehier < bd1 (1−β) − assuming The proof for Condition 16 is(bsimilar: 1)(1 − β) 0 < β < 1, it β ) is this time the second term in Equation 18 (−(L − 1) (1−β) and then: which is negative, so that one obtains: b(L−1)(1−β) − 1 1 − β (L−1)(1−β) If < K, b then Size−hier 1 < Sizef lat b − 1) (b(1−β) Sizehier < bd1 (b(1−β) − 1)(1 − β) which concludes the proof of the proposition. and then: It canb(L−1)(1−β) be shown, − but is βbeyond the scope of this paper, 1− 1 this If Condition <satisfiedK, < Sizeof Sizehier f lat that 16 is forthen a range of values β ∈ (1−β) b (b − 1) ]0, 1[. However, as is shown in the experimental part, it is which concludes the proof of1the proposition. Condition 15 of Proposition that holds in practice. The previous proposition complements the analysis presented It can shown,it but this isthat beyond the scope of test this time paper, in [32] be in which is shown the training and of that Condition 16 is satisfied for a range of values of β ∈ hierarchical classifiers is importantly decreased with respect ]0, as isflat shown in the experimental it is to 1[. the However, ones of their counterpart. In this workpart, we show Condition 15 ofcomplexity Propositionof1 hierarchical that holds inclassifiers practice. is also that the space The previous thepractice, analysis presented better, underproposition a conditioncomplements that holds in than the in [32] in which is shown thatTherefore, the training test time of one of their flat it counterparts. forand large scale taxhierarchical classifiers importantly decreased respect onomies whose featureis size distribution exhibitwith power law to the ones of their classifiers flat counterpart. In this work we show decay, hierarchical should be better in terms of that the space hierarchical classifiers is also speed than flat complexity ones, due toofthe following reasons: better, under a condition that holds in practice, than the shown above, the space complexity of hierarchical one1.ofAs their flat counterparts. Therefore, for large scale taxclassifier lower than classifiers. onomies whoseisfeature size flat distribution exhibit power law decay, hierarchical classifiers should be better in terms of 2. For K flat classes, classifiers need to be evalspeed than ones,only dueO(log to theK) following reasons: uated per test document as against O(K) classifiers in flat shown classification. 1. As above, the space complexity of hierarchical classifier is lower than flat classifiers. In order to empirically validate the claim of Proposition 1, we2. measured the trained modelK) sizes of a standard top-down For K classes, only O(log classifiers need to be evalhierarchical scheme (TD), which uses a O(K) linear classifiers classifier at uated per test document as against in each parent of the hierarchy, and the flat one. flat classification. We use the publicly available DMOZ data of the LSHTC In order towhich empirically validate claim ofMozilla. Proposition 1, challenge is a subset of the Directory More we measuredwe theused trained sizes of aofstandard top-down specifically, the model large dataset the LSHTC-2010 hierarchical scheme (TD), which uses a linear classifier at each parent of the hierarchy, and the flat one. We use the publicly available DMOZ data of the LSHTC challenge which is a subset of Directory Mozilla. More specifically, we used the large dataset of the LSHTC-2010 Volume 16, Issue 1 Page 54 edition and two datasets were extracted from the LSHTC2011 edition. These are referred to as LSHTC1-large, LSHTC2a and LSHTC2-b respectively in Table 2. The fourth dataset (IPC) comes from the patent collection released by World Intellectual Property Organization. The datasets are in the edition two datasets were extracted from by thestemming LSHTCLibSVMand format, which have been preprocessed 2011 edition. These are referred to as LSHTC1-large, and stopword removal. Various properties of interest LSHTC2for the a and LSHTC2-b respectively datasets are shown in Table 2.in Table 2. The fourth dataset (IPC) comes from the patent collection released by World Intellectual Organization. The datasets are in the Dataset Property#Tr./#Test #Classes #Feat. LibSVM format, which have been preprocessed by stemming LSHTC1-large 93,805/34,880 12,294 347,255 and stopword removal. Various properties of interest for the LSHTC2-a 25,310/6,441 1,789 145,859 datasets are shown in Table 2. LSHTC2-b 36,834/9,605 3,672 145,354 IPC 46,324/28,926 451 1,123,497 Dataset #Tr./#Test #Classes #Feat. LSHTC1-large 93,805/34,880 12,294 347,255 Table 2: Datasets for hierarchical classification with the LSHTC2-a 25,310/6,441 1,789 target 145,859 properties: Number of training/test examples, classes LSHTC2-b 145,354 and size of the feature36,834/9,605 space. The depth3,672 of the hierarchy tree IPC for LSHTC datasets 46,324/28,926 is 6 and for the IPC451 dataset1,123,497 is 4. Table 32: shows Datasets for hierarchical classification the Table the difference in trained model sizewith (actual properties: Number of training/test examples, target classes value of the model size on the hard drive) between the two and size of theschemes feature space. of thealong hierarchy classification for theThe fourdepth datasets, with tree the for LSHTC datasets is 6 and 1. forThe the symbol IPC dataset is 4.to the values defined in Proposition refers K of condition 15. quantity K−b(L−1) Table 3 shows the difference in trained model size (actual value of the model size on the hard drive) between the two Dataset schemes forTD Flatdatasets, β b the classification the four along with values defined in Proposition 1. The symbol refers to the LSHTC1-large 2.8 90.0 1.62 344 1.12 K of condition 15. quantity LSHTC2-a 0.46 5.4 1.35 55 1.14 K−b(L−1) LSHTC2-b 1.1 11.9 1.53 77 1.09 IPC 3.6 Flat 10.5 2.03 34 1.17 Dataset TD β b LSHTC1-large 2.8 90.0 1.62 344 1.12 Table 3: Model size (in GB) for flat and hierarchical models LSHTC2-a 0.46 5.4 1.35 55 1.14 along with the corresponding values defined in Proposition LSHTC2-b 1.1 11.9 1.53 K77 1.09 1. The symbol refers to the quantity K−b(L−1) IPC 3.6 10.5 2.03 34 1.17 As shown for the three DMOZ datasets, the trained model Table Model size for flatofand hierarchical models for flat3:classifiers can(inbeGB) an order magnitude larger than along with the corresponding definedfrom in Proposition for hierarchical classification. values This results the sparse K 1. symbol refers to theofquantity K−b(L−1) andThe high-dimensional nature the problem which is quite typical in text classification. For flat classifiers, the entire As shown the three for DMOZ datasets, model feature setfor participates all the classes,the buttrained for top-down for flat classifiers be an magnitude than classification, the can number of order classesofand features larger participatfor hierarchical classification. This results from thetraverssparse ing in classifier training are inversely related, when and the high-dimensional nature of thethe problem is quite ing tree from the root towards leaves.which As shown in typical in text classification. For flat classifiers, the entire Proposition 1, the power law exponent β plays a crucial role feature set participates for of allhierarchical the classes, classifier. but for top-down in reducing the model size classification, the number of classes and features participating in classifier training are inversely related, when travers6. CONCLUSIONS ing the tree from the root towards the leaves. As shown in In this work1,we a exponent model in βorder the Proposition thepresented power law playstoa explain crucial role dynamics insize the of creation and evolution in reducingthat the exist model hierarchical classifier. of largescale taxonomies such as the DMOZ directory, where the categories are organized in a hierarchical form. More specif6. ically,CONCLUSIONS the presented process models jointly the growth in In presented modelofindocuments) order to explain thethis size work of thewe categories (ina terms as wellthe as dynamics existtaxonomy in the creation andofevolution of which largethe growththat of the in terms categories, scale such not as the DMOZ directory, where the to ourtaxonomies knowledge have been addressed in a joint framecategories are organized in a hierarchical form. More specifwork. From one of them, the power law in category size ically, the presented process theofgrowth in distribution, we derived powermodels laws at jointly each level the hierthe sizeand of the terms oflaw documents) as welllaw as archy, withcategories the help (in of Heaps’s a third scaling thethe growth of the in ofterms of categories, in features size taxonomy distribution categories which wewhich then to our knowledge have not been addressed in a joint framework. From one of them, the power law in category size distribution, we derived power laws at each level of the hierarchy, and with the help of Heaps’s law a third scaling law in the features size distribution of categories which we then SIGKDD Explorations exploit for performing an analysis of the space complexity of linear classifiers in large-scale taxonomies. We provided a grounded analysis of the space complexity for hierarchical and flat classifiers and proved that the complexity of the former is always lower than that of the latter. The analysis exploit performing an analysis of thelarge-scale space complexity has beenfor empirically validated in several datasets of linear classifiers in large-scale taxonomies. We provided showing that the size of the hierarchical models can be siga groundedsmaller analysis of the the space complexity nificantly that ones created by a for flathierarchical classifier. and classifiers andanalysis proved can thatbe theused complexity The flat space complexity in order of to the esformer is always lower than that of the latter. The analysis timate beforehand the size of trained models for large-scale has been empirically validated in several large-scale datasets data. This is of importance in large-scale systems where the showing that the size of the hierarchical models can time. be sigsize of the trained models may impact the inference nificantly smaller that the ones created by a flat classifier. The space complexity analysis can be used in order to es7. ACKNOWLEDGEMENTS timate beforehand the size of trained models for large-scale This has been partially supportedsystems by ANR project data. work This is of importance in large-scale where the Class-Y (ANR-10-BLAN-0211), BioASQ size of the trained models may impact theEuropean inference project time. (grant agreement no. 318652), LabEx PERSYVAL-Lab ANR11-LABX-0025, and the Mastodons project Garguantua. 7. ACKNOWLEDGEMENTS This work has been partially supported by ANR project 8. REFERENCES Class-Y (ANR-10-BLAN-0211), BioASQ European project (grant agreement no. 318652), LabEx PERSYVAL-Lab ANR[1] R. Babbar, and I. Partalas, C. Metzig, E.Garguantua. Gaussier, and 11-LABX-0025, the Mastodons project M.-R. Amini. Comparative classifier evaluation for webtaxonomies using power law. In European Seman8. scale REFERENCES tic Web Conference, 2013. [1] Babbar, I. Partalas, C. Metzig, E. Gaussier, [2] R. A.-L. Barabási and R. Albert. Emergence of scalingand in M.-R. Amini. Comparative classifier evaluation for webrandom networks. science, 286(5439):509–512, 1999. scale taxonomies using power law. In European Seman[3] tic S. Bengio, J. Weston,2013. and D. Grangier. Label embedWeb Conference, ding trees for large multi-class tasks. In Neural Infor[2] mation A.-L. Barabási andSystems, R. Albert. Emergence scaling in Processing pages 163–171,of2010. random networks. science, 286(5439):509–512, 1999. [4] P. N. Bennett and N. Nguyen. Refined experts: im[3] proving S. Bengio, J. Weston, inand D. taxonomies. Grangier. Label embedclassification large In Proceedding trees large multi-classACM tasks.SIGIR In Neural Inforings of the for 32nd international Conference mation Processing Systems, pages 163–171, 2010. on Research and Development in Information Retrieval, pages 11–18, 2009. [4] P. N. Bennett and N. Nguyen. Refined experts: imclassification in large taxonomies. Proceed[5] proving L. Bottou and O. Bousquet. The tradeoffs ofInlarge scale ings of the international SIGIR Conference learning. In32nd Advances In NeuralACM Information Processing on Research and161–168, Development Systems, pages 2008.in Information Retrieval, pages 11–18, 2009. [6] L. Cai and T. Hofmann. Hierarchical document cate[5] gorization L. Bottou and Bousquet. The machines. tradeoffs ofInlarge scale withO.support vector Proceedlearning. Inthirteenth Advances ACM In Neural Information Processing ings of the international conference on Systems, pages 2008. Information and161–168, knowledge management, pages 78–87, 2004. [6] L. Cai and T. Hofmann. Hierarchical document catewith vectorF.machines. [7] gorization A. Capocci, V. support D. Servedio, Colaiori, InL.ProceedS. Buings thirteenth ACM international conference on riol, of D.the Donato, S. Leonardi, and G. Caldarelli. PrefInformation and knowledge management, pages 78–87, erential attachment in the growth of social networks: 2004.internet encyclopedia wikipedia. Physical Review The E, 74(3):036116, 2006. [7] A. Capocci, V. D. Servedio, F. Colaiori, L. S. BuD. Donato, S. Leonardi, and G. Caldarelli. [8] riol, O. Dekel, J. Keshet, and Y. Singer. Large margin Prefhiererential classification. attachment inInthe growth ofofsocial networks: archical Proceedings the twenty-first The internet encyclopedia international conference onwikipedia. Machine Physical learning, Review ICML E, 74(3):036116, ’04, pages 27–34, 2006. 2004. [8] O. J. Keshet, and and Y. Large margin hier[9] S. Dekel, N. Dorogovtsev J. Singer. F. F. Mendes. Evolution archical classification. In Proceedings of the twenty-first of networks with aging of sites. Physical Review E, international2000. conference on Machine learning, ICML 62(2):1842, ’04, pages 27–34, 2004. [9] S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networks with aging of sites. Physical Review E, 62(2):1842, 2000. Volume 16, Issue 1 Page 55 [10] L. Egghe. Untangling herdan’s law and heaps’ law: Mathematical and informetric arguments. Journal of the American Society for Information Science and Technology, 58(5):702–709, 2007. [11] P. Faloutsos, and C. law Faloutsos. On power[10] M. L. Faloutsos, Egghe. Untangling herdan’s and heaps’ law: law relationships the internetarguments. topology. SIGCOMM. Mathematical andof informetric Journal of the American Society for Information Science and [12] Technology, R.-E. Fan, K.-W. Chang, 2007. C.-J. Hsieh, X.-R. Wang, 58(5):702–709, and C.-J. Lin. LIBLINEAR: A library for large linear Journal of Machine Learning On Research, [11] classification. M. Faloutsos, P. Faloutsos, and C. Faloutsos. power9:1871–1874, 2008. law relationships of the internet topology. SIGCOMM. [13] T. GaoFan, andK.-W. D. Koller. Discriminative of re[12] R.-E. Chang, C.-J. Hsieh,learning X.-R. Wang, laxed hierarchy for large-scale visual recognition. In and C.-J. Lin. LIBLINEAR: A library for large linear IEEE International Computer Vision classification. JournalConference of MachineonLearning Research, (ICCV), pages 2072–2079, 2011. 9:1871–1874, 2008. [14] C.Koller. J. Tessone, and F. Schweitzer. [13] M. T. M. GaoGeipel, and D. Discriminative learningAofcomreplementary view for on the growth of directory trees. The laxed hierarchy large-scale visual recognition. In European Physical Journal B, 71(4):641–648, 2009. IEEE International Conference on Computer Vision (ICCV), pages 2072–2079, 2011. [15] S. Gopal, Y. Yang, B. Bai, and A. Niculescu-Mizil. models forTessone, large-scale classifica[14] Bayesian M. M. Geipel, C. J. and hierarchical F. Schweitzer. A comtion. In Neural Processing Systems, 2012. plementary viewInformation on the growth of directory trees. The European Physical Journal B, 71(4):641–648, 2009. [16] G. Jona-Lasinio. Renormalization group and probabiltheory. Y. Physics 2001. [15] ity S. Gopal, Yang,Reports, B. Bai,352(4):439–458, and A. Niculescu-Mizil. Bayesian models for large-scale hierarchical classifica[17] tion. K. Klemm, V. M. Eguı́luz, and M. San Miguel. In Neural Information Processing Systems,Scaling 2012. in the structure of directory trees in a computer cluster. Physical review letters, 95(12):128701, 2005. [16] G. Jona-Lasinio. Renormalization group and probability theory. Physics Reports, 352(4):439–458, 2001. [18] D. Koller and M. Sahami. Hierarchically classifying using very fewand words. In Miguel. Proceedings of [17] documents K. Klemm, V. M. Eguı́luz, M. San Scaling the Fourteenth International Conference on Machine in the structure of directory trees in a computer cluster. Learning, ICMLletters, ’97, 1997. Physical review 95(12):128701, 2005. [19] Liu, Y. H. Wan, Hierarchically H.-J. Zeng, Z. Chen, and [18] T.-Y. D. Koller andYang, M. Sahami. classifying W.-Y. Ma. Support vector classification with documents using very fewmachines words. In Proceedings of a large-scale taxonomy. SIGKDD, 2005. thevery Fourteenth International Conference on Machine Learning, ICML ’97, 1997. [20] B. Mandelbrot. A note on a class of skew distribution critique of aZeng, paperZ. byChen, ha simon. [19] functions: T.-Y. Liu, Analysis Y. Yang,and H. Wan, H.-J. and Information and Control, 1959. W.-Y. Ma. Support vector2(1):90–99, machines classification with a very large-scale taxonomy. SIGKDD, 2005. [21] C. Metzig and M. B. Gordon. A model for scaling in size and A growth rate distribution. Physica A, [20] firms’ B. Mandelbrot. note on a class of skew distribution 2014. functions: Analysis and critique of a paper by ha simon. Information and Control, 2(1):90–99, 1959. [21] C. Metzig and M. B. Gordon. A model for scaling in firms’ size and growth rate distribution. Physica A, 2014. SIGKDD Explorations [22] M. Newman. Power laws, pareto distributions and zipf’s law. Contemporary Physics, 46(5):323–351, 2005. [23] M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 2005. [22] M. Newman. Power laws, pareto distributions and zipf’s [24] I. Partalas, R. Babbar, É. Gaussier, and C. Amblard. law. Contemporary Physics, 46(5):323–351, 2005. Adaptive classifier selection in large-scale hierarchical In ICONIP, pages 612–619, 2012. [23] classification. M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 2005. [25] P. Richmond and S. Solomon. Power laws are disboltzmann laws. É. International Journal of Mod[24] guised I. Partalas, R. Babbar, Gaussier, and C. Amblard. ern Physics C, 12(03):333–343, Adaptive classifier selection in 2001. large-scale hierarchical classification. In ICONIP, pages 612–619, 2012. [26] H. A. Simon. On a class of skew distribution functions. 1955. [25] Biometrika, P. Richmond42(3/4):425–440, and S. Solomon. Power laws are disguised boltzmann laws. International Journal of Mod[27] C. Song, S. Havlin, and H. A. Makse. Self-similarity of ern Physics C, 12(03):333–343, 2001. complex networks. Nature, 433(7024):392–395, 2005. [28] Takayasu, A.-H. Sato, and distribution M. Takayasu. Stable [26] H. A. Simon. On a class of skew functions. infinite variance fluctuations 1955. in randomly amplified Biometrika, 42(3/4):425–440, langevin systems. Physical Review Letters, 79(6):966– [27] 969, C. Song, 1997.S. Havlin, and H. A. Makse. Self-similarity of complex networks. Nature, 433(7024):392–395, 2005. [29] Tessone, A.-H. M. M.Sato, Geipel, and Schweitzer.Stable Sus[28] C. H. J. Takayasu, and M.F.Takayasu. tainable growth in complex networks. EPLamplified (Euroinfinite variance fluctuations in randomly physics 2011. Letters, 79(6):966– langevinLetters), systems.96(5):58005, Physical Review 969, 1997. [30] K. G. Wilson and J. Kogut. The renormalization group expansion. Physics and Reports, 12(2):75–199, [29] and C. J.the Tessone, M. M. Geipel, F. Schweitzer. Sus1974. tainable growth in complex networks. EPL (Europhysics Letters), 96(5):58005, 2011. [31] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classifiin large-scale text hierarchies. In Proceedings of [30] cation K. G. Wilson and J. Kogut. The renormalization group the international SIGIR conference and 31st the annual expansion. Physics ACM Reports, 12(2):75–199, on Research and development in information retrieval, 1974. SIGIR ’08, pages 619–626, 2008. [31] G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classifi[32] cation Y. Yang, Zhang, and Kisiel. A scalability analysis in J. large-scale textB.hierarchies. In Proceedings of of in text categorization. In Proceedings of theclassifiers 31st annual international ACM SIGIR conference the 26th annual ACM SIGIR conference on Research and international development in information retrieval, on Research and 619–626, development in informaion retrieval, SIGIR ’08, pages 2008. SIGIR ’03, pages 96–103, 2003. [32] Y. Yang, J. Zhang, and B. Kisiel. A scalability analysis [33] of G. classifiers U. Yule. Ainmathematical theory ofInevolution, based text categorization. Proceedings of on of dr. jc willis, frs. Philosophical the the 26thconclusions annual international ACM SIGIR conference Transactions of the Royal Society of London. Series B, on Research and development in informaion retrieval, Containing Papers96–103, of a Biological Character, 213:21– SIGIR ’03, pages 2003. 87, 1925. [33] G. U. Yule. A mathematical theory of evolution, based on the conclusions of dr. jc willis, frs. Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character, 213:21– 87, 1925. Volume 16, Issue 1 Page 56 Interview: Michael Brodie, Leading Database Researcher, Industry Leader, Thinker Gregory Piatetsky KDnuggets Brookline, MA [email protected] 3. INTERVIEW ABSTRACT We discuss the most important database research advances, industry developments, role of relational and NoSQL databases, Computing Reality, Data Curation, Cloud Computing, Tamr and Jisto startups, what he learned as a chief Scientist of Verizon, Knowledge Discovery, Privacy Issues, and more. Keywords Data Curation, NoSQL, Data Curation, Cloud Computing, Verizon, Privacy, Computing Reality. 1. INTRODUCTION I had a pleasure of working with Michael Brodie when we were both at GTE Laboratories in 1990s, where he was already a worldfamous researcher and a department manager. I recently met him at another conference, and our discussion led to this interview. Michael is still very sharp, very active, and busy - he answered these questions while flying from Boston to Doha, Qatar where he is advising Qatar Computing Research Institute. Parts of this interview were published in KDnuggets [1-3]. 2. BACKGROUND Dr. Michael L. Brodie [4] has served as Chief Scientist of a Fortune 20 company, an Advisory Board member of leading national and international research organizations, and an invited speaker and lecturer. In his role as Chief Scientist Dr. Brodie has researched and analyzed challenges and opportunities in advanced technology, architecture, and methodologies for Information Technology strategies. He has guided advanced deployments of emergent technologies at industrial scale, most recently Cloud Computing and Big Data. Gregory Piatetsky: You have started as a researcher in Databases (PhD from Toronto) and had a very distinguished and varied career spanning academia, industry, and government, in US, Europe, Australia, and Latin America over the last 25+ years. From your unique vantage point, what were 3 most important database research advances? Michael Brodie: Three most important database research advances: 1. 2. 3. Throughout his career Dr. Brodie has been active in both advanced, academic research and large-scale industrial practice attempting to obtain mutual benefits from the industrial deployment of innovative technologies while helping research to understand industrial requirements and constraints. He has contributed to multi-disciplinary problem solving at scale in contexts such as Terrorism and Individual Privacy, and Information Technology Challenges in Healthcare Reform. SIGKDD Explorations Volume 16, Issue 1 Ted Codd’s Relational model of data (1970) is the most important database research advance as it launched what is now a $28 BN/year market still growing at 11% CAGR with over 215 RDBMSs on the market. More important to me it launched four decades of amazing research advances starting with query optimization (Selinger) and transactions (Gray) and innovation that has probably grown at 20% CAGR. The next most important research advance or stage was a change in perspective that specific domains require their own DBMS such as graph databases, array stores, document stores, key-value stores, NoSQL, NewSQL, and many more to come. DB-Engines.com lists twelve DBMS categories thus bumping the database world from managing 8% of the world’s data to about 12% but due to the growth of non-database data back to 10%. Soon, due to the role of data in our digitized world there will be data management systems for many more domains. While this is amazingly cool, how do we solve multi-disciplinary (multi-data domain) problems in a consistent rather that disjoint way? The next most important research advance is just emerging and is mind blowing. I call it Computing Reality, acknowledging that every datum (every real world observation) is not definitive but probabilistic. Unlike conventional databases and more like reality, Computing Reality has no single version of truth. How do we model such worlds, more realistic worlds and compute over them? The simple answer is that it is already in Big Data sources. There are many related attempts to address Computing Reality including social computing, probabilistic computing, probabilistic databases, Open Worlds in AI, Web Science, Approximate Computing, Crowd Computing, and more. Perhaps this will be the next generation of computing. Page 57 GP: What about the most important database industry developments? MB: Alas the database industry, like all industries, has a legacy problem that stifles innovation. It has taken over 30 years to emerge from the relational era. The most important recent database industry development came from outside the database industry, it is Big Data and its marketing arm called MapReduce and its data sidekicks, Hadoop and NoSQL. Frankly, the database industry has been insular and protected its relational turf for FAR too long. Smart folks at Yahoo!, Google and other places saw value in data, non-database data, and thus emerged MapReduce, Hadoop, and NoSQL- generally crappy database ideas but it woke up the database industry1. Hadoop and NoSQL are growing in demand. In time it will be seen that they are amazing for a very specific problem domain, embarrassingly parallel problems, but it is a money pit for everything else. The importance of MapReduce is that it forced the database industry to get out of their hammocks. GP: What is the role of Relational Databases, NoSQL databases, Graph databases, and other databases today? Relational Databases have two extremely well established roles. Conventional row stores serve the OLTP community as the backbone of enterprise operations. These blindingly fast transaction processors are moving in-memory. OLTP stores are modest in number and size (< 1 TB) growing and declining in lock step with business growth and decline. Column stores, OLAP, are the backbone of data warehouses and until recently business intelligence. In general there are huge numbers of these, often of very large size in the Petabyte and Exabyte range. This is where Big Data battle lines are being drawn. What fun!! This is also where we turn from polishing the relational round ball [5] and focus on the other dozen or so other DBMS categories. Taking over is relative; none of the 12 other categories has more than 3% of the database market. Graph databases serve graph applications like networking in communications, telecom, social networks, and of course NSA applications! But what is wonderful about these emerging classes of data-domain specific DBMSs is that we are only now discovering the rich use cases that they serve. The use cases define the DBMSs and the DBMSs help formulate the use cases. SciDB is a superb example of managing scientific data and computation at scale. It is awkward for both communities – database folks who don’t speak linear algebra or matrices, and scientists who only speak R. Exciting times. For a little fun look at the database-engines list [6]. Database Engines DB-Engines lists 216 different database management systems, which are classified according to their database model (e.g. 1 On June 25, 2014 Google launched Cloud Dataflow replacing MapReduce and marking the decline of MR and Hadoop as predicted at launch in 2010 by Mike Stonebraker in 2010. SIGKDD Explorations relational DBMS, key-value stores etc.). This pie-chart shows the number of systems in each category. Some of the systems belong to more than one category. Popularity changes per category, April 2014, over 1 year x Graph DBMS – growing dramatically 3.5X x Wide column stores – 2X x Document stores 2X x Native XML DBMS – 1.5X x Key-value stores – 1.5X x Search engines – 1.5X x RDF stores – 1.5X x Object oriented DBMS - flat x Multivalue DBMS - flat x Relational DBMS - flat GP: You have held an amazing variety of positions in academy, industry, government organization, VC firms, and start-ups, in US, Brazil, Canada, Australia, and Europe. Which 3 positions were most satisfying to you and why? MB: What a great question. Thank you for asking because it caused me to think about what I have really enjoyed over 40 years. Somehow CSAIL at MIT and the Faculty of Computing and Communications at EPFL jump to mind. There are scary smart people at those places. Like climbing mountains it both scares and exhilarates me. To be frank my jobs at big enterprises in hindsight are confusing. I guess I was window dressing because my role did not feel like it had impact. So getting motivated and scared at MIT and EPFL are probably top, so there’s number one. Why? Just look down 5,000 feet and ask why am I here? Second is a combination of Advisory roles at US Academy of Science, DERI, STI, ERCIM, Web Science Trust, and others because they gave me a sense of collaborating, challenging, and contributing. How cool is that?_________________________ Third would be working at startups like Tamr and Jisto. Imagine Volume 16, Issue 1 Page 58 waking up in the morning and thinking you might change the world. That requires that I conceive the world not just differently, but so that it solves someone’s REAL problem. Even more cool. Gregory Piatetsky: Currently you are an adviser at a startup called Tamr [7], co-founded by another leading DB researcher and serial entrepreneur Michael Stonebraker. What can you tell us about Tamr and its product? Michael Brodie: Consider the data universe. Since the 1980’s I have said in keynotes that the database and business worlds deal with less than 10% of the world’s data most of which is structured, discrete, and conforms to some schema. With the Web and Internet of Things in the 1990s massive amounts of unstructured data began to emerge with a growth rate that was inconceivable while shrinking database data to less than 8%. EMC/IDC claims [14] that our Digital Universe is 4.4 zettabytes and will double every two years until 2020 when it will be 44 zettabytes. [If you are constantly amazed at the growth of the Digital World, you don’t understand it yet – A profound, casual comment of my departed friend, Gerard Berry, Academie Francais.] In 1988 or so you, Gregory, and a few others saw the potential of data with your knowledge discovery in databases – then a radical idea. Little did others, including me, realize the potential of this, now named Big Data. Even though Big Data is hot in 2014, almost 30 years later, it’s application, tools, and technologies are in their infancy, analogous to the emergence of the Web in the early 1990s. Just as the Web has and is changing the world, so too will Big Data. ________________________________________ Compared with database data, Big Data is crazy. It’s largely not understood hence it is schema-less or model-less. Big Data is inconceivably massive, dirty, imprecise, incomplete, and heterogeneous beyond anything we’ve seen before. Yet it trumps finite, precise, database data in many ways hence is a treasure trove of value. Big Data is qualitatively different from database data that is a small subset of Big Data - EMC/IDC claims 1% as of 2013. It offers far greater potential thus value and requires different thinking, tools, and techniques. Database data is approached top-down. Telco billing folks know billing inside out so they create models that they impose, top-down on data. Data that does not comply is erroneous. Database data, like Telco bills must be precise with a single version of truth, so that the billing amount is justifiable. Due in part to scale, Big Data must be approached bottom up. More fundamentally, we should let data speak; see what models or correlations emerge from the data, e.g., to discover if adding strawberry to the popsicle line-up makes sense (a known unknown) or to discover something we never thought of (unknown unknowns). Rather than impose a preconceived, possibly biased, model on data we should investigate what possible models, interpretations, or correlations are in the data (possibly in the phenomena) that might help us understand it. SIGKDD Explorations Hence, the new paradigm is to approach Big Data bottom-up due to the scale of the data and to let the data speak. Big Data is a different, larger world than the database world. The database world (small data) is a small corner of the Big Data world. Correspondingly Big Data requires new tools, e.g., Big Data Analytics, Machine Learning (the current red haired child), Fourier transforms, statistics, visualizations, in short any model that might help elucidate the wisdom in the data. But how do you get Big Data, e.g., 100 data sources, 1,000, 100,000 or even 500,000, into these tools? How do you identify the 5,000 data sources that include Sally Blogs and consolidate them into a coherent, rationale, consistent view of dear Sally? When questions arise in consolidating Sally’s data, how do you bring the relevant human expertise, if needed, to bear – at scale on 1 million people? Many successful Big Data projects report that this data curation process takes 80% of the project resources leaving 20% for the problem at hand. Data curation is so costly because it is largely manual hence it is error prone. That’s where Tamr comes to the rescue. It is a solution to curate data at scale. We call it collaborative data curation because it optimizes the use of indispensable human experts. Data Curation is for Big Data what Data Integration is for small data. Data Curation is bottom up and Data Integration is top down. It took me about a year to understand that fundamental difference. I have spent over 20 years of my professional life dealing with those amazing Data Integration platforms and some of the world’s largest data integration applications. Those technologies and platforms apply beautifully to database data – small data; they simply do not apply to Big Data. To emphasize what is ahead, here is a prediction. Data Integration is increasingly crucial to combining top-down data into meaningful views. Data Integration is a huge challenge and huge market that will not go away. Big Data is orders of magnitude larger than small or database data. Correspondingly Data Curation will be orders of magnitude larger than Data Integration. The world will need Data Curation solutions like Tamr to let data scientists focus on analytics, the essential use and value of big data, while containing the costs of data preparation. In addition to Tamr there are over 65 very cool data curation products contributing to addressing the growing need and creating a new software market. What is also cool about data curation is that it can be used to enrich the existing information assets that are the core of most enterprise’s applications and operations. Of course, the really cool potential of data curation is that it makes Big Data analytics more efficiently available to allow users to discover that they never knew! How cool is things that?__________________________ Volume 16, Issue 1 Page 59 For more on Data Curation at scale, see Stonebraker [8]. GP: You also advise another startup Jisto. What can you tell us about your role there? MB: I am having a blast with Jisto [9]– some amazingly talented young engineers [PhDs actually] with lots of energy and a killer idea. Jisto is an exceptional example of the quality you ask about in the next question. Cloud computing enabled by virtualization is radically changing the world by reducing the cost and increasing the availability of computing resources. Can you imagine that only 50% of the world’s servers are virtualized? Pop quiz [do not cheat and read ahead]. What is the average CPU utilization of physical servers, worldwide? Of virtual servers? Answer: Virtual machine CPU utilization is typically in the 3050% range while physical servers are 10-20%, due to risk and scheduling ,but mostly cultural challenges. Jisto enables enterprises to transparently run more computeintensive workloads on these paid-for but unused resources whether on premises or in public or private clouds, thus reducing costs by 75–90% over acquiring more hardware or cloud resources. Jisto provides a high-performance, virtualized, elastic cloudcomputing environment from underutilized enterprise or cloud computing resources (servers, laptops, etc.) without impacting the primary task on those resources. Organizations that will benefit most from Jisto are those that run parallelized compute-intensive applications in the data center or in private and public clouds (e.g., Amazon Web Services, Windows Azure, Google Cloud Platform, IBM SmartCloud). _ Jisto is currently looking for early adopters for its beta program who will gain significant reduction in the cost of their computing possibly avoiding costly data center expansion. GP: You are also a Principal at First Founders Limited. What do you look for in young business ventures - how do you determine quality?-------------------------------------------MB: There are armies of people who evaluate the potential of startups. The professional ones are called "Venture Capitalists (VCs). The retired ones are called Angels. Like any serious problem there is due diligence to determine and evaluate the factors relevant to the business opportunity, the technology, the business plan, etc. as the many books [10] and formulas suggest.-------------------------------------------------------If you are reading a book, then you don’t know. Ultimately it comes down to good taste developed over years of successful experience. Andy Palmer, a serial entrepreneur, good friend, and very smart guy said “Do it once really well then repeat.” Andy ought to know, Tamr is about his 25th startup. SIGKDD Explorations At First Founders I can do some technology, Jim can do finance, Howard can do business plans. Collectively we make a judgment. But good VC’s are the wizards. They have Rolodexes. When their taste says maybe they refer the startup to the relevant folks in their network who essentially do the due diligence for them. Like at First Founders, the judgment is crowd sourced, actually what we call at Tamr, it is expert sourced. I have a growing trust of the crowd and especially of the expert crowd. GP: You were a Chief Scientist at Verizon for over 10 years (and before that at GTE Labs which became part of Verizon). What were some of the most interesting projects you were involved in at GTE and Verizon? MB: The technical challenge that stays with me is that addressed by the Verizon Portal, Verizon’s solution for Enterprise Telecommunications – providing Telecommunication services to enterprise customers, such as Microsoft. Verizon, like all large Telcos, is the result of the merger & acquisition of 300+ smaller Telcos. Each had at least 3 billing systems; hence Verizon acquired over 1,000 billing systems. Billing is only one of over a dozen systems categories, including sales, marketing, ordering, and provisioning. Providing a customer like Microsoft with a telephone bill for each Microsoft organization requires integrating data potentially from over 1,000 databases. As is the case for most enterprises, Verizon and Microsoft reorganize constantly complicating the sources to be integrated, like Microsoft, and the targets, Verizon’s changing businesses, e.g., wireline and FiOS. Every service company faces this little-discussed massive challenge. Integrating 1,000s of operational systems is a backward looking problem. The cool forward-looking problem was Verizon IT’s Standard Operating Environment (SOE). Prior to cloud platforms and cloud providers, Verizon IT (actually one team) sought to develop an SOE onto which Verizon’s major applications (0ver 6,000) could be migrated to be managed virtually on an internal cloud. What a fun challenge. When the team left Verizon as a group over 60 major corporate applications, including SAP, had been migrated. Smart folks, good solution that failed in Verizon. In industry, challenges are 80-20; 80% political, 20% technical. The SOE is being reborn in the infrastructure of another major infrastructure corporation. Finally, the next most interesting and yet unsolved industry challenge was getting over the legacy that Mike Stonebraker and I addressed in [11]. How do you keep a massive system up to date in terms of the application requirements and the underlying technology or migrate it to a modern, efficient, more cost effective platform? Enterprises tend to invest only in new revenue generating opportunities often leaving the legacy problem to grow and grow. So existing systems like billing languish and accumulate. It’s like a teenager never tidying their room for 60 years. Now where are my blue shoes? I suggested to Mike Stonebraker that we rewrite our 1995 book. He did not even respond to the email, suggesting that it is largely a political problem and not technical, no matter the brilliant technical solution provided. Volume 16, Issue 1 Page 60 Lesson: If you are a CIO, clean up your goddamn room; you’re not going out until you do! GP: Around 1989 when you were a manager at GTE Labs and I was a member of technical staff there, you were somewhat skeptical of the idea I proposed for research into Knowledge Discovery in Databases (then called KDD or Data Mining, and more recently Predictive Analytics, and Data Science). The field has progressed significantly since then. From your point of view, what are the main successes and disappointments of KDD/Data Mining/Predictive Analytics and can Data Science become an actual science? MB: My current research concerns the scientific and philosophical underpinnings of Big Data and Data Science. With Big Data we are undergoing a fundamental shift in thinking and in computing. Big Data is a marvelous tool to investigate What – correlations or patterns that suggest that things might have or will occur. Big Data’s weakness is that it says nothing about Why – causation or why a phenomenon occurred or will occur. A pernicious aspect of What are the biases that we bring to it. On a personal note, my biased recall of 1989 was how marvelous your ideas were and the amazing potential of data mining. I accept your view that I was skeptical rather than enthusiastic as I recall. You see I modified reality to fit my desire to be on the winning side, which I was not then. Hence, what we think that we thought may bear little resemblance to reality or, more precisely other people’s reality. As Richard Feynman said, “The first principle is that you must not fool yourself - and you are the easiest person to fool.” That said, I see the main successes of this trend as a nascent trajectory along the lines of Big Data, Data Analytics, Business Intelligence, Data Science, and whatever the current trendy term is. The World of What is phenomenal – machines proposing potential correlations that are beyond our ability to identify. Humans consider seven plus or minus 2 variables at a time, a rather simple model, while models, such as Machine Learning, can consider millions or billions of variables at a time. Yet 95% (or even 99.99999%) of the resulting correlations may be meaningless. For example, ~99% of credit card transactions are legitimate with less than 1% that are fraudulent, yet the 1% can kill the profits of a bank. So precision and outlier cases, called anomalies in science can matter. So it pays to search for apparently anomalous behavior – as it is happening! We have already seen massive benefits of Big Data in the stock market, electoral predictions, marketing success, and many more that underlie the Big Data explosion. Yet there is a potential Big Data Winter ahead if people blindly apply Big Data and more specifically Machine Learning. The failures concern limited models of phenomena and the human tendency of bias. People can and do use What (Big Data, etc.) to support their biases and limited models, e.g., used to support the claim of the absence of climate change or lack of human impact on climate change, rather SIGKDD Explorations than letting the data speak to suggest directions and models that we may never have thought of. As it has always been, it takes courage to change from a discrete world of top-down models [I know how this works!] to an ambiguous, probabilistic world [What possible ways does this work?]. Those are natural successes and limitations of an emerging field. The direction, opportunities, and changes are profound. I experience a mix of fear and tingles thinking of asking the data to speak. Hoping that I can be open to what it says and distinguishing s..t from Shinola. I call the vision Computing Reality. It may be the Next Generation of Computing. GP: In your very insightful report of the White House-MIT Big Data Privacy Workshop [12] you have a quote “Big data has rendered obsolete the current approach to protecting privacy and civil liberties". Will people get used to much less privacy (as the digitally-savvy younger people seem to be) or will government regulation and/or technology be able to protect privacy? How will this play in US vs. Europe vs. other regions of the world? MB: As an undergraduate at the University of Toronto, I was extremely fortunate to have had Kelly Gotlieb, the Father of Computing in Canada, as a mentor. I was a student in his 1971 course, Computers and Society, later to become the first book on the topic. Kelly and the issues, including privacy, have resonated with me throughout my career. Kelly observed that privacy, like many other cultural norms, varies over time. So yes, privacy will fluctuate from Alan Westin’s notion of determining how your personal information is communicated to the Facebook-esk "Get over it". While personal privacy is undergoing significant change, disclosure of information assets that are part of the digital economy or of government or corporate strategy may have very significant impacts on our economy and democracy. Hence, this raises issues of security, protection, and cultural and social issues too complex to be treated here. However, there are a number of very smart people looking at various aspects. The quote you cite is from Craig Mundy [13] who explores changes that Big Data brings debating the balancing of economic versus privacy issues. Very smart folks, like Butler Lampson and Mike Stonebraker, are commenting on practical solutions to this age-old problem. Their arguments are along the following lines. Due to the massive scale of Big Data, and what I call Computing Reality, previously topdown solutions for security, such as anticipating and preventing security breaches, will simply not scale to Big Data. They must be augmented with new approaches including bottom-up solutions such as Stonebraker’s logging to detect and stem previously unanticipated security breaches and Weitzner’s accountable systems. To beat the Heartbleed bug and others like it, “Organizations need Volume 16, Issue 1 Page 61 to be able to detect attackers and issues well after they have made it through their gates, find them, and stop them before damage can occur,” Gazit, a leading cyber security expert said recently. “The only way to achieve such a laser-precision level of detection is through the use of hyper-dimensional big data analytics, deploying it as part of the very core of the defense mechanisms. Big Data has rendered obsolete the current approach to protecting privacy and civil liberties.” Hence, Big Data requires a shift from a focus on top-down methods of controlling data generation and collection to a focus on data usage. Not only do top-down methods not scale, “Tightly restricting data collection and retention could rob society of a hugely valuable resource [13]”. Adequate let alone complete solutions will take years to develop. GP: What interesting technical developments you expect in Database and Cloud Technology in the next 5 years? MB: I call the Big Picture Computing Reality in which we model the world from whatever reasonable perspectives emerge from the data and are appropriate, e.g., have veracity, and make decisions symbiotically with machines and people collaborating to optimize resources while achieving measures of veracity for each result. One subspace of this world is what we currently know with high levels of confidence, the type of information that we store in relational databases. Another encompassing space is what we know but forgot or don’t want to remember (unknown knowns) and a third is what we speculate but do not know (known unknowns), these are all the hypotheses that we make but do not know in science, business, and life. (Michael Brodie and his son on a peak in New Hampshire) My activities include the gym (4 times a week); hiking/climbing ~75 mountains USA, Nepal, Greece, Italy, France, Switzerland, and even Australia; 42 of the 48 4,000 footers in NH (most with Mike Stonebraker); cooking (daily and special occasions with my son Justin, an amazing chef and brewer, when he’s not doing his PhD), travel, and my garden; all of these – except the gym and garden - with family and close friends. Very cool Big Data Books: Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schonberger, Kenneth Cukier, Houghton Mifflin Harcourt confused and inspired me, then The rest of the data space – unknown unknowns - is infinite; otherwise learning would be at an end. That is the space of discovery. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't, by Nate Silver, Penguin Press, inspired me. I am investigating Computing Reality to investigate the entire space with the objective of accelerating Scientific Discovery. This is practically interesting because very little of our world is discrete, bounded, finite, or involves a single version of truth, yet that is the world of most computing. With Computing Reality we hope to be far more pragmatic and realistic. This is technically and theoretically interesting because we have almost no mathematical or computing models in these areas. Those that exist are just emerging or are massively complex. How cool is that? You see what old retired guys get to do? Real books GP: What do you like to do in your free time? What recent book you liked? MB: Free time – what a concept! My yoga teacher, Lynne recommended that I should try to do nothing one day, and I will. I will. Soon. Really. Life is such a blast; it’s hard to keep still. SIGKDD Explorations x Ken Follett’s The Pillars of the Earth; Century Trilogy (Fall of Giants, Winter of the World and Edge of Eternity) x Henning Mankell’s The Fifth Woman (A Kurt Wallander Mystery) GP: You just returned from Doha, Qatar where you were advising the Qatar Computing Research Institute (QCRI) quite far from Silicon Valley, New York, or Boston. What is happening there and what computing research are they doing? MB: This was my first visit to Qatar that was remarkable culturally an intellectually. Culturally I saw spectacular result of hydrocarbon wealth and vision, e.g., amazing architecture emerging from the dessert. Intellectually I saw the beginnings of Qatar’s National Vision 2030 to transform Qatar’s economy from hydrocarbon-based to knowledge-based. Volume 16, Issue 1 Page 62 One step in this direction by the Qatar Foundation was to create the Qatar Computing Research Institute (QCRI). In less than three years QCRI has established the beginnings of a world-class computer science research group seeded with world-class researchers in strategically important areas such as Social Computing, Data Analysis, Cyber Security, and Arabic Language Technologies (e.g., Machine Learning and Translation) amongst others. Each group already has multiple publications over several years in the leading conferences in their areas, e.g., SIGMOD and VLDB for Data Analysis. I spent my time reviewing with them what I consider to be some of the most challenging issues in Big Data. 4. REFERENCES [1] Gregory Piatetsky, Exclusive Interview: Michael Brodie, Leading Database Researcher, Industry Leader, Thinker, in KDnuggets, April 2014, [7] Gregory Piatetsky, Exclusive: Tamr at the New Frontier of Big Data Curation, KDnuggets, May 2014, http://www.kdnuggets.com/2014/05/tamr-new-frontier-big-datacuration.html [8] Stonebraker et al, Data Curation at Scale: The Data Tamer System In CIDR 2013 (Conference on Innovative Data Systems Research). [9] http://jisto.com/ [10] R. Field, "Disciplined Entrepreneurship: 24 Steps to a Successful Startup by Bill Aulet", Journal of Business & Finance Librarianship, vol. 19, no. 1, pp. 83-86, Jan. 2014. [11] M. Brodie and M. Stonebraker. Legacy Information Systems Migration: The Incremental Strategy, Morgan Kaufmann Publishers, San Francisco, CA (1995) ISBN 1-55860-330-1 [12] M. Brodie, White House-MIT Big Data Privacy Workshop Report, KDnuggets, Mar 27, 2014. http://www.kdnuggets.com/2014/04/michael-brodie-databaseresearcher-leader-thinker.html http://www.kdnuggets.com/2014/03/white-house-mit-big-dataprivacy-workshop-report.html [2] Gregory Piatetsky, Interview (part 2): Michael Brodie on Data Curation, Cloud Computing, Startup Quality, Verizon, in KDnuggets, May 2014, [13] Craig Mundy, Privacy Pragmatism: Focus on Data Use, Not Data Collection, Foreign Affairs, March/April 2014. http://www.kdnuggets.com/2014/04/interview-michael-brodie-2data-curation-cloud-computing-verizon.html [3] Gregory Piatetsky, Interview (part 3): Michael Brodie on Industry Lessons, Knowledge Discovery, and Future Trends, in KDnuggets, May 2014, http://www.kdnuggets.com/2014/05/interview-michael-brodie-3industry-lessons-knowledge-discovery-trends-qcri.html [4] www.michaelbrodie.com/michael_brodie.asp [5] Michael Stonebraker. Are We Polishing a Round Ball? Panel Abstract. ICDE, page 606. IEEE Computer Society, 1993 [6] Database Engines List, http://db-engines.com/en/blog_post/23 http:/db-engines.com/en/blog_post/23 SIGKDD Explorations [14] The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. IDC/EMC, April 2014. About the authors: Gregory Piatetsky is a Data Scientist, co-founder of KDD conferences and ACM SIGKDD society for Knowledge Discovery and Data Mining, and President and Editor of KDnuggets. He tweets about Analytics, Big Data, Data Science, and Data Mining at @kdnuggets. Volume 16, Issue 1 Page 63