Intelligent Reduction Techniques for Big Data

Transcription

Florin Pop*, Catalin Negru, Sorin Ciolofan, Mariana Mocanu, and Valentin
Cristea
Computer Science Department, Faculty of Automatic Control and Computers, University
Politehnica of Bucharest, Romania
e-mail: [email protected], [email protected], [email protected],
[email protected], [email protected]
*
Corresponding Author
Abstract: Working with big volume of data collected through many applications in multiple storage locations is both challenging and rewarding. Extracting valuable information from data means to combine qualitative and
quantitative analysis techniques. One of the main promises of analytics is data reduction with the primary function to support decision-making. The motivation of this chapter comes from the new age of applications (social media,
smart cities, cyber-infrastructures, environment monitoring and control,
healthcare, etc.), which produce big data and many new mechanisms for data
creation rather than a new mechanism for data storage. The goal of this
chapter is to analyze existing techniques for data reduction, at scale to facilitate Big Data manipulation and understanding. The chapter will cover the
following subjects: data manipulation, analytics and Big Data reduction
techniques considering descriptive analytics, predictive analytics and prescriptive analytics. The CyberWater case study will be presented by referring
to: monitoring, analysis and control of natural resources, especially water resources to preserve the water quality.
1 Introduction
There are a lot of applications that generate Big Data, like: social networking profiles, social influence, SaaS and Cloud Apps, public web information, MapReduce
scientific experiments and simulations, data warehouse, monitoring technologies,
e-government services, etc. Data grow rapidly, since applications produce continuously increasing volumes of both unstructured and structured data. Decisionmaking is critical in real-time systems and mobile systems [33] and has an important role in business [16]. This process uses data as input, but not the whole data.
So, a representative and relevant data set must to be extracted from data. This is
2 Florin Pop, Catalin Negru, Sorin Ciolofan, Mariana Mocanu, and Valentin Cristea
the subject of data reduction. On the other hand, recognize crowd-data significance is another challenges with respect to making sense of Big Data: it means to
determine "wrong" information from "disagreeing" information and find metrics
to determine certainty [26].
Thomas H. Davenport, Jill Dych in their report "Big Data in Big Companies"
[16] name "Analytics 3.0" the new approach that well established big companies
had to do in order to integrate Big Data infrastructure into their existing IT infrastructure (for example Hadoop clusters that have to coexist with IBM mainframes). The "variety" aspect of Big Data is the main concern for companies (ability to analyze new types of data, eventually unstructured, and not necessary focus
on large volumes).
The most evident benefits of switching to Big Data are the cost reductions. Big
data currently represents a research frontier, having impact in many areas, such as
business, the scientific research, public administration, and so on. Large datasets
are produced by multiple and diverse sources like: www, sensor networks, scientific experiments, high throughput instruments, increase at exponential rate [38,
48] as shown in Figure 1. UPS stores over 16 petabytes of data, tracking 16 millions packages per day for 9 millions customers. Due to an innovative optimization of navigation on roads they saved in 2011, 8.4 millions of gallons of fuel. A
bank saved an order of magnitude by buying a Hadoop cluster with 50 servers and
800 processors compared with a traditional warehouse. The second motivation for
companies to use Big Data is time reduction. Macy's was able to optimize pricing
for 73 million items on sale from 27 hours to 1 hour. Also, companies can run a
lot more (100.000) analytics models than before (10, 20 or 100). Time reduction
allows also to real time react to customer habits. The third objective is to create
new Big Data specific offerings. LinkedIn launched a set of new services and features such as Groups You May Like, Jobs You May Be Interested In, and Who’s
Viewed My Profile etc. Google developed Google Plus, Google Apps. Verizon,
Sprint and T-Mobile deliver services based on location data provided by mobiles.
The fourth advantage offered by Big Data to business is the support in taking internal decision, considering that lot of data coming from customer's interaction is
unstructured or semi-structured (web site clicks, voice recording from call center
calls, notes, video, emails, web logs etc). With the aid of natural language
processing tools voice can be translated into text and calls can be analyzed.
The Big Data stack is composed of storage, infrastructure (e.g. Hadoop), data
(e.g. human genome), applications, views (e.g. Hive) and visualization. The majority of big companies already have in place a warehouse and analytics solution,
which needs to be integrated now with the Big Data solution. The challenge is to
integrate the legacy ERP, CRM, 3-rd party apps, the data warehouse with Hadoop
and new types of data (social media, images, videos, web logs, pdfs, etc) in a way
that allows efficient modeling and reporting.
Since we face a large variety of solutions for specific applications and platforms, a thorough and systematic analysis of existing solutions for data reduction
models, methods and algorithms used in Big Data is needed [7, 8]. This chapter
3
presents the art of existing solutions and creates an overview of current and nearfuture trends; it uses a case study as proof of concept for presented techniques.
Fig.1 Data deluge: the increase of data size has surpassed the capabilities of computation [38,
48].
The chapter is organized as follow. Section 2 presents the data manipulation
challenges being focused on spatial and temporal databases, key-value stores and
no-SQL, data handling and data cleaning, Big Data processing stack and
processing techniques. Section 3 describes the reduction techniques: descriptive
analytics, predictive analytics and prescriptive analytics. In Section 4 we present a
case study focused on CyberWater, which is a research project aiming to create a
prototype platform using advanced computational and communications technology for implementation of new frameworks for managing water and land resources
in a sustainable and integrative manner. The chapter ends with Section 5 presenting conclusions.
2 Data Manipulation Challenges
This section describes the data manipulation challenges: spatial and temporal databases [39], parallel queries processing, key-value stores and no-SQL, Data
Cleaning, MapReduce, Hadoop and HDFS [12]. The processing techniques used
in data manipulation together with Big Data stack are presented in this section [15,
45, 42]. This part also fixes the most important aspects of data handling used in
analytics.
2.1 Spatial and Temporal Databases
Spatial and temporal database systems are in close collaborative relation with other research area of information technology. These systems integrate with other
disciplines such as medicine, CAD /CAM, GIS, environmental science, molecular
biology or genomics/bioinformatics. Also, they use large, real databases to store
and handle large amount of data [46]. Spatial databases are designed to store and
process spatial information systems efficiently. Temporal databases represent
attribute of objects that are changing with respect to time. There are different
models for spatial / temporal and spatio-temporal systems [11]:









Snapshot Model - temporal aspects of data time-stamped layers;
Space-Time composite - every line in space and time is projected onto a
Spatial plane and intersected with each other;
Simple Time Stamping - each object consists of a pair of time stamps
representing creation and deletion time of the object;
Event-Oriented model - a log of events and changes made to the objects are
logged into a transaction log;
History Graph Model - each object version is identified by two timestamp describing the interval of time;
Object-Relationship (O-R) Model - a conceptual level representation of spatiotemporal DB;
Spatio-temporal object-oriented data model - is based on object oriented technology;
Moving Object Data Model - objects are viewed as 3D elements.
The systems designed considering these models manage both space and time
information for several classes of applications, like: tracking of moving people or
objects, management of wireless communication networks, GIS applications, traffic jam preventions, whether prediction, electronic services (e.g. e-commerce), etc.
Current research focuses on the dimension reduction, which targets spatiotemporal data and processes and is achieved in terms of parameters, grouping or
state1. Guhaniyogi et al. [19] address the dimension reduction in both parameters
and data space. Johnson et al. [31] create clusters of sites based upon their temporal variability. Leininger et al. [34] propose methods to model extensive classification data (land-use, census, and topography). Wu et al. [50] considering the problem of predicting migratory bird settling, propose a threshold vectorautoregressive model for the Conway-Maxwell Poisson (CMP) intensity parameter that allows for regime switching based on climate conditions. Dunstan et al.
[17] goal is to study how communities of species respond to environmental
changes. In this respect they o classify species into one of a few archetypal forms
of environmental response using regression models. Hooten et al. [27] are concerned with ecological diffusion partial differential equations (PDE's) and propose
an optimal approximation of the PDE solver that dramatically improves efficiency. Yang et al. [52] are concerned with prediction in ecological studies based on
1
These solution were grouped in the Special Issue on "Modern Dimension Reduction Methods
for Big Data Problems in Ecology" edited by Christopher K. WIKLE, Scott H. HOLAN, and
Mevin B. HOOTEN, in Journal of Agricultural, Biological, and Environmental Statistics
5
high-frequency time signals (from sensing devices) and in this respect they develop nonlinear multivariate time-frequency functional models.
2.2 Key-Value Stores and NoSQL
NoSQL term may refer two different things, like the data management system is
not an SQL-compliant one or accepted one means "Not only SQL", and refers to
environments that combine SQL query language with other types of querying and
access. Due to the fact that NoSQL databases does not have a fix schema model,
they come with a big advantage for developers and data analysts in the process of
development and analysis, as is not need to cast every query as a relational table
database. Moreover there are different NoSQL frameworks specific to different
types of analysis such as key-value stores, document stores, tabular stores, object
data stores, and graph databases [36].
Key-value stores represents schema less NoSql data stores, where values are
associated with keys represented by character strings. There are four basic operations when dealing with this type of data stores:
1.
2.
3.
4.
Put(key; value) - associate a value with the corresponding key;
Delete(key) - removes all the associated values with supplied key;
Get(key) - returns the values for the provided key;
MultiGet(key1; key2; . . . ; keyn) - returns the list of values for the provided
keys.
In this type of data stores can be stored a large amount of data values that are
indexed and can be appended at same key due to the simplicity of representation.
Moreover tables can be distributed across storage nodes [36].
Although key-value stores can be used efficiently to store data resulted from
algorithms such as phrase counts, these have some drawbacks. Firs it is hard to
maintain unique values as keys, being more and more difficult to generate unique
characters string for keys. Second, this model does not provide capacities such as
consistency for multiple transactions executed simultaneously, those must be provided by the application level.
Document stores are data stores similar to key-values stores, main difference
being that values are represented by "documents" that have some structure and encoding models (e.g. XML, JSON, BSON, etc.) of managed data. Also document
stores provide usually and API for retrieving the data.
Tabular stores represent stores for management of structured data based on
tables, descending from Bigtable design from Google. HBase form Hadoop
represents such type on NoSql data management system. In this type of data store
data is stored in a three-dimensional table that is indexed by a row key (that is
used in a fashion that is similar to the key- value and document stores), a column
key that indicates the specific attribute for which a data value is stored, and a
timestamp that may refer to the time at which the row's column value was stored.
Madden defines in [40] Big Data as being data which is "too big (e.g. petabyte
order), too fast (must be quickly analyzed, e.g. fraud detection), too hard (requires
analysis software to be understood)" to be processed by tools in the context of relational databases. Relational databases, especially commercial ones (Teradata,
Netezza, Vertica, etc) can handle the "too big" problem but fail to address the other two. "Too fast" means handling streaming data that cannot be done efficiently
by relational engines. In-database statistics and analytics that are implemented for
relational databases does not parallelize efficiently for large amount of data (the
"too hard" problem). MapReduce or Hadoop based databases (Hive, HBase) are
not solving the Big Data problems but seem to recreate DBMS's. They do not provide data management and show poor results for the "too fast" problem since they
work with big blocks of replicated data over a distributed storage thus making difficult to achieve low-latency.
Currently we face with a gap between analytics tools (R, SAS, Matlab) and
Hadoop or RDBMS's that can scale. The challenge is to build a bridge either by
extending relational model (Oracle Data Mining [20] and Greenplum MadSkills
[22] efforts to include data mining, machine learning, statistical algorithms) or extend the MapReduce model (ongoing efforts of Apache Mahout to implement machine learning on top of MapReduce) or creating something new and different
from these two (GraphLab from Carnegie Mellon [37] which is a new model tailored for machine learning or SciDB which aims to integrate R and Python with
large data sets on disks but these are not enough mature products). All these have
problems from the usability perspective [40].
2.3 Data Handling and Data Cleaning
Traditional technologies and techniques for data storage and analysis are not efficient anymore as the data is produced in high-volumes, come with high-velocity,
has high-variety and there is an imperative need for discovering valuable knowledge in order to help in decision making process.
There are many challenges when dealing with big data handling, starting from
data capture to data visualization. Regarding the process of data capture, sources
of data are heterogeneous, geographically distributed, and unreliable, being susceptible to errors. Current real-world storage solutions such as databases are populated in consequence with inconsistent, incomplete and noisy data. Therefore, several data preprocessing techniques, such as data reduction, data cleaning, data
integration, and data transformation, must be applied to remove noise and correct
inconsistencies and to help in decision-making process [21].
The design of NoSQL databases systems highlights a series of advantages for
data handling compared with relational database systems [25]. First data storage
and management are independent one from another. The data storage part, called
7
also key-value storage, focus on the scalability and high-performance. For the
management part, NoSQL provides low-level access mechanism, which gives the
possibility that tasks related to data management can be implemented in the application layer, contrary to relational databases which spread management logic in
SQL or stored procedures [10].
So, NoSQL systems provide flexible mechanisms for data handling, and application developments and deployments can be easily updated [25]. Another important
design advantage of NoSQL databases is represented by the facts that are schemafree, which permits to modify structure of data in applications; moreover the management layer has policies to enforce data integration and validation.
Data quality is very important, especially in enterprise systems. Data mining
techniques are directly affected by data quality. Poor data quality means no relevant results. Data cleaning in a DBMS means record matching, data deduplication,
and column segmentation. Three operators are defined for data cleaning tasks:
fuzzy lookup that it is used to perform record matching, fuzzy grouping that is
used for deduplication and column segmentation that uses regular expressions to
segment input strings [4, 3, and 5].
Big Data can offer benefits in many fields from businesses through scientific
field, but with condition to overcome the challenges, which arise in data, capture,
data storage, data cleaning, data analysis and visualization.
2.4 Big Data Processing Stack
Finding the best method for a particular processing request behind a particular use
remains a significant challenge. We can see the Big Data processing as a big
"batch" process that runs on an HPC cluster by splitting a job into smaller tasks
and distributing the work to the cluster nodes. The new types of applications, like
social networking, graph analytics, complex business workflows, require data
movement and data storage. In [49] is proposed a general view of a four-layer bigdata processing stack (see Figure 2). Storage Engine provides storage solutions
(hardware/software) for big data applications: HDFS, S3, Lustre, NFS, etc. Execution Engine provides the reliable and efficient use of computational resources to
execute. This layers aggregate YARN-based processing solutions. Programming
Model offers support for application development and deployment. High-Level
Language allows modeling of queries and general data-processing tasks in easy
and flexible languages (especially for non-experts).
The processing models must be aware about data locality and fairness when
decide to move date on the computation node or to create new computation nodes
neat to data. The workload optimization strategies are the key for a guaranteed
profit for resource providers, by using the resource to maximum capacity. For applications that are both computational and data intensive the processing models
combine different techniques like in-memory Big Data or CPU+GPU processing.
Figure 3 describes in a general stack used to define a Big Data processing platform.
Moreover, Big Data platforms face with heterogeneous environments where
different systems, like Custer, Grid, Cloud, and Peer-to-Peer can offer support for
advance processing. At the confluence of Big Data with heterogeneity, scheduling
solutions for Big Data platforms consider distributed applications designed for efficient problem solving and parallel data transfers (hide transfer latency) together
with techniques for failure management in high heterogeneous computing systems. Handling of heterogeneous data sets becomes a challenge for interoperability through various software systems.
Fig.2 Big Data Processing Stack.
Fig. 3 Big Data Platforms Stack: an extended view.
9
2.5 Processing in Big Data Platforms
A general Big Data architecture basically consists of two parts: a job manager that
coordinates processing nodes and a storage manager that coordinates storage
nodes [35]. Apache Hadoop is a set of open source applications that are used together in order to provide a Big Data solution. The two main components mentioned above in Hadoop are, HDFS and YARN. In Figure 4 is presented the general architecture of a computing platform for Big Data platforms. HDFS Hadoop
Distributed File System is organized in clusters where each cluster consists of a
name node and several storage nodes. A large file is split into blocks and name
node takes care of persisting the parts on data nodes. The name node maintains
metadata about the files and commits updates to a file from a temporary cache to
the permanent data node. The data node does not have knowledge about the full
logical HDFS file; it handles locally each block as a separate file. Fault tolerance
is achieved through replication; optimizing the communication by considering the
location of the data nodes (the ones located on the same rack are preferred). A
high degree of reliability is realized using "heartbeat" technique (for monitoring),
snapshots, metadata replication, checksums (for data integrity), and rebalancing
(for performance).
Fig. 4 Typical organization of resources in a big-data platform.
YARN is a name for MapReduce v2.0. This implements a master/slave execution of processes with a JobTracker master node and a pool of Task-Trackers that
do the work. The two main responsibilities of the JobTracker, management of resources and job scheduling/monitoring are split. There is a global resource manag-
er (RM) per application, and an Application Master (AM). The slave is a per node
entity named Node Manager (NM) which is doing the computations. The AM negotiates with the RM for resources and monitors task progress. Other components
are added on top of Hadoop in order to create a Big Data ecosystem capable of
configuration management (Zookeeper [29]), columnar organization (HBase [30]),
data warehouse querying (Hive [28]), easier development of MapReduce programs (Pig [47]), and machine learning algorithms (Mahout) [43].
3 Big Data Reduction Techniques
This section highlights the analytics for Big Data focusing on reduction techniques. The main reduction techniques are based on: statistical models used in data analysis (e.g. kernel estimation) [6, 23]; machine learning techniques: supervised and un-supervised learning, classification and clustering, k-means [18],
multi-dimensional scaling [1]; ranking techniques: PageRank, recursive queries,
etc. [44]; latent semantic analysis; filtering techniques: collaborative filtering,
multi-objective filtering; self-* techniques (self-tuning, self-configuring, self
adaptive, etc.) [24], and data mining techniques [32]. All these techniques are used
for descriptive analytics, predictive analytics and prescriptive analytics.
According with [16], we can see three important stages in evolution of analytics methods:
 Analytics 1.0 (1954-2009) - data sources small and data was mostly internal,
analytics was a batch process that took months;
 Analytics 2.0 (2005-2012) - the most important actors are Internet based companies Google, eBay, Yahoo. Data is mostly external, unstructured, huge volume, and it required parallel processing with Hadoop. It is what was named
Big Data. The tow of data is much faster than in Analytics 1.0. A new category
of skilled employees called "data scientists" emerged;
 Analytics 3.0 (2012 present) - the best of traditional analytics and Big Data
techniques are mixed together in order to tackle large volumes of data, both internal and external, both structured and unstructured, obtained in a continusely
increasing number of di_erent formats (new sensors are added). Hadoop and
Cloud technologies are intensively used not only by online forms but by
various companies such as Banks, Retail, HealthCare providers, etc.
3.1 Intelligent Reduction Techniques
In Data Analysis as part of the Qualitative Research for large datasets, in past decades there were proposed Content Analysis (counting the number of occurrences
of a word in a text but without considering the context) and Thematic Analysis
11
(themes are patterns that occurs repeatedly in data sets and which are important to
the research question). The first form of data reduction is to decide which data
from the initial set of data is going to be analyzed (since not all data could be relevant, some of it can be eliminated). In this respect there should be defined some
method for categorizing data [41].
Structural coding - code related to questions is then applied to responses to that
question in the text. Data can be then sorted using these codes (structural
coding acting as labeling).
Frequencies - word counting can be a good method to determine repeated ideas in
text. It requires prior knowledge about the text since one should know before the keywords that will be searched. An improvement is to count not
words but codes applications (themes).
Co-occurrence - more codes exist inside a segment of text. This allows Boolean
queries (and segment with code A AND code B).
Hierarchical Clustering – using co-occurrence matrices (or code similarity matrices) as input. The goal is to derive natural groupings (clusters) in large
datasets. A value matrix element v(i,j) = n means that code i and code j
co-occurs in n participant files.
Multidimensional Scaling - the input is also similarity matrix and ideas that are
considered to be close each to the other are represented as points with a
small distance between them. This way is intuitive to visualize graphically the clusters.
Big Data does not raise only engineering concerns (how to manage effectively
the large volume of data) but also semantic concerns (how to get meaning information regardless implementation or application specific aspects).
Meaningful data integration process requires following stages, not necessarily
in this order [8]:
 Define the problem to be resolved;
 Search the data to find the candidate datasets that meet the problem criteria;
 ETL (Extract, Transform and Load) of the appropriate parts of the candidate
data for future processing;
 Entity Resolution checks if data is unique, comprehensive and relevant;
 Answer the problem performs computations to give a solution to the initial
problem.
Using the Web of Data, which according to some statistics contains 31 billion
|RDF triples, is possible to find all data about people and their creations (books,
films, musical creations, etc), translate the data into a single target vocabulary,
discover all resources about a specific entity and then integrate this data into a single coherent representation. RDF and Linked Data (such as pre-crawled web data
sets BTC 2011 with 2 billion RDF triples or Sindice 2011 with 11 billion RDF
triples extracted from 280 million web pages an- notated with RDF) are schema
less models that suits Big Data, considering that less than 10% is genuinely relational data. The challenge is to combine DBMS's with reasoning (the next smart
databases) that goes beyond OWL, RIF or SPARQL and for this reason use cases
are needed from the community in order to determine exactly what requirements
the future DB must satisfy. A web portal should allow people to search keywords
in ontologies, data itself and mappings created by users [8].
3.2 Descriptive analytics
Descriptive analytics is oriented on descriptive statistics (counts, sums, averages,
percentages, min, max and simple arithmetic) that summarizes certain groupings
or filtered type of the data, which are typically simple counts of some functions,
criteria or events. For example, number of post on a forum, number of likes on Facebook or number of sensors in a specific area, etc. The techniques behind descriptive analytics are: standard aggregation in databases, filtering techniques and
basic statistics. Descriptive analytics use filters on the date before applying specific statistical functions. We can use geo-filters to get metrics for a geographic region (a country) or temporal filter, to extract date only for a specific period of time
(a week). More complex descriptive analytics are dimensional reduction or stochastic variation.
Dimensionality reduction represents an important tool in information analysis.
Also scaling down data dimensions is important in process of recognition and
classification. Is important to notice that, sparse local operators, which imply less
quadratic complexity and faithful multi-scale models make the de sign of dimension reduction procedure a delicate balance between modeling accuracy and efficiency. Moreover the efficiency of dimension reduction tools is measured in terms
of memory and computational complexity. The authors provide a theoretical support and demonstrate that working in the natural Eigen-space of the data one could
reduce the process complexity while maintaining the model fidelity [2].
A stochastic variation inference is used for Gaussian process models in order to
enable the application of Gaussian process (GP) models to data sets containing
millions of data points. The key finding of the paper is that GPs can be decomposed to depend on a set of globally relevant inducing variables, which factorize
the model in the necessary manner to perform variation inference. These expressions allow for the transfer of a multitude of Gaussian process techniques to big
data [23].
3.3 Predictive analytics
Predictive analytics, which are probabilistic techniques, refers to: (i) temporal
predictive models that can be used to summarize existing data, and then to extrapolate to the future where data doesn't exist; (ii) non-temporal predictive models
(e.g. a model that based on someone's existing social media activity data will pre-
13
dict his/her potential to influence [9]; or sentimental analysis). The most challenging aspect here is to validate the model in the context of Big Data analysis. One
example of this model, based on clustering, is presented in the following.
A novel technique for effectively processing big graph data on cloud surpasses
the raising challenges when data is processed in heterogeneous environments,
such as parallel memory bottlenecks, deadlocks and inefficiency. The data is compressed based on spatial-temporal features, exploring correlations that exist in spatial data. Taking into consideration those correlations graph data is partitioned into
clusters where, the workload can be shared by the inference based on time series
similarity. The clustering algorithm compares the data streams according to the
topology of the streaming data graph topologies from the real world. Furthermore,
because the data items in streaming big data sets are heterogeneous and carry very
rich order information themselves, an order compression algorithm to further reduce the size of big data sets is developed. The clustering algorithm is developed
on the cluster-head. It takes time series set X and similarity threshold as inputs.
The output is a clustering result which specifying each cluster-head node and its
related leaf nodes [51].
The prediction models used by predictive analytics should have the following
properties: simplicity (a simple mathematical model for a time series), flexibility
(possibility to configure and extend the model), visualization (the evolution of the
predicted values can be seen in parallel with real measured values) and the computation speed (considering full vectorization techniques for array operations). Let's
consider a data series V1; V2; . . . ; Vn extracted by a descriptive analytics technique. We consider for the prediction problem P(Vt+1) that denotes the predicted
value for the moment t + 1 (next value).This value is:
PVt 1  PVt 1   f Vt ,Vt 1,...,Vt  window,
where window represent a specific interval with window+1 values and f can be a
linear function such as mean, median, standard deviation or a complex function
that uses a bio-inspired techniques (an adaptive one or a method based on neural
networks).
The linear prediction can be expressed as follow:
window
1
 wiVt i ,
window  1 i  0
where w  wi 0i  window is a vector with weights. If i, wi  1 then we have the
PVt 1  f w 
mean function. It is possible to consider specific distribution of weights as follow:
wi  t  i .
The predictive analytics are very useful to make estimation for future behaviors especially when the date is no accessible (it is not possible to obtain or to
predict) or is too expensive (e.g. money, time) to measure or to compute the data.
The main challenge is to validate the predicted data. One solution is to wait for the
real value (in the future) to measure the error, than to propagate it in the system in
order to improve the future behavior. Other solution is to measure the impact of
the predicted date in the applications that use the data.
3.4 Prescriptive analytics
Prescriptive analytics predicts multiple futures based on the decision maker's actions. A predictive model of the data is created with two components: actionable
(decision making support) and feedback system (tracks the outcome of made decisions). Prescriptive analytics can be used for recommendation systems because it
is possible to predict the consequences based on predictive models used. A selftuning database system is an example that we will present in the following.
Starfish is a self-tuning system for big data analytics, build on Hadoop. This
system is designed according to self-tuning database systems [24]. Cohen et al.
proposed the acronym MAD (Magnetism, Agility, and Depth) in order to express
the features that users expect from a system for big data analytics [14]. Magnetism
represents the propriety of a system that attracts all sources of data regardless of
different issues (e.g. possible presence of outliers, unknown schema, lack of structure, missing values) keeping many data sources out of conventional data warehouses. Agility represents the propriety of adaptation of systems in sync with rapid data evolution. Depth represents the propriety of a system, which supports
analytics needs that go far beyond conventional rollups and drilldowns to complex
statistical and machine-learning analysis. Hadoop represents a MAD system that is
very popular for big data analytics. This type of systems poses new challenges in
the path to self-tuning such as: data opacity until processing, file based processing,
and heavy use of programming languages.
Furthermore three more features in addition to MAD are becoming important
in analytics systems: data-lifecycle awareness, elasticity, and robustness. Datalifecycle-awareness means optimization of the movement, storage, and processing
of big data during its entire lifecycle by going beyond query execution. Elasticity
means adjustment of resource usage and operational costs to the workload and user requirements. Robustness means that this type of system continues to provide
service, possibly with graceful degradation, in the face of undesired events like
hardware failures, software bugs, and data corruption.
The Starfish system has four levels of tuning: Job-level tuning, Workfl0wlevel tuning, and Workload-level tuning. The novelty in Starfish's approach comes
from how it focuses simultaneously on different workload granularities overall
workload, workflows, and jobs (procedural and declarative) as well as across various decision points provisioning, optimization, scheduling, and data layout. This
approach enables Starfish to handle the significant interactions arising among
choices made at different levels [24].
15
To evaluate a prescriptive analytics model we need a feedback system (to
tracks the adjusted outcome based on the action taken) and a model for tacking actions (take actions based on the predicted outcome and based on feedback). We
define several metrics for evaluating the performance of prescriptive analytics.
Precision is the fraction of the data retrieved that are relevant to the user's information need. Recall is the fraction of the data that are relevant to the query that are
successfully retrieved. Fall-Out is the proportion of non-relevant data that are retrieved, out of all non-relevant documents available. Fmeasure is the weighted harmonic mean of precision and recall:
Fmeasure 
2  Pr ecision  Re call
.
Pr ecision  Re call
The general formula for this metric is:
F 
1    Pr ecision  Re call .
2
 2  Pr ecision  Re call 
This metric measures the effectiveness of retrieval with respect to a user who
attaches β times as much importance to recall as precision. As a general conclusion we can summarize the actions performed by three types of analytics as follow: descriptive analytics summarize the data (data reduction, sum, count, aggregation, etc.), predictive analytics predict data that we don't have (influence
scoring, trends, social analysis, etc.) and prescriptive analytics guide the decision
making to a specific outcome.
4 CyberWater Case Study
In this section we present the case study on CyberWater [13], which is a research
project aiming to create a prototype platform using advanced computational and
communications technology for implementation of new frameworks for managing
water and land resources in a sustainable and integrative manner. The main focus
of this effort is on acquiring diverse data from various data sources in a common
digital platform, which is a lose subject to Big Data, and is subsequently used for
routine decision making in normal conditions and for providing assistance in critical situations related to water, such as accidental pollution flooding, which is an
analytics subject.
CyberWater system monitors natural resources and water related events, shares
data compliant to the INSPIRE Directive, and alerts relevant stakeholders about
critical situations. In the area where we conduct the measurements (Somes and
Dambovita rivers from Romania) certain types of chemical and physical indicators
are of interest: pH, Turbidity, Alkalinity, Conductivity, Total Phenols, dissolved
Oxygen, N-NH4, N-NO2, N-NO3, Total N, P-PO4, and Magnesium.
The multi-tier system presented in Figure 5 is composed of: data layer,
processing layer, visualization layer. Water resources management requires the
processing of a huge amount of information with different levels of accessibility
and availability and in various formats. The data collected in CyberWater is derived from various sources: (i) Measured data (from sensors) gathered through the
Sensor Observation Service (SOS) standard. The schema is predefined by the
OGC standard. Information about sensors is transmitted in the XML based format
SensorML and the actual observational data is sent in O&M format; (ii) Predicted
data; (iii) Modeled data from the propagation module; (iv) Subscribers data which
holds information about subscribers and their associated notification services.
Fig. 5 Multi-tier architecture of CyberWater system.
Figure 6 describes the reduction data in the CyberWater system. Data collected
from various sources are used to describe the model of propagation of the pollutants (geo-processing tool based on the river bed profile and fluids dynamics equations). This is the Descriptive Analytics phase, were only relevant date are used
from the system repository. Phase (2) – Predictive Analytics is represented by the
prediction of the next value of a monitored chemical/physical indicator. Then, decision support that triggers alerts to relevant actors of the system is the subject of
Perspective Analytics. The main application that generates alerts is a typical Publish/Subscribe application and is the place where the alert services are defined and
implemented.
17
Fig. 6 CyberWater: from Descriptive Analytics to Perspective Analytics.
5 Conclusion
Qualitative data used to produce relevant information are obtained by reducing the
big amount of data collected and aggregated over the time, in different locations.
Reduction techniques have the main role to extract as much as possible relevant
data that characterize the all analyzed data. In this chapter we made an overview
on data manipulation challenges, on reduction techniques: descriptive analytics,
predictive analytics and prescriptive analytics. We presented a case study on water
resources management that explains the use of reduction techniques for Big Data.
As recommendation about the use of reduction techniques for Big Data, we can
draw the following line: collect and clean data, extract the relevant information using descriptive analytics, estimate the future using predictive analytics and take
decisions to move on using perspective analytics; the loop closes by adding valuable information in the system.
Acknowledgments: The research presented in this paper is supported by projects: CyberWater
grant of the Romanian National Authority for Scientific Research, CNDI-UEFISCDI, project
number 47/2012; clueFarm: Information system based on cloud services accessible through mobile devices, to increase product quality and business development farms - PN-II-PT-PCCA2013-4-0870; “SideSTEP - Scheduling Methods for Dynamic Distributed Systems: a self-*
aproach”, ID: PN-II-CT-RO-FR- 2012-1-0084; MobiWay: Mobility Beyond Individualism: an
Integrated Platform for Intelligent Transportation Systems of Tomorrow - PN-II-PT-PCCA2013-4-0321.
References
1. Yonathan Aflalo and Ron Kimmel. Spectral multidimensional scaling. Proceedings of the National Academy of Sciences, 2013.
2. Yonathan Aflalo and Ron Kimmel. Spectral multidimensional scaling. Proceedings of the National Academy of Sciences, 110(45):18052-18057, 2013.
3. Parag Agrawal, Arvind Arasu, and Raghav Kaushik. On indexing error-tolerant set containment. In Proceedings of the 2010 ACM SIGMOD International Conference on Management
of Data, SIGMOD '10, pages 927-938, New York, NY, USA, 2010. ACM.
4. Arvind Arasu, Michaela Gotz, and Raghav Kaushik. On active learning of record matching
packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, pages 783-794, New York, NY, USA, 2010. ACM.
5. Arvind Arasu, Christopher Re, and Dan Suciu. Large-scale deduplication with constraints using dedupalog. In Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE '09, pages 952-963, Washington, DC, USA, 2009. IEEE Computer Society.
6. Sudipto Banerjee, Alan E Gelfand, Andrew O Finley, and Huiyan Sang. Gaussian predictive
process models for large spatial data sets. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 70(4):825-848, 2008.
7. Ewan Birney. The making of encode: lessons for big-data projects. Nature, 489(7414):49-51,
2012.
8. Christian Bizer, Peter Boncz, Michael L. Brodie, and Orri Erling. The meaningful use of big
data: Four perspectives-four challenges. SIGMOD Rec., 40(4):56-60, January 2012.
9. Erik Cambria, Dheeraj Rajagopal, Daniel Olsher, and Dipankar Das. Big social data analysis.
Big Data Computing, pages 401-414, 2013.
10. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4,
2008.
11.Cindy X Chen. Spatio-temporal databases. In Encyclopedia of GIS, pages 1121-1121. Springer, 2008.
12.Yanpei Chen, Sara Alspaugh, and Randy Katz. Interactive analytical processing in big data
systems: A cross-industry study of mapreduce workloads. Proc. VLDB Endow., 5(12):18021813, August 2012.
13. Sorin Nicolae Ciolofan, Mariana Mocanu, and Anca Ionita. Distributed cyberinfrastructure
for decision support in risk related environments. In Parallel and Distributed Computing
(ISPDC), 2013 IEEE 12th International Symposium on, pages 109-115, June 2013.
14. Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. Mad
skills: New analysis practices for big data. Proc. VLDB Endow., 2(2):1481-1492, August
2009.
15. Alfredo Cuzzocrea, Il-Yeol Song, and Karen C. Davis. Analytics over large-scale multidimensional data: The big data revolution! In Proceedings of the ACM 14 th International Workshop on Data Warehousing and OLAP, DOLAP '11, pages 101-104, New York, NY, USA,
2011. ACM.
16. Thomas H Davenport and Jill Dyche. Big data in big companies. International Institute for
Analitycs, 2013.
17. PiersK. Dunstan, ScottD. Foster, FrancisK.C. Hui, and DavidI. Warton. Finite mixture of regression modeling for high-dimensional count and biomass data in ecology. Journal of Agricultural, Biological, and Environmental Statistics, 18(3):357-375, 2013.
19
18. Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In Proceedings of the TwentyFourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '13, pages 14341453. SIAM, 2013.
19. Rajarshi Guhaniyogi, Andrew Finley, Sudipto Banerjee, and Richard Kobe. Modeling complex spatial dependencies: Low-rank spatially varying cross-covariances with application to
soil nutrient data. Journal of Agricultural, Biological, and Environmental Statistics,
18(3):274-298, 2013.
20. Carolyn Hamm and Donald K. Burleson. Oracle Data Mining: Mining Gold from Your
Warehouse (Oracle In-Focus Series). Rampant TechPress, 2006.
21. Jiawei Han and Micheline Kamber. Data Mining, Southeast Asia Edition: Concepts and
Techniques. Morgan kaufmann, 2006.
22. Joseph M. Hellerstein, Christoper Re, Florian Schoppmann, Daisy Zhe Wang, Eugene
Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun
Kumar. The madlib analytics library: Or mad skills, the sql. Proc. VLDB Endow.,
5(12):1700-1711, August 2012.
23. James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv
preprint arXiv:1309.6835, 2013.
24. Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen
Cetin, and Shivnath Babu. Starfish: A self-tuning system for big data analytics. In CIDR, volume 11, pages 261-272, 2011.
25. Martin Hilbert and Priscila Lopez. The worlds technological capacity to store, communicate,
and compute information. Science, 332(6025):60-65, 2011.
26. Diem HO, Charles Snow, Borge Obel, Pernille Dissing Srensen, and Pernille Kallehave. Unleashing the potential of big data. Technical report, Organizational Design Community, 2013.
27. MevinB. Hooten, MarthaJ. Garlick, and JamesA. Powell. Computationally efficient statistical
differential equation modeling using homogenization. Journal of Agricultural, Biological, and
Environmental Statistics, 18(3):405-428, 2013.
28. Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen
O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. Major technical
advancements in apache hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1235-1246, New York, NY, USA,
2014. ACM.
29. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: Waitfree coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference
on USENIX Annual Technical Conference, USENIXATC'10, pages 11-11, Berkeley, CA,
USA, 2010. USENIX Association.
30. Yifeng Jiang. HBase Administration Cookbook. Packt Publishing, 2012.
31. DevinS. Johnson, RolfR. Ream, RodG. Towell, MichaelT. Williams, and JuanD. Leon Guerrero. Bayesian clustering of animal abundance trends for inference and dimension reduction.
Journal of Agricultural, Biological, and Environmental Statistics, 18(3):299-313, 2013.
32. Mehmed Kantardzic. Data Mining: Concepts, Models, Methods, and Algorithms. WileyIEEE Press, 2nd edition, 2011.
33. Juha K Laurila, Daniel Gatica-Perez, Imad Aad, Jan Blom, Olivier Bornet, T. Do, Olivier
Dousse, Julien Eberle, and Markus Miettinen. The mobile data challenge: Big data for mobile
computing research. Mobile Data Challenge by Nokia Workshop, pages n/a 2012.
34. ThomasJ. Leininger, AlanE. Gelfand, JenicaM. Allen, and Jr. Silander, JohnA. Spatial regression modeling for compositional data with many zeros. Journal of Agricultural, Biological, and Environmental Statistics, 18(3):314-334, 2013.
35. David Loshin. Chapter 7 - big data tools and techniques. In David Loshin, editor, Big Data
Analytics, pages 61-72. Morgan Kaufmann, Boston, 2013.
36. David Loshin. Chapter 9 - nosql data management for big data. In David Loshin, editor, Big
Data Analytics, pages 83-90. Morgan Kaufmann, Boston, 2013.
37. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph
M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in
the cloud. Proc. VLDB Endow., 5(8):716-727, April 2012.
38. Clifford Lynch. Big data: How do your data grow? Nature, 455(7209):28-29, 2008.
39. Sam Madden. From databases to big data. IEEE Internet Computing, 16(3):4-6, 2012.
40. Sam Madden. From databases to big data. IEEE Internet Computing, 16(3):4-6, May 2012.
41. Emily Namey, Greg Guest, Lucy Thairu, and Laura Johnson. Data reduction techniques for
large qualitative data sets. Handbook for team-based qualitative research, pages 137-162,
2007.
42. Catalin Negru, Florin Pop, Valentin Cristea, Nik Bessis, and Jing Li. Energy efficient cloud
storage service: Key issues and challenges. In Proceedings of the 2013 Fourth International
Conference on Emerging Intelligent Data and Web Technologies, EIDWT '13, pages 763766, Washington, DC, USA, 2013. IEEE Computer Society.
43. Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Action. Manning
Publications Co., Greenwich, CT, USA, 2011.
44. Florin Pop, Radu-Ioan Ciobanu, and Ciprian Dobre. Adaptive method to support social-based
mobile networks using a pagerank approach. Concurrency and Computation: Practice and
Experience, pages n/a, 2013.
45. Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian
Reeves. Sailfish: A framework for large scale data processing. In Proceedings of the Third
ACM Symposium on Cloud Computing, SoCC '12, pages 4:1-4:14, New York, NY, USA,
2012. ACM.
46. John F. Roddick, Erik Hoel, Max J. Egenhofer, Dimitris Papadias, and Betty Salzberg. Spatial, temporal and spatio-temporal databases - hot issues and directions for phd research.
SIGMOD Rec., 33(2):126{131, June 2004.
47. Weiyi Shang, Bram Adams, and Ahmed E. Hassan. Using pig as a data preparation language
for large-scale mining software repositories studies: An experience report. J. Syst. Softw.,
85(10):2195-2204, October 2012.
48. A SZALA. Science in an exponential world. Nature, 440:2020, 2006.
49. Ana Lucia Varbanescu and Alexandru Iosup. On many-task big data processing: from gpus
to clouds. In MTAGS Workshop, held in conjunction with ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1{8.
ACM, 2013.
50. Guohui Wu, ScottH. Holan, and ChristopherK. Wikle. Hierarchical Bayesian spatio-temporal
conwaymaxwell poisson models with dynamic dispersion. Journal of Agricultural, Biologcal,
and Environmental Statistics, 18(3):335-356, 2013.
51. Chi Yang, Xuyun Zhang, Changmin Zhong, Chang Liu, Jian Pei, Kotagiri Ramamohanarao,
and Jinjun Chen. A spatiotemporal compression based approach for efficient big data processing
on cloud. Journal of Computer and System Sciences, 2014.
52. Wen-Hsi Yang, ChristopherK. Wikle, ScottH. Holan, and MarkL. Wildhaber. Ecological
prediction with nonlinear multivariate time-frequency functional data models. Journal of
Agricultural, Biological, and Environmental Statistics, 18(3):450-474, 2013.

Intelligent Reduction Techniques for Big Data

Transcription

Similar documents

TEXT ANALYTICS

openness and learning analytics - Universidad del Sagrado Corazón

presentation

DWH/BI - People Tech Group

PDF

business performance management in e-commerce

You can count on us.”

6 million

7. Forum Bid. Ilmu Sistem Informasi_Dr. Wing Wahyu Winarno