Presentazione standard di PowerPoint

Transcription

Presentazione standard di PowerPoint
Data Warehouse design
Design of Enterprise Systems
University of Pavia
10/12/2013 2h for the first; 2h for hadoop
- 1-
Table of Contents
Big Data Overview
Big Data DW & BI
Big Data Market
Hadoop & Mahout
- 2-
Data Warehouse design
BIG DATA OVERVIEW
- 3-
Big Data Overview: Table of Contents
Big Data
Overview
Data Growth
Definition
Big Data v.s.
Relational
Data
Its Value
Big Data
Benefit
Big Data
Usage
Challenges
- 4-
Data Storage Growth
Data Storage Growth
15/通用格式
15/通用格式
6/通用格式
26/通用格式
28/通用格式
6/通用格式
18/通用格式
18/通用格式
Exabytes
Exabytes
Big Data Overview: Data Growth
29/通用格式
9/通用格式
19/通用格式
0/通用格式
11/通用格式
3/通用格式
24/通用格式
18/通用格式
8/通用格式
0/通用格式
Years
 Storage capacity increases 23% on
average annually
Years
 Exponential growth during a decade
starts from 2010
 End the ability to store all the
available information
- 5-
Big Data Overview: Definition
 Gartner Definition(2012): "Big data is high volume,
high velocity, and/or high variety information assets
that require new forms of processing to enable
enhanced decision making, insight discovery and
process optimization."
- 6-
Big Data Overview: Big Data V.S. Relational Data
Application
Relation-Based Data
Big Data
Single-computer
platform that scales with
better CPUs, centralized
processing.
Cluster platforms that
scale to thousands of
nodes, distributed
process.
Data management
Relational database
(SQL), centralized
storage.
Non-relational
databases that manage
varied data types and
formats (NoSQL),
distributed storage.
Analytics
Batched, descriptive,
centralized.
Real-time, predictive
and prescriptive,
distributed analytics.
Data processing
- 7-
Big Data Overview: Its Value 1/3
Several classes of company
heading the revenue
chart($11.59 billion)
 broad-portfolio tech giants
(IBM, HP, Oracle, EMC)
 leading software houses
(Teradata, SAP, Microsoft)
 professional services
companies (PwC, Accenture)
Source: Wikibon, Big Data
Vendor Revenue and Market
Forecast 2012-2017
Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
- 8-
Big Data Overview: Its Value 2/3
 Pure play: vendors who
derive 100 percent of
their revenue from this
market
Source: Wikibon, Big Data
Vendor Revenue and
Market Forecast 20122017
Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
- 9-
Big Data Overview: Its Value 3/3
 IDC: Big data will become a
$17 billion business by
2015($23.8 billion by
2016)
 Big data storage will
account for 6.8% of the
entire worldwide storage
market by 2015
Source: Worldwide Big Data Technologies and
Services: 2012-2015 Forecast (IDC, 2012)
Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/
- 10-
Big Data Overview: Big Data Benefits
Business benefits received by implementing an effective Big Data
methodology. The survey is based on 1153 responses from 325 respondents
- 11-
Big Data Overview: Big Data Usage 1/2
 E-Commerce and Market Intelligence
– Recommender system
 Smart Health and Wellbeing
– Human and plant genomics
– Social media monitoring and analysis
– Healthcare decision support
– Crowd-sourcing systems
– Patient community analysis
– Social and virtual games
 E-Government and Politics 2.0
 Security and Public Safety
– Crime analysis
– Ubiquitous government services
– Computational criminology
– Equal access and public services
– Terrorism informatics
– Citizen engagement
– Open-source intelligence
 Science & Technology
– Cyber security
– S&T innovation
– Hypothesis testing
– Knowledge discovery
- 12-
Big Data Overview: Big Data Usage 2/2
Survey of European companies from Steria's Business Intelligence Maturity Audit (biMA)
- 13-
Big Data Overview: Challenges 1/2
Main challenges between Big Data and companies. The survey is based on
1153 responses from 325 respondents
- 14-
Big Data Overview: Challenges 2/2
A Survey of European
companies from Steria's
Business Intelligence Maturity
Audit (biMA)
 Technical
– 38% has data quality
problem
– A lack of data
governance; no master
data management
system(38%)
 Organizational
– 72% has no BI strategy;
70% has no BI governance
– 7% grades big data as
relevant
Source: http://www.steria.com/uk/media-centre/press-releases/press-releases/article/survey-suggests-only-7of-european-companies-rate-big-data-as-very-relevant-to-their-business/
- 15-
Data Warehouse design
BIG DATA, DW & BI
- 16-
Big Data, DW & BI: Table of Contents
Big Data,
DW & BI
Evolution
Techniques
Cost
Best
Practices
- 17-
BI Evolution
Key Characteristics
BI&A 1.0
BI&A 2.0
BI&A 3.0
-DBMS-based, structured content.
-RDBMS & data warehousing.
-ETL & OLAP.
-Dashboards & scorecards.
-Data mining & statistical analysis.
Gartner BI Platforms Core
Capabilities
-Ad hoc query & search-based BI
-Reporting, dashboards &
scorecards
-OLAP
-Interactive visualization
-Predictive modeling & data mining.
Web-based, unstructured content
-Information retrieval and
extraction
-Opinion mining
-Question answering
-Web analytics and web
intelligence
-Social media analytics
-Social network analysis
-Spatial-temporal analysis
Mobile and sensor-based content
-Location-aware analysis
-Person-centered analysis
-Context-relevant analysis
-Mobile visualization & HCI
Gartner Hype Cycle
-Column-based DBMS
-In-memory DBMS
-Real-time decision
-Data mining workbenches
-Information semantic
services
-Natural language question
answering
-Content & text analytics
-Mobile BI
BI and Analytics: evolution and characteristics
- 18-
Big Data Overview: Techniques 1/2
McKinsey Global Institute in 2011 provided a list of the top 10 common
techniques applicable across a range of industries, particularly in response to
the need to analyze new amounts of data and their combination.
List of the top 10 techniques which require Big data(1/2)
A/B Testing
Cluster Analysis
Classification
Data Mining
A technique in which a control group is compared with a
variety of test groups in order to determine what treatments
will improve a given objective. An example application is
determining what copy text, layouts, images, or colors will
improve conversion rates on an e-commerce Web site. Big
Data enables huge numbers of tests to be executed and
analyzed.
A statistical method aimed to classify an huge data set and
in particular to identify a common behavior.
Classification. A set of techniques to identify the categories
in which new data points belong, based on a training set
containing data points that have already been categorized.
A set of techniques and technologies with the purpose to
extract patterns from large datasets through the combination
of methods following statistics and algorithms. These
techniques include association rule learning, cluster analysis,
- 19classification, and regression.
Big Data Overview: Techniques 2/2
McKinsey Global Institute in 2011 provided a list of the top 10 common
techniques applicable across a range of industries, particularly in response to
the need to analyze new amounts of data and their combination.
List of the top 10 techniques which require Big data(2/2)
Network analysis
A set of techniques used to characterize relationships among discrete
nodes in a graph or a network. In social network analysis, connections
between individuals in a community or organization are analyzed.
Predictive modeling
A set of techniques in which a mathematical model is created or
chosen to best predict the probability of an outcome.
Sentiment analysis
Statistics
Visualization
Application of natural language processing and other analytic
techniques to identify and extract subjective information from source text
material.
The science of the collection, organization, and interpretation of data,
including the design of surveys and experiments. Statistical techniques
are often used to understand the relationships between all the variables.
Techniques used to create images, diagrams or animations, usually
integrated in more complex dashboards.
- 20-
Big Data: Cost 1/2
 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in
particular a comparison between a “build” and “buy” solution.
Item
Cost
Notes
@$22k each; enterprise class with dual
power supplies, 36TB of serial attached
Servers
$400,000
SCSI (SAS) storage, 48-64 gigabytes
memory, 1 rack
Server support
$60,000
@15% of server cost
3 @ $5k for InfiniBand; in older network
$15,000
switches will run at least 3x the costs of
Switches
InfiniBand
Distribution/systems
$90,000
Cloudera: 18 nodes @ $5k each
management software
Integration
Information
Management Tools
Node Configuration
and Implementation
Build Project Costs
$100,000
Licenses and dedicated hardware
$20,000
320 hours @ $100/hour human cost
$16,000
$733,000
8 hours/node, 20 nodes = 160 hours,
$100/hour
Those project items where a "buy" option
exists
Build Versus Buy Elements (Using Build Pricing)
- 21-
Big Data: Cost 2/2
 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in
particular a comparison between a “build” and “buy” solution.
Item
Build Total
Buy (Oracle Big Data
Appliance)
Buy (Oracle Big Data
Appliance) Savings
ESG Estimated Savings
Cost
Notes
$733,000
$450,000
$283,000
~39%
Cost of Oracle Big Data
Appliance for same
infrastructure and tasks
costs (list)
Not lifecycle costs, just
for initial project
Oracle Big Data
Appliance lowers costs
versus do-it-yourself
Build Versus Buy Elements (Using Buy Pricing)
- 22-
Big Data: Best Practices 1/3
First of all, however, we need to focus on some considerations on when is suitable
to use Big Data technologies
 Analyze a huge quantity of data not only structured but also semi-structured and
unstructured from a wide variety of resources;
 All of the data gathered must be analyzed against a sample or in another case,
sampling of data is not as effective as the analysis made upon a large amount of
data;
 Iterative and explorative analysis when business measures on the data are not
determined a priori;
 Solving information and business challenges that are not properly addressed by a
traditional relational database approach.
- 23-
Big Data: Best Practices 2/3
The best practices that we are going to describe regard both the
management aspects and the organizational and technological ones.
 Muting the HiPPOs: the highest-paid person opinions are those on which
depend the most important decisions on how to retrieve and analyze data.
Today these people rely too much on intuition and experience rather than
the pure rationality of data so there is the need to transform this behavior;
 Start with initiative that led to customer-centric outcome. It is very
important for those organization that are customer oriented to begin with
customer analytics that enable better services as a result of a deep
understand of customers needs and future behaviors;
 Develop an enterprise schema that include the vision, the strategies and the
requirements for Big Data and is useful to align the business users need
and the implementation roadmap of information technologies;
 In order to achieve near-term results is crucial the adoption of a pragmatic
approach, starting from the most logical and cost-effective place to look for
insight that is within the enterprise;
- 24-
Big Data: Best Practices 3/3
 Big Data Analytics effectiveness strictly depends on analytical skills and analytics tools.
So the enterprises should invest in acquiring both tools and skills;
 The Big Data strategy and the business analytics should encompass an evaluation of the
decision-making processes of the organization as well as an evaluation on the groups
and types of decision makers;
 Try to uncover new metrics, key performance indicators and new analytics technique to
lock at new and existing data in a different way in order to find new opportunity. This
could require setting up a separate Big Data team with the purpose of experiment and
innovate;
 The final goal of a Big Data project is not the collection of much data as possible but the
support of the concrete business needs and provide new reliable information to decision
makers;
 Only one technology cannot meet all the Big Data requirements. The presence of
different workloads, data types, and user types should be served by the most suitable
technology. For example, Hadoop could be the best choice for a large-scale Web log
analysis but is not suitable for a real-time streaming at all. Multiple Big Data technologies
must coexist and address use cases for which they are optimized.
- 25-
Data Warehouse design
BIG DATA MARKET
- 26-
Big Data Market Definition
IDC(2012) defines
the big data
market as an
aggregation of
storage, server,
networking,
software, and
services market
segments, each
with several subsegments.
Big Data Technology Stack
- 27-
Big Data Market Segments
 Infrastructure
– External storage systems
– Servers(including internal storage,
memory, network cards) and supporting
system software as well as spending for
self-built servers by large cloud service
providers
– Datacenter networking infrastructure
used in support of Big Data server and
storage infrastructure
 Services
– business consulting, business process
outsourcing, plus IT projectbased
services, IT outsourcing, and IT support,
and training services related to Big Data
implementations
 Softwares
– Data organization and management
software, including parallel and
distributed file systems and others
– Analytics and discovery software,
including search engines used for Big
Data applications, data mining, text
mining, rich media analysis, data
visualization, and others
- 28-
Big Data Market Analysis
Marketsandmarkets
– Big Data Market By Types (Hardware; Software;
Services; BDaaS - HaaS; Analytics; Visualization as
Service); By Software (Hadoop, Big Data Analytics
and Databases, System Software (IMDB, IMC):
Worldwide Forecasts & Analysis (2013 – 2018)
- 29-
Data Warehouse design
HADOOP & MAHOUT
- 30-
Hadoop & Mahout: Table of Contents
Hadoop
Overview
HDFS
Mahout
Map Reduce
Hadoop
Ecosystem
Structure
Structure
HBase
File Write
Job
Submission
Pig
File Read
Job
Execution
Hive
Overview
Algorithms
- 31-
Hadoop: Overview
Hadoop Overview
Master Node
HDFS
Slave Node1
Slave Node K
Slave Node N
Storage
Storage
Storage
Computing
......
......
Computing
Computing
 The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
– Open source
Map-Reduce
– Scalable
– Distributed
 Master Node controls everything!
- 32-
Hadoop & Mahout: Table of Contents
Hadoop
Overview
HDFS
Mahout
Map Reduce
Hadoop
Ecosystem
Structure
Structure
HBase
File Write
Job
Submission
Pig
File Read
Job
Execution
Hive
Overview
Algorithms
- 33-
Hadoop: HDFS Structure
HDFS Structure
Metadata
Name Node
File
Data Node1
Data Node K
Data Node N
1
1
1
2
3
…....
..
2
…....
..
3
2
3
 Name node controls almost everything about storage
 Large files are partitioned into chunks and stored across multiple nodes
 File chunks are replicated to mitigate the node failure problems
- 34-
Hadoop: HDFS write
 Operation series when writing a file
- 35-
Hadoop: HDFS Read
 Operation series when reading a file
- 36-
Hadoop & Mahout: Table of Contents
Hadoop
Overview
HDFS
Mahout
Map Reduce
Hadoop
Ecosystem
Structure
Structure
HBase
File Write
Job
Submission
Pig
File Read
Job
Execution
Hive
Overview
Algorithms
- 37-
Hadoop: Map-Reduce Structure
Map-Reduce Structure
Job Tracker
Task Tracker1
Task Tracker K
Task Tracker N
Mapper
Mapper
Mapper
…......
Reducer
…......
Reducer
Reducer
 Job tracker controls almost everything about computing
 Key concepts of Map-Reduce
– Computation goes with data
- 38-
Hadoop: Job submission
 The initialization takes some time
 Job execution is monitored by Job tracker through heartbeat
- 39-
Hadoop: Map-Reduce Execution
 Bandwidth required in the copy process
- 40-
Hadoop & Mahout: Table of Contents
Hadoop
Overview
HDFS
Mahout
Map Reduce
Hadoop
Ecosystem
Structure
Structure
HBase
File Write
Job
Submission
Pig
File Read
Job
Execution
Hive
Overview
Algorithms
- 41-
Hadoop Ecosystem: HBase
 HDFS
 Hbase is an opensource, distributed,
– Structured/semiversioned, columnstructure/unstructure
oriented store
d data
modeled after
– Write only once, read
Google's Bigtable
many
 Column based database. It
supports
– Insert
– Delete
– Update
- 42-
Hadoop Ecosystem: Hbase Storage model 1/3
 Hbase is a column-oriented database
- 43-
Hadoop Ecosystem: Hbase Storage model 1/3
 Hbase storage system
- 44-
Hadoop Ecosystem: Hbase Storage model 1/3
 Hbase storage system
- 45-
Hadoop Ecosystem: Pig
 Hadoop
– A lot of java codes in
case of analyzing
– No scripting
 Pig is a platform for analyzing large
data sets that consists of a highlevel language for expressing data
analysis programs
 Pig generates and compiles a
Map/Reduce program(s) on the fly.
- 46-
Hadoop Ecosystem: Pig Sample Scripts
RawInput = LOAD '$INPUT' USING
com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml');
input = foreach RawInput GENERATE ContextCategoryId as Category,
DefLevelId , TagId, URL,Impressions;
 defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12);
GroupedInput = GROUP defFilter BY (Category, TagId, URL);
result = FOREACH GroupedInput GENERATE group,
SUM(input.Impressions) as Impressions;
STORE result INTO '$OUTPUT' USING
com.contextweb.pig.CWHeaderStore();
- 47-
Hadoop Ecosystem: Hive
 Hive is a data warehouse infrastructure built on top of hadoop
 Supports analysis of large datasets stored in Hadoop compatible file systems like
HDFS and Amazon S3 file system
 Provides SQL-Like query language called HiveSQL
 Provides index to accelerate queries
- 48-
Hadoop Ecosystem: HiveSQL
 DML
– Select
 DDL
– SHOW TABLES
– CREATE TABLE
– ALTER TABLE
– DROP TABLE
- 49-
Mahot
Hadoop
Overview
HDFS
Mahout
Map Reduce
Hadoop
Ecosystem
Structure
Structure
HBase
File Write
Job
Submission
Pig
File Read
Job
Execution
Hive
Overview
Algorithms
- 50-
Mahout: Overview
 A scalable machine
learning library built on
Hadoop, written in java
 Driven by Ng et al.’s
paper “MapReduce for
Machine Learning on
Multicore”
- 51-
Mahout: Algorithms
 Classification
– Logistic Regression
– Bayesian
– SVM
– NN
– Hidden Markov Models
 Clustering
– Kmeans
– Mean Shift Clustering
 Pattern Mining
– Parallel FP Growth
Algorithm
 Regression
 Dimension reduction
– SVD
– PCA
– GDA
– Locally Weighted Linear  Collaborative filtering
Regression
– Non-distributed
recommenders
– Distributed Item-Based
Collaborative Filtering
– Spectral Clustering
– Top Down Clustering
- 52-
Data Warehouse design
EXERCISE
- 53-
Mobility Analyzer: A Show Case
Site
Data Flow
Modules
HANA DB
Local
ExportTweetsInfo
CSV Files
CSVConverter
Sequence Files
Hadoop
Mahout
Run.sh
Cluster Info.
Clusterdump
Cluster Info.
ImportClusterInfo
Local
HANA DB
- 54-