Presentazione standard di PowerPoint
Transcription
Presentazione standard di PowerPoint
Data Warehouse design Design of Enterprise Systems University of Pavia 10/12/2013 2h for the first; 2h for hadoop - 1- Table of Contents Big Data Overview Big Data DW & BI Big Data Market Hadoop & Mahout - 2- Data Warehouse design BIG DATA OVERVIEW - 3- Big Data Overview: Table of Contents Big Data Overview Data Growth Definition Big Data v.s. Relational Data Its Value Big Data Benefit Big Data Usage Challenges - 4- Data Storage Growth Data Storage Growth 15/通用格式 15/通用格式 6/通用格式 26/通用格式 28/通用格式 6/通用格式 18/通用格式 18/通用格式 Exabytes Exabytes Big Data Overview: Data Growth 29/通用格式 9/通用格式 19/通用格式 0/通用格式 11/通用格式 3/通用格式 24/通用格式 18/通用格式 8/通用格式 0/通用格式 Years Storage capacity increases 23% on average annually Years Exponential growth during a decade starts from 2010 End the ability to store all the available information - 5- Big Data Overview: Definition Gartner Definition(2012): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." - 6- Big Data Overview: Big Data V.S. Relational Data Application Relation-Based Data Big Data Single-computer platform that scales with better CPUs, centralized processing. Cluster platforms that scale to thousands of nodes, distributed process. Data management Relational database (SQL), centralized storage. Non-relational databases that manage varied data types and formats (NoSQL), distributed storage. Analytics Batched, descriptive, centralized. Real-time, predictive and prescriptive, distributed analytics. Data processing - 7- Big Data Overview: Its Value 1/3 Several classes of company heading the revenue chart($11.59 billion) broad-portfolio tech giants (IBM, HP, Oracle, EMC) leading software houses (Teradata, SAP, Microsoft) professional services companies (PwC, Accenture) Source: Wikibon, Big Data Vendor Revenue and Market Forecast 2012-2017 Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/ - 8- Big Data Overview: Its Value 2/3 Pure play: vendors who derive 100 percent of their revenue from this market Source: Wikibon, Big Data Vendor Revenue and Market Forecast 20122017 Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/ - 9- Big Data Overview: Its Value 3/3 IDC: Big data will become a $17 billion business by 2015($23.8 billion by 2016) Big data storage will account for 6.8% of the entire worldwide storage market by 2015 Source: Worldwide Big Data Technologies and Services: 2012-2015 Forecast (IDC, 2012) Source: http://www.zdnet.com/big-data-an-overview_p2-7000020785/ - 10- Big Data Overview: Big Data Benefits Business benefits received by implementing an effective Big Data methodology. The survey is based on 1153 responses from 325 respondents - 11- Big Data Overview: Big Data Usage 1/2 E-Commerce and Market Intelligence – Recommender system Smart Health and Wellbeing – Human and plant genomics – Social media monitoring and analysis – Healthcare decision support – Crowd-sourcing systems – Patient community analysis – Social and virtual games E-Government and Politics 2.0 Security and Public Safety – Crime analysis – Ubiquitous government services – Computational criminology – Equal access and public services – Terrorism informatics – Citizen engagement – Open-source intelligence Science & Technology – Cyber security – S&T innovation – Hypothesis testing – Knowledge discovery - 12- Big Data Overview: Big Data Usage 2/2 Survey of European companies from Steria's Business Intelligence Maturity Audit (biMA) - 13- Big Data Overview: Challenges 1/2 Main challenges between Big Data and companies. The survey is based on 1153 responses from 325 respondents - 14- Big Data Overview: Challenges 2/2 A Survey of European companies from Steria's Business Intelligence Maturity Audit (biMA) Technical – 38% has data quality problem – A lack of data governance; no master data management system(38%) Organizational – 72% has no BI strategy; 70% has no BI governance – 7% grades big data as relevant Source: http://www.steria.com/uk/media-centre/press-releases/press-releases/article/survey-suggests-only-7of-european-companies-rate-big-data-as-very-relevant-to-their-business/ - 15- Data Warehouse design BIG DATA, DW & BI - 16- Big Data, DW & BI: Table of Contents Big Data, DW & BI Evolution Techniques Cost Best Practices - 17- BI Evolution Key Characteristics BI&A 1.0 BI&A 2.0 BI&A 3.0 -DBMS-based, structured content. -RDBMS & data warehousing. -ETL & OLAP. -Dashboards & scorecards. -Data mining & statistical analysis. Gartner BI Platforms Core Capabilities -Ad hoc query & search-based BI -Reporting, dashboards & scorecards -OLAP -Interactive visualization -Predictive modeling & data mining. Web-based, unstructured content -Information retrieval and extraction -Opinion mining -Question answering -Web analytics and web intelligence -Social media analytics -Social network analysis -Spatial-temporal analysis Mobile and sensor-based content -Location-aware analysis -Person-centered analysis -Context-relevant analysis -Mobile visualization & HCI Gartner Hype Cycle -Column-based DBMS -In-memory DBMS -Real-time decision -Data mining workbenches -Information semantic services -Natural language question answering -Content & text analytics -Mobile BI BI and Analytics: evolution and characteristics - 18- Big Data Overview: Techniques 1/2 McKinsey Global Institute in 2011 provided a list of the top 10 common techniques applicable across a range of industries, particularly in response to the need to analyze new amounts of data and their combination. List of the top 10 techniques which require Big data(1/2) A/B Testing Cluster Analysis Classification Data Mining A technique in which a control group is compared with a variety of test groups in order to determine what treatments will improve a given objective. An example application is determining what copy text, layouts, images, or colors will improve conversion rates on an e-commerce Web site. Big Data enables huge numbers of tests to be executed and analyzed. A statistical method aimed to classify an huge data set and in particular to identify a common behavior. Classification. A set of techniques to identify the categories in which new data points belong, based on a training set containing data points that have already been categorized. A set of techniques and technologies with the purpose to extract patterns from large datasets through the combination of methods following statistics and algorithms. These techniques include association rule learning, cluster analysis, - 19classification, and regression. Big Data Overview: Techniques 2/2 McKinsey Global Institute in 2011 provided a list of the top 10 common techniques applicable across a range of industries, particularly in response to the need to analyze new amounts of data and their combination. List of the top 10 techniques which require Big data(2/2) Network analysis A set of techniques used to characterize relationships among discrete nodes in a graph or a network. In social network analysis, connections between individuals in a community or organization are analyzed. Predictive modeling A set of techniques in which a mathematical model is created or chosen to best predict the probability of an outcome. Sentiment analysis Statistics Visualization Application of natural language processing and other analytic techniques to identify and extract subjective information from source text material. The science of the collection, organization, and interpretation of data, including the design of surveys and experiments. Statistical techniques are often used to understand the relationships between all the variables. Techniques used to create images, diagrams or animations, usually integrated in more complex dashboards. - 20- Big Data: Cost 1/2 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in particular a comparison between a “build” and “buy” solution. Item Cost Notes @$22k each; enterprise class with dual power supplies, 36TB of serial attached Servers $400,000 SCSI (SAS) storage, 48-64 gigabytes memory, 1 rack Server support $60,000 @15% of server cost 3 @ $5k for InfiniBand; in older network $15,000 switches will run at least 3x the costs of Switches InfiniBand Distribution/systems $90,000 Cloudera: 18 nodes @ $5k each management software Integration Information Management Tools Node Configuration and Implementation Build Project Costs $100,000 Licenses and dedicated hardware $20,000 320 hours @ $100/hour human cost $16,000 $733,000 8 hours/node, 20 nodes = 160 hours, $100/hour Those project items where a "buy" option exists Build Versus Buy Elements (Using Build Pricing) - 21- Big Data: Cost 2/2 ESG (Enterprise Strategy Group) provides an analysis on the costs of Big Data, in particular a comparison between a “build” and “buy” solution. Item Build Total Buy (Oracle Big Data Appliance) Buy (Oracle Big Data Appliance) Savings ESG Estimated Savings Cost Notes $733,000 $450,000 $283,000 ~39% Cost of Oracle Big Data Appliance for same infrastructure and tasks costs (list) Not lifecycle costs, just for initial project Oracle Big Data Appliance lowers costs versus do-it-yourself Build Versus Buy Elements (Using Buy Pricing) - 22- Big Data: Best Practices 1/3 First of all, however, we need to focus on some considerations on when is suitable to use Big Data technologies Analyze a huge quantity of data not only structured but also semi-structured and unstructured from a wide variety of resources; All of the data gathered must be analyzed against a sample or in another case, sampling of data is not as effective as the analysis made upon a large amount of data; Iterative and explorative analysis when business measures on the data are not determined a priori; Solving information and business challenges that are not properly addressed by a traditional relational database approach. - 23- Big Data: Best Practices 2/3 The best practices that we are going to describe regard both the management aspects and the organizational and technological ones. Muting the HiPPOs: the highest-paid person opinions are those on which depend the most important decisions on how to retrieve and analyze data. Today these people rely too much on intuition and experience rather than the pure rationality of data so there is the need to transform this behavior; Start with initiative that led to customer-centric outcome. It is very important for those organization that are customer oriented to begin with customer analytics that enable better services as a result of a deep understand of customers needs and future behaviors; Develop an enterprise schema that include the vision, the strategies and the requirements for Big Data and is useful to align the business users need and the implementation roadmap of information technologies; In order to achieve near-term results is crucial the adoption of a pragmatic approach, starting from the most logical and cost-effective place to look for insight that is within the enterprise; - 24- Big Data: Best Practices 3/3 Big Data Analytics effectiveness strictly depends on analytical skills and analytics tools. So the enterprises should invest in acquiring both tools and skills; The Big Data strategy and the business analytics should encompass an evaluation of the decision-making processes of the organization as well as an evaluation on the groups and types of decision makers; Try to uncover new metrics, key performance indicators and new analytics technique to lock at new and existing data in a different way in order to find new opportunity. This could require setting up a separate Big Data team with the purpose of experiment and innovate; The final goal of a Big Data project is not the collection of much data as possible but the support of the concrete business needs and provide new reliable information to decision makers; Only one technology cannot meet all the Big Data requirements. The presence of different workloads, data types, and user types should be served by the most suitable technology. For example, Hadoop could be the best choice for a large-scale Web log analysis but is not suitable for a real-time streaming at all. Multiple Big Data technologies must coexist and address use cases for which they are optimized. - 25- Data Warehouse design BIG DATA MARKET - 26- Big Data Market Definition IDC(2012) defines the big data market as an aggregation of storage, server, networking, software, and services market segments, each with several subsegments. Big Data Technology Stack - 27- Big Data Market Segments Infrastructure – External storage systems – Servers(including internal storage, memory, network cards) and supporting system software as well as spending for self-built servers by large cloud service providers – Datacenter networking infrastructure used in support of Big Data server and storage infrastructure Services – business consulting, business process outsourcing, plus IT projectbased services, IT outsourcing, and IT support, and training services related to Big Data implementations Softwares – Data organization and management software, including parallel and distributed file systems and others – Analytics and discovery software, including search engines used for Big Data applications, data mining, text mining, rich media analysis, data visualization, and others - 28- Big Data Market Analysis Marketsandmarkets – Big Data Market By Types (Hardware; Software; Services; BDaaS - HaaS; Analytics; Visualization as Service); By Software (Hadoop, Big Data Analytics and Databases, System Software (IMDB, IMC): Worldwide Forecasts & Analysis (2013 – 2018) - 29- Data Warehouse design HADOOP & MAHOUT - 30- Hadoop & Mahout: Table of Contents Hadoop Overview HDFS Mahout Map Reduce Hadoop Ecosystem Structure Structure HBase File Write Job Submission Pig File Read Job Execution Hive Overview Algorithms - 31- Hadoop: Overview Hadoop Overview Master Node HDFS Slave Node1 Slave Node K Slave Node N Storage Storage Storage Computing ...... ...... Computing Computing The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models – Open source Map-Reduce – Scalable – Distributed Master Node controls everything! - 32- Hadoop & Mahout: Table of Contents Hadoop Overview HDFS Mahout Map Reduce Hadoop Ecosystem Structure Structure HBase File Write Job Submission Pig File Read Job Execution Hive Overview Algorithms - 33- Hadoop: HDFS Structure HDFS Structure Metadata Name Node File Data Node1 Data Node K Data Node N 1 1 1 2 3 ….... .. 2 ….... .. 3 2 3 Name node controls almost everything about storage Large files are partitioned into chunks and stored across multiple nodes File chunks are replicated to mitigate the node failure problems - 34- Hadoop: HDFS write Operation series when writing a file - 35- Hadoop: HDFS Read Operation series when reading a file - 36- Hadoop & Mahout: Table of Contents Hadoop Overview HDFS Mahout Map Reduce Hadoop Ecosystem Structure Structure HBase File Write Job Submission Pig File Read Job Execution Hive Overview Algorithms - 37- Hadoop: Map-Reduce Structure Map-Reduce Structure Job Tracker Task Tracker1 Task Tracker K Task Tracker N Mapper Mapper Mapper …...... Reducer …...... Reducer Reducer Job tracker controls almost everything about computing Key concepts of Map-Reduce – Computation goes with data - 38- Hadoop: Job submission The initialization takes some time Job execution is monitored by Job tracker through heartbeat - 39- Hadoop: Map-Reduce Execution Bandwidth required in the copy process - 40- Hadoop & Mahout: Table of Contents Hadoop Overview HDFS Mahout Map Reduce Hadoop Ecosystem Structure Structure HBase File Write Job Submission Pig File Read Job Execution Hive Overview Algorithms - 41- Hadoop Ecosystem: HBase HDFS Hbase is an opensource, distributed, – Structured/semiversioned, columnstructure/unstructure oriented store d data modeled after – Write only once, read Google's Bigtable many Column based database. It supports – Insert – Delete – Update - 42- Hadoop Ecosystem: Hbase Storage model 1/3 Hbase is a column-oriented database - 43- Hadoop Ecosystem: Hbase Storage model 1/3 Hbase storage system - 44- Hadoop Ecosystem: Hbase Storage model 1/3 Hbase storage system - 45- Hadoop Ecosystem: Pig Hadoop – A lot of java codes in case of analyzing – No scripting Pig is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs Pig generates and compiles a Map/Reduce program(s) on the fly. - 46- Hadoop Ecosystem: Pig Sample Scripts RawInput = LOAD '$INPUT' USING com.contextweb.pig.CWHeaderLoader('$RESOURCES/schema/wide.xml'); input = foreach RawInput GENERATE ContextCategoryId as Category, DefLevelId , TagId, URL,Impressions; defFilter = FILTER input BY (DefLevelId == 8) or (DefLevelId == 12); GroupedInput = GROUP defFilter BY (Category, TagId, URL); result = FOREACH GroupedInput GENERATE group, SUM(input.Impressions) as Impressions; STORE result INTO '$OUTPUT' USING com.contextweb.pig.CWHeaderStore(); - 47- Hadoop Ecosystem: Hive Hive is a data warehouse infrastructure built on top of hadoop Supports analysis of large datasets stored in Hadoop compatible file systems like HDFS and Amazon S3 file system Provides SQL-Like query language called HiveSQL Provides index to accelerate queries - 48- Hadoop Ecosystem: HiveSQL DML – Select DDL – SHOW TABLES – CREATE TABLE – ALTER TABLE – DROP TABLE - 49- Mahot Hadoop Overview HDFS Mahout Map Reduce Hadoop Ecosystem Structure Structure HBase File Write Job Submission Pig File Read Job Execution Hive Overview Algorithms - 50- Mahout: Overview A scalable machine learning library built on Hadoop, written in java Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore” - 51- Mahout: Algorithms Classification – Logistic Regression – Bayesian – SVM – NN – Hidden Markov Models Clustering – Kmeans – Mean Shift Clustering Pattern Mining – Parallel FP Growth Algorithm Regression Dimension reduction – SVD – PCA – GDA – Locally Weighted Linear Collaborative filtering Regression – Non-distributed recommenders – Distributed Item-Based Collaborative Filtering – Spectral Clustering – Top Down Clustering - 52- Data Warehouse design EXERCISE - 53- Mobility Analyzer: A Show Case Site Data Flow Modules HANA DB Local ExportTweetsInfo CSV Files CSVConverter Sequence Files Hadoop Mahout Run.sh Cluster Info. Clusterdump Cluster Info. ImportClusterInfo Local HANA DB - 54-