Introducing Apache Hadoop - Stanford Computer Systems
Transcription
Introducing Apache Hadoop - Stanford Computer Systems
11/16/2011, Stanford EE380 Computer Systems Colloquium Introducing Apache Hadoop: The Modern Data Operating System Dr. Amr Awadallah | Founder, CTO, VP of Engineering [email protected], twitter: @awadallah Limitations of Existing Data Analytics Architecture BI Reports + Interactive Apps Can’t Explore Original High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Mostly Append Collection Instrumentation 2 ©2011 Cloudera, Inc. All Rights Reserved. Archiving = Premature Data Death So What is Apache Hadoop ? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). • Core Hadoop has two main systems: – Hadoop Distributed File System: self-healing high-bandwidth clustered storage. – MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. 3 ©2011 Cloudera, Inc. All Rights Reserved. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Schema must be created before any data can be loaded. • Data is simply copied to the file store, no transformation is needed. • An explicit load operation has to take place which transforms data to DB internal structure. • A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding) • New columns must be added explicitly before new data for such columns can be loaded into the database. • New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it. • Read is Fast • Standards/Governance Pros • Load is Fast • Flexibility/Agility 4 ©2011 Cloudera, Inc. All Rights Reserved. Innovation: Explore Original Raw Data Data Committee 5 ©2011 Cloudera, Inc. All Rights Reserved. Data Scientist Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above. 6 ©2011 Cloudera, Inc. All Rights Reserved. Scalability: Scalable Software Development Grows without requiring developers to re-architect their algorithms/application. AUTO SCALE 7 ©2011 Cloudera, Inc. All Rights Reserved. Scalability: Data Beats Algorithm Smarter Algos More Data A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009 8 ©2011 Cloudera, Inc. All Rights Reserved. Scalability: Keep All Data Alive Forever Archive to Tape and Never See It Again Extract Value From All Your Data 9 ©2011 Cloudera, Inc. All Rights Reserved. Use The Right Tool For The Right Job Relational Databases: Use when: Hadoop: Use when: • Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility) • Multistep ACID Transactions • Scalability of Storage/Compute • 100% SQL Compliance • Complex Data Processing 10 ©2011 Cloudera, Inc. All Rights Reserved. HDFS: Hadoop Distributed File System A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3). Optimized for: • Throughput • Put/Get/Delete • Appends Block Replication for: • Durability • Availability • Throughput Block Replicas are distributed across servers and racks. 11 ©2011 Cloudera, Inc. All Rights Reserved. MapReduce: Computational Framework cat *.txt | mapper.pl | sort | reducer.pl > out.txt Split 1 (docid, text) (words, counts) Map 1 (sorted words, counts) Be, 5 Reduce 1 “To Be Or Not To Be?” (sorted words, sum of counts) Output File 1 Be, 30 Be, 12 Split i (docid, text) Map i Be, 7 Be, 6 Split N (docid, text) Map M Reduce i (sorted words, sum of counts) Reduce R (sorted words, sum of counts) Shuffle (words, counts) Map(in_key, in_value) => list of (out_key, intermediate_value) (sorted words, counts) Output File i Output File R Reduce(out_key, list of intermediate_values) => out_value(s) 12 ©2011 Cloudera, Inc. All Rights Reserved. MapReduce: Resource Manager / Scheduler A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. Three levels of data locality: • Same server as data (local disk) • Same rack as data (rack/leaf switch) • Wherever there is a free slot (cross rack) Optimized for: • Batch Processing • Failure Recovery System detects laggard tasks and speculatively executes parallel tasks on the same slice of data. 13 ©2011 Cloudera, Inc. All Rights Reserved. But Networks Are Faster Than Disks! Yes, however, core and disk density per server are going up very quickly: • • • • • • 1 Hard Disk = 100MB/sec (~1Gbps) Server = 12 Hard Disks = 1.2GB/sec (~12Gbps) Rack = 20 Servers = 24GB/sec (~240Gbps) Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps) Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps) Scanning 4.8TB at 100MB/sec takes 13 hours. 14 ©2011 Cloudera, Inc. All Rights Reserved. Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Job Tracker Maintains mapping of file names to blocks to data node slaves. Tracks resources and schedules jobs across task tracker slaves. Data Node Task Tracker Stores and serves blocks of data Runs tasks (work units) within a job Share Physical Node 15 ©2011 Cloudera, Inc. All Rights Reserved. Changes for Better Availability/Scalability Hadoop Client Federation partitions out the name space, High Availability via an Active Standby. Contacts Name Node for data or Job Tracker to submit jobs Name Node Each job has its own Application Manager, Resource Manager is decoupled from MR. Job Tracker Data Node Task Tracker Stores and serves blocks of data Runs tasks (work units) within a job Share Physical Node 16 ©2011 Cloudera, Inc. All Rights Reserved. Build/Test: APACHE BIGTOP CDH: Cloudera’s Distribution Including Apache Hadoop Data Mining UI Framework/SDK File System Mount FUSE-DFS Workflow HUE APACHE MAHOUT Scheduling APACHE OOZIE Metadata APACHE HIVE APACHE OOZIE Languages / Compilers Data Integration APACHE PIG, APACHE HIVE APACHE FLUME, APACHE SQOOP Fast Read/Write Access APACHE HBASE Coordination APACHE ZOOKEEPER SCM Express (Installation Wizard for CDH) 17 ©2011 Cloudera, Inc. All Rights Reserved. Books 18 ©2011 Cloudera, Inc. All Rights Reserved. Conclusion • The Key Benefits of Apache Hadoop: – Agility/Flexibility (Quickest Time to Insight). – Complex Data Processing (Any Language, Any Problem). – Scalability of Storage/Compute (Freedom to Grow). – Economical Storage (Keep All Your Data Alive Forever). • The Key Systems for Apache Hadoop are: – Hadoop Distributed File System: self-healing highbandwidth clustered storage. – MapReduce: distributed fault-tolerant resource management coupled with scalable data processing. 19 ©2011 Cloudera, Inc. All Rights Reserved. Appendix BACKUP SLIDES 20 ©2011 Cloudera, Inc. All Rights Reserved. Unstructured Data is Exploding Complex, Unstructured Relational • 2,500 exabytes of new information in 2012 with Internet as primary driver • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year 21 Source: IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Creation History • Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes 22 ©2011 Cloudera, Inc. All Rights Reserved. Hadoop in the Enterprise Data Stack Data Scientists IDEs Analysts ETL Tools Enterprise Reporting BI, Analytics Development Tools System Operators Business Users Business Intelligence Tools ODBC, JDBC, NFS, Native Cloudera Mgmt Suite Enterprise Data Warehouse Sqoop Data Architects Flume Flume Flume Sqoop Logs Files Web Data Relational Databases 23 ©2011 Cloudera, Inc. All Rights Reserved. Low-Latency Serving Systems Customers Web Application MapReduce Next Gen Main idea is to split up the JobTracker functions: • Cluster resource management (for tracking and allocating nodes) • Application life-cycle management (for MapReduce scheduling and execution) Enables: • High Availability • Better Scalability • Efficient Slot Allocation • Rolling Upgrades • Non-MapReduce Apps 24 ©2011 Cloudera, Inc. All Rights Reserved. Application Industry Application Social Network Analysis Web Clickstream Sessionization Content Optimization Media Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Product Quality Manufacturing Mfg Process Tracking 25 ©2011 Cloudera, Inc. All Rights Reserved. Use Case DATA PROCESSING Use Case ADVANCED ANALYTICS Two Core Use Cases Common Across Many Industries What is Cloudera Enterprise? Cloudera Enterprise makes open source Apache Hadoop enterprise-easy Simplify and Accelerate Hadoop Deployment Reduce Adoption Costs and Risks Lower the Cost of Administration Increase the Transparency & Control of Hadoop Leverage the Experience of Our Experts CLOUDERA ENTERPRISE COMPONENTS Cloudera Management Suite ProductionLevel Support Comprehensive Toolset for Hadoop Administration Our Team of Experts On-Call to Help You Meet Your SLAs 3 of the top 5 telecommunications, mobile services, defense & intelligence, banking, media and retail organizations depend on Cloudera EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Apache Hadoop Deployments Enabling Apache Hadoop to be Affordably Run in Production 26 ©2011 Cloudera, Inc. All Rights Reserved. Hive vs Pig Latin (count distinct values > 0) • Hive Syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0; • Pig Latin Syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; 27 ©2011 Cloudera, Inc. All Rights Reserved. Apache Hive Key Features • • • • • • • • • • • A subset of SQL covering the most common statements JDBC/ODBC support Agile data types: Array, Map, Struct, and JSON objects Pluggable SerDe system to work on unstructured files directly User Defined Functions and Aggregates Regular Expression support MapReduce support Partitions and Buckets (for performance optimization) Microstrategy/Tableau Compatibility (through ODBC) In The Works: Indices, Columnar Storage, Views, Explode/Collect More details: http://wiki.apache.org/hadoop/Hive 28 ©2011 Cloudera, Inc. All Rights Reserved. Hive Agile Data Types • STRUCTS: – SELECT mytable.mycolumn.myfield FROM … • MAPS (Hashes): – SELECT mytable.mycolumn[mykey] FROM … • ARRAYS: – SELECT mytable.mycolumn[5] FROM … • JSON: – SELECT get_json_object(mycolumn, objpath) FROM … 29 ©2011 Cloudera, Inc. All Rights Reserved.