Introducing Apache Hadoop - Stanford Computer Systems

Transcription

Introducing Apache Hadoop - Stanford Computer Systems
11/16/2011, Stanford EE380 Computer Systems Colloquium
Introducing Apache Hadoop:
The Modern Data Operating System
Dr. Amr Awadallah | Founder, CTO, VP of Engineering
[email protected], twitter: @awadallah
Limitations of Existing Data Analytics Architecture
BI Reports + Interactive Apps
Can’t Explore Original
High Fidelity Raw Data
RDBMS (aggregated data)
ETL Compute Grid
Moving Data To
Compute Doesn’t Scale
Storage Only Grid (original raw data)
Mostly Append
Collection
Instrumentation
2
©2011 Cloudera, Inc. All Rights Reserved.
Archiving =
Premature
Data Death
So What is Apache
Hadoop ?
• A scalable fault-tolerant distributed system for
data storage and processing (open source
under the Apache license).
• Core Hadoop has two main systems:
– Hadoop Distributed File System: self-healing
high-bandwidth clustered storage.
– MapReduce: distributed fault-tolerant resource
management and scheduling coupled with a
scalable data programming abstraction.
3
©2011 Cloudera, Inc. All Rights Reserved.
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS):
Schema-on-Read (Hadoop):
•
Schema must be created before
any data can be loaded.
•
Data is simply copied to the file
store, no transformation is needed.
•
An explicit load operation has to
take place which transforms
data to DB internal structure.
•
A SerDe (Serializer/Deserlizer) is
applied during read time to extract
the required columns (late binding)
•
New columns must be added
explicitly before new data for
such columns can be loaded
into the database.
•
New data can start flowing anytime
and will appear retroactively once
the SerDe is updated to parse it.
•
Read is Fast
•
Standards/Governance
Pros
•
Load is Fast
•
Flexibility/Agility
4
©2011 Cloudera, Inc. All Rights Reserved.
Innovation: Explore Original Raw Data
Data Committee
5
©2011 Cloudera, Inc. All Rights Reserved.
Data Scientist
Flexibility: Complex Data Processing
1. Java MapReduce: Most flexibility and performance, but tedious
development cycle (the assembly language of Hadoop).
2. Streaming MapReduce (aka Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility than native Java MapReduce.
3. Crunch: A library for multi-stage MapReduce pipelines in Java
(modeled After Google’s FlumeJava)
4. Pig Latin: A high-level language out of Yahoo, suitable for batch
data flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes.
6. Oozie: A PDL XML workflow engine that enables creating a
workflow of jobs composed of any of the above.
6
©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Scalable Software Development
Grows without requiring developers to
re-architect their algorithms/application.
AUTO SCALE
7
©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Data Beats Algorithm
Smarter Algos
More Data
A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009
8
©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Keep All Data Alive Forever
Archive to Tape and
Never See It Again
Extract Value From
All Your Data
9
©2011 Cloudera, Inc. All Rights Reserved.
Use The Right Tool For The Right Job
Relational Databases:
Use when:
Hadoop:
Use when:
•
Interactive OLAP Analytics (<1sec)
•
Structured or Not (Flexibility)
•
Multistep ACID Transactions
•
Scalability of Storage/Compute
•
100% SQL Compliance
•
Complex Data Processing
10
©2011 Cloudera, Inc. All Rights Reserved.
HDFS: Hadoop Distributed File System
A given file is broken down into blocks
(default=64MB), then blocks are
replicated across cluster (default=3).
Optimized for:
• Throughput
• Put/Get/Delete
• Appends
Block Replication for:
• Durability
• Availability
• Throughput
Block Replicas are distributed
across servers and racks.
11
©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Computational Framework
cat *.txt | mapper.pl | sort | reducer.pl > out.txt
Split 1
(docid, text)
(words, counts)
Map 1
(sorted words, counts)
Be, 5
Reduce 1
“To Be
Or Not
To Be?”
(sorted words,
sum of counts)
Output
File 1
Be, 30
Be, 12
Split i
(docid, text)
Map i
Be, 7
Be, 6
Split N
(docid, text)
Map M
Reduce i
(sorted words,
sum of counts)
Reduce R
(sorted words,
sum of counts)
Shuffle
(words, counts)
Map(in_key, in_value) => list of (out_key, intermediate_value)
(sorted words, counts)
Output
File i
Output
File R
Reduce(out_key, list of intermediate_values) => out_value(s)
12
©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Resource Manager / Scheduler
A given job is broken down into tasks,
then tasks are scheduled to be as
close to data as possible.
Three levels of data locality:
• Same server as data (local disk)
• Same rack as data (rack/leaf switch)
• Wherever there is a free slot (cross rack)
Optimized for:
• Batch Processing
• Failure Recovery
System detects laggard tasks and
speculatively executes parallel tasks
on the same slice of data.
13
©2011 Cloudera, Inc. All Rights Reserved.
But Networks Are Faster Than Disks!
Yes, however, core and disk density per server
are going up very quickly:
•
•
•
•
•
•
1 Hard Disk = 100MB/sec (~1Gbps)
Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)
Rack = 20 Servers = 24GB/sec (~240Gbps)
Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)
Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)
Scanning 4.8TB at 100MB/sec takes 13 hours.
14
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop High-Level Architecture
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs
Name Node
Job Tracker
Maintains mapping of file names
to blocks to data node slaves.
Tracks resources and schedules
jobs across task tracker slaves.
Data Node
Task Tracker
Stores and serves
blocks of data
Runs tasks (work units)
within a job
Share Physical Node
15
©2011 Cloudera, Inc. All Rights Reserved.
Changes for Better Availability/Scalability
Hadoop Client
Federation partitions
out the name space,
High Availability via
an Active Standby.
Contacts Name Node for data
or Job Tracker to submit jobs
Name Node
Each job has its own
Application Manager,
Resource Manager is
decoupled from MR.
Job Tracker
Data Node
Task Tracker
Stores and serves
blocks of data
Runs tasks (work units)
within a job
Share Physical Node
16
©2011 Cloudera, Inc. All Rights Reserved.
Build/Test: APACHE BIGTOP
CDH: Cloudera’s Distribution Including Apache Hadoop
Data Mining
UI Framework/SDK
File System Mount
FUSE-DFS
Workflow
HUE
APACHE MAHOUT
Scheduling
APACHE OOZIE
Metadata
APACHE HIVE
APACHE OOZIE
Languages / Compilers
Data
Integration
APACHE PIG, APACHE HIVE
APACHE FLUME,
APACHE SQOOP
Fast
Read/Write
Access
APACHE HBASE
Coordination
APACHE ZOOKEEPER
SCM Express (Installation Wizard for CDH)
17
©2011 Cloudera, Inc. All Rights Reserved.
Books
18
©2011 Cloudera, Inc. All Rights Reserved.
Conclusion
• The Key Benefits of Apache Hadoop:
– Agility/Flexibility (Quickest Time to Insight).
– Complex Data Processing (Any Language, Any Problem).
– Scalability of Storage/Compute (Freedom to Grow).
– Economical Storage (Keep All Your Data Alive Forever).
• The Key Systems for Apache Hadoop are:
– Hadoop Distributed File System: self-healing highbandwidth clustered storage.
– MapReduce: distributed fault-tolerant resource
management coupled with scalable data processing.
19
©2011 Cloudera, Inc. All Rights Reserved.
Appendix
BACKUP SLIDES
20
©2011 Cloudera, Inc. All Rights Reserved.
Unstructured Data is Exploding
Complex, Unstructured
Relational
• 2,500 exabytes of new information in 2012 with Internet as primary driver
• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2
“zettabytes” this year
21
Source: IDC White Paper - sponsored by EMC.
As the Economy Contracts, the Digital Universe Expands. May 2009.
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Creation History
• Fastest sort of a TB,
62secs over 1,460 nodes
• Sorted a PB in 16.25hours
over 3,658 nodes
22
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop in the Enterprise Data Stack
Data Scientists
IDEs
Analysts
ETL Tools
Enterprise
Reporting
BI, Analytics
Development Tools
System
Operators
Business Users
Business Intelligence Tools
ODBC, JDBC,
NFS, Native
Cloudera
Mgmt Suite
Enterprise
Data
Warehouse
Sqoop
Data
Architects
Flume
Flume
Flume
Sqoop
Logs
Files
Web Data
Relational
Databases
23
©2011 Cloudera, Inc. All Rights Reserved.
Low-Latency
Serving
Systems
Customers
Web
Application
MapReduce Next Gen
Main idea is to split up the JobTracker functions:
• Cluster resource management (for tracking and
allocating nodes)
• Application life-cycle management (for MapReduce
scheduling and execution)
Enables:
• High Availability
• Better Scalability
• Efficient Slot Allocation
• Rolling Upgrades
• Non-MapReduce Apps
24
©2011 Cloudera, Inc. All Rights Reserved.
Application
Industry
Application
Social Network Analysis
Web
Clickstream Sessionization
Content Optimization
Media
Clickstream Sessionization
Network Analytics
Telco
Mediation
Loyalty & Promotions
Retail
Data Factory
Fraud Analysis
Financial
Trade Reconciliation
Entity Analysis
Federal
SIGINT
Sequencing Analysis
Bioinformatics
Genome Mapping
Product Quality
Manufacturing
Mfg Process Tracking
25
©2011 Cloudera, Inc. All Rights Reserved.
Use Case
DATA PROCESSING
Use Case
ADVANCED ANALYTICS
Two Core Use Cases Common Across Many Industries
What is Cloudera Enterprise?
Cloudera Enterprise makes open
source Apache Hadoop enterprise-easy
 Simplify and Accelerate Hadoop Deployment
 Reduce Adoption Costs and Risks
 Lower the Cost of Administration
 Increase the Transparency & Control of Hadoop
 Leverage the Experience of Our Experts
CLOUDERA ENTERPRISE COMPONENTS
Cloudera
Management
Suite
ProductionLevel Support
Comprehensive
Toolset for Hadoop
Administration
Our Team of Experts
On-Call to Help You
Meet Your SLAs
3 of the top 5 telecommunications, mobile services, defense &
intelligence, banking, media and retail organizations depend on Cloudera
EFFECTIVENESS
EFFICIENCY
Ensuring Repeatable Value from
Apache Hadoop Deployments
Enabling Apache Hadoop to be
Affordably Run in Production
26
©2011 Cloudera, Inc. All Rights Reserved.
Hive vs Pig Latin (count distinct values > 0)
• Hive Syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
• Pig Latin Syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;
27
©2011 Cloudera, Inc. All Rights Reserved.
Apache Hive Key Features
•
•
•
•
•
•
•
•
•
•
•
A subset of SQL covering the most common statements
JDBC/ODBC support
Agile data types: Array, Map, Struct, and JSON objects
Pluggable SerDe system to work on unstructured files directly
User Defined Functions and Aggregates
Regular Expression support
MapReduce support
Partitions and Buckets (for performance optimization)
Microstrategy/Tableau Compatibility (through ODBC)
In The Works: Indices, Columnar Storage, Views, Explode/Collect
More details: http://wiki.apache.org/hadoop/Hive
28
©2011 Cloudera, Inc. All Rights Reserved.
Hive Agile Data Types
• STRUCTS:
– SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
– SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
– SELECT mytable.mycolumn[5] FROM …
• JSON:
– SELECT get_json_object(mycolumn, objpath) FROM …
29
©2011 Cloudera, Inc. All Rights Reserved.