HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM

Transcription

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM
HOW TO LIVE WITH THE ELEPHANT
IN THE SERVER ROOM
APACHE HADOOP WORKSHOP
AGENDA
•
•
•
•
•
•
•
Introduction
What is Hadoop and the rationale behind it
Hadoop Distributed File System (HDFS) and MapReduce
Common Hadoop use cases
How Hadoop integrates with other systems like Relational Databases and Data Warehouses
The other components in a typical Hadoop “stack” such as: Hive, Pig, HBase, Sqoop, Flume
and Oozie
Conclusion
ABOUT TRIFORCE
Triforce provides critical, reliable IT
infrastructure solutions and services to
Australian and New Zealand listed corporations
and government agencies. Triforce has
qualified and experienced technical and sales
consultants and demonstrated experience in
designing and delivering enterprise Apache
Hadoop solutions.
TRIFORCE BIG DATA PARTNERSHIP
NetApp
Cloudera
•
•
The NetApp Open Solution for
Hadoop provides customers with
flexible choices for delivering
enterprise-class Hadoop.
Cloudera is the market leader in
Hadoop enterprise solutions.
Cloudera’s 100% open-source
distribution including Apache
Hadoop (CDH), combined with
Cloudera Enterprise, comprises
the most reliable and complete
Hadoop solution available.
WHAT IS HADOOP?
• “a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models.”
(http://hadoop.apache.org/)
• “Apache Hadoop is a software framework that supports data-intensive
distributed applications under a free license. It enables applications to
work with thousands of nodes and petabytes of data.”
(http://en.wikipedia.org/wiki/Hadoop/)
THE RATIONALE FOR HADOOP
• “Hadoop enables distributed parallel processing of huge amounts of data
across inexpensive, industry-standard servers that both store and process
the data, and can scale without limits. With Hadoop, no data is too big.”
(http://www.cloudera.com)
• Hadoop processes petabytes of unstructured data in parallel across
potentially thousands of commodity boxes using an open source filesystem and related tools
• Hadoop has been all about innovative ways to process, store, and
eventually analyse huge volumes of multi-structured data.
EXAMPLES
• 2.7 Zettabytes of data exist in the digital universe today.
(Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte)
• Facebook stores, accesses, and analyses 30+
Petabytes of user generated data.
• Decoding the human genome originally took 10 years to
process; now it can be achieved in one week.
• YouTube users upload 48 hours of new video every
minute of the day.
• 100 terabytes of data uploaded daily to Facebook
HADOOP
•
Handles all types of data
– structured, unstructured, log files, pictures, audio files, communications records, email
•
No prior need for a schema
– you don’t need to know how you intend to query your data before you store it
•
Makes all of your data useable
– By making all of your data useable, not just what’s in your databases, Hadoop lets you
see relationships that were hidden before and reveal answers that have always been
just out of reach. You can start making more decisions based on hard data instead of
hunches and look at complete data sets, not just samples.
•
Two parts to Hadoop
– MapReduce
– Hadoop Distributed File System (HDFS)
What is this Big Elephant ? HADOOP
Geever Paul Pulikkottil
BigData Solutions Architect (CCAH,CCDH)
CASE FOR BIGDATA
Databases
– here for more than 20yrs
– continue to store structured transactional data
•
•
•
•
Large server (s)
Multi CPUs
Huge Memory Buffer
SAN disks
– Relatively low latency queries, indexed data
CASE FOR BIGDATA
TYPICAL WORKLOADS – DATABASE

•
•
•
•
OLTP (online transaction processing)
Typical Use: e-commerce, banking
Nature:
User facing, real-time, low latency, highly-concurrent
Job:
relatively small set of “standard” transactional queries

•
•
•
•
OLAP (online analytical processing)
Typical Use: BI, Data Mining
Nature:
Back-end processing, Batch workloads
Job:
complex analytical queries, often ad hoc
Data access: Table scans, Large query
Data access pattern: random reads, updates, writes (relatively small data)
CASE FOR BIGDATA
Data warehouse:
–
–
–
–
Consolidated database loaded from CRM, ERP, OLTP
Process: Staging, Cleansing, Loading
Purpose: BI Reporting, Forecasts, Quarterly reporting
Size: larger server, multiple CPUs, SAN disks- many TBs
• Challenge:
• As the data grows overtime, things getting slower
• Batch should fit in within daily, weekly loading cycle
• Relatively expensive to license, store, manage
CASE FOR BIGDATA
New Objective: Businesses wants to “connect” with the customer
•
•
•
•
We are generating lots of data – most discarded them
Likes and Dislikes – Facebook, Twitter, Linked-in
Predictable outcomes - you can when you know the customer
React quickly – time missed = opportunity lost !
Question: Can DW provide that ?
• Where can you store TB or PB’s unstructured data more economically
• How can you scale out easily, rather than forklift upgrades
• How can I finish batch jobs when the data grows beyond TBs
• Need a scalable, distributed system that can store and process
large amounts of data
CASE FOR BIGDATA
• Distributed systems are not NEW:
–
–
–
–
Common frameworks include MPI, PVM
Focuses on distributing the processing workload
Powerful compute nodes with Separate systems for data storage
Fast network connections – Infiniband
• Typical processing pattern:
–
–
–
–
Step 1: Copy input data from storage to compute node
Step 2: Perform necessary processing
Step 3: Copy output data back to storage
Often hundreds to thousands of nodes with GPUs
CASE FOR BIGDATA
Distributed HPC
– relatively small amounts of data
– doesn’t scale with large amounts of data
– more time spent copying data than actually processing
– getting data to the processors is the bottleneck
– getting worse as more compute nodes are added
– each node competing for the same bandwidth
– compute nodes become starved for data
“Distributed systems pay for compute scalability by adding complexity
CudaFortran , PGI programing ? ”
BIGDATA SOLUTION: HADOOP
What is Hadoop
–
–
–
–
–
–
–
open source distributed computing platform
based on Google’s GFS File system
commodity hardware, no SAN, no infiniband
scale up from single servers to thousands of machines
each offering local computation and storage
designed to detect and handle failures at the application layer
adding more nodes, increase “performance” and “capacity” with no
penalty
– commodity hardware is prone to failures, Hadoop knows that !
HADOOP CLUSTER STACK
Master Nodes (1st rack)
- Name Node
- Standby Name Node
- Job Tracker
Slave Nodes (all racks)
- Data Nodes with direct attached large capacity disks (SATA)
Plus:
- Management or Admin Node
- Hadoop Client Node(s)
- Typical setup
MAPREDUCE PROGRAMING
Hadoop is great for large-data processing !
- MapReduce code requires you to write Java class, driver code
- Its complicated to write MapReduce jobs so we need a
simpler method.
- Develop a higher-level language to facilitate large data
processing
- Hive: SQL language for Hadoop , called HQL
- Pig: Pig Latin is scripting language, a bit like Perl
- Both translate and run a series of Map only or MapReduce
Jobs
ECOSYSTEM TOOLS: HIVE AND PIG
Hive:
Pig:
- Data warehousing application in Hadoop
- Query language is HQL, variant of SQL
- Tables stored on HDFS as flat files
- Developed by Facebook, now open source
- large-scale data processing system
- Scripts are written in Pig Latin
- Dataflow language Developed by Yahoo!, now open source
Objective:
- Higher-level language to facilitate large-data processing
- Higher-level language “compiles down” to Hadoop jobs
HIVE AND PIG EXAMPLE CODE
Hive example:
Pig example:
ECOSYSTEM TOOLS: SQOOP
-
Import data from RDBMS to Hadoop
-
-
Individual tables, Portions (where clause) or entire Databases
Stored to HDFS as delimited text files or Sequence Files
Provides the ability to import from SQL databases straight into your Hive Datawarehouse
JDBC to connect to RDBMS, additional connectors available to BI/DW
Sqoop automatically generates a Java class to import data into Hadoop
Sqoop provides an incremental import mode
Export tables to RDBMS from Hadoop
SQOOP IMPORT EXAMPLES
> Importing Data into HDFS as Hive table using SQOOP
user@dbserver$> sqoop --connect jdbc:mysql://db.example.com/website --table USERS --local \
--hive-import
> Importing Data to HDFS as compressed sequence files (No Hive) using SQOOP
user@dbserver$>sqoop --connect jdbc:mysql://db.example.com/website --table USERS \
--as-sequencefile
> Importing Data into HBase using SQOOP:
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** \
--hbase-create-table --hbase-table ORDERS --column-family mysql
>Exporting Data to RDBMS using SQOOP:
$ sqoop export --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** \
--export-dir /user/arvind/ORDERS
•
•
•
This would connect to the MySQL database on this server and import the USERS table into HDFS.
The –-local option instructs Sqoop to take advantage of a local MySQL connection.
The –-hive-import option after reading the data into HDFS, Sqoop will connect to the Hive metastore, create a table named USERS with
the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on
HDFS (instead of a subdir of your HDFS home dir)
SQOOP CUSTOM CONNECTORS
Sqoop Works with standard JDBC connection with common
Databases, custom faster tuned connectors available for
– Cloudera Connector for Teradata
– Cloudera Connector for Netezza
– Cloudera Connector for MicroStrategy
– Cloudera Connector for Tableau
– Quest Data Connector for Oracle and Hadoop
ECOSYSTEM TOOLS: FLUME
Flume:
Gather data/logs from Multiple systems, inserting them into HDFS as they are
generated. Typically used to ingest log files from real-time systems such as Web
servers, firewalls and mail servers into HDFS.
Each Flume agent has a source and a sink
 Source
– Tells the node where to receive data from
 Sink
– Tells the node where to send data to
 Channel
– A queue between the Source and Sink
– Can be in memory only or ‘Durable’
– Durable channels will not lose data if power is lost
ECOSYSTEM TOOLS: FUSE
FUSE :“ Filesystem in Userspace “
– Allows HDFS to be mounted as a UNIX file system
– User can operate 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep', or use
standard Posix libraries like open, write, read, close.
– You can export a fuse mount using NFS,
ECOSYSTEM TOOLS: OOZIE
Oozie:
–
–
–
–
–
Oozie is a ‘workflow engine’
Runs workflows of Hadoop jobs
Pig, Hive, Sqoop jobs
Jobs can be run at specific times, One-off or recurring
Jobs can also be run when data is present in a directory
ECOSYSTEM TOOLS: MAHOUT
Mahout:
- Mahout is a Machine Learning
library
- Contains many pre written ML
algorithms
- R is another set of open source
library used by Data Scientists
ECOSYSTEM TOOLS: IMPALA <CDH4.1>
IMPALA:
– Brings real-time, ad hoc query
– Query data stored in HDFS or HBase
– SELECT, JOIN, and aggregate
functions in real time.
– Uses the same Hive Metadata
– SQL syntax (Hive SQL), ODBC driver
– User interface (Hue Beeswax) as
Hive and Impala shell
– Released 26th Oct 2012 CDH4.1
HBASE – REAL TIME DATA WITH UPDATE
HBase is a distributed, sparse, column-oriented data store
– Real-time read/write access to data on HDFS
– Modeled after Google’s Bitable data store
– Designed to use multiple machines to store and serve data
– Leverages HDFS to store data
– Each row may or may not have values for all columns
– Data is stored grouped by column, rather than by row
– Columns are grouped into ‘column families’, which define what columns are physically stored
together
– Scales to provide very high write throughput – Hundreds of thousands of inserts per second
– Has a constrained access model: NO SQL
•
•
–
Insert a row, retrieve a row, do a full or partial table scan
Only one column (the ‘row key’) is indexed
Based on Key/value Store: [rowkey, column family, column qualifier, timestamp] -> Cell Value
•
•
[TheRealMT, info, password, 1329088818321] -> abc123
[TheRealMT, info, password, 13290888321289] -> newpass123
HBASE
Hbase:
–
Indexed by [rowkey+column qualifier +timestamp]
• HBase is Not a Relational Database
–
–
–
–
–
–
–
–
–
No SQL Query language (GET/PUT/SCAN)
No Joins, No Secondary Indexing, No Transactions
Table is split into Regions
Regions are served by Region Servers
Region Servers are Java processes, on DataNodes
two special tables: ROOT and .META
MemStore, Hfiles
Every Memstore flush creates one HFile per Col.Fam
Compactions Major/Minor – reduce consolidated hfiles
DATA HAS CHANGED
HADOOP USE CASES:
• What do we know today?
•
•
•
•
•
•
•
We love to be connected and collaborated
We love to share emotions likes and dislikes
Digital marketing has focus towards social media
Get more insights across collection of data
Need all sorts of data to store and analyse
Real-time recommendation engines
Predictive modelling with data science
COMMON HADOOP USE CASES
Financial Services –
Consumer & market risk modelling
Personalization & recommendations
Fraud detection & anti-money laundering
Portfolio valuations
COMMON HADOOP USE CASES
• Government –
– Cyber security & fraud detection,
– Geospatial image & video processing
COMMON HADOOP USE CASES
• Media & Entertainment –
– Search & recommendation optimization,
– User engagement & digital content analysis,
– Ad/offer targeting,
– Sentiment & social media analysis
HADOOP USE CASES: DATA STORES

OLTP database (OLTP)
• for user-facing transaction, Retain records

Extract-Transform-Load (ETL)
• Periodic ETL (e.g., nightly), Extract records from source
• Transform: clean data, check integrity, aggregate, etc.
• Load into OLAP database

OLAP database for Data Warehousing (DW)
•
Business Intelligence: reporting, ad hoc queries, data mining
HADOOP USE CASES: REPLACE DW ?
Reporting is often a nightly task
•
•
ETL is often slow, runs after the day
What happens if processing 24 hours of data takes longer than 24hr
Hadoop is perfect
likely, you already have some DW
• Most
• Ingest is limited by speed of HDFS
• Scales out with more nodes
parallel
• Massively
• Ability to use any processing tool
• Much cheaper than parallel databases
•
ETL is a batch process anyway!
CLOUDERA DISTRIBUTION HADOOP 4.1
Cloudera Enterprise Subscription
Options:
• Cloudera Enterprise Core
• Cloudera Enterprise RTD
(Real-Time Delivery)
• Cloudera Enterprise RTQ
(Real-Time Query)
WHERE TO FROM HERE?
Understand
Use Cases
Deploy
Hadoop
Infrastructure
Confirm
Data
sources
Build
a
business
Case
Use Hadoop
to answer
questions
Design a
solution
CONTACT TRIFORCE
• Call 1300 664 667
• Email: [email protected]
• View our Big Data Resources page at
www.triforce.com.au
• Follow us on LinkedIN
http://www.linkedin.com/company/triforceaustralia