How to Leverage Data Lakes

Transcription

How to Leverage Data Lakes
How to Leverage Data Lakes (Hive, Hortonworks,
Cloudera, Impala, Spark SQL)
Presented by: Trishla Maru
#mstrworld
Agenda
Data Lakes and Hadoop Ecosystem
Various Ways to Access Big Data
Demo
Top 5 Reasons to Why Use MicroStrategy
Customer Success Stories
Summary and Q&A
2
#mstrworld
Typical Big Data Lake
MicroStrategy Analytics Platform
Hive
Spark
Impala
Pig
Drill
DATA LAKE
HDFS
Hadoop
OLTP,ERP,
CRM
systems
Document
s Emails
Web logs,
Click
Streams
Social
Networks
Machine
Generated
SOURCES OF DATA
Sensor
Data
Geolocation
Data
Data Warehouse vs Data Lake
A Data Lake is NOT a Data Warehouse
Data Warehouse
Data Lake
Structured and Processed Data
Structured, Semi
Structured/Unstructured/Raw Data
Schema on write processing
Schema on read processing
Expensive for large data storage volumes
Designed for low cost storage
Less agile, fixed configuration
Highly agile, configure and reconfigure as
needed
More for business professionals
Data scientists et. al.
MicroStrategy/Hadoop Ecosystem
MicroStrategy Analytics Platform
Cloudera, MapR, Hortonworks, Amazon EMR
Spark SQL
Hive, Pig
MAP
REDUCE
IMPALA
DRILL
SPARK
MSTR
HADOOP
GATEWAY
Phoenix/JDBC
HBase, Cassandra
YARN
Hadoop Distributed File System (HDFS)
Support for Big Data Sources in the Data Lake
Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single Database
MapReduce &
NOSQL Databases
Bring All Relevant Data to
Decision Makers
Columnar
Databases
Elastic Map
Reduce
BigInsights
Distribution
HDFS
Redshift
Data Warehouse
Appliances
HANA
Parallel Data Warehouse
Relational
Databases
Multidimensional
Databases
Analysis Services
Google
Analytics
SaaS-Based App
Data
Generic Web
Services
User / Departmental
Data
SOAP
REST
Generic Web
Services with
OAuth
Clipboard
Zendesk
..many more..
MicroStrategy
Dataset
Various Ways to Connect to Hadoop
Generalized Big Data Analytics Workflow
Big Data Analytics Workflow
Sources:
Unstructured,
Semi-structured,
Structured
Outputs: HDFS, RDBMS, Hive,
HBase, Cassandra, MongoDB,
SOLR, Elastic, Other NoSQL
Models
Models
•
In general, the big data environments are distributed transformation and calculation
frameworks that take in sources and generate outputs.
•
Sources tend to be unstructured or semi-structured whereas outputs tend to be semistructured and structured.
•
Hadoop-based workflows may incorporate several components tied together with a
dataflow or workflow manager such as Pig, Oozie, others or custom programs.
•
Spark is able to use Spark APIs and components in the workflow for a more uniform
experience
Ways to Connect and Query Hadoop
#1 Access and Query Hadoop via Hive
Hadoop
MicroStrategy
Analytics Platform
Hive ODBC
Connector
Any Hadoop
Distribution
Hive
HDFS
•
This is the most popular way of querying Hadoop, via Hive.
•
Hive allows users who aren’t familiar with programming to access and analyze big data in
a less technical way, using a SQL-like syntax called Hive Query Language (HiveQL).
Hive is used for complex, long-running tasks and analyses on large sets of data.
•
Hive uses the map reduce framework to query data from HDFS. It is fault tolerant, but
has high latency.
Ways to Connect and Query Hadoop
#2 Access and Query Hadoop via Impala (only with Cloudera distribution)
Hadoop
MicroStrategy
Analytics Platform
Hive ODBC
Connector
Cloudera
Distribution
Impala
HDFS
•
Impala also uses SQL syntax to query Hadoop. But does NOT use the map reduce
framework. It has its own processing engine to query HDFS.
•
Impala is used for analysis when the user wants to run and return quickly on a small
subset of data, e.g. analyzing company finances for a daily or weekly report. Not ideal for
complex data manipulation, data preparation etc.
•
Hive is a screwdriver and Impala is a drill bit.
Hive and Impala
Hive
Impala
•
Hive is ‘SQL on Hadoop’
•
Impala is ‘SQL on HDFS’
•
Hive uses MapReduce to process its
queries
•
Impala uses its own processing engine
•
Hive cannot perform in-memory query
operations
•
Impala can perform in-memory query
operations
•
Hive is fault tolerant. Recommended for
ETL type of jobs
•
Impala is not fault tolerant
•
Hive has high latency
•
Impala has faster response times, great
for small ad-hoc queries
•
Use Case: If you have batch
processing kind of needs, use Hive
•
Use Case: If you need real time, adhoc queries over a subset of data,
use Impala
Ways to Connect and Query Hadoop
#3 Access and Query Hadoop via Apache Spark
Hadoop
MicroStrategy
Analytics Platform
•
•
•
Spark ODBC
Connector
Apache
Spark
HDFS
Apache Spark is one of the fastest evolving open source communities, popular for it’s
MPP capabilities on top of Hadoop and providing advanced analytical workflows.
MicroStrategy connects to Spark via Spark SQL/ODBC connector.
Hive on Spark
• Replaces Hive SQL Engine with Spark SQL Engine starting Hive 1.2
• Set hive.execution.engine=spark
Key Points for Spark
•
Spark doesn’t store information – it uses common Hadoop and NoSQL storage for input
and output
•
Internally, Spark uses in-memory stores to pass data from one processing step to
another
•
Spark SQL in Spark applications works on the in-memory stores to simplify Spark
programming against the Resilient Distribute Datasets (RDD) structures. This is not
accessible via ODBC
•
In-memory RDD’s are not available after a Spark application completes. Spark SQL
ODBC can only access RDD’s that have persisted in Hive
Ways to Connect and Query Hadoop
#4 Tap into Hadoop Natively with MicroStrategy’s Hadoop Gateway
NEW
MicroStrategy
Analytics Platform
Big Data
Engine/Hadoop
Gateway
Hadoop
HDFS
•
We launched this connectivity with v10. Big Data Engine (BDE), is a native YARN based
application that enables direct access to HDFS. None of the core BI competitors have
this capability.
•
YARN (Yet Another Resource Negotiator) is the prerequisite for Enterprise Hadoop,
providing resource management and a central platform to deliver consistent operations,
security, and data governance tools across Hadoop clusters.
•
Use Case: Fulfills faster data loading of data from HDFS and leverage our in-memory
layer for analytics.
Demo
Top Five Reasons to Why use MicroStrategy for your
next Big Data Project?
#1 Leverage varied flavors to access Hadoop that fulfills the business needs
•
MicroStrategy is the only core BI tool
to provide a native connectivity to
HDFS that enables fast data loads into
its in-memory layer.
•
MicroStrategy certifies Cloudera
Impala, Presto, Google Big Query and
Pivotal HAWQ, Apache Drill as a data
source.
•
MicroStrategy optimizes and certifies
Hadoop/Hive as a data source.
•
MicroStrategy certifies Spark via
Spark SQL/ODBC.
•
MicroStrategy also provides a
connector to execute freeform pig-latin
reports
Hadoop Gateway
SQL on Hadoop/Big Data
Apache
Hive
Apache
Spark
Apache Pig
#2 Analyze large volumes of data with a platform that is scalable and performant
Sample Applications
Access the Database
With
High Throughout
•
CRM analysis across a large customer
base
•
Interactive analysis on high-volume
clickstream data
•
Merchant analytics for a credit card issuer
•
Store manager application for a large
national chain
Create and Publish the Cube
With
High Data Scalability
Analyze Data
With
Fast Response Times
#3: Cleanse your Hadoop data ‘on the fly‘ with MicroStrategy’s Data Wrangler
• Cleanse and refine your data with• ‘Data
Wrangler’
Extract,
Split Patterns
Cleanse your data
• Generate Facets
Visualization
• Designed for analysts and business
users
without
spending
any extra
Get
Suggestions
from
our Data
Wrangler,
as $$
• Rename, Delete, and many more..
Predictive Interactions
you prepare your data! &
Analysis
•
•
Data
Preparation
Work on a sample and
“Apply” to the entire
dataset
Undo/Redo without
penalty
#4: Perform ‘on the fly’ dynamic searches on semi-structured data
Business Case:
•
•
Analyzing Product Catalog
•
Perform faceted search by dynamically
clustering the items.
•
Users can then “drill down” by
applying specific constraints to the
search results
•
Manufacturers can get superior
feedback
Analyzing Diagnostic Notes
•
Needs search engine’s capabilities to
facilitate semi-structured data analysis
in health care domain.
•
Helping health care professionals
make informed decisions in situations
that are time sensitive.
Configure the file in
Apache Solr
Import the file via data
import
Dynamically Search for
keywords via Visual Insight
#5: Get federated data on dashboards leveraging MicroStrategy’s integrated
architecture
Oracle
Personal Data Sources
Relational Databases
Multi-Dimensional Sources
Analysis
Services
Map Reduce Databases
Hadoop
Excel
Customer Success Stories
Key BI Characteristics:
INDUSTRY:
BI COMPONENTS:
DATABASE:
HADOOP DISTRIBUTION:
VOLUME OF DATA
APPLICATIONS:
Self Service Big Data
Analytics with over 30
Applications
Wireless Telecom
Multiple Applications
Hive and Red Shift
Impala
1 TB Daily
30 Applications
Business Use and Benefits
• The largest telecom company in Korea, SK Planet provides mobile
services such as navigation, mobile payment, online shopping,
memberships, etc. They have 30+ services, and they have all of their
data in Hadoop (basically no relational databases for analysis).
• MicroStrategy provides with an centralized environment for analyzing
their Hadoop data in a managed self-service manner.
• Self service data discovery while fully leveraging MicroStrategy’s inmemory technology.
Key BI Characteristics:
INDUSTRY:
BI COMPONENTS:
USERS
DATABASE:
HADOOP DISTRIBUTION:
VOLUME OF DATA
TYPE OF DATA
APPLICATIONS:
Entertainment
1 Application; Traditional Reports
~200
Hadoop, Teradata
Amazon EMR
Petabytes
Log and Events data
Sales Analysis
Business Use and Benefits
• Sales Analysis generally with a new launch in new region, quick report
analysis to understand the new accounts, number of hours of viewing
etc.
• Directly querying and reporting from MicroStrategy on logs via Hive
• Able to make better Sales decisions
Making Better Sales
Decisions by Getting
Insights from Web Logs
in Hadoop
Key BI Characteristics:
INDUSTRY:
BI COMPONENTS:
USERS
DATABASE:
HADOOP DISTRIBUTION:
VOLUME OF DATA
TYPE OF DATA
APPLICATIONS:
E-commerce
1 Application; Reports, Dashboards, VI
~200
Hadoop, Oracle
Apache
Petabytes
Web Logs, Online behavior
Sales Analysis
Business Use and Benefits
• Analyzing web logs/online behavior stored in Hadoop. Dashboards and
VI analysis run against our in-memory cubes. And ad-hoc reports run
live against Hive.
• End users do not need to code with MapReduce
Analyzing Online
Behavior with Live
Queries to Hadoop
• Developers are more productive delivering self service BI through a
tool instead of coding custom user interface.
Key BI Characteristics:
Leading Electronics
Manufacturing
INDUSTRY:
BI COMPONENTS:
DATABASE:
HADOOP DISTRIBUTION:
VOLUME OF DATA
APPLICATIONS:
Electronics Manufacturing
Multiple Applications
Hive and Red Shift
S3
20-30 TB Daily
Smart TV Analytics, Mobile App Analytics
Business Use and Benefits
• This Electronics company is collecting data from all of their smart TVs
around the world to analyze the user behavior and app usage. It
includes remote controller click stream and log analysis for more than
30 smart TV services.
Smart TV Analytics on
User Behavior and
App Usage
• The data is actually not stored in HDFS but in S3. Fully leveraging the
elastic computing capability of AWS for Map Reduce, with the CPU time
based costing of ec2, this saves a big cut from the infrastructure cost.
• Using MicroStrategy to build applications to analyze click streams and
logs.
Key BI Characteristics:
Multi-Channel Digital
Distribution Provider
INDUSTRY:
BI COMPONENTS:
DATABASE:
HADOOP DISTRIBUTION:
VOLUME OF DATA
APPLICATIONS:
Electronics and Media
1 Application; Reports, VI, Dashboards
Hadoop, Hive
Cloudera Impala
Over 1 Billion traffic attribute combinations
Traffic Attribute Multiplier
Business Use and Benefits
• The Traffic Attribute Multiplier application is helping Adconion to get
precisely target their digital ads, shorten the time to prepare and tune
models and better ad delivery ROI for their customers
• Leveraging MicroStrategy’s integration to Impala and the rich
visualizations library, making it easy to be consumed by business
users.
Precise Targeting of
Digital Ads
• Achieved 2.4% improvement in ad budgets spending efficiency
Summary
Data Lakes are Powerful
MicroStrategy is ready to support your hybrid architecture
Various Ways to Access Big Data
MicroStrategy offers it all and is committed to be at par
A Robust Platform to Analyze Big Data
Highly Scalable In Memory engine
Get Federated Data
Effectively Navigate Data Across Multiple Data Sources with MicroStrategy multi-source and data blending
End User Data Preparation
Perform ‘on the fly’ cleansing on Hadoop data
Questions?
Q&A