How to Leverage Data Lakes
Transcription
How to Leverage Data Lakes
How to Leverage Data Lakes (Hive, Hortonworks, Cloudera, Impala, Spark SQL) Presented by: Trishla Maru #mstrworld Agenda Data Lakes and Hadoop Ecosystem Various Ways to Access Big Data Demo Top 5 Reasons to Why Use MicroStrategy Customer Success Stories Summary and Q&A 2 #mstrworld Typical Big Data Lake MicroStrategy Analytics Platform Hive Spark Impala Pig Drill DATA LAKE HDFS Hadoop OLTP,ERP, CRM systems Document s Emails Web logs, Click Streams Social Networks Machine Generated SOURCES OF DATA Sensor Data Geolocation Data Data Warehouse vs Data Lake A Data Lake is NOT a Data Warehouse Data Warehouse Data Lake Structured and Processed Data Structured, Semi Structured/Unstructured/Raw Data Schema on write processing Schema on read processing Expensive for large data storage volumes Designed for low cost storage Less agile, fixed configuration Highly agile, configure and reconfigure as needed More for business professionals Data scientists et. al. MicroStrategy/Hadoop Ecosystem MicroStrategy Analytics Platform Cloudera, MapR, Hortonworks, Amazon EMR Spark SQL Hive, Pig MAP REDUCE IMPALA DRILL SPARK MSTR HADOOP GATEWAY Phoenix/JDBC HBase, Cassandra YARN Hadoop Distributed File System (HDFS) Support for Big Data Sources in the Data Lake Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single Database MapReduce & NOSQL Databases Bring All Relevant Data to Decision Makers Columnar Databases Elastic Map Reduce BigInsights Distribution HDFS Redshift Data Warehouse Appliances HANA Parallel Data Warehouse Relational Databases Multidimensional Databases Analysis Services Google Analytics SaaS-Based App Data Generic Web Services User / Departmental Data SOAP REST Generic Web Services with OAuth Clipboard Zendesk ..many more.. MicroStrategy Dataset Various Ways to Connect to Hadoop Generalized Big Data Analytics Workflow Big Data Analytics Workflow Sources: Unstructured, Semi-structured, Structured Outputs: HDFS, RDBMS, Hive, HBase, Cassandra, MongoDB, SOLR, Elastic, Other NoSQL Models Models • In general, the big data environments are distributed transformation and calculation frameworks that take in sources and generate outputs. • Sources tend to be unstructured or semi-structured whereas outputs tend to be semistructured and structured. • Hadoop-based workflows may incorporate several components tied together with a dataflow or workflow manager such as Pig, Oozie, others or custom programs. • Spark is able to use Spark APIs and components in the workflow for a more uniform experience Ways to Connect and Query Hadoop #1 Access and Query Hadoop via Hive Hadoop MicroStrategy Analytics Platform Hive ODBC Connector Any Hadoop Distribution Hive HDFS • This is the most popular way of querying Hadoop, via Hive. • Hive allows users who aren’t familiar with programming to access and analyze big data in a less technical way, using a SQL-like syntax called Hive Query Language (HiveQL). Hive is used for complex, long-running tasks and analyses on large sets of data. • Hive uses the map reduce framework to query data from HDFS. It is fault tolerant, but has high latency. Ways to Connect and Query Hadoop #2 Access and Query Hadoop via Impala (only with Cloudera distribution) Hadoop MicroStrategy Analytics Platform Hive ODBC Connector Cloudera Distribution Impala HDFS • Impala also uses SQL syntax to query Hadoop. But does NOT use the map reduce framework. It has its own processing engine to query HDFS. • Impala is used for analysis when the user wants to run and return quickly on a small subset of data, e.g. analyzing company finances for a daily or weekly report. Not ideal for complex data manipulation, data preparation etc. • Hive is a screwdriver and Impala is a drill bit. Hive and Impala Hive Impala • Hive is ‘SQL on Hadoop’ • Impala is ‘SQL on HDFS’ • Hive uses MapReduce to process its queries • Impala uses its own processing engine • Hive cannot perform in-memory query operations • Impala can perform in-memory query operations • Hive is fault tolerant. Recommended for ETL type of jobs • Impala is not fault tolerant • Hive has high latency • Impala has faster response times, great for small ad-hoc queries • Use Case: If you have batch processing kind of needs, use Hive • Use Case: If you need real time, adhoc queries over a subset of data, use Impala Ways to Connect and Query Hadoop #3 Access and Query Hadoop via Apache Spark Hadoop MicroStrategy Analytics Platform • • • Spark ODBC Connector Apache Spark HDFS Apache Spark is one of the fastest evolving open source communities, popular for it’s MPP capabilities on top of Hadoop and providing advanced analytical workflows. MicroStrategy connects to Spark via Spark SQL/ODBC connector. Hive on Spark • Replaces Hive SQL Engine with Spark SQL Engine starting Hive 1.2 • Set hive.execution.engine=spark Key Points for Spark • Spark doesn’t store information – it uses common Hadoop and NoSQL storage for input and output • Internally, Spark uses in-memory stores to pass data from one processing step to another • Spark SQL in Spark applications works on the in-memory stores to simplify Spark programming against the Resilient Distribute Datasets (RDD) structures. This is not accessible via ODBC • In-memory RDD’s are not available after a Spark application completes. Spark SQL ODBC can only access RDD’s that have persisted in Hive Ways to Connect and Query Hadoop #4 Tap into Hadoop Natively with MicroStrategy’s Hadoop Gateway NEW MicroStrategy Analytics Platform Big Data Engine/Hadoop Gateway Hadoop HDFS • We launched this connectivity with v10. Big Data Engine (BDE), is a native YARN based application that enables direct access to HDFS. None of the core BI competitors have this capability. • YARN (Yet Another Resource Negotiator) is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. • Use Case: Fulfills faster data loading of data from HDFS and leverage our in-memory layer for analytics. Demo Top Five Reasons to Why use MicroStrategy for your next Big Data Project? #1 Leverage varied flavors to access Hadoop that fulfills the business needs • MicroStrategy is the only core BI tool to provide a native connectivity to HDFS that enables fast data loads into its in-memory layer. • MicroStrategy certifies Cloudera Impala, Presto, Google Big Query and Pivotal HAWQ, Apache Drill as a data source. • MicroStrategy optimizes and certifies Hadoop/Hive as a data source. • MicroStrategy certifies Spark via Spark SQL/ODBC. • MicroStrategy also provides a connector to execute freeform pig-latin reports Hadoop Gateway SQL on Hadoop/Big Data Apache Hive Apache Spark Apache Pig #2 Analyze large volumes of data with a platform that is scalable and performant Sample Applications Access the Database With High Throughout • CRM analysis across a large customer base • Interactive analysis on high-volume clickstream data • Merchant analytics for a credit card issuer • Store manager application for a large national chain Create and Publish the Cube With High Data Scalability Analyze Data With Fast Response Times #3: Cleanse your Hadoop data ‘on the fly‘ with MicroStrategy’s Data Wrangler • Cleanse and refine your data with• ‘Data Wrangler’ Extract, Split Patterns Cleanse your data • Generate Facets Visualization • Designed for analysts and business users without spending any extra Get Suggestions from our Data Wrangler, as $$ • Rename, Delete, and many more.. Predictive Interactions you prepare your data! & Analysis • • Data Preparation Work on a sample and “Apply” to the entire dataset Undo/Redo without penalty #4: Perform ‘on the fly’ dynamic searches on semi-structured data Business Case: • • Analyzing Product Catalog • Perform faceted search by dynamically clustering the items. • Users can then “drill down” by applying specific constraints to the search results • Manufacturers can get superior feedback Analyzing Diagnostic Notes • Needs search engine’s capabilities to facilitate semi-structured data analysis in health care domain. • Helping health care professionals make informed decisions in situations that are time sensitive. Configure the file in Apache Solr Import the file via data import Dynamically Search for keywords via Visual Insight #5: Get federated data on dashboards leveraging MicroStrategy’s integrated architecture Oracle Personal Data Sources Relational Databases Multi-Dimensional Sources Analysis Services Map Reduce Databases Hadoop Excel Customer Success Stories Key BI Characteristics: INDUSTRY: BI COMPONENTS: DATABASE: HADOOP DISTRIBUTION: VOLUME OF DATA APPLICATIONS: Self Service Big Data Analytics with over 30 Applications Wireless Telecom Multiple Applications Hive and Red Shift Impala 1 TB Daily 30 Applications Business Use and Benefits • The largest telecom company in Korea, SK Planet provides mobile services such as navigation, mobile payment, online shopping, memberships, etc. They have 30+ services, and they have all of their data in Hadoop (basically no relational databases for analysis). • MicroStrategy provides with an centralized environment for analyzing their Hadoop data in a managed self-service manner. • Self service data discovery while fully leveraging MicroStrategy’s inmemory technology. Key BI Characteristics: INDUSTRY: BI COMPONENTS: USERS DATABASE: HADOOP DISTRIBUTION: VOLUME OF DATA TYPE OF DATA APPLICATIONS: Entertainment 1 Application; Traditional Reports ~200 Hadoop, Teradata Amazon EMR Petabytes Log and Events data Sales Analysis Business Use and Benefits • Sales Analysis generally with a new launch in new region, quick report analysis to understand the new accounts, number of hours of viewing etc. • Directly querying and reporting from MicroStrategy on logs via Hive • Able to make better Sales decisions Making Better Sales Decisions by Getting Insights from Web Logs in Hadoop Key BI Characteristics: INDUSTRY: BI COMPONENTS: USERS DATABASE: HADOOP DISTRIBUTION: VOLUME OF DATA TYPE OF DATA APPLICATIONS: E-commerce 1 Application; Reports, Dashboards, VI ~200 Hadoop, Oracle Apache Petabytes Web Logs, Online behavior Sales Analysis Business Use and Benefits • Analyzing web logs/online behavior stored in Hadoop. Dashboards and VI analysis run against our in-memory cubes. And ad-hoc reports run live against Hive. • End users do not need to code with MapReduce Analyzing Online Behavior with Live Queries to Hadoop • Developers are more productive delivering self service BI through a tool instead of coding custom user interface. Key BI Characteristics: Leading Electronics Manufacturing INDUSTRY: BI COMPONENTS: DATABASE: HADOOP DISTRIBUTION: VOLUME OF DATA APPLICATIONS: Electronics Manufacturing Multiple Applications Hive and Red Shift S3 20-30 TB Daily Smart TV Analytics, Mobile App Analytics Business Use and Benefits • This Electronics company is collecting data from all of their smart TVs around the world to analyze the user behavior and app usage. It includes remote controller click stream and log analysis for more than 30 smart TV services. Smart TV Analytics on User Behavior and App Usage • The data is actually not stored in HDFS but in S3. Fully leveraging the elastic computing capability of AWS for Map Reduce, with the CPU time based costing of ec2, this saves a big cut from the infrastructure cost. • Using MicroStrategy to build applications to analyze click streams and logs. Key BI Characteristics: Multi-Channel Digital Distribution Provider INDUSTRY: BI COMPONENTS: DATABASE: HADOOP DISTRIBUTION: VOLUME OF DATA APPLICATIONS: Electronics and Media 1 Application; Reports, VI, Dashboards Hadoop, Hive Cloudera Impala Over 1 Billion traffic attribute combinations Traffic Attribute Multiplier Business Use and Benefits • The Traffic Attribute Multiplier application is helping Adconion to get precisely target their digital ads, shorten the time to prepare and tune models and better ad delivery ROI for their customers • Leveraging MicroStrategy’s integration to Impala and the rich visualizations library, making it easy to be consumed by business users. Precise Targeting of Digital Ads • Achieved 2.4% improvement in ad budgets spending efficiency Summary Data Lakes are Powerful MicroStrategy is ready to support your hybrid architecture Various Ways to Access Big Data MicroStrategy offers it all and is committed to be at par A Robust Platform to Analyze Big Data Highly Scalable In Memory engine Get Federated Data Effectively Navigate Data Across Multiple Data Sources with MicroStrategy multi-source and data blending End User Data Preparation Perform ‘on the fly’ cleansing on Hadoop data Questions? Q&A