Analytics Drives Big Data Drives Infrastructure Confessions of
Transcription
Analytics Drives Big Data Drives Infrastructure Confessions of
Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8th, 2013 [email protected] What’s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You’d Like on Netflix? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 2 The Sommelier “Robot” Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3 Predicting What Movies You’d Watch Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4 (Analytics, BigData, DataStore)+ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 5 Many Analytics Techniques . . . AI (McCarthy) 1956 SNARC (Minsky) 1951 Dendral (Feigenbaum) 1965 Neural Networks Expert Systems SVM LDA . . . Vapnik (1992) Linear Machine Learning Naïve Bayes Decision Trees Regression Random Forests Time-Series Statistics Random Forests Genetic Algorithms K-nearest neighbor ... Fraser and Burnell (1970) R Ihaka and Gentleman (1993) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 6 Common Analytics Processing pre-2000 • • • • Sources: Local Data: Numeric, Homogeneous Processing: Local Consumer: Local • Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 7 Flavor Predictor – Neural Networks USPTO #5,373,452 (1994) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 1988 8 Pattern Recognition – Genetic Algorithms US PTO #5,140,530, 1992 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9 Small to Big http://article.wn.com/view/2013/04/04/Big_data_forefather_Michael_Stonebraker_shows_no_signs_of_sl/#/related_news Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 10 Typical Analytics: 2000-2006 • Sources: Global , Social Networks • Data: Heterogeneous, Numeric, Text • Processing: Hosted/Scale • Consumer: Global • Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc. Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 11 2007- : Internet Data Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12 Financial Risk Scoring: Detect Risk Scoring: detect incremental change in # occurrences where corporate officers mention “risk” (or equivalent terms) during earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13 Financial Risk Scoring: Listen *Risk Scoring: detect incremental change in occurrences where corporate officers mention “risk” (or semantically equivalent terms) during the corporate earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14 Banking: Credit Worthiness – remember 2008? Analyze bank reports to assess loans, payments, recoveries, etc. for key bank indexes, groups of banks, or individual banks Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15 Share of Voice: Online Buzz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16 Sentiment Analysis Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17 Analytics Processing: 2007• Sources: Global, Mobile, New Social (Instagram, . . ) • Data: Multi-Dimensional, Heterogeneous, Audio/Video • Processing: Hosted/Scale • Consumer: Global • Analytics: Batch, Streaming, . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 18 2008 - : Real-Time/Streaming Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19 Brand Marketing Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20 Brand Management 21 Customer Support Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22 Customer Support 23 Lead Generation Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 24 . . . More Data, Faster http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25 “Internet of Things” Machine-to-Machine Message Queuing Telemetry Transport http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-form2m-technology-to-drive-connected-smarter-cities/ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26 AumniData: Batch Processing Dashboard Configuration Twitter YouTube (TomCat) Blog/Web Site Blog/Web Site Blog/Web Site RSS/ATOM Feed Custom Analytics Display Dashboard Application Ad-Hoc Query Summary (.3rd party App) Content / Metadata Index (MySQL) Dashboard Store (SQL Server) Requestor/ URL Scanner Data Collector Data Collector (Batch Scheduled) Data Collector (Batch Scheduled) (Batch Scheduled) Content Store NLP+ Cruxly Intent NLP+ Cruxly Intent Detection NLP+ Cruxly Intent Detection NLP+ Cruxly Intent (AWS) Detection NLP+ Cruxly Intent (AWS) NLP Stack+ AumniData Detection (AWS) Detection (AWS) + Analytics* Classifier (AWS) (RackSpace VM) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 27 Cruxly: Stream Processing Twitter Reports / Dashboard Tracker Editor (web app - Heroku) Request (Keywords) Tweets (Keywords) Streaming API Client Streaming API Client (Heroku Worker) Streaming API Client (Heroku Worker) (24x7) Worker) (Heroku (24x7) (24x7) Tweets (Keywords) Tweets Content Store (DynamoDB) Tweet ID + Intent Signal (Heroku PostgresSQL) NLP+ Cruxly Intent NLP+ Cruxly Intent Detection NLP+ Cruxly Intent Detection NLP+ Cruxly Intent (AWS) Detection NLP+ Cruxly Intent (AWS) NLP Detection (NER, etc + Cruxly (AWS) Detection (AWS)Detection Intent (AWS) (AWS) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 28 Data Analytics Demands . . . Dashboards View View Chart Report Query/ RT Query Ad Hoc/ Search/ SQL Analyze Analyze Custom Analytics Process NLP Classify Index Store Machine Learning Library Stats Library R Process Store Data Collector Text / Sensor Data/ Stream . . . Storm Yarn 29 Storage Implications: Back to the Future IOPs – Stream MB/s – Batch Both? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30 Storage Implications: Back to the Future II, III MapReduce Master Slave #1 Task tracker Task tracker Slave #N Task tracker Mgmt Node Zookeeper Hive Job Tracker Pig HDFS Oozie Name Node HUE Data Node Data Node Data Node HDFS client Storage Capacity Scaling? Import/Export Data? Storage Tiering? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 31 Data Ingesters Processing Stream and Batch Metadata / In-Mem Store SAN Local Storage/ Flash / DAS Visualization Library / Interactive Query Analytics Processing Data Ingesters (Smart) Map Reduce /Distributed Data Store Data Ingesters (Basic) Sensor Processing: Data Integration A More General Data Analytics Framework? Content Store Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 32 Conclusion • • • • • • Data Analytics Big Data Scale-Out Variety Infrastructure Volume Bandwidth Support Velocity Streaming Support We Solved the Processing Problem We Need to Solve the Larger Storage Problem Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 33 Grateful Acknowledgements • Kapil Tundwal • Venky Madireddy • Dr. Kirill Kireyev • Dr. Shumin Wu • Dr. Andrew Lampert • Joan Wrabetz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 34