Big Data Analytics
Transcription
Big Data Analytics
Big Data Analytics Big Data Analytics Prof. Dr. Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim [based on original slides by Lucas Rego Drumond, ISMLL 2014] Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25 Big Data Analytics Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25 Big Data Analytics 1. What is Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25 Big Data Analytics 1. What is Big Data? What is Big Data? Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 1 / 25 Big Data Analytics 1. What is Big Data? What is Big Data? Some definitions: I “A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” http://en.wikipedia.org/wiki/Big data I “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” www.gartner.com/it-glossary/big-data/ Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 2 / 25 Big Data Analytics 1. What is Big Data? Big Data — Dimensions (the “4 Vs”) Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 3 / 25 Big Data Analytics 1. What is Big Data? What is Big Data? Big Data is about: I Storing and accessing large amounts of (unstructured/complex) data I querying I Processing high volume data streams I Decision support based on large data I I I I visualization, reporting navigation / query interfaces contextualization / sense making Building predictive models trained on large data Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 4 / 25 Big Data Analytics 2. Where to Find Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 5 / 25 Big Data Analytics 2. Where to Find Big Data? Big Data in Physics (CERN) I Large Hadron Collider has collected data from over 300 trillion proton-proton collisions I Approx. 25 Petabytes per year Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 5 / 25 Big Data Analytics 2. Where to Find Big Data? Big Data in Genetics (Ensembl) I I Ensembl database contains the genome of humans and 50 other species “only” 250 GB source: http://www.ensembl.org/ Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 6 / 25 Big Data Analytics 2. Where to Find Big Data? Big Data in the Web (Google) I 3.3 billion searches per day (on average) I 30 trillion unique URLs identified on the Web I 20 billion sites crawled a day I In 2008 Google processed more than 20 Petabytes of data per day Source: – http://searchengineland.com/google-search-press-129925 – Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113. Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 7 / 25 Big Data Analytics 2. Where to Find Big Data? Big Data in Social Media (Facebook) I I I I 1.28 billion users (1.23 billion monthly active in January 2014) Size of user data stored by Facebook: 300 Petabytes Average amount of data that Facebook takes in daily: 600 Terabytes Size of Facebook’s Graph Search database: 700 Terabytes Source: http://allfacebook.com/orcfile b130817 Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 8 / 25 Big Data Analytics 2. Where to Find Big Data? Big Data in Social Media (Twitter) I I I Average number of tweets per day: 58 million Number of Twitter search engine queries every day: 2.1 billion Total number of active registered Twitter users: 645,750,000 Source: http://www.statisticbrain.com/twitter-statistics/ Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 9 / 25 Big Data Analytics 2. Where to Find Big Data? Big Data — Public Datasets 1000 Genomes Project Common Crawl Corpus Wikipedia / Freebase Million Song Dataset OpenStreetMap 2000 US Census PubChem library NCDC weather data Open Library Twitter DNA of 1700 humans 5G web pages 1.9G subject/predicate/object triples audio features of 1M songs a map of earth US census data biological activities of small molecules daily measurements from 9000 stations metadata of 20M books 1.6G tweets 200 81 250 280 90 200 230 20 7 0.6 TB TB GB GB GB GB GB GB GB GB CD 700 MB, DVD 4,7–17 GB, Blu-ray 25–100 GB, hard disc: 3–4 TB. Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 10 / 25 Big Data Analytics 2. Where to Find Big Data? How Large is 1 Petabyte (PB)? I 35.7M years counted one byte per second, I 254 years listening to music stored in CD quality (500MB/h) 25 years watching DVDs I I I 223.000 DVDs (a 4.7 GB) but there are only 74.000 TV movies on IMDB ! (1.8M including TV episodes) I can be stored on 341 harddisks à 3 TB/90 e (30,000 e) I 96 days to read from standard harddisks sequentially (1030 MBits/s) Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 11 / 25 Big Data Analytics 3. Big Data Applications Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 12 / 25 Big Data Analytics 3. Big Data Applications What to do with Big Data? Application examples: I Test complex scientific hypotheses (physics, genetics) I Index the web, including relevance feedback by users (web) I Online personalized advertising (social media, esp. Google) I Recommender systems (e-commerce, esp. Amazon) I Media analysis, sentiment analysis, market research (social media) I e.g., Obama campaign 2012 Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 12 / 25 Big Data Analytics 3. Big Data Applications More Applications & Case Studies I T-Mobile USA: integrated Big Data across multiple IT systems to combine customer transaction and interactions data in order to better predict customer defections I I US Xpress: collects data elements ranging from fuel usage to tire condition to truck engine operations to GPS information I I By leveraging social media data along with transaction data from CRM and billing systems, customer defections is said to have been cut in half in a single quarter. Optimal fleet management McLaren’s Formula One racing team: real-time car sensor data during car races I Real time identification of issues with its racing cars Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 13 / 25 Big Data Analytics 4. How to analyze Big Data? Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 14 / 25 Big Data Analytics 4. How to analyze Big Data? How to handle Big Data? — The BI Approach Data Warehouse I Static databases (snapshots) I Structured data I Centralized approaches Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 14 / 25 Big Data Analytics 4. How to analyze Big Data? How to handle Big Data? — The Distributed Approach I Massive Parallelism I Heterogeneous data sources I Unstructured data I Data streams Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 15 / 25 Big Data Analytics 4. How to analyze Big Data? Challenges Challenges to deal with large volumes of data: I Store and query large amounts of data in a distributed environment efficiently I I I Process distributed large data efficiently I I distributed file systems nosql databases, distributed databases execution environments Scale / distribute machine learning techniques I I distributed learning algorithms (message passing) ML execution environments Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 16 / 25 Big Data Analytics 4. How to analyze Big Data? Execution Environments: MapReduce Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 17 / 25 Big Data Analytics 4. How to analyze Big Data? Execution Environments: GraphLab Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 18 / 25 Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 19 / 25 Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim Usage of Big Data in Cooperative Research Projects I transportation I I e-commerce / recommender systems I I I NetFlix dataset: 100 million transactions Rossmann Online / Compra technology enhanced learning I I REDUCTION: Danish Taxi fleet (14.000 vehicles, 1.5 million trips, 2.2 billion GPS measurements) Whizz Education (1200 exercises, 250,000 students, 30 million interactions) engineering data mining I I Rolls Royce: jet engine vibration Detectino: ground penetrating radar Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 19 / 25 Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim ISMLL Compute Cluster I 61 compute nodes I 840 cores, 1288 threads I 2.2 TB RAM I 183 TB hard disk capacity I 10 TFlops special nodes: I I I I database server coprocessor compute node (240 cores, 960 threads, 4 TFlops) software: I I simple scheduler (sun grid engine) Map–Reduce (hadoop) Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 20 / 25 Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim Data Analytics — A New Study Programme in 2015 term 1 2 3 4 Total module Advanced Machine Learning Modern Optimization Techniques Programming Machine Learning Seminar Data Analytics I application module I Advanced Database Technologies Data and Privacy Protection methodological specialization I Distributed Machine Learning Seminar Data Analytics II Project (part I) Planning and Optimal Control methodological specialization II Project (part II) Seminar Data Analytics III application module II Master thesis and colloquium type lecture lecture lab seminar misc. lecture lecture lecture lab seminar project lecture lecture project seminar misc. thesis ECTS 6 6 6 4 6 6 3 6 6 4 6 6 6 9 4 6 30 120 Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 21 / 25 Big Data Analytics 5. Big Data at ISMLL, University of Hildesheim Data Analytics — A New Study Programme in 2015 I solid education on all fundamental aspects of data analytics at the state-of-the-art I I I I several methodological specializations: I I I I Bayesian Networks, Computer Vision, etc. integrated application area I I machine learning analytical database technology & execution models planning and control media systems, software engineering, environmental sciences, computer linguistics, information sciences, psychology (several still requested) hands-on lab courses a deep and fun, two term integrated group project fully internationally targeted (completely in English) Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 22 / 25 Big Data Analytics 6. Conclusions Outline 1. What is Big Data? 2. Where to Find Big Data? 3. Big Data Applications 4. How to analyze Big Data? 5. Big Data at ISMLL, University of Hildesheim 6. Conclusions Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 23 / 25 Big Data Analytics 6. Conclusions Big Data — Chances . . . Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 23 / 25 Big Data Analytics 6. Conclusions Big Data — . . . and Risks Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 24 / 25 Big Data Analytics 6. Conclusions Conclusions I Big Data Analytics addresses the analysis of large and complex data (volume, variety, velocity, veracity) I at Petabyte scale (1024 Terabytes) I Big Data emerges naturally in many different domains (science, web, e-commerce, social media, robotics) I Big Data requires massively parallel infrastructure to be analyzed timely I I I I data centers with hundreds and thousands of compute nodes distributed databases, nosql databases execution environments (MapReduce) To exploit big data in a principled and optimal way, machine learning methods have to be scaled and distributed, I I requiring innovations at the level of the learning algorithms, also requiring special ML execution environments. Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim 25 / 25