Big Data Analytics

Transcription

Big Data Analytics
Big Data Analytics
Big Data Analytics
Prof. Dr. Lars Schmidt-Thieme
Information Systems and Machine Learning Lab (ISMLL)
Institute of Computer Science
University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
[based on original slides by Lucas Rego Drumond, ISMLL 2014]
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
1 / 25
Big Data Analytics
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
1 / 25
Big Data Analytics
1. What is Big Data?
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
1 / 25
Big Data Analytics
1. What is Big Data?
What is Big Data?
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
1 / 25
Big Data Analytics
1. What is Big Data?
What is Big Data?
Some definitions:
I
“A collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or
traditional data processing applications.”
http://en.wikipedia.org/wiki/Big data
I
“Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision making.”
www.gartner.com/it-glossary/big-data/
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
2 / 25
Big Data Analytics
1. What is Big Data?
Big Data — Dimensions (the “4 Vs”)
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
3 / 25
Big Data Analytics
1. What is Big Data?
What is Big Data?
Big Data is about:
I
Storing and accessing large amounts of (unstructured/complex) data
I
querying
I
Processing high volume data streams
I
Decision support based on large data
I
I
I
I
visualization, reporting
navigation / query interfaces
contextualization / sense making
Building predictive models trained on large data
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
4 / 25
Big Data Analytics
2. Where to Find Big Data?
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
5 / 25
Big Data Analytics
2. Where to Find Big Data?
Big Data in Physics (CERN)
I
Large Hadron Collider has collected data from over 300 trillion
proton-proton collisions
I
Approx. 25 Petabytes per year
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
5 / 25
Big Data Analytics
2. Where to Find Big Data?
Big Data in Genetics (Ensembl)
I
I
Ensembl database contains the genome of humans and 50 other
species
“only” 250 GB
source: http://www.ensembl.org/
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
6 / 25
Big Data Analytics
2. Where to Find Big Data?
Big Data in the Web (Google)
I
3.3 billion searches per day (on average)
I
30 trillion unique URLs identified on the Web
I
20 billion sites crawled a day
I
In 2008 Google processed more than 20 Petabytes of data per day
Source:
– http://searchengineland.com/google-search-press-129925
– Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM
51, 1 (January 2008), 107-113.
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
7 / 25
Big Data Analytics
2. Where to Find Big Data?
Big Data in Social Media (Facebook)
I
I
I
I
1.28 billion users (1.23 billion monthly active in January 2014)
Size of user data stored by Facebook: 300 Petabytes
Average amount of data that Facebook takes in daily: 600 Terabytes
Size of Facebook’s Graph Search database: 700 Terabytes
Source: http://allfacebook.com/orcfile b130817
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
8 / 25
Big Data Analytics
2. Where to Find Big Data?
Big Data in Social Media (Twitter)
I
I
I
Average number of tweets per day: 58 million
Number of Twitter search engine queries every day: 2.1 billion
Total number of active registered Twitter users: 645,750,000
Source: http://www.statisticbrain.com/twitter-statistics/
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
9 / 25
Big Data Analytics
2. Where to Find Big Data?
Big Data — Public Datasets
1000 Genomes Project
Common Crawl Corpus
Wikipedia / Freebase
Million Song Dataset
OpenStreetMap
2000 US Census
PubChem library
NCDC weather data
Open Library
Twitter
DNA of 1700 humans
5G web pages
1.9G subject/predicate/object triples
audio features of 1M songs
a map of earth
US census data
biological activities of small molecules
daily measurements from 9000 stations
metadata of 20M books
1.6G tweets
200
81
250
280
90
200
230
20
7
0.6
TB
TB
GB
GB
GB
GB
GB
GB
GB
GB
CD 700 MB, DVD 4,7–17 GB, Blu-ray 25–100 GB, hard disc: 3–4 TB.
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
10 / 25
Big Data Analytics
2. Where to Find Big Data?
How Large is 1 Petabyte (PB)?
I
35.7M years counted one byte per second,
I
254 years listening to music stored in CD quality (500MB/h)
25 years watching DVDs
I
I
I
223.000 DVDs (a 4.7 GB)
but there are only 74.000 TV movies on IMDB !
(1.8M including TV episodes)
I
can be stored on 341 harddisks à 3 TB/90 e (30,000 e)
I
96 days to read from standard harddisks sequentially (1030 MBits/s)
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
11 / 25
Big Data Analytics
3. Big Data Applications
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
12 / 25
Big Data Analytics
3. Big Data Applications
What to do with Big Data?
Application examples:
I
Test complex scientific hypotheses (physics, genetics)
I
Index the web, including relevance feedback by users (web)
I
Online personalized advertising (social media, esp. Google)
I
Recommender systems (e-commerce, esp. Amazon)
I
Media analysis, sentiment analysis, market research (social media)
I
e.g., Obama campaign 2012
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
12 / 25
Big Data Analytics
3. Big Data Applications
More Applications & Case Studies
I
T-Mobile USA:
integrated Big Data across multiple IT systems to combine customer
transaction and interactions data in order to better predict customer
defections
I
I
US Xpress:
collects data elements ranging from fuel usage to tire condition to
truck engine operations to GPS information
I
I
By leveraging social media data along with transaction data from CRM
and billing systems, customer defections is said to have been cut in half
in a single quarter.
Optimal fleet management
McLaren’s Formula One racing team:
real-time car sensor data during car races
I
Real time identification of issues with its racing cars
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
13 / 25
Big Data Analytics
4. How to analyze Big Data?
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
14 / 25
Big Data Analytics
4. How to analyze Big Data?
How to handle Big Data? — The BI Approach
Data
Warehouse
I
Static databases (snapshots)
I
Structured data
I
Centralized approaches
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
14 / 25
Big Data Analytics
4. How to analyze Big Data?
How to handle Big Data? — The Distributed Approach
I
Massive Parallelism
I
Heterogeneous data
sources
I
Unstructured data
I
Data streams
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
15 / 25
Big Data Analytics
4. How to analyze Big Data?
Challenges
Challenges to deal with large volumes of data:
I
Store and query large amounts of data in a distributed environment
efficiently
I
I
I
Process distributed large data efficiently
I
I
distributed file systems
nosql databases, distributed databases
execution environments
Scale / distribute machine learning techniques
I
I
distributed learning algorithms (message passing)
ML execution environments
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
16 / 25
Big Data Analytics
4. How to analyze Big Data?
Execution Environments: MapReduce
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
17 / 25
Big Data Analytics
4. How to analyze Big Data?
Execution Environments: GraphLab
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
18 / 25
Big Data Analytics
5. Big Data at ISMLL, University of Hildesheim
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
19 / 25
Big Data Analytics
5. Big Data at ISMLL, University of Hildesheim
Usage of Big Data in Cooperative Research Projects
I
transportation
I
I
e-commerce / recommender systems
I
I
I
NetFlix dataset: 100 million transactions
Rossmann Online / Compra
technology enhanced learning
I
I
REDUCTION: Danish Taxi fleet
(14.000 vehicles, 1.5 million trips, 2.2 billion GPS measurements)
Whizz Education
(1200 exercises, 250,000 students, 30 million interactions)
engineering data mining
I
I
Rolls Royce: jet engine vibration
Detectino: ground penetrating radar
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
19 / 25
Big Data Analytics
5. Big Data at ISMLL, University of Hildesheim
ISMLL Compute Cluster
I
61 compute nodes
I
840 cores, 1288 threads
I
2.2 TB RAM
I
183 TB hard disk capacity
I
10 TFlops
special nodes:
I
I
I
I
database server
coprocessor compute node
(240 cores, 960 threads,
4 TFlops)
software:
I
I
simple scheduler
(sun grid engine)
Map–Reduce (hadoop)
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
20 / 25
Big Data Analytics
5. Big Data at ISMLL, University of Hildesheim
Data Analytics — A New Study Programme in 2015
term
1
2
3
4
Total
module
Advanced Machine Learning
Modern Optimization Techniques
Programming Machine Learning
Seminar Data Analytics I
application module I
Advanced Database Technologies
Data and Privacy Protection
methodological specialization I
Distributed Machine Learning
Seminar Data Analytics II
Project (part I)
Planning and Optimal Control
methodological specialization II
Project (part II)
Seminar Data Analytics III
application module II
Master thesis and colloquium
type
lecture
lecture
lab
seminar
misc.
lecture
lecture
lecture
lab
seminar
project
lecture
lecture
project
seminar
misc.
thesis
ECTS
6
6
6
4
6
6
3
6
6
4
6
6
6
9
4
6
30
120
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
21 / 25
Big Data Analytics
5. Big Data at ISMLL, University of Hildesheim
Data Analytics — A New Study Programme in 2015
I
solid education on all fundamental aspects of data analytics at the
state-of-the-art
I
I
I
I
several methodological specializations:
I
I
I
I
Bayesian Networks, Computer Vision, etc.
integrated application area
I
I
machine learning
analytical database technology & execution models
planning and control
media systems, software engineering, environmental sciences, computer
linguistics, information sciences, psychology (several still requested)
hands-on lab courses
a deep and fun, two term integrated group project
fully internationally targeted (completely in English)
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
22 / 25
Big Data Analytics
6. Conclusions
Outline
1. What is Big Data?
2. Where to Find Big Data?
3. Big Data Applications
4. How to analyze Big Data?
5. Big Data at ISMLL, University of Hildesheim
6. Conclusions
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
23 / 25
Big Data Analytics
6. Conclusions
Big Data — Chances . . .
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
23 / 25
Big Data Analytics
6. Conclusions
Big Data — . . . and Risks
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
24 / 25
Big Data Analytics
6. Conclusions
Conclusions
I
Big Data Analytics addresses the analysis of large and complex data
(volume, variety, velocity, veracity)
I
at Petabyte scale (1024 Terabytes)
I
Big Data emerges naturally in many different domains
(science, web, e-commerce, social media, robotics)
I
Big Data requires massively parallel infrastructure to be analyzed
timely
I
I
I
I
data centers with hundreds and thousands of compute nodes
distributed databases, nosql databases
execution environments (MapReduce)
To exploit big data in a principled and optimal way, machine learning
methods have to be scaled and distributed,
I
I
requiring innovations at the level of the learning algorithms,
also requiring special ML execution environments.
Prof. Dr. Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
33. Sitzung des Arbeitskreises Informationstechnologie, Hildesheim
25 / 25

Similar documents