USER-CENTRIC DATA MANAGEMENT IN THE ERA OF BIG DATA Alexandros Labrinidis

Transcription

USER-CENTRIC DATA MANAGEMENT IN THE ERA OF BIG DATA Alexandros Labrinidis
1
USER-CENTRIC DATA MANAGEMENT
IN THE ERA OF BIG DATA
Alexandros Labrinidis
Advanced Data Management Technologies Lab
Department of Computer Science
University of Pittsburgh
http://labrinidis.cs.pitt.edu
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
2
Data-Intensive Science
Data-intensive
science
Observational
✤
SDSS: Sloan Digital Sky Survey (2000 - )
200 GB/night
✤
LSST: Large Synoptic Survey Telescope (2015 - )
30 TB/night -- 1.28PB/year
✤
LHC: Large Hadron Collider
15 PB/year
✤
SKA: Square Kilometer Array (2019 - )
10 PB/hour
One (virtual)
instrument
Multiple
instruments
Simulation
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
3
Data-Intensive Science
Data-intensive
science
Observational
Simulation
One (virtual)
instrument
Multiple
instruments
✤
Gene Sequencing
✤
Personalized Medicine
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
4
Data-Intensive Science
Data-intensive
science
Observational
Simulation
One (virtual)
instrument
Multiple
instruments
✤
Climate Modeling
✤
Turbulent Combustion Flow
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
5
What’s the Big Deal with Big Data?
•  Featured on the cover of Nature and the Economist!
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
6
What’s the Big Deal with Big Data?
•  And even has a Dilbert Cartoon!
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
7
Big Data Definition - The three Vs
•  Volume - size does matter!
•  Velocity - data at speed, i.e., the data “fire-hose”
•  Variety - heterogeneity is the rule
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
8
Five more Vs
•  Variability - rapid change of data characteristics
over time
•  Veracity - ability to handle uncertainty,
inconsistency, etc
•  Visibility – protect privacy and provide security
•  Value – usefulness & ability to find the right
hay-colored needle in the haystack
•  Voracity - strong appetite for data!
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
9
Enter Moore’s Law
Moore's law is the observation that, over the
history of computing hardware, the number of
transistors in a dense integrated circuit doubles
approximately every two years. The law is
named after Gordon E. Moore, co-founder of
Intel Corporation, who described the trend in his
1965 paper.
Source: http://en.wikipedia.org/wiki/Moore's_law
[ Wikipedia Image ]
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
10
Enter Bezos’ Law
Bezos' law is the observation that, over
the history of cloud, a unit of computing
power price is reduced by 50% approximately
every 3 years
Source: http://blog.appzero.com/blog/futureofcloud
© 2014 Alexandros
Labrinidis, University of Pittsburgh
Photo: http://www.slashgear.com/google-data-center-hd-photos-hit-where-the-internet-lives-gallery-17252451/
October 8, 2014
11
Storage capacity increase
HDD Capacity (GB)
7000
6000
5000
4000
3000
2000
1000
0
Insert other exponentially increasing graphs here
(e.g., data generation rates, world-wide smartphone access rates,
Internet of Things, …)
[ Wikipedia Data ]
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
12
But
•  Human processing capacity remains
roughly the same!
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
13
We refer to this as the:
Big Data – Same Humans
Problem
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
14
About the ADMT Lab
•  Directed by:
•  Panos K. Chrysanthis
•  Alexandros Labrinidis
•  Established in 1995
•  Currently: 5 PhD students
•  Our “slogan”:
User-centric data management for network-centric applications
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
15
Look at the entire data lifecycle
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
16
AQSIOS - A DSMS Architecture
AQSIOS is the DSMS prototype developed at our ADMT Lab. It is built on top of the
STREAM prototype from Stanford.
Cont. queries
AQS
IOS
Scheduler
Query optimizer
Administrator
Set the delay targets
and priorities for
queries
Statistics
collector
Query networks
Load
Manager
Q1
Q2
Q3
Stream applications
Data stream sources
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
17
DILoS evaluation – QoS and QoD
Average response time (ms)
Average data loss (%)
Class 1
Class 2
Class 3
Class 1 Class 2 Class 3
No load manager
3.40
3.53
56541.69
0
0
0
Common load manager
3.00
3.13
517.07
11.42
11.43
11.60
Per-class load manager
3.55
3.75
492.84
0
0
35.95
DILoS
4.28
4.38
42.95
0
0
0
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
18
Style of research
•  Emphasis on systems and algorithms
•  Building real systems
•  Often based on academic prototypes (e.g., Stream from Stanford)
or on top of well-known open-source software (e.g., Storm)
•  Experimenting using real systems and simulation
•  Comparing alternatives
•  Should we do grouping of queries in way A or way B?
•  If we do 4 different optimizations, what is the relative benefit of
each one?
•  In which cases would a certain algorithm be better than another?
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
19
Types of projects for undergrads
•  Upcoming:
•  web-based user interface to visualize run-time behavior
of a real system
•  Past:
•  clustering of tweets
•  web-based interfaces to different database back-ends
•  REST APIs for remote data access
•  application to coordinate supernovae observations
•  monitoring application for transient astronomical events
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014
20
More info
•  [1] The Beckman Report on Database Research
By Abadi et al, October 2013
http://beckman.cs.wisc.edu
•  [2] Big Data and Its Technical Challenges
By Jagadish, Gehrke, Labrinidis, Papakonstantinou,
Patel, Ramakrishnan, and Shahabi,
Communications of the ACM, July 2014
http://bit.ly/bigdatachallenges
(over 4,500 downloads)
•  [3] Contact me: http://labrinidis.cs.pitt.edu/contact
© 2014 Alexandros Labrinidis, University of Pittsburgh
October 8, 2014