What are we doing??? What are we doing???

Transcription

What are we doing??? What are we doing???
Faculty of Computer Science – Institute for System Architecture, Database Technology Group
Vortrag Fakultätskolloquium – DB Group
What are we doing???
Wolfgang Lehner
17. November 2008
… famous Past
… the golden present …
Bright Future … ???
Datenvolumen
in DBMS gespeichert
Zeit
Bright Future … ???
Java
SQL:1999
Objektorientierung
Multimedia
Information
Retrieval
Internet
XML
Strukturen in Tabellen
Strukturen zwischen Tabellen
Contentmanagement
Data Warehouse
OLAP / Data
Mining
Bright Future … ???
Search!!!
Information Retrieval
Bright Future … ???
• Wo finde ich meine
Information und wie
komme ich an die
gesuchten
Informationen heran?
• Wie weit kann ich
meinen Datenbeständen
vertrauen und meinen
Datenfluss gezielt
kontrollieren?
© 2005 IBM Corporation
8
Bright Future … ???
B. Lindsay
Ontology
XQuery
XML
Model Management
Events
Data Analytics
Data Integration
Web 2.0
Info 2.0
Knowledge
Discovery
Sampling
Data Quality
Personal
Information
Management
DWHDWH-Models
WebServices
Data Streams
P2PP2P-Systems
Semantic Web
Query Optimization
SelfSelf-Tuning
Data(base) Analytics
Math & Models
SystemArchitecture
System Architecture
DB Infrastructures
Open Service Process Platform
Approximate
Query Processing
(Sampling)
XMLStream
Optimization
Theseus
Data Process
Orchestration
Research
Data Stream
Quality
Research
Data Mining
Process Control
GCIP – Generation of
Integration Processes
QStream
(trading time
for space)
space)
Data Mining Support
for SAP /BIA
Netweaver
Planning Support
for SAP/BIA
Netweaver
LargeLarge-Scale
Reporting Landscape
Model aware
Phys DB
Design
Application Level (external)
• Clustering
– Find similar groups
– Ofter superlinear in input size
• Procedure
– Run k-means
– Estimate mean and
variance
– 99% confidence
interval under
normal distribution
• Run on sample
– 5%
System Level (internal)
• Selectivity Estimation
– Determine percentage of tuples that
satisfy a query
– Key to effective
query optimization
• Procedure
– Exact computation
– 5% Sample
• How good is this?
– Arbitrary dataset
– 1% absolute error,
95% confidence
– ≈20k items
Exact:
1.1%
Sample:
≈1.2%
Sample:
≈83,6%
Exact:
83,8%
Option 1: Query Sampling
Sampling
step
Queries
Updates
Base
data
• Advantages
– No impact on traditional query
processing
– No storage requirements
• Disadvantages
– Sampling step is expensive
– Supports only simple queries
– Cannot handle data skew
Estimation
step
Approximate
queries
Approximate
results
Option 2: Materialized Sampling
Approximate
queries
Queries
Updates
Base
data
Sample
data
Sampling
step
• Advantages
– Quick access to the sample
– Sophisticated preprocessing
feasible
• Disadvantages
– Storage space
– Impact on updates
Estimation
step
Approximate
results
Sampling – Problem Dimensions
AMTC
•
Entwicklung von High-end Photomasken Immersion und Extrem Ultra Violet
(EUV) Lithographie für sub 90nm Halbleitertechnologien
Dresden, 17.06.08
TransConnect® - Infotag
Folie 19
Übersicht Projektziele
DataData-Mining
zusätzliche Steuerdaten
für den Produktionsprozess
Übergang von traditionellen Pull
basierten Analysesystemen auf
PushPush-Szenarien für
Produktionsmodelle
Produktionsprozess
Prozesssteuerung
Datenvorbereitung
Produktherstellung
Abfassen von Metriken
für Meta Mining
Online Update von
Prozessmodellen
Integration von Produktionsdaten
mit möglichst geringer Latenz
zum Zeitpunkt der
Datenerstellung
MessdatenMessdaten-Erfassung Integration
Qualitätssicherung
ADAMAS
Clustering of Time Series
•
Problem:
•
– capture human similarity notion despite
unequal length, amplitude scale and
phase
•
Synopsis for Series
– relative frequency of
subsequences of length n
Preprocess (Discretization)
– represent equal length segments by mean
(PAA)
– transform mean to symbolic value (SAX)
•
Clustering using
– standard
algorithms
– Euclidian distance
Clustering of Time Series
•
Problem:
•
– capture human similarity notion despite
unequal length, amplitude scale and
phase
•
Synopsis for Series
– relative frequency of
subsequences of length n
Preprocess (Discretization)
– represent equal length segments by mean
(PAA)
– transform mean to symbolic value (SAX)
•
Clustering using
– standard
algorithms
– Euclidian distance
•
Current
– extension to map
data
Mining on Complex Objects using
Monte Carlo Database Ideas
•
•
Traditional systems perform Mining
on scalar data sets
Complex Object
− Uncertain data
annotated data points with PDF
− 3D Maps
− Tree structures
from
[Japni08]
Data(base) Analytics
Math & Models
SystemArchitecture
System Architecture
DB Infrastructures
Open Service Process Platform
Approximate
Query Processing
(Sampling)
XMLStream
Optimization
Theseus
Data Process
Orchestration
Research
Data Stream
Quality
Research
Data Mining
Process Control
GCIP – Generation of
Integration Processes
QStream
(trading time
for space)
space)
Data Mining Support
for SAP /BIA
Netweaver
Planning Support
for SAP/BIA
Netweaver
LargeLarge-Scale
Reporting Landscape
Model aware
Phys DB
Design
MQO Techniques in DSMS on Modern
Hardware
•
Data Stream-Management Systems (DSMS)
– Many applications (sensor networks,
financial analysis, …)
– Key requirements: Low-latency and
high-throughput processing
•
Multi-Query Optimization in DSMS
– Huge number of queries, over the same data sources
– Opportunity to share “work” for operator trees
with complex operator semantics (e.g. DM)
•
Exploiting modern hardware (Cell BE)
– Heterogeneous multi-core
architecture, plenty of parallelism
– Taking advantage of SIMD
Flash and Databases
•
Flash has asynchronous performance
Write
Read/Write
1GB of data
0,392
Read
10,314
0
•
5
[MB/s]
10
15
Idea: Trade Write for Read
– Example: Sorting a data set larger than available memory
– Repeatedly reading the data set instead of writing intermediate
results
External Merge Sort
264,641
Read Only Sort
79,61
0
50
100 150
[time in s]
200
250
300
Sorting 64MB of
integer values
with 16MB of
available
memory
SAP BIA: Main Memory DBMS
•
SAP Netwaever BIA (TREX)
– main-memory-based decision support engine
– supports aggregation queries only
– use of massive parallelism
Project Subjects
Merging and results
preparation for BI
queries
Any Tool
Business Explorer
BEx
Speed and Flexibility
via Attribute Search Technology
Integration
– adaptation of
Data Mining techniques
for BIA technology
– Support of next
generation planning
scenarios
with SAP Business Intelligence
•
Analytic
Engine
5
4
3
InfoCubes
ETL
1
Scalability
Any Data
Parallel
ininmemory
processin
g of query
results
2
via Adaptive Computing
SAP Data
Aggregatio
n onon-thethe-fly
Parallel indexing of
InfoCube data via
standard BI processes
Vertical
decomposition
& compression
of indexes
Parallelism for ColStore
Exploit, monitor, improve…
Exploitation loop
(optimization time)
Feedback loop
(execution time)
Query
Sample View
matching
Probe query
generation
Report actual
cardinalities
⋈
σ
GB
Injection of
new estimate
Monitor quality
of estimates
⋈
U
σ
Probe query
execution
R
⋈
S
T
Sample View
Refresh
Sample View
Control chart for
SV2
Estimation Process when using Sample
Views
Sample view SV2
⋈
(1) View matching
⋈
T
⋈
(5) Inject new
cardinalities
S
R
σ
GB
(2) Probe query
generation
⋈
Probe query
GB
Pr
R
Cardinality
estimator
Evaluate
predicates
Internal Synopses
SV1(R,S)
Sample View
S
T
Regular cardinality
estimation
SV2
(3) Probe query
execution
σ
⋈
Compute
aggregates
U
(4) Report sample
measures
SV2(R,S,T)
Sample View
SV3(U,V)
R S T U
Sample View
Histograms
Database
statistics
Quality Control Process for Sample Views
Optimization Time
Execution Time
⋈
⋈
SVSV-based
cardinality
actual
cardinality
⋈
⋈
R
S
U
U
σ
⋈
σ
⋈
σ
GB
σ
GB
R
T
S
statistical
quality
control
T
InfoPack
Match &
Compensation
alternative
estimate
cardinality
estimator
- SVId, VersionId
- sample cardinality: n
- estimated cardinality: K’
- eff. sampling rate
- alternative estimate
- actual cardinality: K
Probe query
if outside of
control bound
Execute
Refresh
Delete/Insert
Sample View
Data(base) Analytics
Math & Models
SystemArchitecture
System Architecture
DB Infrastructures
Open Service Process Platform
Approximate
Query Processing
(Sampling)
XMLStream
Optimization
Theseus
Data Process
Orchestration
Research
Data Stream
Quality
Research
Data Mining
Process Control
GCIP – Generation of
Integration Processes
QStream
(trading time
for space)
space)
Data Mining Support
for SAP /BIA
Netweaver
Planning Support
for SAP/BIA
Netweaver
LargeLarge-Scale
Reporting Landscape
Model aware
Phys DB
Design
Data Transitions
• Extended SOA framework
– Data-Grey-Box Web Service technology
– BPELDT – Data Transition in BPEL (process execution language)
• Architecture: Open Service Process Platform
(OSPP)
– Based on SOA and
database technologies
–
– Extensible to include semantic
schema matching technology
Streams in SOA - towards CEP framework
• Stream-based execution – Web Service level
– Stream-based processing of requests
– Intermediate results,
lower resource consumption
• Stream-based execution – Process level
– Standing processes
– Pipelined execution of
incoming messages
GCIP Overview
•
Vision (“invisible deployment”)
– Transparent use of integration systems
– Optimality decision
– Heterogeneous load balancing
Model-Driven Generation and
Optimization of Integration
Processes
•
•
•
•
Platform-independent modeling
Rule- and costcost-based optimization
o
(Bi(Bi-directional) modelmodel-transformations
Code generation for different IS
GCIP Project as foundation for the
investigation of optimization concepts in
integration processes
GCIP Optimization Perspectives
– Results and Challenges
•
Workload-Based Optimization
– Cost-based optimization
techniques
– Statistics maintenance
– Workload shift detection
– Inter-operator optimization
•
Throughput Maximization
– Vectorization of integration
processes (pipelining)
– Multi-Process Optimization (MPO)
• Message batch processing
• horizontal partitioning
– Dynamic throughput
maximization (workload-based)
•
Heterogeneous Load Balancing
Efficiency Comparison
• DIPBench
- Benchmark for integration systems
- DIPBench toolsuite
Specific
Specific Problems
• Message indexing for document-oriented integration processes
• Dynamic adapter / wrapper
generation
• SIR Transaction Model
- Transaction model for integration
processes (compensation-based)
• Optimization under transactional
execution restrictions
Real-time Data Warehouse Scheduling
•
Multi-objective Scheduling
• for hundreds of transactions
• offline and online
• stable schedules required
Scheduler
u1
w(u1)
…
uj
w(uj)
DWH
Query Queue
•
Real-time Data Warehouse
• continuous stream of write-only
updates and read-only queries /
push principle
• query (currency) / update
(freshness) contention
Two utility functions
• QoS and QoD objective
Two user groups
• both trying to minimize/maximize
their own utility functions
Update Queue
•
q1
0.3 0.7
…
qi
0.9 0.1
QoS
QoD
lPbruhV
ch
.G
aktsB
izM
gE
n
e-u
bch.G
V
el ford
aktPB
sM
gE
u
ord
-efizn
Precisison Dairy Farming
uationsyem
cägszaF
olm
eh
M
d
n
-fru
d
w
rU
n
u
scod
em
ath
täsiInQ
ualgcherofm
-yM
dkNIQ
w
rgU
zF
äd
Q
-ym
altnrsich
FofThuz
e
geH
n
ru
ygied
Q
n
Fre
fosH
n
-täam
tiekgtlahcru
N
r derun gen
Qua
litä
ford tse
rung en
ungs
Halt der
o
anf en
rung
Eff
izi
for enzrun deg en
P
ste roz
u e ess
ru ng
Bio- u
.
Gentechn
ologie
emaMath he
tisc e
ll
Mode
und
oMeth
de n
Fo
M
er ar
fo kt
ni rde s s re
au
br
g
un
er
Ve
r
ich
t ss
ch
e
rs c
tä
ali
hu
tz
Qu
Qua
lit
ien e
g
y
sich ätsH
u n d s - ru n e gstung
Hal eme systeme
t
sy s
arm
fo sIn tion me
e
st
sy
tel
w m e
U f o r d g en
n
ru
ch
Na
(J. Spilke, J, Büscher,
W. Doluschitz, R.; Fahr, R.;
Lehner,W., 2003)
e
sozial
Forde
n
runge
„Precision Dairy
Farming ist ein
integrativer Ansatz für
eine
nachhaltige Erzeugung
von Milch mit
gesicherter Qualität
sowie einem hohen
Grad an Verbraucherund Tierschutz.“
ha
g
lti
ke
er
Ti
hu
sc
tz
it
Datenbeziehungen zwischen
Landwirtschaftsunternehmen und
Informationspartnern
Rechenzentrum
(ZWS)
Rechenzentrum
(HIT)
Zuchtverband
LKV
Futterqualit
ät
Labor
Landwirtschaftsbetrieb
Data(base) Analytics
Math & Models
SystemArchitecture
System Architecture
DB Infrastructures
Open Service Process Platform
Approximate
Query Processing
(Sampling)
XMLStream
Optimization
Theseus
Data Process
Orchestration
Research
Data Stream
Quality
Research
Data Mining
Process Control
GCIP – Generation of
Integration Processes
QStream
(trading time
for space)
space)
Data Mining Support
for SAP /BIA
Netweaver
Planning Support
for SAP/BIA
Netweaver
LargeLarge-Scale
Reporting Landscape
Model aware
Phys DB
Design
What’s Next ???
• Math & Models
– Prediction – FlashForeward Queries
– Monte Carlo DB as core concept for DM applications
• System Architecture
– Investigate Flash and Vectorization Potential for
Main Memory DB
• DB Infrastructures
– Standing Processes
– Process Deployment
Data(base) Analytics
Math & Models
SystemArchitecture
System Architecture
DB Infrastructures
Open Service Process Platform
Approximate
Query Processing
(Sampling)
XMLStream
Optimization
Theseus
Data Process
Orchestration
Research
Data Stream
Quality
Research
Data Mining
Process Control
GCIP – Generation of
Integration Processes
QStream
(trading time
for space)
space)
Data Mining Support
for SAP /BIA
Netweaver
Planning Support
for SAP/BIA
Netweaver
LargeLarge-Scale
Reporting Landscape
Model aware
Phys DB
Design
The Team
Maik Thiele
Thomas Legler
Rainer Gemulla Philipp Rösch
Dirk Habich
Simone Linke
Benjamin Schlegel Hannes Voigt
Peter B. Volk
Martin Hamann
Frank Rosenthal Steffen Preissler
Anja Klein
Matthias Böhm
Ines Funke
Bernd Keller
Faculty of Computer Science – Institute for System Architecture, Database Technology Group
Vortrag Fakultätskolloquium – DB Group
What are we doing???
Wolfgang Lehner
17. November 2008

Similar documents