Towards Intensional Answers to OLAP Queries for Analytical Sessions

Transcription

Towards Intensional Answers to OLAP Queries for Analytical Sessions
Consider the following
scenario…
2.
THE FRAMEWORK
The cube (portion)
This section motivates and then introduces the general
framework we propose, starting with the definition of cubes
and queries (based on a simplified formalization used in [1]).
LOCATION
PRODUCT
2.1All Motivating
Example
State
City
All
Group Product
TIME
All
Year
Month
This simpleToronto!
example will give an
intuition of our approach.
Jan.12!
Redtab!
2012!
Ontario!
Levi’s!
We consider Ottawa!
a cube of sales perSilvertab!
city, product, andFeb.12!
month.
All!
All!
All!
Jan.11!
Three hierarchies
LOCATION,
PRODQueens! are defined, namely
Loose!
2011!
NY!
CK!
Brooklyn!
UCT, and TIME
(see Figure 1).Lowrise!
For example, a Feb.11!
product
(Redtab) belongs to a group (Levi’s). A portion of the cube
is shown below.
Queens
Brooklyn
Toronto
Ottawa
Redtab
Jan.11 Feb.11
50
40
10
20
0
10
0
10
LOCATION
All
State
Ontario!
City
Toronto!
Ottawa!
Silvertab
Jan.11 Feb.11
30
40
10
0
0
10
0
10
TIME
PRODUCT
All
Group
Levi’s!
Product
Redtab!
Silvertab!
integrate
“expecte
hOntar
hNY, C
In this
the user
swer do
sophistic
tion 3.
answer t
Let th
The com
previous
per state
using int
All
Year
2012!
Month
Jan.12!
Feb.12!
hOntar
hNY, C
The first query:
•  Total of sales for all products, all years,
all locations?
•  System answers: 640
e
l
m
s
e
—
-
e
a
-
l
s
.
the query is as follows:
Ontario
NY
Redtab
0
20
60
60
Silvertab
0
20
40
40
Loose
10
10
20
20
Lowrise
10
10
20
20
Redtab
0
Silvertab
0
Loose
10
10
Lowrise
10
10
60
60
40
40
The second query:
Jan.11
Feb.11
Jan.11
Feb.11
•  Drill-down and slice: 2011 monthly sales
Parts of this answer match the user’s current understandper
per state?
ing
of product
data while others
do not; to reduce the overall size
the answer, the system compares the extensional answer
•  of
System
answers: (Ontario, Levi’s, Feb.11)
with the data in the expected cube, and it only returns the
“unexpected”
and (NY, facts:
CK, 2011) as expected, but:
Ontario
NY
Jan.11
Feb.11
Jan.11
Feb.11
integrated with an intensional answer that summarizes the
“expected” facts:
hOntario, Levi’s, Feb.11i: as expected
hNY, CK, 2011i: as expected
h.
h.
Dct
be
the user that the facts not reported in the extensional answer do not deviate at all from her expectation. A more
sophisticated form of intensional answer is discussed in Section 3. Finally, the system uses the complete extensional
answer to update the expected cube.
Let the third query in the session be a drill-down to city.
Drill
down extensional
to citiesanswer is twice the size of the
The complete
previous one (because in this example we only have 2 cities
System
All, 2011)
and
per state), answers:
but again the (Ontario,
system may represent
it concisely
using intensional
information:
(NY,
CK, 2011)
as expected, but:
The third query:
• 
• 
Jan.11
Feb.11
Queens
Brooklyn
Queens
Brooklyn
Redtab
50
10
40
20
Silvertab
30
10
40
0
hOntario, All, 2011i: as expected
hNY, CK, 2011i: as expected
Importantly, the “as expected” value is now to be interpreted with respect to the user understanding of data after
the previous answer. This means, for instance, that the sales
Towards Intensional Answers to
OLAP Queries for Analytical
Sessions
Patrick Marcel, Rokia Missaoui, Stefano Rizzi
DOLAP 2012
Outline
• 
• 
• 
• 
Motivation
The approach
An instance of the approach
Future directions
Motivation
•  Intensional Answers (IA)?
–  Concise description of the answer
•  OLAP Queries (OQ)?
–  Known J
•  Analytical Sessions (AS)?
–  Sequence of OLAP queries
•  IA2OQ4AS
–  Leveraging past queries to reduce the size of
the answer
Approach overview
Q
IA
EAC
C
0. Execute
EA
3. Build
1. Predict
EEA
2. Improve
Q: query
C: data cube
EC: expected cube
EA: extensional answer
EEA: expected ext. answer
IA: intensional answer
EAC: ext. answer complement
EC
Startup
Q
IA
EAC
C
Expected cube initialized
with the first query of the
session.
0. Execute
EA
3. Build
1. Predict
EEA
2. Improve
EC
Execute
Q
IA
EAC
C
0. Execute
EA
3. Build
1. Predict
Queries are first evaluated
over the actual cube…
EEA
2. Improve
EC
Predict
Q
IA
EAC
C evaluated over
… and then
the expected cube.
0. Execute
EA
3. Build
1. Predict
EEA
2. Improve
EC
Improve
Q
IA
C
The expected
0. Execute
EAC
cube is
updated with the
extensional answer…
EA
3. Build
1. Predict
EEA
2. Improve
EC
Build
Q
IA
EAC
C
0. Execute
EA
3. Build
1. Predict
EEA
The intensional answer is
built from extensional
answer
and the predictedEC
2. Improve
one.
An instance of the framework
•  Relying on past contributions
–  Cube modeling
•  Using maximum entropy principle
–  like in [Sarawagi, VLDB’00], [Palpanas & al., TKDE05]
–  Intensional answers:
•  Information theoretic characterization
–  like in [Chum & Muntz, VLDB’88]
•  Using hierarchies to build the IA
–  like in [Park & Yoon, HICSS’99]
Improve the expected cube
•  The estimated values are those that:
–  maximize the uniformity of data values
–  maintain the value of known aggregates
•  Estimates are scored:
–  A confidence score is computed from a
distance in the group by lattice
–  the more precise the aggregate used for
estimation, the more accurate the estimate,
and the higher the score
Figure 1: Dimension hierarchies
distribut
Consider an OLAP session consisting of three queries that
investigate monthly sales of products. The first query simply
asks for the grand total, i.e., the total sales for all products,
all locations, all years. The system returns the total, say
640, to the user and updates the expected cube accordingly.
The user then combines a drill-down and a slice operator to ask for the 2011 monthly sales per product per state.
Knowing the grand total, she might expect that the distribution of sales is the following:
2.2 P
Example
2.2.1
Our fo
to keep t
ing the v
without
•  System answers: grand total is 640
•  Expected 2011 monthly sales per product
Defin
per state is:
tidimens
Redtab Silvertab Loose Lowrise
Ontario
NY
Jan.11
Feb.11
Jan.11
Feb.11
20
20
20
20
20
20
20
20
20
20
20
20
•  Confidence of estimates is:
LOCATION
All
State
Ontario
All
NY
City
Toronto
Ottawa
Queens
Brooklyn
Group
Levi’s
All
CK
Product
All
Redtab
Silvertab
Loose
Lowrise
M = hL
avg(0.5, 0, 0)=0.17
TIME
PRODUCT
All
20
20
20
20
Year
2012
All
2011
Month
Jan.12
Feb.12
Jan.11
Feb.11
• Li
on
Predict the extensional
answer
•  Run the query over the expected cube:
–  Not only the slice requested by the query
–  But also slices with the same schema that
have the best confidence score
closer group-by sets. In particular, we propose to define
Extq (EC) by using, besides ⌃,gsel (EC), the slice(s) of EC
with highest confidence among those having schema ⌃.
Example
The e
CK,2
Example 3.4. Consider the following sample of EC,
where each fact is associated with the aggregate used to estimate it as well as with the fact’s confidence.
•  If the facts and their scores are:
f
hhOntario,CK,2012i, 80i
hhNY,CK,2012i, 80i
hhOntario,CK,2011i, 80i
hhNY,CK,2011i, 80i
Agg(f )
hhAll,All,2012i, 320i
hhAll,All,2012i, 320i
hhAll,CK,2011i, 120i
hhNY,All,2011i, 280i
conf (f )
0.4
0.4
0.9
0.9
The query
first two facts
formfor
slice the =
•  If the
asks
2012 sales where
of CK
⌃ = hhState, Group, Yeari, hAll, Group, Yearii while the last
by state,
then
will
two facts form
sliceslice=(All,CK,2011)
. Now let
q askbe
for the 2012 sales for CK by states. It turns out that the
used
to adjust
the prediction
estimates
for 2011 (slice
) have higher confidence than
2012
2011
⌃,hAll,CK,2012i
⌃,hAll,CK,2011i
2011
those for 2012 (slice 2012 ). Therefore, the former will be
used to adjust the latter.
Finding the slices of EC to compute the expected extensional answer to q is done with Algorithm 2, whose explanation is given below. Let
Algo
Input
pe
Outp
1: let
2: if
3:
4: els
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
closer group-by sets. In particular, we propose to define
Extq (EC) by using, besides ⌃,gsel (EC), the slice(s) of EC
with highest confidence among those having schema ⌃.
Example
The e
CK,2
Example 3.4. Consider the following sample of EC,
where each fact is associated with the aggregate used to estimate it as well as with the fact’s confidence.
•  If the facts and their scores are:
f
hhOntario,CK,2012i, 80i
hhNY,CK,2012i, 80i
hhOntario,CK,2011i, 80i
hhNY,CK,2011i, 80i
Agg(f )
hhAll,All,2012i, 320i
hhAll,All,2012i, 320i
hhAll,CK,2011i, 120i
hhNY,All,2011i, 280i
conf (f )
0.4
0.4
0.9
0.9
The query
first two facts
formfor
slice the =
•  If the
asks
2012 sales where
of CK
⌃ = hhState, Group, Yeari, hAll, Group, Yearii while the last
by state,
then
will
two facts form
sliceslice=(All,CK,2011)
. Now let
q askbe
for the 2012 sales for CK by states. It turns out that the
used
to adjust
the prediction
estimates
for 2011 (slice
) have higher confidence than
those
for 2012 answer
(slice will ).
Therefore,
the former will be
The expected
extensional
include
fact hhOntario,
•  Expected
value
for (Ontario,CK,2012) is:
used tobecause
adjust
the latter.
CK,2012i,
172.3i
2012
2011
⌃,hAll,CK,2012i
⌃,hAll,CK,2011i
2011
2012
80
80
320 Finding
⇥ 120
⇥ the
0.9 +
320 of
⇥ EC
⇥ 0.4
slices
320 to compute the expected exten= 172.3
sional answer
q is done with Algorithm 2, whose expla0.9 to
+ 0.4
nation is given below. Let
Algo
Input
pe
Outp
1: let
2: if
3:
4: els
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
Build the intensional
answer
•  EA and EEA are compared
•  Regions are labeled "as expected" if
–  Deviations in the region are low
–  Déviations in the region are homogeneous
•  Regions are delimited using granularity
levels coarser than the ones of the EA
asks for the grand total, i.e., the total sales for all products,
all locations, all years. The system returns the total, say
640, to the user and updates the expected cube accordingly.
The user then combines a drill-down and a slice operator to ask for the 2011 monthly sales per product per state.
Knowing the grand total, she might expect that the distribution of sales is the following:
Example
•  Expected answer:
Ontario
er mixed with
lly, the idea is
haracterize the
ch with the exscribe in detail
tly di↵er from
Jan.11
Feb.11
NY Jan.11
Feb.11
These values
are
Redtab Silvertab Loose Lowrise
20
20
20
20
20
20
20
20
20
20
20
20
20 those currently
20
20
indeed
stored in20the ex-
pected cube. However, the actual (extensional) answer to
the query is as follows:
•  Extensional answer:
be classified as
neral —because
opted for buildional answer—
relies on previodeling and in-
Ontario
NY
Jan.11
Feb.11
Jan.11
Feb.11
Redtab
0
20
60
60
Silvertab
0
20
40
40
Loose
10
10
20
20
Lowrise
10
10
20
20
Parts
of this answer
match and
the user’s
current
understand•  (Ontario,
Levi’s,
Feb.11)
(NY,
CK,
2011)
ing of data while others do not; to reduce the overall size
as expected
of the answer, the system compares the extensional answer
with the data in the expected cube, and it only returns the
“unexpected” facts:
2
to
ing
wi
tid
M
Conclusion
•  A generic and flexible framework for
computing intensional answers to OLAP
queries
–  Leverages the previous queries of the session
–  Helps reduce the answer’s size
•  An instance of the framework
–  Assumes uniformly distributed values
–  Specific details and procedures
Future directions
•  Different instantiations of the 3 steps
–  Alternatives to uniform assumption
–  A different kind of intensional answer
–  User’s profile to generate the IA
•  Coupling with a recommendation engine
–  Among various possible queries, recommend
the one whose answer diverges the most
from the user’s expectation
Thanks for your attention
C
Q
EA
C
IA
0. Execute
EA
3. Build
1. Predict
EEA
2. Improve
EC
?