Towards Intensional Answers to OLAP Queries for Analytical Sessions
Transcription
Towards Intensional Answers to OLAP Queries for Analytical Sessions
Consider the following scenario… 2. THE FRAMEWORK The cube (portion) This section motivates and then introduces the general framework we propose, starting with the definition of cubes and queries (based on a simplified formalization used in [1]). LOCATION PRODUCT 2.1All Motivating Example State City All Group Product TIME All Year Month This simpleToronto! example will give an intuition of our approach. Jan.12! Redtab! 2012! Ontario! Levi’s! We consider Ottawa! a cube of sales perSilvertab! city, product, andFeb.12! month. All! All! All! Jan.11! Three hierarchies LOCATION, PRODQueens! are defined, namely Loose! 2011! NY! CK! Brooklyn! UCT, and TIME (see Figure 1).Lowrise! For example, a Feb.11! product (Redtab) belongs to a group (Levi’s). A portion of the cube is shown below. Queens Brooklyn Toronto Ottawa Redtab Jan.11 Feb.11 50 40 10 20 0 10 0 10 LOCATION All State Ontario! City Toronto! Ottawa! Silvertab Jan.11 Feb.11 30 40 10 0 0 10 0 10 TIME PRODUCT All Group Levi’s! Product Redtab! Silvertab! integrate “expecte hOntar hNY, C In this the user swer do sophistic tion 3. answer t Let th The com previous per state using int All Year 2012! Month Jan.12! Feb.12! hOntar hNY, C The first query: • Total of sales for all products, all years, all locations? • System answers: 640 e l m s e — - e a - l s . the query is as follows: Ontario NY Redtab 0 20 60 60 Silvertab 0 20 40 40 Loose 10 10 20 20 Lowrise 10 10 20 20 Redtab 0 Silvertab 0 Loose 10 10 Lowrise 10 10 60 60 40 40 The second query: Jan.11 Feb.11 Jan.11 Feb.11 • Drill-down and slice: 2011 monthly sales Parts of this answer match the user’s current understandper per state? ing of product data while others do not; to reduce the overall size the answer, the system compares the extensional answer • of System answers: (Ontario, Levi’s, Feb.11) with the data in the expected cube, and it only returns the “unexpected” and (NY, facts: CK, 2011) as expected, but: Ontario NY Jan.11 Feb.11 Jan.11 Feb.11 integrated with an intensional answer that summarizes the “expected” facts: hOntario, Levi’s, Feb.11i: as expected hNY, CK, 2011i: as expected h. h. Dct be the user that the facts not reported in the extensional answer do not deviate at all from her expectation. A more sophisticated form of intensional answer is discussed in Section 3. Finally, the system uses the complete extensional answer to update the expected cube. Let the third query in the session be a drill-down to city. Drill down extensional to citiesanswer is twice the size of the The complete previous one (because in this example we only have 2 cities System All, 2011) and per state), answers: but again the (Ontario, system may represent it concisely using intensional information: (NY, CK, 2011) as expected, but: The third query: • • Jan.11 Feb.11 Queens Brooklyn Queens Brooklyn Redtab 50 10 40 20 Silvertab 30 10 40 0 hOntario, All, 2011i: as expected hNY, CK, 2011i: as expected Importantly, the “as expected” value is now to be interpreted with respect to the user understanding of data after the previous answer. This means, for instance, that the sales Towards Intensional Answers to OLAP Queries for Analytical Sessions Patrick Marcel, Rokia Missaoui, Stefano Rizzi DOLAP 2012 Outline • • • • Motivation The approach An instance of the approach Future directions Motivation • Intensional Answers (IA)? – Concise description of the answer • OLAP Queries (OQ)? – Known J • Analytical Sessions (AS)? – Sequence of OLAP queries • IA2OQ4AS – Leveraging past queries to reduce the size of the answer Approach overview Q IA EAC C 0. Execute EA 3. Build 1. Predict EEA 2. Improve Q: query C: data cube EC: expected cube EA: extensional answer EEA: expected ext. answer IA: intensional answer EAC: ext. answer complement EC Startup Q IA EAC C Expected cube initialized with the first query of the session. 0. Execute EA 3. Build 1. Predict EEA 2. Improve EC Execute Q IA EAC C 0. Execute EA 3. Build 1. Predict Queries are first evaluated over the actual cube… EEA 2. Improve EC Predict Q IA EAC C evaluated over … and then the expected cube. 0. Execute EA 3. Build 1. Predict EEA 2. Improve EC Improve Q IA C The expected 0. Execute EAC cube is updated with the extensional answer… EA 3. Build 1. Predict EEA 2. Improve EC Build Q IA EAC C 0. Execute EA 3. Build 1. Predict EEA The intensional answer is built from extensional answer and the predictedEC 2. Improve one. An instance of the framework • Relying on past contributions – Cube modeling • Using maximum entropy principle – like in [Sarawagi, VLDB’00], [Palpanas & al., TKDE05] – Intensional answers: • Information theoretic characterization – like in [Chum & Muntz, VLDB’88] • Using hierarchies to build the IA – like in [Park & Yoon, HICSS’99] Improve the expected cube • The estimated values are those that: – maximize the uniformity of data values – maintain the value of known aggregates • Estimates are scored: – A confidence score is computed from a distance in the group by lattice – the more precise the aggregate used for estimation, the more accurate the estimate, and the higher the score Figure 1: Dimension hierarchies distribut Consider an OLAP session consisting of three queries that investigate monthly sales of products. The first query simply asks for the grand total, i.e., the total sales for all products, all locations, all years. The system returns the total, say 640, to the user and updates the expected cube accordingly. The user then combines a drill-down and a slice operator to ask for the 2011 monthly sales per product per state. Knowing the grand total, she might expect that the distribution of sales is the following: 2.2 P Example 2.2.1 Our fo to keep t ing the v without • System answers: grand total is 640 • Expected 2011 monthly sales per product Defin per state is: tidimens Redtab Silvertab Loose Lowrise Ontario NY Jan.11 Feb.11 Jan.11 Feb.11 20 20 20 20 20 20 20 20 20 20 20 20 • Confidence of estimates is: LOCATION All State Ontario All NY City Toronto Ottawa Queens Brooklyn Group Levi’s All CK Product All Redtab Silvertab Loose Lowrise M = hL avg(0.5, 0, 0)=0.17 TIME PRODUCT All 20 20 20 20 Year 2012 All 2011 Month Jan.12 Feb.12 Jan.11 Feb.11 • Li on Predict the extensional answer • Run the query over the expected cube: – Not only the slice requested by the query – But also slices with the same schema that have the best confidence score closer group-by sets. In particular, we propose to define Extq (EC) by using, besides ⌃,gsel (EC), the slice(s) of EC with highest confidence among those having schema ⌃. Example The e CK,2 Example 3.4. Consider the following sample of EC, where each fact is associated with the aggregate used to estimate it as well as with the fact’s confidence. • If the facts and their scores are: f hhOntario,CK,2012i, 80i hhNY,CK,2012i, 80i hhOntario,CK,2011i, 80i hhNY,CK,2011i, 80i Agg(f ) hhAll,All,2012i, 320i hhAll,All,2012i, 320i hhAll,CK,2011i, 120i hhNY,All,2011i, 280i conf (f ) 0.4 0.4 0.9 0.9 The query first two facts formfor slice the = • If the asks 2012 sales where of CK ⌃ = hhState, Group, Yeari, hAll, Group, Yearii while the last by state, then will two facts form sliceslice=(All,CK,2011) . Now let q askbe for the 2012 sales for CK by states. It turns out that the used to adjust the prediction estimates for 2011 (slice ) have higher confidence than 2012 2011 ⌃,hAll,CK,2012i ⌃,hAll,CK,2011i 2011 those for 2012 (slice 2012 ). Therefore, the former will be used to adjust the latter. Finding the slices of EC to compute the expected extensional answer to q is done with Algorithm 2, whose explanation is given below. Let Algo Input pe Outp 1: let 2: if 3: 4: els 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: closer group-by sets. In particular, we propose to define Extq (EC) by using, besides ⌃,gsel (EC), the slice(s) of EC with highest confidence among those having schema ⌃. Example The e CK,2 Example 3.4. Consider the following sample of EC, where each fact is associated with the aggregate used to estimate it as well as with the fact’s confidence. • If the facts and their scores are: f hhOntario,CK,2012i, 80i hhNY,CK,2012i, 80i hhOntario,CK,2011i, 80i hhNY,CK,2011i, 80i Agg(f ) hhAll,All,2012i, 320i hhAll,All,2012i, 320i hhAll,CK,2011i, 120i hhNY,All,2011i, 280i conf (f ) 0.4 0.4 0.9 0.9 The query first two facts formfor slice the = • If the asks 2012 sales where of CK ⌃ = hhState, Group, Yeari, hAll, Group, Yearii while the last by state, then will two facts form sliceslice=(All,CK,2011) . Now let q askbe for the 2012 sales for CK by states. It turns out that the used to adjust the prediction estimates for 2011 (slice ) have higher confidence than those for 2012 answer (slice will ). Therefore, the former will be The expected extensional include fact hhOntario, • Expected value for (Ontario,CK,2012) is: used tobecause adjust the latter. CK,2012i, 172.3i 2012 2011 ⌃,hAll,CK,2012i ⌃,hAll,CK,2011i 2011 2012 80 80 320 Finding ⇥ 120 ⇥ the 0.9 + 320 of ⇥ EC ⇥ 0.4 slices 320 to compute the expected exten= 172.3 sional answer q is done with Algorithm 2, whose expla0.9 to + 0.4 nation is given below. Let Algo Input pe Outp 1: let 2: if 3: 4: els 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Build the intensional answer • EA and EEA are compared • Regions are labeled "as expected" if – Deviations in the region are low – Déviations in the region are homogeneous • Regions are delimited using granularity levels coarser than the ones of the EA asks for the grand total, i.e., the total sales for all products, all locations, all years. The system returns the total, say 640, to the user and updates the expected cube accordingly. The user then combines a drill-down and a slice operator to ask for the 2011 monthly sales per product per state. Knowing the grand total, she might expect that the distribution of sales is the following: Example • Expected answer: Ontario er mixed with lly, the idea is haracterize the ch with the exscribe in detail tly di↵er from Jan.11 Feb.11 NY Jan.11 Feb.11 These values are Redtab Silvertab Loose Lowrise 20 20 20 20 20 20 20 20 20 20 20 20 20 those currently 20 20 indeed stored in20the ex- pected cube. However, the actual (extensional) answer to the query is as follows: • Extensional answer: be classified as neral —because opted for buildional answer— relies on previodeling and in- Ontario NY Jan.11 Feb.11 Jan.11 Feb.11 Redtab 0 20 60 60 Silvertab 0 20 40 40 Loose 10 10 20 20 Lowrise 10 10 20 20 Parts of this answer match and the user’s current understand• (Ontario, Levi’s, Feb.11) (NY, CK, 2011) ing of data while others do not; to reduce the overall size as expected of the answer, the system compares the extensional answer with the data in the expected cube, and it only returns the “unexpected” facts: 2 to ing wi tid M Conclusion • A generic and flexible framework for computing intensional answers to OLAP queries – Leverages the previous queries of the session – Helps reduce the answer’s size • An instance of the framework – Assumes uniformly distributed values – Specific details and procedures Future directions • Different instantiations of the 3 steps – Alternatives to uniform assumption – A different kind of intensional answer – User’s profile to generate the IA • Coupling with a recommendation engine – Among various possible queries, recommend the one whose answer diverges the most from the user’s expectation Thanks for your attention C Q EA C IA 0. Execute EA 3. Build 1. Predict EEA 2. Improve EC ?