Windowing Functions on MySQL with InfiniDB

Transcription

Windowing Functions on MySQL with InfiniDB
Windowing Functions on MySQL
with InfiniDB
Dipti Joshi, Data Architect, InfiniDB
Copyright © 2014 InfiniDB. All Rights Reserved.
What is InfiniDB?
•
•
•
•
•
•
Massively Parallel MySQL Storage Engine for Fast Analytics
Linear scale to handle exponential growth
Open-Source
Runs on premise, on AWS cloud or Hadoop HDFS cluster
Standard ANSI SQL compliance
First MySQL storage engine to support ANSI SQL-compliant
windowing functions
Copyright © 2014 InfiniDB. All Rights Reserved.
InfiniDB Parallelism
 User Module – Processes SQL Requests
 Performance Module – Executes the Queries
Single Server
MPP
or
Copyright © 2014 InfiniDB. All Rights Reserved.
Analytic Use Cases
With Traditional MySQL and
With InfiniDB
Copyright © 2014 InfiniDB. All Rights Reserved.
Example: Website Visitor Tracking Database
 Website visitor tracking database
 Website visit table
Id_visit
(Unique visit identifier
server_date
(date of visit)
vistor_id
(visitor identifier)
visit_total_time
(time spent during the visit)
1
02-02-2014
John_A_001
600
2
02-02-2014
Tom_J_002
400
3
03-28-2014
Dan_K_001
200
4
02-03-2014
John_A_001
70
5
03-04-2014
Jane_M_009
340
6
03-05-2014
Jane_M_009
660
7
03-06-2014
John_A_001
800
 Find the top visitor of the site every month based on time
spent on the site
Copyright © 2014 InfiniDB. All Rights Reserved.
Top Visitor by time spent on the site
Simple !
Id_visit
server_date
(Unique visit identifier (date of visit)
vistor_id
(visitor identifier)
visit_total_time
(time spent during the visit)
1
02-02-2014
John_A_001
600
2
02-02-2014
Tom_J_002
400
3
03-28-2014
Dan_K_001
200
4
02-03-2014
John_A_001
70
5
03-04-2014
Jane_M_009
340
6
03-05-2014
Jane_M_009
660
7
03-06-2014
John_A_001
800
vistor_id
(visitor identifier)
SUM(visit_total_time)
John_A_001
1470
Jane_M_009
1000
Tom_J_002
400
Dan_K_001
200
SELECT vistor_id, SUM(visit_total_time) total_time
FROM log_visit
GROUP BY visitor_id
ORDER by 2 desc
LIMIT 1;
Copyright © 2014 InfiniDB. All Rights Reserved.
Top visitor for each month
Id_visit (Unique
visit identifier
server_date
(date of visit)
vistor_id
(visitor identifier)
visit_total_time
(time spent during the visit)
1
02-02-2014
John_A_001
600
2
02-02-2014
Tom_J_002
400
3
03-28-2014
Dan_K_001
200
4
02-03-2014
John_A_001
70
5
03-04-2014
Jane_M_009
340
6
03-05-2014
Jane_M_009
660
7
03-06-2014
John_A_001
800
Totals for Each
Visitor by Month
Month
vistor_id
(visitor identifier)
Monthly_Vistor_time=
SUM(visit_total_time)
February
John_A_001
670
February
Tom_J_002
400
March
Dan_K_001
200
March
Jane_M_009
1000
March
John_A_001
800
Copyright © 2014 InfiniDB. All Rights Reserved.
MAX of total
visitor time for
each month
Month
MAX(Monthly_
Vistor_time)
February
670
March
1000
Top Visitor for each month : MySQL
Getting Complicated !
SELECT visitor_id, visit_month, m_time
FROM
Totals for Each
Visitor by
Month
(SELECT visitor_id, month(visit_server_date) visit_month,
SUM(visit_total_time) total_time
FROM log_visit
GROUP BY visitor_id, visit_month)
t1, JOIN
(SELECT month(visit_server_date) visit_month,
MAX of total
visitor time for
each month
MAX(total_time) m_time
FROM ( SELECT visitor_id, month(visit_server_date) visit_month,
SUM(visit_total_time) total_time
FROM log_visit
GROUP BY visitor_id, visit_month) subq
GROUP BY MONTH)
t2
WHERE t1.total_time = t2.m_time AND t1.visit_month = t2.visit_month
Copyright © 2014 InfiniDB. All Rights Reserved.
Top 2 visitors for each month
Totals for Each
Visitor by Month
Id_visit (Unique
visit identifier
server_date
(date of visit)
vistor_id
(visitor identifier)
visit_total_time
(time spent during the visit)
1
02-02-2014
John_A_001
600
2
02-02-2014
Tom_J_002
400
3
03-28-2014
Dan_K_001
200
4
02-03-2014
John_A_001
70
5
03-04-2014
Jane_M_009
340
6
03-05-2014
Jane_M_009
660
7
03-06-2014
John_A_001
800
Month
vistor_id
(visitor identifier)
Monthly_Vistor_time=
SUM(visit_total_time)
February
John_A_001
670
February
Tom_J_002
400
March
Dan_K_001
200
March
Jane_M_009
March
John_A_001
Top 2 total visitor
time for each
month
Month
Top 2
Monthly_Vistor_time
1000
February
670
800
February
400
March
1000
March
800
Copyright © 2014 InfiniDB. All Rights Reserved.
Top N Visitors for each month
With MySQL: Getting Complicated !
Totals for
Each Visitor
by Month
With Windowing Function: Simple !
Top N total
visitor time
for January
Rank of each
visitor with in
each month
Top N total
visitor time
for February
•
•
•
Totals for
Each Visitor
by Month
Top N total
visitor time
for December
Copyright © 2014 InfiniDB. All Rights Reserved.
Simplified: How do we do that ?
SELECT visitor_id, total_time, visit_month,
RANK() OVER
(PARTITION BY visit_month
ORDER BY t1.total_time desc) time_rank
Windowing Function
Totals for Each
Visitor by Month
Month
vistor_id
(visitor identifier)
Total_time
Time_rank
February
John_A_001
670
1
February
Tom_J_002
400
March
Dan_K_001
Jane_M_009
200
1000
March
Jane_M_009
John_A_001
1000
800
March
John_A_001
Dan_K_001
800
200
Partition rows for February
Rank in February
2
1
Partition rows for March
2
3
Top 1 : Time_rank = 1
Top 2 : Time_rank <= 2
Top N: Time_rank <= N
Copyright © 2014 InfiniDB. All Rights Reserved.
Rank in March
Another Use Case: Running Average
 Website Sales Items Daily Revenue table
Item Id
Server_date
Revenue
Running Average
1
02-01-2014
20000.00
20000.00
2
02-01-2014
15000.00
15000.00
3
02-01-2014
17250.00
17250.00
1
02-02-2014
5001.00
12500.50
3
02-03-2014
25010.00
250100.00
3
02-04-2014
21034.00
23022.00
2
02-04-2014
34029.00
34029.00
3
02-05-2014
4120.00
12577.00
2
02-05-2014
7138.00
20583.50
 For each item, for each day find average of revenue for
that day and the previous day
Copyright © 2014 InfiniDB. All Rights Reserved.
Running Daily Average per item
MySQL:Union of two Cartesian join Queries, Complex!
Item’s
Revenue of
previous day
Item’s
Revenue of
the current
day
Windowing Function: Simple!
Copyright © 2014 InfiniDB. All Rights Reserved.
Simplified: How do we do that ?
SELECT item_id, server_date, daily_revenue,
AVG(revenue) OVER
(PARTITION BY item_id ORDER BY server_date
RANGE INTERVAL '1' DAY PRECEDING ) running_avg
FROM web_item_sales
Item Id
Server_date
Revenue
Running Average
1
02-01-2014
20000.00
20000.00
1
02-02-2014
5001.00
12500.50
2
02-01-2014
15000.00
15000.00
2
02-04-2014
34029.00
34209.00
2
02-05-2014
7138.00
20583.50
3
02-01-2014
17250.00
17250.00
3
02-03-2014
25010.00
250100.00
3
02-04-2014
21034.00
12577.00
3
02-05-2014
4120.00
20583.50
Copyright © 2014 InfiniDB. All Rights Reserved.
Complex Analytics with
Traditional MySQL vs InfiniDB
 Traditional MySQL
- Complex Sub Query Joins
- Unions of Cartesian Joins
- Reduced Efficiency
 InfiniDB Windowing function
-
Easy
Powerful
Simplified syntax
Liberating from complexity
Copyright © 2014 InfiniDB. All Rights Reserved.
InfiniDB Windowing Functions
Copyright © 2014 InfiniDB. All Rights Reserved.
Windowing functions
 Aggregate over a series of related rows
 Simplified function for complex statistical analytics over sliding window
per row
- Cumulative, moving or centered aggregates
- Simple Statistical functions like rank, max, min, average, median
- More complex functions such as distribution, percentile, lag, lead
- Without running complex sub-queries or writing stored procedures
 Applications
- Data warehousing advanced aggregate analytics
- Business Intelligence
- Mathematical time-series functions
Copyright © 2014 InfiniDB.
All Rights Reserved.
17
Partition
 PARTITION BY expr1, expr2,…exprn
- One or more columns or expression on which rows are grouped
for windowing function calculation
 Each input row belongs to one partition
 Similar to GROUP BY
- But, each row retains its identity for output
c1
C2
c3
c1
C2
c3
Group By Output Rows
Partition By Output Rows
Copyright © 2014 InfiniDB. All Rights Reserved.
PARTITION BY ORDER BY
 ORDER BY
- One or more columns or Functions
- Column does not need to be in projection list
- Rows with in the Partitions are ordered by given columns
Copyright © 2014 InfiniDB. All Rights Reserved.
Frame
 FRAME for each row is a subset of a PARTITION for the row
-
Range of rows within partition
Range of values within partition
Default frame for a row is the entire partition
 Windowing function calculated by aggregation over the FRAME
 As the row moves, the frame can move
SUM(X) OVER (PARTITION BY Y ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
Row Number
X
Y
1
1
1
2
4
1
3
7
1
4
10
1
5
2
2
6
5
2
7
8
2
8
3
3
9
6
3
10
9
3
PARTITION
Partition for
rows 1 to 4
FRAME
Frame for
row 1
FRAME
Frame for row 2
sum(x) = 21
FRAME
Partition for
rows 8 to 10
Frame for
row 5
sum(x) =
15
Frame for row 6
sum(x) =
13
Frame for
row 8
sum(x) =
18
Frame for row 9
sum(x) =
15
Sum(X)
22
Frame for row 3
sum(x) = 17
sum(x) =
22
Partition for
rows 5 to 7
FRAME
Frame for row
4
sum(x) = 10
21
17
10
15
Frame for row 7
sum(x) = 8
13
8
18
Frame for row 10
sum(x) = 9
Copyright © 2014 InfiniDB.
All Rights Reserved.
20
15
9
What is Windowing Function ?
Provide aggregated value based on a group of rows –the
PARTITION
Performs a calculation across a set of rows that are somehow
related to the current row – the FRAME
Rows not grouped into a single output row — the rows retain their
separate identities
Returns multiple rows for each PARTITION
Copyright © 2014 InfiniDB.
All Rights Reserved.
21
Traditional vs Windowing aggregation
Traditional Aggregate Function Windowing Aggregate Function
with GROUP BY
compute aggregates by creating
groups
compute aggregates via partitions and
frames
output is one row for each group
output is one row for each input row
only one way of aggregating for
each group
different rows in the same partition can
have different frames
only one way of grouping for each same SELECT can use different partitions
SELECT
for each aggregate function
Copyright © 2014 InfiniDB.
All Rights Reserved.
22
Windowing Function Processing Order
 Analytic Function(arg1,...,argn) OVER ( [PARTITION BY
<...>] [ORDER BY <...>] [<FRAME_CLAUSE>] )
 OVER clause indicates a query result set that the
function operates on
JOIN, WHERE,
GROUP BY, HAVING
CLAUSE of the
main query
PARTITIONS
created, ordered
and function
applied to each
row
Copyright © 2014 InfiniDB.
All Rights Reserved.
23
Final ORDER BY of
the main query
 PARTITION BY
- One or more columns or Functions


PARTITION BY server_date, item_id
PARTITION BY MONTH(server_date)
- Column does not need to be in projection list
- If omitted, all the input rows are in one partition
 ORDER BY
- One or more columns or Functions
- Column does not need to be in projection list
 FRAME
- ROWS [BETWEEN <start> and <end> | <end>]





CURRENT ROW
UNBOUNDED PRECEDING
UNBOUNDED FOLLOWING
<Number of rows> PRECEDING
<Number of rows> FOLLOWING
- RANGE [BETWEEN <start> and <end> | <end>]




UNBOUNDED PRECEDING
UNBOUNDED FOLLOWING
<value1> PRECEDING
<value2> FOLLOWING
24
Copyright © 2014 InfiniDB. All Rights Reserved.
InfiniDB Windowing Functions
 In Database statistical windowing functions
 Distributed computation over distributed data
 Aggregate
- MAX, MIN, COUNT, SUM, AVG
- STD, STDDEV_SAMP, STDDEV_POP, VAR_SAMP, VAR_POP
 Ranking
- ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK,
CUME_DIST, NTILE, PERCENTILE, PERCENTILE_CONT,
PERCENTILE_DISC, MEDIAN
 FIRST/LAST
- NTH_VALUE, FIRST_VALUE, LAST_VALUE
 LEAD/LAG
- LAG, LEAD
Copyright © 2014 InfiniDB.
All Rights Reserved.
25
EXAMPLES
Copyright © 2014 InfiniDB. All Rights Reserved.
Moving Aggregate
 AVG(expression): Average of expression over the frame
of the current row
 Moving Centered Daily Average
SELECT item_id, server_date, daily_revenue,
AVG(daily_revenue) OVER (PARTITION BY item_id ORDER BY server_date
RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND
INTERVAL '1' DAY FOLLOWING) centered_avg
FROM web_item_sales
Item Id
Server_date
Revenue
1
02-01-2014
400.00
1200.00
1
02-02-2014
2000.00
1300.00
1
02-03-2014
1500.00
2500.00
1
02-04-2014
4000.00
2750.00
2
02-01-2014
500.00
750.00
2
02-02-2014
1000.00
900.00
2
02-03-2014
1200.00
1400.00
2
02-04-2014
2000.00
Copyright
© 2014 InfiniDB. All Rights Reserved.
AVG(daily_revenue) OVER(…)
1600.00
RANK/DENSE RANK
 RANK()
- Ranking of the row within the row’s frame with Gaps
 DENSE_RANK()
- Ranking of the row within row’s frame with no gap
SELECT item_id, server_date, daily_revenue,
RANK() OVER (PARTITION BY server_date ORDER BY daily_revenue desc) item_rank,
DENSE_RANK() OVER (PARTITION BY server_date ORDER BY daily_revenue desc) item_rank
FROM web_item_sales
Item Id
Server_date
Revenue
1
02-01-2014
20000.00
2
2
2
02-01-2014
15000.00
3
3
7
02-01-2014
15000.00
3
3
4
02-01-2014
5001.00
5
4
6
02-01-2014
21034.00
1
1
4
02-02-2014
4120.00
2
2
5
02-02-2014
7138.00
1
1
Copyright © 2014 InfiniDB. All Rights Reserved.
RANK
DENSE_RANK
NTILE(N)
 Bucket number of a row when partition is divided in N
buckets : PERCENTILE = NTILE(100), QUARTILE=NTILE(4)
 Ntile(3) of an item based on daily revenue
SELECT item_id, server_date, daily_revenue,
NTILE(3) OVER (PARTITION BY server_date ORDER BY daily_revenue) item_ntile
FROM web_item_sales
Item Id
Server_date
Revenue
1
02-01-2014
20000.00
2
2
02-01-2014
15000.00
1
3
02-01-2014
17250.00
2
4
02-01-2014
5001.00
1
5
02-01-2014
25010.00
3
6
02-01-2014
21034.00
3
4
02-02-2014
4120.00
1
5
02-02-2014
7138.00
2
6
02-02-2014
12577.00
3
Copyright © 2014 InfiniDB. All Rights Reserved.
NTILE(3)
NTH_VALUE(expression, n)
 nth value of expression in the frame
- FIRST_VALUE = first value in the frame
- LAST_VALUE = last value in the frame
SELECT item_id, server_date, daily_revenue,
NTH_VALUE(daily_revenue, 3) OVER (PARTITION BY server_date
ORDER BY daily_revenue RANGE UNBOUNDED FOLLOWING)
FROM web_item_sales
Item Id
Server_date
Revenue
1
02-01-2014
20000.00
25010.00
2
02-01-2014
15000.00
20000.00
3
02-01-2014
17250.00
21034.00
4
02-01-2014
5001.00
17250.00
5
02-01-2014
25010.00
NULL
6
02-01-2014
21034.00
NULL
4
02-02-2014
4120.00
12577.00
5
02-02-2014
7138.00
NULL
6
02-02-2014
12577.00
NULL
Copyright © 2014 InfiniDB. All Rights Reserved.
NTH_VALUE(daily_revenue, 3)
LAG/LEAD
 LAG(expression,offset): Value of the expression in the row
offset before the current row in the partition
 LEAD(expression, offset): Value of the expression in the
row offset after the current row in the partition
SELECT item_id, server_date, daily_revenue,
LEAD(daily_revenue, 3) OVER (PARTITION BY server_date ORDER BY daily_revenue)
FROM web_item_sales
Item Id
Server_date
Revenue
LEAD(daily_evenue, 1)
1
02-01-2014
20000.00
21034.00
17250.00
2
02-01-2014
15000.00
17250.00
5001.00
3
02-01-2014
17250.00
20000.00
15000.00
4
02-01-2014
5001.00
15000.00
NULL
5
02-01-2014
25010.00
NULL
21034.00
6
02-01-2014
21034.00
25010.00
20000.00
4
02-02-2014
4120.00
7138.00
NULL
5
02-02-2014
7138.00
12577.00
4120.00
6
02-02-2014
12577.00
NULL
12577.00
Copyright © 2014 InfiniDB. All Rights Reserved.
LAG(daily_evenue, 1)
More Analytics Use cases
 Ranking
- Top N or Bottom N items by monthly sales revenue
- Top N or Bottom N visitors per month by monthly spending on site
- Items in top N-tile range of monthly sales revenue
 Reporting Aggregates
- Report for each page: Total revenue that resulted from click on the
page per month, average monthly revenue per page and its
percentage contribution towards the site monthly revenue
 Moving Aggregates
- Running total revenue per item over previous 7 days
- Year to date revenue per visitor
- Sliding standard deviation of hourly stock price over the day
 Lead and Lag Analytics
- Report for each page: current and previous order
- Report for each page: How much the page lags behind the best
performer by revenue
Copyright © 2014 InfiniDB. All Rights Reserved.
Summary
 InfiniDB first MySQL Storage Engine to support ANSI
SQL compliant windowing analytic functions
 Windowing analytic functions simplifies complex
analytics
Copyright © 2014 InfiniDB. All Rights Reserved.
Thanks !





More at http://infinidb.co
Download InfiniDB at http://infinidb.co/download
Follow us on twitter @InfiniDB
Follow presenter on @dipti_smg
Visit Our Booth InfiniDB
Copyright © 2014 InfiniDB. All Rights Reserved.
Questions ?
Copyright © 2014 InfiniDB. All Rights Reserved.

Similar documents