Nanocubes for Real-Time Exploration of Spatiotemporal Datasets

Transcription

Nanocubes for Real-Time Exploration of Spatiotemporal Datasets
Online Submission ID: 276
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets
Category: Research
Fig. 1. Example visualizations of 210 million public geolocated Twitter posts over the course of a year. The data structure we
propose enables real-time (these images above were rendered faster than the typical screen refresh rate) visual exploration of large,
spatiotemporal, multidimensional datasets. The visual encodings built using nanocubes are within a controllable difference to ones
rendered by a traditional linear scan over the dataset. They naturally support linked navigation and brushing, and include choropleth
maps, time series over arbitrary regions and scales of space and time, parallel sets, histograms, and binned scatterplots. The
color scale of the choropleth map is a diverging scale in which blue corresponds to iPhones being relatively more popular, and red
corresponds to higher relative popularity of Android devices.
Abstract—Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by
a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the
data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Data cubes
are a well-known aggregration operation in relational databases. In a sense, they precompute every possible aggregate query over
the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk
storage. In contrast, we show how to construct a data cube that fits in a modern laptop’s main memory, even for billions of entries;
we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used
to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact
visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks
to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets,
and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are
dominated by network and user-interaction latencies.
Index Terms—Data Cube, Data Structures, Interactive Exploration
1
I NTRODUCTION
work of Sismanis et al. [30]. In addition, the query times to build the
visual encodings in which we are interested will be at most proportional to the size of the output, which is bounded by the number of
screen pixels (within a small factor). This is an important observation: the time complexity of a visualization algorithm should ideally
be bounded the number of pixels it touches on the screen. Our technique enables real-time exploratory visualization on datasets that are
large, spatiotemporal, and multidimensional. Because the speed of
our data cube structure hinges partly on it being small enough to fit in
main memory, we call them nanocubes.
By real-time, we mean extremely fast queries: query times average
under a millisecond for a single thread running on a computer that
ranges from a laptop, to a workstation, to a server-class computing
node (Section 6). By large, we mean that the datasets we support have
millions to billions of entries.
By spatiotemporal, we mean that nanocubes support queries typical of spatial databases, such as counting events in a spatial region that can be either a rectangle covering most of the world, or
a heatmap of activity in downtown San Francisco (Section 4.3.1).
As datasets get larger, real-time visualization becomes more difficult.
Consider a dataset with a billion entries. If we compute a summary
of the dataset and visualize it, the natural question becomes “does the
summary represent the data well?” and the problem has simply shifted
to “how can we visualize one billion residuals”? Even drawing the
simplest scatterplot is not straightforward. If we decide to produce
the visualization by scanning the rows of a table, we will either need
non-trivial parallel rendering algorithms or significant time to produce
a drawing. Neither of these solutions scales well with dataset size.
Data cubes are structures that perform aggregations across every
possible set of dimensions of a table in a database, to support quick exploration [14, 30]. Many visualization systems are built on top of data
cubes, concretely or conceptually. Still, only recently have researchers
started to examine data cube creation algorithms in the context of information visualization [32, 17, 20].
Data cubes are often problematic in that they can take prohibitively
large amounts of memory as the number of dimensions increases. In
Section 4, we show how to construct a data cube that fits in the main
memory of a modern laptop computer or workstation, extending the
1
Five Tweets: Location and Device
1.
o1
01,10
o2
`device (
) = Android
`device (
) = iPhone
`spatial1
0,0
proper
01,10
content (next dimension):
iPhone
Android
`spatial2
01,11
10,11
11,11
00,10
01,10
10,10
11,10
00,01
01,01
10,01
11,01
00,00
01,00
10,00
11,00
4.
1,0
Indexing Schema
S = [ [`spatial1 , `spatial2 ], [`device ] ]
1,0
01,10
iPhone Android
o2 o3
o1
0,1
1,1
Android iPhone
iPhone
proper
shared
Android iPhone
o1
00,11
shared
10,01
01,10
Android
o5
parent-child (same dimension):
1,0
0,1
3.
0,1
o4
o3
0,1
2.
0,1
o1 o4
o2
1,1
o1
o2
o1 o2
o3
o1 o2 o3 o4
o2
o1
o4
updated in
current step
o3
o1 o2
o1 o2 o3
iPhone
o2 o3 o5
Android
o1 o4
o1 o2 o3 o4 o5
dimension
boundary
1,1
1,0
0,1
01,10
Android
iPhone
o2 o3
5.
10,10
10,01
Android iPhone
o1 o2
10,10
10,01 11,01
Android iPhone
o1
iPhone
o2
o1 o2
o3
iPhone
iPhone
o5
Android
o4
o3 o5
Fig. 2. An illustration of how to build a nanocube for five points [o1 , . . . , o5 ] under schema S. The complete process is described in Section 4.
By the same token, nanocubes support temporal queries at multiple
scales, such as event counts by hour, day, week, or month over a period of years (Section 4.3.3). Data cubes in general enable the Visual
Information-Seeking Mantra [28] of “Overview first, zoom and filter, then details-on-demand” by providing rollup summaries over all
projections and letting users drill down by expanding the wanted dimensions. Nanocubes also provide overviews, filters, zooming, and
details-on-demand inside the spatiotemporal dimensions themselves.
By multidimensional, we mean that besides latitude, longitude, and
time, each entry can have additional attributes (see section 6) that can
be used in query selections and rollups.
As we will show, nanocubes support queries that lend themselves
very well to building visual encodings which are fundamental building blocks of interactive visualization systems, such as scatterplots,
histograms, parallel coordinate plots, and choropleth maps. In summary, we contribute:
techniques such as query prediction become necessary [5]. Leveraging the enormous power of graphics processing units has also become
popular [24, 20], but without algorithmic changes, linear scans through
the dataset will still be too slow for fluid interaction, even with GPUs.
Another popular way to cope with large datasets is through sampling. Statistical sampling can be performed on the database backend [25, 1, 9, 13], or on the front-end [10]. Still, the techniques we
introduce with nanocubes can produce results quickly and exactly (to
within screen precision) without requiring approximations, which we
believe is preferable. In addition, as Liu et al. argue, sampling by itself is not sufficient to prevent overplotting, and might actually mask
important data outliers [20].
Fekete and Plaisant have proposed modifications of traditional visual encodings which use the computer screen more efficiently, and
so scale better with dataset size [12]. Still, the proposal requires a
traversal of all input data points, which is not a feasible approach at
the scale which we want to approach here. Carr et al. were among the
first to propose techniques replacing a scatterplot with an equivalent
density plot [4]; nanocubes can be seen as a data structure to enable
these visualizations at a variety of dataset sizes and scales.
Careful data aggregation [16], then, appears to be one of the few
scalable solutions for low-latency large data graphics. While Elmqvist
and Fekete propose variations of visualization techniques that include
aggregation as a first-class citizen [11], in this paper we show how to
issue queries such that at the screen resolution in which the application
is operating, the result is indistinguishable (or close to) from a complete scan through the dataset. We note that over-aggressive aggregation itself introduces potential discrepancies between the visualization
and the dataset, and there are proposals to understand this [8].
We are interested in bounding the difference between our visual encoding and a visual encoding that would traverse the entirety of the
data by the size of the screen, i.e. the number of pixels in it. Related to
this, pixel-oriented techniques [18] have been investigated. However,
these tend to focus on the development of new visual encodings, while
in this paper we show how to create the already well-known and established encodings with low error, high performance and interactivity.
Our technique is most closely related to the work of Sismanis et
al. [30, 29]. Specifically, we use similar ideas to reduce the size of
a data cube from potentially exponential in the cardinality of the key
space (we define these terms in Section 4.1) to essentially linear [31].
Nanocubes improve on Sismanis et al.’s work in two fundamental directions. First, we develop a model for spatiotemporal data cubes that
exploits unique characteristics of space and time to get a good com-
• a novel data structure that improves on the current state of the art
data cube technology to enable real-time exploratory visualization of multidimensional, spatiotemporal datasets;
• algorithms to query the nanocube and build linked and brushable
visual encodings commonly found in visualization systems; and
• case studies highlighting the strengths and weaknesses of our
technique, together with experiments to measure its utilization
of space, time, and network bandwidth.
2
R ELATED W ORK
Relational databases are so widespread and fundamental to the practice of computing that they were a natural target for information visualization almost since the field’s inception [19]. Mackinlay’s Automatic
Presentation Tool is the breakthrough result that critically connected
the relational structure of the data with the graphical primitives available for display [22] and ultimately lead to data cube visualization
tools like Polaris [33, 34] and Show Me [23]. Nanocubes are specifically designed to speed up queries for spatiotemporal data cubes, and
could eventually be used as a backend for these types of applications.
In contrast, some of the work in large data visualization involves
shipping the computation and data to a cluster of processing nodes.
While parallelism is an attractive option for increasing throughput, it
does not necessarily help achieve low latency, which is essential for
fluid interactions with a visualization tool. As a result, sophisticated
2
Online Submission ID: 276
1: function NANO C UBE([o1 , o2 , . . . , on ], S, `time )
.n>0
1: procedure A DD(root, o, d, S, `time , updated nodes)
2:
nano cube ← N ODE( )
. New empty node
2:
[`1 , . . . , `k ] ← C HAIN(S, d)
3:
for i = 1 to n do
Online Submission ID: 276
3:
stack ← T RAIL P ROPER PATH(root, [`1 (o), . . . , `k (o)])
4:
updated nodes ← 0/
4:
child ← null
5:
A DD(nano cube, oi , 1, S, `time , updated nodes)
5:
while stack is not empty do
6: 1: end
for NANO C UBE([o1 , o2 , . . . , on ], S, `time )
function
.n>0
1: procedure
o, d, S, `time , updated nodes)
6:
nodeA DD
← (root,
P OP(stack)
nano
cubecubeN ODE( )
. New empty node
7: 2: return
nano
2:
[`1 , .update
. . , `k ] ←Cfalse
HAIN(S, d)
3: function
for i = 1 to n do
7:
8: end
3:
stackif node
T RAIL
ATH (root,
4:
updated nodes 0/
8:
hasPaROPER
singlePchild
then[`1 (o), . . . , `k (o)])
4:
child S
null
5:
A DD(nano cube, oi , 1, S, `time , updated nodes)
9:
ET
S
HARED
C
ONTENT
(node, C ONTENT(child))
1: function T RAIL P ROPER PATH(root, [v1 , . . . , vk ])
5:
while
stack
not empty
do is null then
6:
end for
10:
else
if CisONTENT
(node)
2:
stack ← S TACK( )
. New Empty Stack
6:
node S ETPPOP
(stack)C ONTENT(node,
7:
return nano cube
11:
ROPER
3: 8: Pend
USH (stack, root)
Online Submission ID: 276 7:
update ( d=
false
function
dim(S) ?
4:
node ← root
8:
if node has a single child then
S UMMED TABLE T IME S ERIES( ) : N ODE( ) )
5:
for
i
=
1
to
k
do
9:
S
ET
S
HARED
C ONTENT(node, C ONTENT(child))
function NT
ANO
C UBE
([o1 , o2P
, .ATH
. . , on(root,
], S, `time
1:1: function
RAIL
P ROPER
[v1), . . . , vk ]) . n > 0
12: A DD(root, o, d,update
true
6:
child
←C
(node,
vi )
1: procedure
,←
updated
nodes)
nano
N ODE
. New.empty
10:
else ifS,C`time
ONTENT
(node)
is null then
2:2:
stackcube
S HILD
TACK
( )( )
Newnode
Empty Stack
else
CPONTENT
IONTENT
S S HARED (node) and
2:
[`1 , 13:
.11:
. . , `k ]
C HAIN
(S,Sifd)
i = 1 =tonull
n do then
7: 3:3:
iffor
child
ET
ROPER
C
(node,
P USH
(stack,
root) 0/
3:
stack
T
RAIL
P
ROPER
P
ATH
(root,
[`
(o),
.
.
.
,
4:
updated
nodes
C
ONTENT
not in updated nodes then
1 (node)
Submission
8: 4:
child root
← N EW P ROPER C HILD(node,Online
vi , N ODE
( )) ID: 276
( d=
dim(S)
? `k (o)])
node
4:
child14: null
5:
A DD(nano
cube, oi , 1, S, `time , updated nodes)
S ET P ROPER
C ONTENT
S UMMED
TABLE(node,
T IME S ERIES( ) : N ODE( ) )
9: 5:6:
else
for for
iif=I S1StoHARED
k do C HILD(node,child) then
5:
while stack is not empty do
end
S
HALLOW
C
OPY (C ONTENT (node)))
12:
update
true
10: 6:1:
child
←
R
EPLACE
C
HILD
(node,
N
ANO
C
UBE
([o
,
o
,
.
.
.
,
o
],
S,
`
)
.
n
>
0
6:
node
P
OP
(stack)
7: function
return
nano
cube
1 (node,
2
child
C HILD
vn i ) time
15:
true
1: procedure
o, d,update
,←
updated
2:
( ) then C OPY (child))
. New empty node
13: A DD(root,
else
ifS,C`time
ONTENT
I S Snodes)
HARED (node) and
7:
update
false
8: endnano
function
child,
S HALLOW
7:
ifcube
child
=N ODE
null
2:
[`1 ,if
. . .node
, `k ] hasCaHAIN
for i = 1 to n do
8:
single
then (node)
16:
else(S,child
ifd)CCONTENT
ONTENT
I S P ROPER
not in(node)
updatedthen
nodes then
11: 8:3:
end
if
child
N
EW P ROPER C HILD (node, vi , N ODE( )) 3:
stack STET
RAIL
P ROPER
PATH(root,
[`1 C
(o),
. . . , `k (o)])
4:
updated
nodes
/ (root, [v , . . . , v ]) Online Submission ID:
9: 276
S HARED
CSupdate
ONTENT
(node,
ONTENT
(child))
1:
function
T
RAIL P
ROPER P0
ATH
17:
←
true
14:
ET
P
ROPER
C
ONTENT
(node,
1
k
else
if I S Scube,
HARED
CS,HILD
(node,child)
then
12: 9:5:
P USH
(stack,
child)
4:
child
null
A
DD
(nano
o
,
1,
`
,
updated
nodes)
10:
else
if
C
ONTENT
(node)
is
null
then
i
time
2:
stack
S TACK( )
. New Empty Stack
S HALLOW
C OPY(C ONTENT(node)))
18:stack
ifdo
5:
while
not end
empty
end
for
13: 10:6:
node
←child
child
11:
S ETis
P ROPER
C ONTENT
(node,
3:
P USH
(stack,
root) R EPLACE C HILD(node,
15: ( Pd=
update
true
6:
node
OP
(stack)
19:
if
update
then
7:
return
nano
cube
dim(S)
?
child,
S
HALLOW
C
OPY
(child))
1:
function
N
ANO
C
UBE
([o
,
o
,
.
.
.
,
o
],
S,
`
)
.
n
>
0
14:
for root
n
1 2
time
4:end node
update
false
16:
else
if`time
CABLE
ONTENT
SERIES
P ROPER
(node)
8: end
function
So,UMMED
T
T IMEISnodes)
() : N
ODE( ) )then
1: 7:
procedure20:
A DD(root,
d, S,if
, dim(S)
updated
d=
then
nano
cube
. New empty node
end
for
istack
= 1 toifN
k ODE
do ( )
15: 2:11:5:return
8: [`1 , . . .if
has
a single
child thentrue
true
2:12:
,17:
`node
C HAIN
(S, d)
update
3:12:
for
i
=
1
to
n
do
k ]update
6:
child
C
HILD (node,
v
)
21:
I
NSERT
(C
ONTENT
(node),
`time (o))
P
USH
(stack,
child)
i
16: end
function
9: stack else
S ET
SPHARED
ONTENT
(child))
if
C ONTENT
IC
Sif
HARED
(node)
and
3:13:
T RAIL
ROPER
PSATH
(root,(node,
[`1 (o),C
.ONTENT
. . , `k (o)])
function
T
RAIL
ROPER
PATH(root, [v1 , . . . , vk ])
4: 1:
updated
nodes
0/then
18:
end
ifnode
child
=Pnull
13:7:
child
else
10: child 22:
else
if
C
ONTENT(node)
(node)
is
null
then
C
ONTENT
not
in
updated
nodes
then
4:
null
stack
S
TACK
(
)
.
New
Empty
Stack
5: 2:
A
DD
(nano
cube,
o
,
1,
S,
`
,
updated
nodes)
i P ROPER
time
8:
child
N EW
C HILD(node, vi , N ODE( ))
19: S ET P ROPER
if update
end
for
11: while stack
ONTENT
(node,
23:
Athen
DD
(C ONTENT(node), o, d+1, updated nodes)
S ET
P ROPER
CCONTENT
(node,
5:14:
is not
empty
do
3:
(stack,
6:14:
for
9: endP USH
if I S Sroot)
HARED
HILD (node,child) then
1: function
Selse
HALLOW
C OPYC(node)
20: PSOP
ifend
dim(S)
then
(HALLOW
d=
dim(S)
? d=(C
return
stack
C
OPY
ONTENT (node)))
6:
node
(stack)
4:
node
root
7:15:
return
nano
cube
24:
if
10:
child
R
EPLACE C HILD (node,
2: 8:16:5:
node
sc←
(
)
S UMMED
TIABLE
T IME
SONTENT
ERIES ( ) :(node),
N ODE( )`)time (o))
21:update
NSERT
(C
15:
true
7:
update
false
for
i = 1N
toODE
k
do
end
function
end
function
child, S HALLOW C OPY(child))
25: update trueI NSERT(updated nodes, C ONTENT
(node))
12:
22: ifhas
3: 11:
(node
else
C ONTENT
Selse
P ROPER
8:16:
if node
a single Ichild
then(node) then
6:S ET S HARED
child
HILD (node,
vi ) sc, C ONTENT (node))
end if CCONTENT
26:
end
if
13:
else
ifSCHARED
ONTENT
I S S HARED
(node)
and (node),
9:
S
ET
C
ONTENT
(node,
C
ONTENT
(child))
17:
update
true
7:
if
child
=
null
then
23:
A
DD
(C
ONTENT
o,
d+1,
updated
nodes)
4: 1: 12:
for v inTPRAIL
C
HILDREN
(node)
function
P ROPER
PL
ATH
(root,
[v1 , . . . , vdo
USH
(stack,
child)
k ])
1:
S HALLOW
CABELS
(node)
C ONTENT
(node)
not
in then
updated nodes then
27:
child
←
node
10:18:
else
if C
(node)
is if
null
8: function
EW
POPY
ROPER
C HILD(node,
vEmpty
( ))
end
ifONTENT
i , N ODE
stack
SSchild
TACK
( ) NC
. v,
New
Stack
node
child
24:
end
sc,
C
HILD
(node,
v))
5: 2: 13:
N
EW
HARED
HILD
(node
14:
S ET
Pthen
ROPER
C ONTENT
(node,
2:
node
ODE (C) HILD(node,child) then
9: P USH
else
if root)
I S SN
HARED
11:19:
Supdate
ET
P ROPER
ONTENT
(node,
if
28:
endCwhile
(stack,
14:
end
forsc
25: S HALLOW
NSERT
(updated
nodes, C ONTENT(node))
CIOPY
(C ONTENT
(node)))
6: 3:10:
for
child
R EPLACE
C HILD
(node,
d= dim(S)
dim(S)
?then
20:
if(end
d=
3: end
S ET Sroot
HARED
ONTENT
(node
sc, C ONTENT(node))
4: 15:
node
return
stack C
29:
procedure
26:
end
if
15:
update
true
child,
S
HALLOW
C
OPY
(child))
S
UMMED
T
ABLE
T
IME
S
ERIES
( ) : N ODE( ) )
7: 5: 16:
return
node
sc
21:
I
NSERT
(C
ONTENT
(node),
`
(o))
time
4: end
inkCdoHILDREN L ABELS(node) do
for for
i = 1v to
function
16:
else
if C ONTENT
I S P ROPER
(node) then
27:
child
node
end ifC HILD(node, vi )
update
true
else
8: end
6:11:
child
5: function
N EW S HARED C HILD(node sc, v, C HILD(node, v))12:22:
17:
update
true
13:23:
else28:
if C ONTENT
S S HARED
(node)o,and
A DD
(CIwhile
ONTENT
(node),
d+1, updated nodes)
end
18:
endend
Cifend
ONTENT
(node) not in updated nodes then
24:
if procedure
29:
19:
if
update
then
14:25:
S ET
P ROPER
C ONTENT
(node,
I NSERT
(updated
nodes,
C ONTENT(node))
9: 3:
I S S HARED
C HILD
(node,child)
then (node))
Selse
ET Sif
HARED
ONTENT
(node
sc, C ONTENT
20:
ifSifHALLOW
d= dim(S)C OPY
then(C ONTENT
Natural
language query
s
c
stack C
8: between
endreturn
function
26:
endexample,
10:15:
child
Rusage
EPLACE
C HILD
(node,
promise
space
and
efficiency
of queries (Sections 4.2.1
For
GROUP
BY
on(node)))
4:
for
v
in
C
HILDREN
L
ABELS
(node)
do
21:
I
NSERT
(C
ONTENT
(node),
`attributes
(o))
count
of all
Delta Device
flights and Language
R with the
U
R { De
time
16: end function
15:27:
update node
true
child
child,
S
HALLOW
C
OPY
(child))
HARED C
HILDthese
(node sc,
v, C HILD(node,
v)) the visualizaand 6).5: Second,N EW
weS show
how
structures
enable
count
aggregation
function
results
in the
relation
C in Figure
Note
22:
count
of all Delta
flights
in the Midwest
R 4.(Midwest)
R { De
16:28:
else
if else
C ONTENT
I S P ROPER
(node)
then
end
while
Pseudo-code
of an algorithm to build nanocubes.
11:Fig.
endfor
if
6: 3. are
end
23:
DDtrue
(C ONTENT
(node),
o, d+1,
updated
nodes)
count
of early
flights
in 2010
U
D
tions12:which
common
in
interactive
tools (Section 4.3).
that
for Aevery
different
combination
of
values
present in theRattributes
17:29:
update
end procedure
1: function
S
HALLOW
C
OPY
(node)
P
USH
(stack,
child)
7:
return
node
sc
24:
end
if
time-series
of
all United
flights
into2009
R ways U
R
U
exploits
unique
characteristics
ofbuild
spacedata
and cube
time to
get a good
Using
thisrelation,
notation,an
it is
easy
to understand
conventional
18: com- end
if
of
a Ibase
aggregation
record
issome
added
the resulting
renode
sc child
N ODE(efforts
)
There
have
been
recent
to
structures
specif13: 2:
node
8: end
function
NSERT
(updated
nodes, Cheatmap
ONTENTof
(node))
Deltarelation:
flights in 2010
D and(tile)
R { De
19:25:
update
then
between
space
usage
andsc,
efficiency
(Sections
4.2.1 iflation.
of
describing
aggregations
for a these
given
groupare
by,(Android,
cube,
3: endSfor
ET
C ONTENT
(node
C ONTENT
14:promise
forS HARED
In our running
example,
combinations
en),
ically
suited
visualization.
Crossfilter
[32]of(node))
isqueries
a software
package
26:
end
if
20:
if d= dim(S)
then
4: 6).
for
vstack
in C HILDREN
L ABELS
(node)
and
Second,
we
show
how
these
structures
the visualizarollup.
Anode
group
by(iPhone,
operationru).
is one in which a relation is derived from
Fig.
3.return
Pseudo-code
of an
algorithm
to
builddo
nanocubes.
childI NSERT
(iPhone,
en),
built15:
on
the
clever
observation
that
many
inenable
interactive
visualFig. 4.`time
simplified
set of queries supported by the nanocube data structure. T
21:27:
(Cand
ONTENT
(o))
5:
Nare
EW S
HARED C HILD
(node
sc, v,queries
C HILD
(node,
v))4.3).
16:tions
end which
function
common
in interactive
tools
(Section
a while
base relation
given a(node),
list
of Aattributes
and an aggregate
function.
For the subset of that di
28:
end
D
means
next to R or all
D contains
The CUBE operation is “drilldown”.
the resultTheofvalue
collecting
possible
ization6:are incremental:
assuming that previous results are available,
22:
else
end for
example,
group
by
on
attributes
Device
and
Language
with
the count
29:
end
procedure
There
have
been
recent
efforts
to
build
data
cube
structures
specifdomain
(“universe”).
Omitted
here, but
supported
by our structure, are: the
23:
A DD
(C ONTENT
(node), o,
d+1,aupdated
nodes)
7:
return
node
sc
BY
aggregations
into
single
relation
for
a
given
list
at- and drilldowns
GROUP
the results
needed
for
the
next
query
can
be
quickly
computed.
UnforCountry
Device
Language
1:ically
function
S HALLOW
C OPY
(node)related
Oursuited
technique
most
closely
to the work
et package
time-based
drilldown;C
multiple
categories
aggregate
function
results
in enthe relation
in Figure
5. with separateofrollups
for isvisualization.
Crossfilter
[32] of
is Sismanis
a software
24:
end if US
n GROUP
8: end
function
Android
2:al.
node
scnot
Nsee
ODEeasily
( ) we how
tributes
(i.e.
2
BYs
for
n
input
attributes).
In
our
running
tunately,
we
do
this
would
work
for
the
multiscale
[30,
29].
Specifically,
use
similar
ideas
to
reduce
the
size
of
I NSERT
ONTENT
built on
the clever
observation
that
many queries
in interactive25:
visualUS for nodes,
iPhone C
ru (node))
Note (updated
that
every
combination
of values
present
in the
3:a data
S ET S HARED
CaONTENT
(node
sc, C ONTENT
(node))
frominpotentially
exponential
in the
cardinality
the key 26:Kan- end
CUBE
fordifferent
count
on Device and
Language
is the
union
queries
spatiotemporal
setting.
Justresults
asof recently,
Fig.necessary
3. cube
Pseudo-code
of
an algorithm
to build
nanocubes.
Africa
iPhone
en
if Souththe
are
incremental:
assuming
that
previous
are available, example,
attributes
of
a
base
relation
an
aggregation
record
is
added
to
the
re4:ization
for
v in
C HILDREN
L ABELS
(node) 4.1)
do
space
(we
define
these
terms
in
Section
to
essentially
linear
[31].
India
Android
en
sulting
relation.
In our running
example
these
combinations
27:
child
node
of
four
GROUP
BYs:
on
(1)
no
attributes;
on
(2)
Device
only;
on
(3) are (An- ind
del et5:the
al.results
proposed
Datavore,
a
column-oriented
database
that
supports
needed
for
the
next
query
can
be
quickly
computed.
UnforN EW
S HARED
HILD (node
sc, v,
C HILD
(node,
v))
Nanocubes
improve
on C
Sismanis
et al.’s
work
in two
fundamental
di- 28:
sulting relation.
IniPhone
our running
example
(Figure
these ru).
combinations
Australia
en
droid,
en),
(iPhone,
and 5)
(iPhone,
Theincube
operation is the a w
end while
Language
only; and
(4)
on
Device
anden),Language,
shown
relation
fast data
cube
queries
[17],
Liu
al. would
leverage
graphics
6:tunately,
end First,
for
we
do
see easily
how
this
work
for the
multiscale
rections.
wenot
develop
aand
model
foret
spatiotemporal
data
cubes
thathardware
Country
Device
Language
are
(Android,
en),
(iPhone,
en),
and
(iPhone,
ru).
The
cube
operation
result of collecting all possible group by aggregations into a single re- by
procedure
Our
technique
is
most
closely fast
related to the work
of
Sismanis
et 29: end
7:queries
return
nodecharacteristics
scin
D in Figure
Finally,
a for
ROLL
UPlistisgroup
constrained
of the the cube de
in imMens,
achieving
extremely
large
data
Weaggregation
US 4. the
Android
en a given
represents
idea of
selecting
a certain
of
exploits
unique
of spacequeries
andsetting.
timeover
to get
a as
good
com-[20]. An
necessary
a spatiotemporal
Just
lation
ofa attributes.
In ourversion
running example,
al.end
[30, 29].
Specifically,
we use similar ideas
to reduce
therecently,
size of Kandel
US summarizing
iPhone this group
ru
records from a relation and
using an aggregabetween space
andcomparison
efficiency of queries
(Sections 4.2.1
defer8:apromise
fullfunction
discussion
andusage
direct
of nanocubes
to Datavore
P USH=(stack,
child)
7:12:
if child
nullC
then
S HALLOW
6:1: function
end
for
13:
node
child OPY(node)
8: 2:Pseudo-code
child
Nan
EW
(node,nanocubes.
vi , N ODE( ))
node
sc node
N
)P ROPER C HILD
Fig. 3.
ofODE
to build
7:
return
sc( algorithm
14:
end
for
eta al.
Datavore,
a column-oriented
database
support fast
datahave
cubeproposed
from potentially
exponential
in the cardinality
of the key
for count on Device and Language would be the same as the union
en
on (2) Device only; on (3)
n
aggregation
for
the
relation
above
could
be
to
select all its records and
tions
which
are
common
in
interactive
tools
(Section
4.3).
Language
Nanocubes
improve
on Sismanis
et queries
al.’s workover
in two
fundamental
di-While we
Australia
iPhone
en only; and (4) on Device and Language (2 group bys where
mens,
achieving
extremely
fast
large
data
[20].
summarize thoseRelation
using count.
In thisn case,
would
be the
final Language
There First,
have also
been recent
efforts
to build data cube
structures
Cube
onattributes):
Device,
A
is the five
number
of
input
rections.
we7develop
a model
for and
spatiotemporal
data cubes
thatnanocubes
defer
to
Section
a
full
discussion
direct
comparison
of
D
result. represents
If we allowthe
entries
in selecting
the recordatocertain
have agroup
special
3 Dexploits
ATA C UBES
specifically
suited
for visualization.
Crossfilter
[32]toisget
a software
An aggregation
idea of
of
unique
characteristics
of space
and time
a good packcom- aggregation
Country
Device
Language
Our
technique
isclever
most
closely related
to
work14
of Sismanis
etvi- achieve
to
Datavore
and
iMmens,
in the
Figure
nanocubes
value
All,
we acould
represent
this
aggreation
as the following
relation:
Country
Device
Language
Count
age
built
on the
observation
that many
interactive
US and summarizing
Android
en group using an aggregarecords
from
relation
this
promise
between
space
usagewe
andshow
efficiency
ofqueries
queriesinthat
(Sections
4.2.1
A
al.good
[30, 29].
Specifically,
use
similar
ideas
to reduce
the
of 4 a relaFollowing
common
practice,
wesparse
willthat
call
the
table
in size
Figure
All
All
All
5
sualization
are incremental:
assuming
results
are
availperformance
forweboth
(theprevious
regime
where
Datavore’s
US
iPhone
ru For example, a possible
tiondata
function (e.g.count,
sum,
max, min).
and
6).
Second,
we show
how
these structures
enable
the visualizaAll
Android
All
2
cube
potentially
exponential
in thecan
cardinality
of
the keyvalues. An
Country
Language
Count
able,
the from
results
needed and
for
the
next occupation
query
be
quickly
computed.
tion,a data
its
columns
lines
records,
and
its
entries
South
Africa Device
iPhone
structures
are
ideal)
dense
of
key
space
(where
iMaggregation
for the
relation
above
couldAllbe en
to select
all its All
records and
tions
which
areattributes,
common
in its
interactive
tools (Section
4.3).
iPhone
All
3
All
All
5
space
(we define these
terms
in easily
Section
4.1)
to would
essentially
linear
[31].
India
Android
en
Unfortunately,
we
do
not
see
how
this
work
for
the
multisummarize those using count. In this case, five would All
be the final
aggregation
represents
therecent
ideaetof
selecting
a certain
of records
Thereare).
have also
efforts
to build
cube group
structures
mens’s
All
eu
4
Nanocubes
improve
onbeen
Sismanis
al.’s work
insetting.
twodata
fundamental
diAustralia
iPhone
en
scale
queries
necessary
in
a
spatiotemporal
Just
as
recently,
aggregation
result. inIfawe
allowthat
entries
in the
have
a special
suited
for visualization.
Crossfilter
[32]
isdata
aan
software
pack- We
refer to records
relation
contain
therecord
specialtovalue
as agAllAll
All
ru
1
from
aspecifically
relation
and
summarizing
this
group
using
aggregation
funcrections.
First,
wehave
develop
a model
for
spatiotemporal
cubes
that
Kandel
et
al.
proposed
Datavore,
a
column-oriented
data
cube
value
All,
we
could
represent
this
aggreation
as
the
following
relation:
iPhone
ru
1
age built
on the
clever observation
thatand
many
queries
ina interactive
vi- gregation
records.
Using thisthe
notation,
is easy toaunderstand
the of
conAggregation
An aggregation
represents
certain All
group
unique
space
time
to gethardware
comBidea of itselecting
tionexploits
(e.g.
sum,
For
example,
agood
possible
3sualization
Dcount,
ATA are
Ccharacteristics
UBES
representation
[17],
andmax,
Liu of
etmin).
al. leverage
graphics
inavailiM-aggregaAll
Android
en
2
incremental:
assuming
that
previous
results are
ventional
of describing
aggregations
for a given
group
records
fromways
a relation
and summarizing
this group
usingrelation:
an All
aggregabetween
space
usage and
efficiency
ofall
queries
(Sections
4.2.1
mens,
achieving
extremely
fast
queries
over
large
data
[20].
While
we
tionpromise
for
the
relation
A
could
be
to
select
its
records
and
summarize
Country
Device
Language
Count
iPhone
en
2
able,
results
needed
for the
next
query
canenable
be
computed.
cube, and
rollup.
A group by
operation
is one in whichaapossible
relation is
Following
common
practice,
we
will
call
thequickly
table
in Figure tion
5 by,
a function
rela(e.g.count,
max,
min).
and
6). the
Second,
we
show
how
these
structures
the visualizaB
All sum,All
All For example,
5
defer
to
Section
7 ayielding
full
discussion
and
direct
comparison
nanocubes
iPhone
ru
1
those
using
count,
five
as
the
aggregation
result.
Ifvalues.
we
allow
Unfortunately,
we
do
not
see easily
how
this
would
work of
for
the multiderived
from
a base
relation
given
a list
andrecords
anAllaggregate
aggregation
for the
relation
above
could
beof
to attributes
select all its
and
tions
which
areand
common
in interactive
tools
(Section
4.3).
tion,
its
columns
attributes,
its
lines
records,
and
its
entries
An
to
Datavore
iMmens,
we
show
in
Figure
13
that
nanocubes
achieve
scalevalue
queries
necessary
in
a spatiotemporal
setting.we
Just
as recently,
function.
For
example,
group In
by
oncontain
attributes
Device
and
Language
a special
All
to
be
a
valid
attribute
value,
could
represent
this
We
refer
to
records
in
a
relation
that
the
special
value
All
as
agsummarize
those
using
count.
this
case,
five
would
be
the
final
Finally, a roll up is a constrained version of the cube operation where
There
have also
been both
recent
efforts
build data
structures
aggregation
the
idea(the
ofto regime
selecting
acube
certain
group
good performance
sparse
where
Datavore’s
dataof records
Group
By on
Device,results
Language
Kandel
et al. represents
have for
proposed
Datavore,
a column-oriented
data
cube
Cunderstand
with
the count
aggregate
function
in
thetoof
relation:
gregation
records.
this
notation,
is order
easy
the con- is important. So a roll up on Device
aggregation
IfUsing
we allow
entries
init
the
record
tothe
have
a special
the
input
attributes
aggregation
as
relation
B Liu
in
Figure
4.
record
that
contains
spespecifically
suited
for
visualization.
Crossfilter
[32]
isusing
aspace
software
pack-iM- the
structures
are
ideal)
and
dense
occupation
of
key
(where
from
a relation
and
summarizing
thisAgroup
an
aggregation
func- result.
representation
[17],
and
et al.
leverage
graphics
hardware
in iM- value
Country
Device
Language
ventional
ways
of
describing
aggregations
given
relation:
groupmeans
All,
we
could
represent
this
aggreation
theaCount
following
relation:
andasfor
Language
(in
this
order)
the union
builtAll
on
the
clever
observation that
many queries
inthis
interactive
vi- it is easy
Equivalent
to Group
By onof group by’s on: (1)
mens’s
are).
cialage
value
is
an
aggregation
record.
Using
notation,
mens,
achieving
extremely
fast
queries
over
large
data
[20].
While
we
Android
en is one in
2 which a relation is
tion (e.g. count, sum, max, min). For example, a possible aggregaby, cube, and rollup.AllA group
by operation
no
attributes;
(2) Device;
and (3) Device
sualization are incremental: assuming that previous results are availall possible
subsetsand
of Language. Note that the
C
All
iPhone
en
2
defer
to
Section
7
a
full
discussion
and
direct
comparison
of
nanocubes
to understand
some
conventional
ways
of
describing
aggregations
for
tiontheforresults
the relation
A could
to select
its records
and summarize
Device
Count
derived from a Country
base relation
givenLanguage
a listgroup
of attributes
and an aggregate
by on
only is Language}
not part of the roll up. As the results
able,
needed
for
the nextbequery
can13
beall
quickly
computed.
iPhone
ru
1 Language{Device,
to
iMmens,yielding
we show
in Figure
that
nanocubes result.
achieve If function.
3 Datavore
DATA
Cand
UBES
AllAll
All
5 Device
a given
relation:
GROUP
BY,
CUBE,
and
ROLL
group
by onAllattributes
and Language
those
using
count,
five
aswould
the
aggregation
we al- For example,
Unfortunately,
we
do not
how
this
workUP.
for
the multiof group
by’s, cubes
and roll ups can be seen as relations, we can
good performance
for see
botheasily
sparse
(the
regime where
Datavore’s
data with the count aggregate function results
in values
thecompose
relation:
queries
necessary
in
spatiotemporal
setting.
Just
recently,
naturally
such
operations.
As we will describe nanocubes is
In the
relational
databases
community,
table
below
isasa (where
relation,
its represent
for every
combination
of
present
inasthe
low
a special
value
Alla dense
to
a valid
attribute
value,
we
Ascale
GROUP
BY
operation
isbeone
in the
which
a relation
is could
derived
from
WeNote
referthat
to records
in adifferent
relation that
contain
the
special
value
All
ag-atstructures
are
ideal)
and
occupation
of key
space
iMCountry
Device
Language
Count
Kandel
et are).
al.
proposed
Datavore,
a column-oriented
data
a easy
specialized
structure
to
cubes of roll ups.
columns
arehave
attributes,
its lines
areinrecords,
and
are cube
values.function.
tributesrecords.
of the
relation
an aggregation
record
isdata
added
to contheaggregation
re-store and query
Fig.base
4.
A
sample
relation
and
its
associated
operators.
this
aggregation
asa relation
B
Figure
5. its
gregation
Using
this
notation,
it
is
to
understand
the
a base
relation
given
list
of attributes
and
anentries
aggregation
mens’s
All
Android
en
2
representation [17], and Liu et al. leverage graphics hardware in iM- ventional waysFig.
5. All
A sample
relation
and
its
associated
aggregation
operators.
of describing
aggregations
for
a
given
relation:
group
A
record
that
contains
the
special
value
All
is
an
aggregation
record.
iPhone
2
4en N ANOCUBES
: A C OMPACT, S PATIOTEMPORAL R OLL -U P
3
mens, achieving extremely fast queries over large data [20]. While we by,
cube, and
operation ru
is one in which
a relation is
iPhone
3 rollup. AAllgroup by
3 toDSection
ATA C UBES
C UBE 1
defer
7 a full discussion and direct comparison of nanocubes derived from
a
base
relation
given
a
list
of
attributes
and
an
aggregate
3
to Datavore
and iMmens,
we show
in Figure
thatbelow
nanocubes
achieve its function.
Dataofvisualizations
inina computer
In the relational
databases
community,
the13table
is a relation,
Note that
every different
combination
values present
the at- are necessarily bounded by display
Forfor
example,
group by
on attributes
Device
and Language
good
performance
for
both
sparse
(the
regime
where
Datavore’s
data
andrelation:
so is
weadded
wouldtolike
columns are attributes, its lines are records, and its entries are values. with
tributes
of theaggregate
base relation
an aggregation
record
the to
re-be able to quickly collect subspaces of
the count
function
resultssize,
in the
structures are ideal) and dense occupation of key space (where iMthe dataset
Country
Device
Language
Count that would end up in the same pixel on the screen. HowSouth Africa
iPhone
function (e.g.count, sum, max, min).
example,
and 6).
Second,
we[17],
show7.
how
these
structures
enable
thelinear
visualizaof fourFor
group
by’s: aonpossible
(1) no attributes;
andFig.
imMens
until
Section
space
(we
define
these
inLiu
Section
4.1)
to essentially
[31]. tion
data
queries
and
al.
leverage
graphics
hardware
in iMIndia
Android
en
3. cube
Pseudo-code
of anterms
algorithm
toetbuild
nanocubes.
tio
sp
is
we
qu
(e.
tur
be
da
ea
it
na
cie
of
fo
qu
4.
Le
lab
ar
sio
co
the
fac
is
Natural language query
count of all Delta flights
count of all Delta flights in the Midwest
count of all flights in 2010
time-series of all United flights in 2009
heatmap of Delta flights in 2010
s
R
R
R
R
D
U
Midwest
U
U
tile0
c
R
R
D
R
R
{ Delta }
{ Delta }
{ United }
{ Delta }
t
R
R
R
D
R
U
U
2010
2009
2010
URL
/where/carrier=Delta
/region/Midwest/where/carrier=Delta
/field/carrier/when/2010
/tseries/when/2009/where/carrier=United
/tile/tile0/when/2010/where/carrier=Delta
Fig. 5. A simplified set of queries supported by nanocubes. The column s represents space; t, time; c, category. R means “rollup”, D means
“drilldown”. The value next to R or D contains the subset of that dimension’s domain being selected. U represents the entire domain (“universe”).
CUBE operation where the order of the input attributes is important. A
ROLL UP on Device and Language (in this order) means the union of
GROUP BYs on: (1) no attributes; (2) Device; and (3) Device and Language, but does not include the GROUP BY on Language only. As the
results of GROUP BYs, CUBEs and ROLL UPs can be seen as relations,
we can naturally compose such operators (e.g. a ROLL UP CUBE).
4
N ANOCUBE :
A
denoted by dim(S). The multiplicity of a schema S is the product of its
chains’ number of levels: µ(S) = ∏ni=1 levels(ci ).
A full assignment for a sequence of labeling functions [`1 , `2 , . . . , `k ]
is a sequence of label values [v1 , v2 , . . . , vk ] where vi is a label value
under `i . Any prefix of a full assignment for a sequence of labeling functions, including the empty one, is referred to as a partial assignment. Note that a full assignment is also a partial assignment
since a sequence is also a prefix of itself. An address on a schema
is a sequence of partial assignments for its chains, more formally, if
S = [c1 , c2 , . . . , cn ] is an indexing schema, then a = [p1 , p2 , . . . , pn ] is
an address of S if pi is a partial assignment for chain ci . The set of
possible addresses of S is denoted by addr(S).
The addresses of an object o under indexing schema S, denoted by
addr(o, S) are all the addresses in addr(S) whose partial assignments
are consistent with the label values associated to o and it is easy to see
that the size of addr(o, S) is always µ(S). Besides a schema S, the
definition of a nanocube requires a separate labeling function, `time :
O → T , which we refer to as the time labeling function since we use
it to encode the temporal aspect of our datasets. Thus, a nanocube for
objects o1 , . . . , on is denoted by:
C OMPACT, S PATIOTEMPORAL DATA C UBE
Data visualizations in a computer are necessarily bounded by display
size, and so we would like to be able to quickly collect subsets of
the dataset that would end up in the same pixel on the screen. However, spatiotemporal navigation is inherently multiscale. The same
data structure should support quick indexing for a visualization over
multiple years of time series and for drilling down into one particular hour or day. Similarly, the data cube should support aggregation
queries over vast spatial regions covering entire continents, as well as
very narrow queries covering only a few city blocks.
The database notion of ROLL UP, in a sense, aligns nicely with the
notion of Level of Detail. For example, if the records of a table (relation) contain a location attribute, one can design a ROLL UP query
whose resulting relation encodes the same information as the one encoded in a level of detail data structure. More concretely, suppose
`1 , . . . , `k are attributes computed from the original location attribute
and yield “quadtree addresses” of increasingly higher levels of detail
(from 1 to k). A ROLL UP query on these (computed) attributes results
in, essentially, the same information as the one contained in a quadtree
(given that we are keeping the same summary in both, e.g. count).
The second important notion in the design of nanocubes is the idea
that we want to combine aggregations of independent dimensions at
independent levels of detail. For example we might want to know for
a whole country, what is the spatial distribution of tweets generated
by an iPhone: coarse on the spatial dimension, but specific on the device dimension. Conversely, we might want to know the distribution
of tweets (coarse on device) in a small city block (fine in space). In
relational database terminology, this model has a name: it is a CUBE
of ROLL UP, or a ROLL UP CUBE. With the terminology set, we can
state: a nanocube is a data structure to efficiently store and query spatiotemporal ROLL UP CUBE. Besides implementation tricks, the main
difference between nanocubes and previously published sparse coalesced data cubes such as Dwarf cubes [29] is in the design of aggregations across spatiotemporal dimensions (see Sections 4.3.1 and 4.3.3).
Next, we present a formal description of the components that make
up our nanocube index, pseudo-code for building nanocubes, an illustrated example, and how queries are made against our index.
4.1
NANO C UBE([o1 , . . . , on ], S, `time )
A key in a nanocube is any pair (a,t) where a ∈ addr(S) and corresponds to a full assignment (see definition above) and t ∈ T is a
possible time label. If we remove the requirement of a being a full
assignment, we say that pair (a,t) is an aggregate key. Note that every
key is also an aggregate key. The set of all possible keys and the set of
all possible aggregate keys of a nanocube are respectively referred to
as its key space, or K ? , and its aggregate key space, or Ka? . The size of
the key space, |K ? |, is referred to as its cardinality.
4.2 Building the Index
To ease the remaining exposition, we assume that a nanocube maps
an aggregate key to a count. However, nanocubes support any kind of
summary that is an algebra with weighted sums and subtractions, and
thus works for any linear statistics, including means and variances.
The pseudo-code for building a nanocube is presented in Figure 3.
The main idea of the algorithm is for every object oi to first find the
finest address of the schema S hit by this object, update the time series
associated with this address and from there on update in a deepest first
fashion, all coarser addresses also hit by oi . Note that the content of
the last dimension of schema S is always a time series and that is why,
in line 21 of A DD, we insert the time label of the current object. The
important trick used is to, when possible, allow for shared links across
dimensions (dashed blue lines in Figure 2) and in the same dimension
(dashed black connections). In real use cases this sharing is responsible for significant memory savings and enables exploring even larger
datasets on small laptops. The functions not defined in the pseudocode should have its semantics easy to understand from its name.
Definitions
Let O be a set of objects. A labeling function ` : O → L associates a
label value to the objects of O. We can think of ` as an attribute in
a relational database. In connection with the level of detail discussion above, if `1 and `2 are two labeling functions for O, we say `1 is
coarser than `2 or that `2 is finer than `1 if for any two objects o, o0 ∈ O
the implication `2 (o) = `2 (o0 ) ⇒ `1 (o) = `1 (o0 ) holds. We denote this
fact by `1 < `2 .
A sequence of labeling functions c = [`1 , `, . . . , `k ] for objects O is a
chain for O if every labeling function is coarser than the next labeling
function in the sequence: `i < `i+1 . The number of levels of a chain
is defined by levels(c) = |c| + 1. An indexing schema for objects O
consists of a sequence of chains S = [c1 , c2 , . . . , cn ]. The dimension
of an indexing schema S is the length of its sequence of chains and is
4.2.1 Nanocube Example
Consider the scenario where an analyst is interested in understanding
the spatiotemporal distribution of Twitter data (i.e. tweets), including which devices (e.g. iPhone, Android) people are using. Natural
questions to ask include: Which device is more popular for tweeting?
Is one device more popular in certain areas than in others? How has
this popularity changed over time? We illustrate the construction of a
nanocube built using Twitter data in Figure 2. For clarity, this example
contains only five tweets o1 , . . . , o5 , all ordered in time.
4
Online Submission ID: 276
event
0
1
2
3
4
5
6
7
8
9
10
11
12
13
time
query/tseries/1/3/4
start at bin 1, use buckets of 3 bins each, and collect 4 of these buckets
solve using...
bin: 1
bin: 3
bin: 4
bin: 6
bin: 10
accum: 2 accum: 3 accum: 4 accum: 7 accum: 9
A Summed Table Sparse Representation for Counts
result
3
4
0
2
Fig. 6. An illustration of the summed-area table variant we use for our
time series indexing scheme. Every node in Figure 2 stores an array of
timestamped counts like the one in this figure.
Fig. 7. Which device is more popular for tweeting: iPhone (blue) or Android (orange)? This choropleth map highlights areas in which devices
are more popular based on a sample of 210M tweets. When we zoom in
to Chicago we can observe something not seen from the overview display: south and west of the city, Android is more popular than iPhone.
As shown on the top-left map of Figure 2, the first two tweets (o1
and o2 ) were sent from the east coast of the United States; the third
tweet (o3 ), from South Africa; the fourth tweet (o4 ) was sent from
Asia, and the fifth tweet (o5 ) from Australia. Tweets o1 and o4 were
sent from an Android device while o2 , o3 , and o5 were sent from an
iPhone device. The labeling functions `device , `spatial1 , and `spatial2
as well as the schema of this nanocube, S, are all defined on the
left part of this figure. The labeling `device assigns a device to each
tweet and `spatial1 and `spatial2 assign a spatial label to each tweet. The
tweet labels given by `spatial1 and `spatial2 are essentially addresses in a
quadtree partition of a square. Note that `spatial1 is coarser than `spatial2 .
The right part of Figure 2 presents intermediate nanocubes generated
by NANO C UBE (Figure 3) after each tweet is inserted.
gregate values for multiple branches in that dimension. In a single
nanocube query, each dimension is independently set to be used as the
basis for either a rollup or a drilldown. In Figure 5, we provide a set
of example queries and their mapping to the server query URL.
It is worth noting that the order of the d dimensions does not impact
the worst-case query run-time. For example, a marginal barchart of a
categorical dimension (with k bars), requires O(kd) time, regardless of
the category chosen or the ordering of the dimensions.
4.3.1
4.3 Querying the Cube
Nanocubes support three distinct dimension types, which are always
traversed in a fixed order: spatial, categorical, and finally temporal.
Before describing queries for each of these specific dimension types,
we first illustrate how simple queries are conducted on nanocubes using an example. Recall that the end result of the query will be to return
precomputed aggregates across one or more dimensions.
In Figure 2(5), assume we are interested in the count of all tweets
that occurred in the northwest quadrant of the world, regardless of
the device type and time. The aggregate key ka = ((p1 , p2 ),t) for this
query consists of: (1) the partial assignment for the northwest quadrant
in the spatial dimension: p1 = [0, 1]; (2) the empty partial path for the
device dimension p2 = [ ] indicating any device; and (3) a time label
t indicating any time. Finding the precomputed aggregate for a given
aggregate key is called a simple query. In this example, we start at
the top-most node and traverse all black parent-child links described
in the partial assignment p1 : in this case only the black [0,1] link.
We next cross the dimension boundary line by traversing the (blue)
content link of the current node. The traversal process is repeated for
the device dimension using the partial assignment p2 . In this specific
case, no restrictions are made on the device, and we can jump to the
next dimension by traversing the content link. At this point, we reach
a leaf node containing {o1 , o2 }. Since no time constraint is imposed,
the count of elements inside the leaf (2) is the answer for the query.
Note that, for each dimension, a simple query only traverses a single path of its tree before jumping to the root node of a tree in the next
dimension (or to a leaf node which encodes time and is treated differently). In general, higher level queries might traverse multiple paths
of a single tree, and may also report single aggregates, multiple aggregates, or even combine aggregates from multiple branches. To abstract
and classify how a general nanocube query processes a dimension, we
use the terminology of rollups and drilldowns (the ROLL UP relational
operation is related but has a different meaning than the one we intend
here). The dimension that is the basis of a rollup should report a single
aggregate value as a result. This aggregate might be a single existing aggregate in the nanocube or a combination of multiple aggregates
from different branches of that dimension. A drilldown reports ag-
Spatial Queries
In our current implementation, the first dimension to be traversed in a
nanocube is always the spatial dimension. It is helpful to think of this
dimension as being represented by a traditional quadtree [27], where
each quadtree node is enriched by an extra pointer (content pointer)
that jumps to the next dimension of the nanocube. If a query matches
exactly the region represented by a quadtree node, then the content
pointer of that node is the gateway for all aggregates that refers precisely to that region. If the query includes categorical restrictions (or
drilldowns), then these can be found by traversing down the following
categorical dimensions, as described below. However, spatial regions
will very rarely match exactly one node in the quadtree; therefore, we
use the traditional region quadtree intersecting algorithms to compute
the minimal disjoint set of quadtree nodes that exactly cover the query
region [27], and sum the resulting rollups across the nodes.
Arbitrarily shaped regions are not currently supported for spatial
queries because of the additional complexity that is introduced, but
there is no major intrinsic barrier in the nanocubes framework which
prevents them from working. Instead, we support regions defined by
the tiling scheme of most mapping services on the WWW. For example, the widest tile in the world in OpenStreetMap [15] has coordinates
(0, 0, 0), while a tile for block-level maps of downtown Los Angeles
might have coordinates (22485, 52342, 17). The first two coordinates
are integer addresses, and the third coordinate corresponds to the zoom
level: going down a zoom level doubles the resolution in both x and
y. Our spatial drilldowns are then specified by a tile (x, y, z) address
and an additional integer resolution, which denotes how many levels
to break down space inside the tile. Traditionally, tiles from mapping
services are squares with 256 pixels on the side, which corresponds in
our case to a resolution of 8. Since our spatial drilldowns return an
array of counts broken down by latitude and longitude, they are the
basis for spatial density plots and choropleth maps.
4.3.2
Categorical Queries
Categorical dimensions in a nanocube are represented by flat trees,
which always contain a root node with potentially as many children
as there are different values in that category. To restrict the domain
5
A
A
B
9/11
Fig. 8. A perspective on the history of American Airlines (blue) and
Delta (orange). The time series show the weekly percentage of the total
number of commercial flights in the United States. Note the presence of
aligned spikes as well as opposing spikes. After 9/11 Delta saw a positive spike where American saw a negative one. The big bump on American Airlines after 9/11 was the merger with TWA. The heatmap shows
the spatial hotspots of the two companies counting all flights after 9/11
(the time bar A can be dragged and resized to change the considered
time window for the heatmap).
Fig. 9. Two kinds of Customer Tickets: Type 1 (Red) and Type 2 (Blue).
The heatmap on the left map corresponds to time bar A, and the one on
the right to time bar B: both encode the difference between number of
reports of Type 2 and Type 1 in each point of the map. Reports of Type
1 exceed reports of Type 2, but not everywhere: notice that the region
of Denver is still blue. Zooming into Denver we see that the number of
Type 1 reports has increased over time, but Type 2 still dominates.
which makes it easy to plug in different data structures for each dimension of the nanocube. For example, for the Twitter data, we use
a 2d quadtree for the spatial dimensions (latitude and longitude), and
flat trees for each categorical dimension (e.g. language, device, application), and our summed-area table variant for the time dimension.
The nanocube construction algorithm has not been optimized for
speed (results are included in section 6) but there are several possible
improvements that we could make: using multiple threads, or using
memory pools to avoid the overhead of repeated memory allocations
and deallocations. Due to the scale of the input data, most of our effort
has been spent on optimizing memory usage, including optimized libraries for memory allocation (libtcmalloc) and tagged pointers, which
allow us to use the 16 most significant bits in a 64-bit pointer to quickly
identify different types of nodes in our data structure.
The nanocube server exposes its API for queries via HTTP. More
specifically, it provides a web service by which queries can be issued [26]. After the data cube is built, the data structures are no longer
mutated, and so the server is easily parallelizable (it also means that
nanocubes are add-only: they cannot be updated if a record is removed
from the base relation). Our implementation uses the Mongoose library for handling multiple HTTP requests in separate threads concurrently [21]. We have built two front-end visualization clients to query
the nanocube server. One client is written in C++ and uses OpenGL
for efficient rendering. The other client is browser-based and is written in Javascript, HTML5, SVG, WebGL, and D3 [3]. Snapshots of
the clients appear in the accompanying figures of the paper. the
to a certain value of the category, the query engine simply follows the
path down the child of the corresponding value. Categorical rollups
are performed by simply returning the count corresponding to either
the top-level node (in case of no restriction) or the child node (in case
of a restriction). Categorical drilldowns are also similarly simple: they
are a sparse array of all children with non-zero counts.
We note that since categorical dimensions appear under spatial
dimensions, answering spatial region rollups with either categorical
restrictions or drilldowns requires combining the categorical rollups
across all quadtree nodes that are reached by the region. An analogous
phenomenon happens for nested drilldowns across multiple categories.
For example, the binned scatterplot in Figure 11 can be built directly
from the result of drilling down in both day of week and hour of day.
The recombinable parallel set visualization of Figure 1 requires a triple
breakdown of language, device and application. Single category drilldowns also trivially enable histogram plots.
4.3.3
Temporal Queries
To represent the temporal dimension, we use a variant of summedarea tables [7], namely a sparse variant of the one-dimensional analogue (see Figure 6). Each time series in a node is stored as a dense,
sorted array of cumulative counts, tagged by timestamp. With this data
structure, we can compute a temporal rollup of event counts along any
contiguous period, using only two binary searches: one to find the
array element with the least upper bound of the period’s beginning,
and another to find the greatest lower bound of the period’s end. The
difference between these numbers is the total number of events in the
period. A temporal drilldown happens similarly, and we can compute
a time series with t entries by performing t + 1 binary searches. Each
determines the breaking points in the cumulative array, and the final
value is computed by stepwise differences.
This scheme for storing time entries is attractive for several reasons.
First, it ensures that we can store time series of any granularity without requiring a nested tree structure like our spatial indexing scheme.
Second, the running time is essentially optimal (up to a log n factor),
and the algorithm is extremely fast in practice.
5
6
E XPERIMENTS
To study the behavior of our nanocube data structure, we collected six
public and private datasets that ranged in size from four million records
up to over one billion records. Each dataset includes geospatial, temporal, as well as data-specific categorical dimensions that range from
only a few bins to as many as 30. For all but the synthetic dataset
experiments, we included the geospatial time-series dimensions, and
varied the other dimensions based on the datasets.
In the following sections, we provide a brief overview of each of
the datasets, followed by an overall summary of our experimental results in section 6.8. For each of the experiments, we paid particular
attention to how much memory was required to build and store the
nanocube index, as well as the overall complexity of the dataset itself,
which varied greatly from one to the next. Once the nanocubes were
constructed, we queried them using one or both of our front-end clients
to highlight the ease with which analysts could explore the data.
I MPLEMENTATION
We use a client-server architecture for the current implementation
of nanocubes. The server reads the multidimensional data, builds a
nanocube, and then processes queries on the nanocube from client
applications. The server is a C++11 template-based implementation
6
Online Submission ID: 276
Fig. 10. Highlights of a visual analysis session of the CDR dataset, with 1, 043, 884, 027 records. We noticed the different patterns in call volume by
interacting with the dataset and trying different regions and category selections. Notice the patterns occur at different spatial and temporal scales.
6.2 Airline Commercial Flights History
This publicly available dataset contains data for every commercial
flight in the United States over a 20 year period (1987-2008) [2, 35].
For over 120 million flights, the records include the scheduled departure and arrival times, the actual departure and arrival times, the origin
and destination airports, the airline, and other fields. For this experiment, we built our index using the origin airport (for latitude and longitude), scheduled departure time, the departure delay, and the airline.
This allows us to answer queries related to overall departure delays for
any airports, airlines, time of day, or combinations thereof. In Figure 8
we present an overview on the weekly percentages of total commercial
flights in the U.S. for a 20 year period of Delta and American Airlines.
The query times and bandwidth usage across all experiments are
consistent, so we report them in aggregate here. The mean query time
was 800µs (less than 1 millisecond) with a maximum of 12 milliseconds. The output size per query averaged 5KB, with a maximum size
of 50KB (geographical tiles dominated bandwidth usage). Our server
currently uses no compression, although we plan to support transparent gzip stream encoding. The mean number of queries for the C++
client was 100 requests per second. The HTML5 client is much quieter, at around 1 query per second, since linked views are only updated
when a brush is released. The C++ client was designed for LANs, and
its bandwidth usage is around 5Mbps, well within current capacities.
6.1
Twitter
6.3 Call Detail Records
For each cellular phone call, telecommunications companies collect
information about the call including time, duration, and the sequence
of cell towers that carried the call. This information is organized into
what are known as Call Detail Records (CDRs). A large U.S. service
provider (privately) shared with us over one billion CDRs generated
from a one month period in July 2010. Due to the sensitivity of CDR
data, our data has been completely anonymized. No personally identifiable information was gathered or used in conducting this study. To
the extent that any data was used, it was anonymous and aggregated
data. The nanocube was built using the geospatial temporal data (of
first cell tower), as well as the duration (transformed to a catagorical
dimension) of each call (see Figure 10).
Between November 2011 and June 2012, we collected about 210 million tweets that originated in the United States using Twitter’s public
feed which provides a representative sampling of all tweets. The rate
of tweets obtained averaged about one million per day. The data was
streamed in the form of JSON objects, from which we extracted the
following attributes: latitude and longitude of the device, the time the
tweet occurred, the client application used, the type of device, and the
language of the tweet. The categorical dimensions in our data (application, device, language) had respectively 4, 5, and 15 distinct values.
With a nanocube built using this data, we could quickly explore the
data to better understand the areas in which one device is more popular than another, where each of the languages is most prevalent, and
how that information changes over time (see Figure 7).
7
Fig. 12. By supporting multiscale time series queries, we can explore
the Brightkite checkin frequency to investigate global trends as well as
short-lived events. Searching online for news about Brightkite and the
event dates, we found that the iOS client for Brightkite was released
exactly when the upward spike happened, and that the downward spike
that appears to have ultimately caused the social network to lose its
users was caused by a global outage that lasted a few days.
Fig. 11. Selecting different geographical regions highlights how different
populations interacted with the Brightkite social network. While in the
US and UK there is no substantial difference between weekday and
weekend traffic, in Japan weekday usage is markedly lower.
6.4
Location-Based Social Networks
The next dataset is also publicly available, and consists of locationbased checkins in the Brightkite social network collected by Cho et
al. [6]. The dataset comprises all data checkins from the (now-defunct)
website between April 2008 and October 2010. In addition to latitude,
longitude, and the time of each checkin, we redundantly encoded hour
of day and day of week as extra categorical dimensions, since we expected there to be interesting periodic day-to-day and weekday vs.
weekend patterns. In Figures 11 and 12, we show some results of
visualizing this dataset with our HTML5 frontend.
6.5
growth that quickly flattens out. In the case of SPLOM 50, the index
grew from 0 to 300MB with the first 200 million object insertions, but
grew less than 100MB larger as a result of the next 800 million object
insertions. The explanation for this behavior is that, by a characteristic
of the synthetic object generator (samples from a normal distribution
for each dimension) a key set of high probability was quickly generated making it harder and harder for a new object with an unseen key
to be generated. Thus, later in the process, most inserted objects will
not require more memory since their keys were already inserted into
the nanocube. We refer to this phenomena as key saturation.
In Figure 14(B), we present curves for memory usage and number
of keys, both relative to the final nanocube numbers. The objects in
this example are calls from the CDR dataset and the dimensions are:
location of first cell tower hit by the call, call duration and initial time
of the call. To test for a key saturation effect as seen in the SPLOM
example, we excluded the time dimension present in the original data
for this experiment. Once again, we observe an initial rapid growth on
memory usage explained by the large number of combinations of cell
locations and call durations not yet inserted. Once the bulk of the keys
corresponding to these combinations are inserted, a relatively small
but steady rate of new keys are inserted reflecting a small but steady
growth in the cell tower infrastructure. Similarly defined curves for the
Flights dataset are shown in Figure 14(C). We included the location
of the flight’s origin airport, departure delay, and carrier, but again
excluded the time dimension to test for the saturation effect. The first
part of this experiment follows the same trend as before: rapid initial
growth, followed by a saturation of keys and a steady but much slower
growth reflecting the small rate of new airport locations and carriers.
At about 80M inserted flights (circa 1995), we again observe a regime
of rapid growth, which corresponds to a burst of new carriers.
Customer tickets
This dataset contains a record of about 8 million records of customer
interactions of a large U.S. service provider over a period of 2.5 years.
The dataset contains latitude, longitude, time and report type (one of
eight categories). The same measures taken to anonymize CDR data
in Section 6.3 were used here. In Figure 9, we highlight the use of
nanocubes to detect relative changes in category in the time series plot,
and how choropleth maps restricted to different time regions show the
change in geographical distribution of the report types.
6.6
SPLOM
This is a collection of synthetic datasets (each with five dimensions)
designed by Kandel et al. [17] to exercise data cube technology
(SPLOM stands for ScatterPLOt Matrix, the visual encoding used to
explore the dataset in that work), also used by Liu et al. [20]. To
compare resource usage to that of these other proposals, we built
nanocubes using five different bin sizes per dimension, from 10 to 50.
6.7
Memory Usage
To understand the memory requirements to build a nanocube, it is
important to remember that objects are not inserted directly into the
nanocube, but rather through their corresponding keys (see Figure 3).
An object’s key identifies the most specific bin in the nanocube that
contains that object. Thus, depending upon the resolutions defined for
the dimensions of a nanocube, two different objects may or may not
be distinguishable. For example, if the time resolution of a nanocube
is one hour, two objects with timestamps at 20h10m and 20h50m will
both have have keys with the same time label rounded to 20h. As
a result, new occurrences of keys that were already inserted into a
nanocube do not require additional storage space.
Figure 14(A) shows the memory usage growth for the SPLOM
dataset as we insert from zero to one billion objects into the five
nanocubes of increasing bin size. In all cases there is an initial rapid
6.8
Performance Summary
In Figure 13, we summarize the relevant information for building our
nanocubes on the previously described datasets. The number of input
objects N, the memory requirements, and the build times are reported
in the first three columns, while the exact schema used for each dataset
is included in the last column. Column “size” indicates the number of
nodes in our data structure (in the nanocube of Figure 2(5) this number
would be 41: 22 circles + 19 entries in the leaves). The “sharing”
column indicates how much larger the nanocube would be without the
sharing mechanism (dashed lines in Figure 2). For example, the twitter
8
Online Submission ID: 276
dataset
brightkite
customer tix
flights
twitter-small
twitter
splom-10
splom-50
cdrs
objects (N)
4.5 M
7.8 M
121.0 M
210.0 M
210.0 M
1.0 B
1.0 B
1.0 B
memory
1.6 GB
2.5 GB
2.3 GB
10.2 GB
46.4 GB
4.3 MB
166.0 MB
3.6 GB
time
3.50 m
8.47 m
31.13 m
1.23 h
5.87 h
4.13 h
4.72 h
3.08 h
size
149.0 M
213.0 M
274.0 M
1.2 B
5.2 B
51.2 K
8.8 M
271.0 M
sharing
3.00x
2.93x
16.50x
3.72x
4.00x
5.67x
16.00x
18.60x
keys (|K|)
3.5 M
7.8 M
43.3 M
116.0 M
136.0 M
7.4 K
1.9 M
96.3 M
|K ? |
274
269
275
253
260
220
230
269
schema
lat(25), lon(25), time(16), weekday(3), hour(5)
lat(25), lon(25), time(16), type(3)
lat(25), lon(25), time(16), carrier(5), delay(4)
lat(17), lon(17), time(16), device(3)
lat(17), lon(17), time(16), lang(5), device(3), app(2)
d1(4), d2(4), d3(4), d4(4), d5(4)
d1(6), d2(6), d3(6), d4(6), d5(6)
lat(25), lon(25), time(16), duration(3)
Fig. 13. Summary of resource usage for our reported experimental results (K=103 , M=106 , B=109 ). The numbers in parentheses on the schema
column denote the number of bits necessary to refer to a value of that dimension, and their sum is the exponent of 2 in the |K ? | column.
dataset nanocube would use at least 4 × 46.4GB = 185.6GB without
sharing. Column “keys” is the number of distinct keys induced by
the N objects (note here that the time dimension is included). Finally,
column |K ? | reports the cardinality of the key space of each nanocube.
All but two of the datasets fit in 4GB of RAM, and only one of them
would not fit in 16GB, the amount of memory in a high-end laptop.
The multiscale nature of spatiotemporal datasets make the cardinality
of the key space impractically large for any dense storage scheme.
7
for which we can have a smooth interactive visualization experience.
Other data cubes for visualization It is enlightening to compare
nanocubes to recent data cube visualization proposals: Datavore [17]
and imMens [20] (see Figure 15). For this discussion, we assume input objects inducing keys K and aggregate keys Ka . Datavore’s algorithms behave well in the sparse regime, but cannot handle very large
datasets, because its querying time appears to be proportional to |K|.
imMens, on the other hand, has extremely fast query times (below
1ms per query, and apparently O(1)), but is designed for the dense
regime, and uses memory proportional to the cardinality of its key
space, O(|K ? |). This limits the size of the key space and we observe
that although imMens reports experiments varying N from 106 to 109 ,
the value of |K ? | in all experiments were within a factor of 5 of one
another [20]. The hierarchical nature of a nanocube’s spatiotemporal index provides advantages in the fidelity of resulting visualizations
for a much larger set of scales than imMens (Datavore supports exact
visual encodings across any dimensions, but cannot cope with largescale datasets). This same hierarchical nature provides nanocubes with
natural offscreen culling: the region visible onscreen can be interpreted
as a spatiotemporal selection, reducing the total processing necessary.
D ISCUSSION
Sparse vs. Dense As a sparse scheme to store aggregates,
nanocubes suffer from the same general drawbacks of any sparse data
structure, when compared with a dense one. Namely, when the occupancy (i.e. key space covered by the inserted objects) is large,
the extra logic and memory needed to handle pointers is wasteful.
A dense mechanism using implicit addressing on arrays has simpler
logic, faster access, and uses less memory. On the other hand, when
considering multidimensional datasets, the memory requirements of
a dense mechanism quickly becomes impractical (see the cardinality column of Figure 13). Except for the SPLOM experiments, every
other dataset would require at least 253 memory locations to represent
only the finest bin summaries (e.g. counts), without even considering the memory needed to represent aggregates. This requirement is
simply overwhelming. On the other hand, a smart sparse scheme like
nanocubes can handle well those datasets with present day technology.
Obviously, we cannot expect nanocubes to solve the combinatorial explosion that happens when an arbitrary number of dimensions is considered, but it pushes the size limits of the multidimensional datasets
0.9
300
SPLOM 50^5
SPLOM 40^5
SPLOM 30^5
SPLOM 20^5
SPLOM 10^5
Fraction of Final Value
A
0.8
0.7
B
0.6
0.5
0.3
0.2
Memory Usage
Number of Keys
0.0
0M
200
250M
500M
750M
1B
Number of Inserted CDRs
1.0
Memory Usage
Number of Keys
Fraction of Final Value
0.9
100
0.8
0.7
0.6
0.5
0.4
0.3
C
0.2
0.1
0
0.0
0m
200m
600m
Number of Objects Inserted
1b
0M
25M
50M
75M
100M
125M
Number of Inserted Flights
Fig. 14. (A) Nanocube memory usage growth with number of elements, using the SPLOM benchmark by Kandel et al [17]. Notice the
plateauing in memory usage due to key saturation. On the right, the
growth of memory usage and number of keys when inserting objects
into nanocubes for the Call Detail Records (B) and Flights (C) datasets.
Query Time
O(|K|)
O(1)
O(1)
Constraints
|K| ≤ Main Mem.
|K ? | ≤ GPU Mem.
| f (Ka )| ≤ Main Mem.
8 L IMITATIONS AND F UTURE W ORK
Nanocubes offer efficient storage and querying of large, multidimensional, spatiotemporal datasets, but are not without limits. Nanocubes
do not allow queries down to any individual record, like a traditional database. Our index was designed specifically to answer queries
from interactive visualization systems that explore massive datasets by
building a variety of visual encodings.
Our current server API encourages much chattier communication
than is necessary, peaking at hundreds of HTTP requests a second.
This happens when brushes are being moved in the C++ client, but
could be avoided by techniques like query queueing and prediction [5].
The current nanocube implementation allows for only one spatial
dimension and one temporal dimension. It would be clearly useful
to allow schemata that included multiple spatial dimensions so that
one could visualize, for example, the distribution of geographical locations of flights leaving a certain different geographical area. Similarly,
phone calls have two natural geographical dimensions.
Nanocubes still take more memory than we would like. We picked
one example to clearly demonstrate this: when indexing all six dimensions, the 210 million points from Twitter take around 45GB of
memory. This is enough memory for a present-day server, but above
that of typical laptops and workstations. We envision dynamic control
over the cardinality of dimensions, but leave that for future work. We
would also like to explore hybrid solutions that utilize both on-disk
and in-memory data structures to enable more complex nanocubes.
0.4
0.1
Space
O(|K| log2 |K ? |)
O(|K ? |)
O(| f (Ka )|)
Fig. 15. Comparison of observed asymptotic resource usage of recent
methods. Set K correspons to the input keys and set Ka to the aggregate
keys induced by K. Key space, K ? , is a set that grows quickly with
resolution and number of dimensions. Function f reflects nanocube’s
sharing mechanism and has an important compression effect on the
already sparse set Ka (see sharing col. in Figure 13): | f (Ka )| ≤ |Ka |.
1.0
400
Nanocube Size in MB
Datavore [17]
imMens [20]
Nanocubes
9
R EFERENCES
[1] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica.
Blinkdb: Queries with bounded errors and bounded response times on
very large data. In Proceedings of EuroSys, to appear. ACM, 2013.
[2] American Statistical Association Data Expo. Flights dataset, 2009.
http://stat-computing.org/dataexpo/2009.
[3] M. Bostock, V. Ogievetskey, and J. Heer. D3: Data-driven documents.
IEEE Transactoins on Visualization and Computer Graphics, 17(12),
2011.
[4] D. B. Carr, R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. Scatterplot matrix techniques for large n. Journal of the American Statistical
Association, 82(398), 1987.
[5] S.-M. Chan, L. Xiao, J. Gerth, and P. Hanrahan. Maintaining interactivity while exploring massive time series. In IEEE Symposium on Visual
Analytics Science and Technology, pages 59–66. IEEE, 2008.
[6] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: User
movement in location-based social networks. In Proceedings of SIGKDD.
ACM, 2011.
[7] F. C. Crow. Summed-area tables for texture mapping. SIGGRAPH Comput. Graph., 18(3):207–212, Jan. 1984.
[8] Q. Cui, M. Ward, E. Rundensteiner, and J. Yang. Measuring data abstraction quality in multiresolution visualizations. IEEE Transactions on
Visualization and Computer Graphics, 12(5):709–716, Sept. 2006.
[9] A. Das Sarma, H. Lee, H. Gonzalez, J. Madhavan, and A. Halevy. Efficient spatial sampling of large geographical tables. In Proceedings of the
2012 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’12, pages 193–204, New York, NY, USA, 2012. ACM.
[10] A. Dix and G. Ellis. by chance: enhancing interaction with large data sets
through statistical sampling. In Proceedings of the Working Conference
on Advanced Visual Interfaces, AVI ’02, pages 167–176, New York, NY,
USA, 2002. ACM.
[11] N. Elmqvist and J.-D. Fekete. Hierarchical aggregation for information
visualization: Overview, techniques, and design guidelines. IEEE Transactions on Visualization and Computer Graphics, 16(3):439–454, May
2010.
[12] J.-D. Fekete and C. Plaisant. Interactive information visualization of a
million items. In Proceedings of the IEEE Symposium on Information
Visualization (InfoVis’02), INFOVIS ’02, pages 117–, Washington, DC,
USA, 2002. IEEE Computer Society.
[13] D. Fisher, I. Popov, S. Drucker, and m. schraefel. Trust me, i’m partially
right: incremental visualization lets analysts explore large datasets faster.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pages 1673–1682, New York, NY, USA, 2012.
ACM.
[14] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-totals. Data Mining
and Knowledge Discovery, 1(1):29–53, 1997.
[15] M. Haklay and P. Weber. Openstreetmap: User-generated street maps.
Pervasive Computing, IEEE, 7(4):12–18, 2008.
[16] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. SIGMOD Rec., 26(2):171–182, June 1997.
[17] S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment.
In Advanced Visual Interfaces, 2012.
[18] D. A. Keim. Designing pixel-oriented visualization techniques: Theory and applications. IEEE Transactions on Visualization and Computer
Graphics, 6(1):59–78, Jan. 2000.
[19] J. LeBlanc, M. O. Ward, and N. Wittels. Exploring n-dimensional
databases. In Proceedings of the 1st conference on Visualization ’90, VIS
’90, pages 230–237, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press.
[20] Z. L. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of
big data. Computer Graphics Forum (Proc. EuroVis), to appear, 2013.
[21] S. Lyubka.
Mongoose: a small and easy to user web server.
https://github.com/valenok/mongoose/.
[22] J. Mackinlay. Automating the design of graphical presentations of relational information. ACM Trans. Graph., 5(2):110–141, Apr. 1986.
[23] J. Mackinlay, P. Hanrahan, and C. Stolte. Show me: Automatic presentation for visual analysis. IEEE Transactions on Visualization and
Computer Graphics, 13(6):1137–1144, Nov. 2007.
[24] B. McDonnel and N. Elmqvist. Towards utilizing gpus in informa-
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
10
tion visualization: A model and implementation of image-space operations. IEEE Transactions on Visualization and Computer Graphics,
15(6):1105–1112, Nov. 2009.
F. Olken and D. Rotem. Simple random sampling from relational
databases. In Proceedings of the 12th International Conference on Very
Large Data Bases, VLDB ’86, pages 160–169, San Francisco, CA, USA,
1986. Morgan Kaufmann Publishers Inc.
L. Richardson and S. Ruby. RESTful Web Services. O’Reilly Media,
2004.
H. Samet. Foundations of Multidimensional and Metric Data Structures
(The Morgan Kaufmann Series in Computer Graphics and Geometric
Modeling). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
2005.
B. Shneiderman. The eyes have it: A task by data type taxonomy for
information visualizations. In Proceedings of the 1996 IEEE Symposium
on Visual Languages, VL ’96, pages 336–, Washington, DC, USA, 1996.
IEEE Computer Society.
Y. Sismanis, A. Deligiannakis, Y. Kotidis, and N. Roussopoulos. Hierarchical dwarfs for the rollup cube. In Proceedings of the 6th ACM
International Workshop on Data Warehousing and OLAP, pages 17–24.
ACM, 2003.
Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf:
Shrinking the petacube. In Proceedings of ACM International Conference
on Management of Data (SIGMOD), pages 464–475, 2002.
Y. Sismanis and N. Roussopoulos. The polynomial complexity of fully
materialized coalesced cubes. In Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, VLDB ’04, pages
540–551. VLDB Endowment, 2004.
I. Square. Crossfilter: Fast multidimensional filtering for coordinated
views, 2013. http://github.com/square/crossfilter.
C. Stolte, D. Tang, and P. Hanrahan. Query, analysis, and visualization
of hierarchically structured data using polaris. In Proceedings of the 8th
ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pages 112–122, 2002.
C. Stolte, D. Tang, and P. Hanrahan. Multiscale visualization using data
cubes. IEEE Transactions on Visualization and Computer Graphics,
9(2):176–187, 2003.
H. Wickham. ASA 2009 data expo. Computational and Graphical Statistics, 20(2):281–283, 2011.