Nanocubes for Real-Time Exploration of Spatiotemporal Datasets
Transcription
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets
Online Submission ID: 276 Nanocubes for Real-Time Exploration of Spatiotemporal Datasets Category: Research Fig. 1. Example visualizations of 210 million public geolocated Twitter posts over the course of a year. The data structure we propose enables real-time (these images above were rendered faster than the typical screen refresh rate) visual exploration of large, spatiotemporal, multidimensional datasets. The visual encodings built using nanocubes are within a controllable difference to ones rendered by a traditional linear scan over the dataset. They naturally support linked navigation and brushing, and include choropleth maps, time series over arbitrary regions and scales of space and time, parallel sets, histograms, and binned scatterplots. The color scale of the choropleth map is a diverging scale in which blue corresponds to iPhones being relatively more popular, and red corresponds to higher relative popularity of Android devices. Abstract—Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Data cubes are a well-known aggregration operation in relational databases. In a sense, they precompute every possible aggregate query over the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk storage. In contrast, we show how to construct a data cube that fits in a modern laptop’s main memory, even for billions of entries; we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets, and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are dominated by network and user-interaction latencies. Index Terms—Data Cube, Data Structures, Interactive Exploration 1 I NTRODUCTION work of Sismanis et al. [30]. In addition, the query times to build the visual encodings in which we are interested will be at most proportional to the size of the output, which is bounded by the number of screen pixels (within a small factor). This is an important observation: the time complexity of a visualization algorithm should ideally be bounded the number of pixels it touches on the screen. Our technique enables real-time exploratory visualization on datasets that are large, spatiotemporal, and multidimensional. Because the speed of our data cube structure hinges partly on it being small enough to fit in main memory, we call them nanocubes. By real-time, we mean extremely fast queries: query times average under a millisecond for a single thread running on a computer that ranges from a laptop, to a workstation, to a server-class computing node (Section 6). By large, we mean that the datasets we support have millions to billions of entries. By spatiotemporal, we mean that nanocubes support queries typical of spatial databases, such as counting events in a spatial region that can be either a rectangle covering most of the world, or a heatmap of activity in downtown San Francisco (Section 4.3.1). As datasets get larger, real-time visualization becomes more difficult. Consider a dataset with a billion entries. If we compute a summary of the dataset and visualize it, the natural question becomes “does the summary represent the data well?” and the problem has simply shifted to “how can we visualize one billion residuals”? Even drawing the simplest scatterplot is not straightforward. If we decide to produce the visualization by scanning the rows of a table, we will either need non-trivial parallel rendering algorithms or significant time to produce a drawing. Neither of these solutions scales well with dataset size. Data cubes are structures that perform aggregations across every possible set of dimensions of a table in a database, to support quick exploration [14, 30]. Many visualization systems are built on top of data cubes, concretely or conceptually. Still, only recently have researchers started to examine data cube creation algorithms in the context of information visualization [32, 17, 20]. Data cubes are often problematic in that they can take prohibitively large amounts of memory as the number of dimensions increases. In Section 4, we show how to construct a data cube that fits in the main memory of a modern laptop computer or workstation, extending the 1 Five Tweets: Location and Device 1. o1 01,10 o2 `device ( ) = Android `device ( ) = iPhone `spatial1 0,0 proper 01,10 content (next dimension): iPhone Android `spatial2 01,11 10,11 11,11 00,10 01,10 10,10 11,10 00,01 01,01 10,01 11,01 00,00 01,00 10,00 11,00 4. 1,0 Indexing Schema S = [ [`spatial1 , `spatial2 ], [`device ] ] 1,0 01,10 iPhone Android o2 o3 o1 0,1 1,1 Android iPhone iPhone proper shared Android iPhone o1 00,11 shared 10,01 01,10 Android o5 parent-child (same dimension): 1,0 0,1 3. 0,1 o4 o3 0,1 2. 0,1 o1 o4 o2 1,1 o1 o2 o1 o2 o3 o1 o2 o3 o4 o2 o1 o4 updated in current step o3 o1 o2 o1 o2 o3 iPhone o2 o3 o5 Android o1 o4 o1 o2 o3 o4 o5 dimension boundary 1,1 1,0 0,1 01,10 Android iPhone o2 o3 5. 10,10 10,01 Android iPhone o1 o2 10,10 10,01 11,01 Android iPhone o1 iPhone o2 o1 o2 o3 iPhone iPhone o5 Android o4 o3 o5 Fig. 2. An illustration of how to build a nanocube for five points [o1 , . . . , o5 ] under schema S. The complete process is described in Section 4. By the same token, nanocubes support temporal queries at multiple scales, such as event counts by hour, day, week, or month over a period of years (Section 4.3.3). Data cubes in general enable the Visual Information-Seeking Mantra [28] of “Overview first, zoom and filter, then details-on-demand” by providing rollup summaries over all projections and letting users drill down by expanding the wanted dimensions. Nanocubes also provide overviews, filters, zooming, and details-on-demand inside the spatiotemporal dimensions themselves. By multidimensional, we mean that besides latitude, longitude, and time, each entry can have additional attributes (see section 6) that can be used in query selections and rollups. As we will show, nanocubes support queries that lend themselves very well to building visual encodings which are fundamental building blocks of interactive visualization systems, such as scatterplots, histograms, parallel coordinate plots, and choropleth maps. In summary, we contribute: techniques such as query prediction become necessary [5]. Leveraging the enormous power of graphics processing units has also become popular [24, 20], but without algorithmic changes, linear scans through the dataset will still be too slow for fluid interaction, even with GPUs. Another popular way to cope with large datasets is through sampling. Statistical sampling can be performed on the database backend [25, 1, 9, 13], or on the front-end [10]. Still, the techniques we introduce with nanocubes can produce results quickly and exactly (to within screen precision) without requiring approximations, which we believe is preferable. In addition, as Liu et al. argue, sampling by itself is not sufficient to prevent overplotting, and might actually mask important data outliers [20]. Fekete and Plaisant have proposed modifications of traditional visual encodings which use the computer screen more efficiently, and so scale better with dataset size [12]. Still, the proposal requires a traversal of all input data points, which is not a feasible approach at the scale which we want to approach here. Carr et al. were among the first to propose techniques replacing a scatterplot with an equivalent density plot [4]; nanocubes can be seen as a data structure to enable these visualizations at a variety of dataset sizes and scales. Careful data aggregation [16], then, appears to be one of the few scalable solutions for low-latency large data graphics. While Elmqvist and Fekete propose variations of visualization techniques that include aggregation as a first-class citizen [11], in this paper we show how to issue queries such that at the screen resolution in which the application is operating, the result is indistinguishable (or close to) from a complete scan through the dataset. We note that over-aggressive aggregation itself introduces potential discrepancies between the visualization and the dataset, and there are proposals to understand this [8]. We are interested in bounding the difference between our visual encoding and a visual encoding that would traverse the entirety of the data by the size of the screen, i.e. the number of pixels in it. Related to this, pixel-oriented techniques [18] have been investigated. However, these tend to focus on the development of new visual encodings, while in this paper we show how to create the already well-known and established encodings with low error, high performance and interactivity. Our technique is most closely related to the work of Sismanis et al. [30, 29]. Specifically, we use similar ideas to reduce the size of a data cube from potentially exponential in the cardinality of the key space (we define these terms in Section 4.1) to essentially linear [31]. Nanocubes improve on Sismanis et al.’s work in two fundamental directions. First, we develop a model for spatiotemporal data cubes that exploits unique characteristics of space and time to get a good com- • a novel data structure that improves on the current state of the art data cube technology to enable real-time exploratory visualization of multidimensional, spatiotemporal datasets; • algorithms to query the nanocube and build linked and brushable visual encodings commonly found in visualization systems; and • case studies highlighting the strengths and weaknesses of our technique, together with experiments to measure its utilization of space, time, and network bandwidth. 2 R ELATED W ORK Relational databases are so widespread and fundamental to the practice of computing that they were a natural target for information visualization almost since the field’s inception [19]. Mackinlay’s Automatic Presentation Tool is the breakthrough result that critically connected the relational structure of the data with the graphical primitives available for display [22] and ultimately lead to data cube visualization tools like Polaris [33, 34] and Show Me [23]. Nanocubes are specifically designed to speed up queries for spatiotemporal data cubes, and could eventually be used as a backend for these types of applications. In contrast, some of the work in large data visualization involves shipping the computation and data to a cluster of processing nodes. While parallelism is an attractive option for increasing throughput, it does not necessarily help achieve low latency, which is essential for fluid interactions with a visualization tool. As a result, sophisticated 2 Online Submission ID: 276 1: function NANO C UBE([o1 , o2 , . . . , on ], S, `time ) .n>0 1: procedure A DD(root, o, d, S, `time , updated nodes) 2: nano cube ← N ODE( ) . New empty node 2: [`1 , . . . , `k ] ← C HAIN(S, d) 3: for i = 1 to n do Online Submission ID: 276 3: stack ← T RAIL P ROPER PATH(root, [`1 (o), . . . , `k (o)]) 4: updated nodes ← 0/ 4: child ← null 5: A DD(nano cube, oi , 1, S, `time , updated nodes) 5: while stack is not empty do 6: 1: end for NANO C UBE([o1 , o2 , . . . , on ], S, `time ) function .n>0 1: procedure o, d, S, `time , updated nodes) 6: nodeA DD ← (root, P OP(stack) nano cubecubeN ODE( ) . New empty node 7: 2: return nano 2: [`1 , .update . . , `k ] ←Cfalse HAIN(S, d) 3: function for i = 1 to n do 7: 8: end 3: stackif node T RAIL ATH (root, 4: updated nodes 0/ 8: hasPaROPER singlePchild then[`1 (o), . . . , `k (o)]) 4: child S null 5: A DD(nano cube, oi , 1, S, `time , updated nodes) 9: ET S HARED C ONTENT (node, C ONTENT(child)) 1: function T RAIL P ROPER PATH(root, [v1 , . . . , vk ]) 5: while stack not empty do is null then 6: end for 10: else if CisONTENT (node) 2: stack ← S TACK( ) . New Empty Stack 6: node S ETPPOP (stack)C ONTENT(node, 7: return nano cube 11: ROPER 3: 8: Pend USH (stack, root) Online Submission ID: 276 7: update ( d= false function dim(S) ? 4: node ← root 8: if node has a single child then S UMMED TABLE T IME S ERIES( ) : N ODE( ) ) 5: for i = 1 to k do 9: S ET S HARED C ONTENT(node, C ONTENT(child)) function NT ANO C UBE ([o1 , o2P , .ATH . . , on(root, ], S, `time 1:1: function RAIL P ROPER [v1), . . . , vk ]) . n > 0 12: A DD(root, o, d,update true 6: child ←C (node, vi ) 1: procedure ,← updated nodes) nano N ODE . New.empty 10: else ifS,C`time ONTENT (node) is null then 2:2: stackcube S HILD TACK ( )( ) Newnode Empty Stack else CPONTENT IONTENT S S HARED (node) and 2: [`1 , 13: .11: . . , `k ] C HAIN (S,Sifd) i = 1 =tonull n do then 7: 3:3: iffor child ET ROPER C (node, P USH (stack, root) 0/ 3: stack T RAIL P ROPER P ATH (root, [` (o), . . . , 4: updated nodes C ONTENT not in updated nodes then 1 (node) Submission 8: 4: child root ← N EW P ROPER C HILD(node,Online vi , N ODE ( )) ID: 276 ( d= dim(S) ? `k (o)]) node 4: child14: null 5: A DD(nano cube, oi , 1, S, `time , updated nodes) S ET P ROPER C ONTENT S UMMED TABLE(node, T IME S ERIES( ) : N ODE( ) ) 9: 5:6: else for for iif=I S1StoHARED k do C HILD(node,child) then 5: while stack is not empty do end S HALLOW C OPY (C ONTENT (node))) 12: update true 10: 6:1: child ← R EPLACE C HILD (node, N ANO C UBE ([o , o , . . . , o ], S, ` ) . n > 0 6: node P OP (stack) 7: function return nano cube 1 (node, 2 child C HILD vn i ) time 15: true 1: procedure o, d,update ,← updated 2: ( ) then C OPY (child)) . New empty node 13: A DD(root, else ifS,C`time ONTENT I S Snodes) HARED (node) and 7: update false 8: endnano function child, S HALLOW 7: ifcube child =N ODE null 2: [`1 ,if . . .node , `k ] hasCaHAIN for i = 1 to n do 8: single then (node) 16: else(S,child ifd)CCONTENT ONTENT I S P ROPER not in(node) updatedthen nodes then 11: 8:3: end if child N EW P ROPER C HILD (node, vi , N ODE( )) 3: stack STET RAIL P ROPER PATH(root, [`1 C (o), . . . , `k (o)]) 4: updated nodes / (root, [v , . . . , v ]) Online Submission ID: 9: 276 S HARED CSupdate ONTENT (node, ONTENT (child)) 1: function T RAIL P ROPER P0 ATH 17: ← true 14: ET P ROPER C ONTENT (node, 1 k else if I S Scube, HARED CS,HILD (node,child) then 12: 9:5: P USH (stack, child) 4: child null A DD (nano o , 1, ` , updated nodes) 10: else if C ONTENT (node) is null then i time 2: stack S TACK( ) . New Empty Stack S HALLOW C OPY(C ONTENT(node))) 18:stack ifdo 5: while not end empty end for 13: 10:6: node ←child child 11: S ETis P ROPER C ONTENT (node, 3: P USH (stack, root) R EPLACE C HILD(node, 15: ( Pd= update true 6: node OP (stack) 19: if update then 7: return nano cube dim(S) ? child, S HALLOW C OPY (child)) 1: function N ANO C UBE ([o , o , . . . , o ], S, ` ) . n > 0 14: for root n 1 2 time 4:end node update false 16: else if`time CABLE ONTENT SERIES P ROPER (node) 8: end function So,UMMED T T IMEISnodes) () : N ODE( ) )then 1: 7: procedure20: A DD(root, d, S,if , dim(S) updated d= then nano cube . New empty node end for istack = 1 toifN k ODE do ( ) 15: 2:11:5:return 8: [`1 , . . .if has a single child thentrue true 2:12: ,17: `node C HAIN (S, d) update 3:12: for i = 1 to n do k ]update 6: child C HILD (node, v ) 21: I NSERT (C ONTENT (node), `time (o)) P USH (stack, child) i 16: end function 9: stack else S ET SPHARED ONTENT (child)) if C ONTENT IC Sif HARED (node) and 3:13: T RAIL ROPER PSATH (root,(node, [`1 (o),C .ONTENT . . , `k (o)]) function T RAIL ROPER PATH(root, [v1 , . . . , vk ]) 4: 1: updated nodes 0/then 18: end ifnode child =Pnull 13:7: child else 10: child 22: else if C ONTENT(node) (node) is null then C ONTENT not in updated nodes then 4: null stack S TACK ( ) . New Empty Stack 5: 2: A DD (nano cube, o , 1, S, ` , updated nodes) i P ROPER time 8: child N EW C HILD(node, vi , N ODE( )) 19: S ET P ROPER if update end for 11: while stack ONTENT (node, 23: Athen DD (C ONTENT(node), o, d+1, updated nodes) S ET P ROPER CCONTENT (node, 5:14: is not empty do 3: (stack, 6:14: for 9: endP USH if I S Sroot) HARED HILD (node,child) then 1: function Selse HALLOW C OPYC(node) 20: PSOP ifend dim(S) then (HALLOW d= dim(S) ? d=(C return stack C OPY ONTENT (node))) 6: node (stack) 4: node root 7:15: return nano cube 24: if 10: child R EPLACE C HILD (node, 2: 8:16:5: node sc← ( ) S UMMED TIABLE T IME SONTENT ERIES ( ) :(node), N ODE( )`)time (o)) 21:update NSERT (C 15: true 7: update false for i = 1N toODE k do end function end function child, S HALLOW C OPY(child)) 25: update trueI NSERT(updated nodes, C ONTENT (node)) 12: 22: ifhas 3: 11: (node else C ONTENT Selse P ROPER 8:16: if node a single Ichild then(node) then 6:S ET S HARED child HILD (node, vi ) sc, C ONTENT (node)) end if CCONTENT 26: end if 13: else ifSCHARED ONTENT I S S HARED (node) and (node), 9: S ET C ONTENT (node, C ONTENT (child)) 17: update true 7: if child = null then 23: A DD (C ONTENT o, d+1, updated nodes) 4: 1: 12: for v inTPRAIL C HILDREN (node) function P ROPER PL ATH (root, [v1 , . . . , vdo USH (stack, child) k ]) 1: S HALLOW CABELS (node) C ONTENT (node) not in then updated nodes then 27: child ← node 10:18: else if C (node) is if null 8: function EW POPY ROPER C HILD(node, vEmpty ( )) end ifONTENT i , N ODE stack SSchild TACK ( ) NC . v, New Stack node child 24: end sc, C HILD (node, v)) 5: 2: 13: N EW HARED HILD (node 14: S ET Pthen ROPER C ONTENT (node, 2: node ODE (C) HILD(node,child) then 9: P USH else if root) I S SN HARED 11:19: Supdate ET P ROPER ONTENT (node, if 28: endCwhile (stack, 14: end forsc 25: S HALLOW NSERT (updated nodes, C ONTENT(node)) CIOPY (C ONTENT (node))) 6: 3:10: for child R EPLACE C HILD (node, d= dim(S) dim(S) ?then 20: if(end d= 3: end S ET Sroot HARED ONTENT (node sc, C ONTENT(node)) 4: 15: node return stack C 29: procedure 26: end if 15: update true child, S HALLOW C OPY (child)) S UMMED T ABLE T IME S ERIES ( ) : N ODE( ) ) 7: 5: 16: return node sc 21: I NSERT (C ONTENT (node), ` (o)) time 4: end inkCdoHILDREN L ABELS(node) do for for i = 1v to function 16: else if C ONTENT I S P ROPER (node) then 27: child node end ifC HILD(node, vi ) update true else 8: end 6:11: child 5: function N EW S HARED C HILD(node sc, v, C HILD(node, v))12:22: 17: update true 13:23: else28: if C ONTENT S S HARED (node)o,and A DD (CIwhile ONTENT (node), d+1, updated nodes) end 18: endend Cifend ONTENT (node) not in updated nodes then 24: if procedure 29: 19: if update then 14:25: S ET P ROPER C ONTENT (node, I NSERT (updated nodes, C ONTENT(node)) 9: 3: I S S HARED C HILD (node,child) then (node)) Selse ET Sif HARED ONTENT (node sc, C ONTENT 20: ifSifHALLOW d= dim(S)C OPY then(C ONTENT Natural language query s c stack C 8: between endreturn function 26: endexample, 10:15: child Rusage EPLACE C HILD (node, promise space and efficiency of queries (Sections 4.2.1 For GROUP BY on(node))) 4: for v in C HILDREN L ABELS (node) do 21: I NSERT (C ONTENT (node), `attributes (o)) count of all Delta Device flights and Language R with the U R { De time 16: end function 15:27: update node true child child, S HALLOW C OPY (child)) HARED C HILDthese (node sc, v, C HILD(node, v)) the visualizaand 6).5: Second,N EW weS show how structures enable count aggregation function results in the relation C in Figure Note 22: count of all Delta flights in the Midwest R 4.(Midwest) R { De 16:28: else if else C ONTENT I S P ROPER (node) then end while Pseudo-code of an algorithm to build nanocubes. 11:Fig. endfor if 6: 3. are end 23: DDtrue (C ONTENT (node), o, d+1, updated nodes) count of early flights in 2010 U D tions12:which common in interactive tools (Section 4.3). that for Aevery different combination of values present in theRattributes 17:29: update end procedure 1: function S HALLOW C OPY (node) P USH (stack, child) 7: return node sc 24: end if time-series of all United flights into2009 R ways U R U exploits unique characteristics ofbuild spacedata and cube time to get a good Using thisrelation, notation,an it is easy to understand conventional 18: com- end if of a Ibase aggregation record issome added the resulting renode sc child N ODE(efforts ) There have been recent to structures specif13: 2: node 8: end function NSERT (updated nodes, Cheatmap ONTENTof (node)) Deltarelation: flights in 2010 D and(tile) R { De 19:25: update then between space usage andsc, efficiency (Sections 4.2.1 iflation. of describing aggregations for a these given groupare by,(Android, cube, 3: endSfor ET C ONTENT (node C ONTENT 14:promise forS HARED In our running example, combinations en), ically suited visualization. Crossfilter [32]of(node)) isqueries a software package 26: end if 20: if d= dim(S) then 4: 6). for vstack in C HILDREN L ABELS (node) and Second, we show how these structures the visualizarollup. Anode group by(iPhone, operationru). is one in which a relation is derived from Fig. 3.return Pseudo-code of an algorithm to builddo nanocubes. childI NSERT (iPhone, en), built15: on the clever observation that many inenable interactive visualFig. 4.`time simplified set of queries supported by the nanocube data structure. T 21:27: (Cand ONTENT (o)) 5: Nare EW S HARED C HILD (node sc, v,queries C HILD (node, v))4.3). 16:tions end which function common in interactive tools (Section a while base relation given a(node), list of Aattributes and an aggregate function. For the subset of that di 28: end D means next to R or all D contains The CUBE operation is “drilldown”. the resultTheofvalue collecting possible ization6:are incremental: assuming that previous results are available, 22: else end for example, group by on attributes Device and Language with the count 29: end procedure There have been recent efforts to build data cube structures specifdomain (“universe”). Omitted here, but supported by our structure, are: the 23: A DD (C ONTENT (node), o, d+1,aupdated nodes) 7: return node sc BY aggregations into single relation for a given list at- and drilldowns GROUP the results needed for the next query can be quickly computed. UnforCountry Device Language 1:ically function S HALLOW C OPY (node)related Oursuited technique most closely to the work et package time-based drilldown;C multiple categories aggregate function results in enthe relation in Figure 5. with separateofrollups for isvisualization. Crossfilter [32] of is Sismanis a software 24: end if US n GROUP 8: end function Android 2:al. node scnot Nsee ODEeasily ( ) we how tributes (i.e. 2 BYs for n input attributes). In our running tunately, we do this would work for the multiscale [30, 29]. Specifically, use similar ideas to reduce the size of I NSERT ONTENT built on the clever observation that many queries in interactive25: visualUS for nodes, iPhone C ru (node)) Note (updated that every combination of values present in the 3:a data S ET S HARED CaONTENT (node sc, C ONTENT (node)) frominpotentially exponential in the cardinality the key 26:Kan- end CUBE fordifferent count on Device and Language is the union queries spatiotemporal setting. Justresults asof recently, Fig.necessary 3. cube Pseudo-code of an algorithm to build nanocubes. Africa iPhone en if Souththe are incremental: assuming that previous are available, example, attributes of a base relation an aggregation record is added to the re4:ization for v in C HILDREN L ABELS (node) 4.1) do space (we define these terms in Section to essentially linear [31]. India Android en sulting relation. In our running example these combinations 27: child node of four GROUP BYs: on (1) no attributes; on (2) Device only; on (3) are (An- ind del et5:the al.results proposed Datavore, a column-oriented database that supports needed for the next query can be quickly computed. UnforN EW S HARED HILD (node sc, v, C HILD (node, v)) Nanocubes improve on C Sismanis et al.’s work in two fundamental di- 28: sulting relation. IniPhone our running example (Figure these ru). combinations Australia en droid, en), (iPhone, and 5) (iPhone, Theincube operation is the a w end while Language only; and (4) on Device anden),Language, shown relation fast data cube queries [17], Liu al. would leverage graphics 6:tunately, end First, for we do see easily how this work for the multiscale rections. wenot develop aand model foret spatiotemporal data cubes thathardware Country Device Language are (Android, en), (iPhone, en), and (iPhone, ru). The cube operation result of collecting all possible group by aggregations into a single re- by procedure Our technique is most closely fast related to the work of Sismanis et 29: end 7:queries return nodecharacteristics scin D in Figure Finally, a for ROLL UPlistisgroup constrained of the the cube de in imMens, achieving extremely large data Weaggregation US 4. the Android en a given represents idea of selecting a certain of exploits unique of spacequeries andsetting. timeover to get a as good com-[20]. An necessary a spatiotemporal Just lation ofa attributes. In ourversion running example, al.end [30, 29]. Specifically, we use similar ideas to reduce therecently, size of Kandel US summarizing iPhone this group ru records from a relation and using an aggregabetween space andcomparison efficiency of queries (Sections 4.2.1 defer8:apromise fullfunction discussion andusage direct of nanocubes to Datavore P USH=(stack, child) 7:12: if child nullC then S HALLOW 6:1: function end for 13: node child OPY(node) 8: 2:Pseudo-code child Nan EW (node,nanocubes. vi , N ODE( )) node sc node N )P ROPER C HILD Fig. 3. ofODE to build 7: return sc( algorithm 14: end for eta al. Datavore, a column-oriented database support fast datahave cubeproposed from potentially exponential in the cardinality of the key for count on Device and Language would be the same as the union en on (2) Device only; on (3) n aggregation for the relation above could be to select all its records and tions which are common in interactive tools (Section 4.3). Language Nanocubes improve on Sismanis et queries al.’s workover in two fundamental di-While we Australia iPhone en only; and (4) on Device and Language (2 group bys where mens, achieving extremely fast large data [20]. summarize thoseRelation using count. In thisn case, would be the final Language There First, have also been recent efforts to build data cube structures Cube onattributes): Device, A is the five number of input rections. we7develop a model for and spatiotemporal data cubes thatnanocubes defer to Section a full discussion direct comparison of D result. represents If we allowthe entries in selecting the recordatocertain have agroup special 3 Dexploits ATA C UBES specifically suited for visualization. Crossfilter [32]toisget a software An aggregation idea of of unique characteristics of space and time a good packcom- aggregation Country Device Language Our technique isclever most closely related to work14 of Sismanis etvi- achieve to Datavore and iMmens, in the Figure nanocubes value All, we acould represent this aggreation as the following relation: Country Device Language Count age built on the observation that many interactive US and summarizing Android en group using an aggregarecords from relation this promise between space usagewe andshow efficiency ofqueries queriesinthat (Sections 4.2.1 A al.good [30, 29]. Specifically, use similar ideas to reduce the of 4 a relaFollowing common practice, wesparse willthat call the table in size Figure All All All 5 sualization are incremental: assuming results are availperformance forweboth (theprevious regime where Datavore’s US iPhone ru For example, a possible tiondata function (e.g.count, sum, max, min). and 6). Second, we show how these structures enable the visualizaAll Android All 2 cube potentially exponential in thecan cardinality of the keyvalues. An Country Language Count able, the from results needed and for the next occupation query be quickly computed. tion,a data its columns lines records, and its entries South Africa Device iPhone structures are ideal) dense of key space (where iMaggregation for the relation above couldAllbe en to select all its All records and tions which areattributes, common in its interactive tools (Section 4.3). iPhone All 3 All All 5 space (we define these terms in easily Section 4.1) to would essentially linear [31]. India Android en Unfortunately, we do not see how this work for the multisummarize those using count. In this case, five would All be the final aggregation represents therecent ideaetof selecting a certain of records Thereare). have also efforts to build cube group structures mens’s All eu 4 Nanocubes improve onbeen Sismanis al.’s work insetting. twodata fundamental diAustralia iPhone en scale queries necessary in a spatiotemporal Just as recently, aggregation result. inIfawe allowthat entries in the have a special suited for visualization. Crossfilter [32] isdata aan software pack- We refer to records relation contain therecord specialtovalue as agAllAll All ru 1 from aspecifically relation and summarizing this group using aggregation funcrections. First, wehave develop a model for spatiotemporal cubes that Kandel et al. proposed Datavore, a column-oriented data cube value All, we could represent this aggreation as the following relation: iPhone ru 1 age built on the clever observation thatand many queries ina interactive vi- gregation records. Using thisthe notation, is easy toaunderstand the of conAggregation An aggregation represents certain All group unique space time to gethardware comBidea of itselecting tionexploits (e.g. sum, For example, agood possible 3sualization Dcount, ATA are Ccharacteristics UBES representation [17], andmax, Liu of etmin). al. leverage graphics inavailiM-aggregaAll Android en 2 incremental: assuming that previous results are ventional of describing aggregations for a given group records fromways a relation and summarizing this group usingrelation: an All aggregabetween space usage and efficiency ofall queries (Sections 4.2.1 mens, achieving extremely fast queries over large data [20]. While we tionpromise for the relation A could be to select its records and summarize Country Device Language Count iPhone en 2 able, results needed for the next query canenable be computed. cube, and rollup. A group by operation is one in whichaapossible relation is Following common practice, we will call thequickly table in Figure tion 5 by, a function rela(e.g.count, max, min). and 6). the Second, we show how these structures the visualizaB All sum,All All For example, 5 defer to Section 7 ayielding full discussion and direct comparison nanocubes iPhone ru 1 those using count, five as the aggregation result. Ifvalues. we allow Unfortunately, we do not see easily how this would work of for the multiderived from a base relation given a list andrecords anAllaggregate aggregation for the relation above could beof to attributes select all its and tions which areand common in interactive tools (Section 4.3). tion, its columns attributes, its lines records, and its entries An to Datavore iMmens, we show in Figure 13 that nanocubes achieve scalevalue queries necessary in a spatiotemporal setting.we Just as recently, function. For example, group In by oncontain attributes Device and Language a special All to be a valid attribute value, could represent this We refer to records in a relation that the special value All as agsummarize those using count. this case, five would be the final Finally, a roll up is a constrained version of the cube operation where There have also been both recent efforts build data structures aggregation the idea(the ofto regime selecting acube certain group good performance sparse where Datavore’s dataof records Group By on Device,results Language Kandel et al. represents have for proposed Datavore, a column-oriented data cube Cunderstand with the count aggregate function in thetoof relation: gregation records. this notation, is order easy the con- is important. So a roll up on Device aggregation IfUsing we allow entries init the record tothe have a special the input attributes aggregation as relation B Liu in Figure 4. record that contains spespecifically suited for visualization. Crossfilter [32] isusing aspace software pack-iM- the structures are ideal) and dense occupation of key (where from a relation and summarizing thisAgroup an aggregation func- result. representation [17], and et al. leverage graphics hardware in iM- value Country Device Language ventional ways of describing aggregations given relation: groupmeans All, we could represent this aggreation theaCount following relation: andasfor Language (in this order) the union builtAll on the clever observation that many queries inthis interactive vi- it is easy Equivalent to Group By onof group by’s on: (1) mens’s are). cialage value is an aggregation record. Using notation, mens, achieving extremely fast queries over large data [20]. While we Android en is one in 2 which a relation is tion (e.g. count, sum, max, min). For example, a possible aggregaby, cube, and rollup.AllA group by operation no attributes; (2) Device; and (3) Device sualization are incremental: assuming that previous results are availall possible subsetsand of Language. Note that the C All iPhone en 2 defer to Section 7 a full discussion and direct comparison of nanocubes to understand some conventional ways of describing aggregations for tiontheforresults the relation A could to select its records and summarize Device Count derived from a Country base relation givenLanguage a listgroup of attributes and an aggregate by on only is Language} not part of the roll up. As the results able, needed for the nextbequery can13 beall quickly computed. iPhone ru 1 Language{Device, to iMmens,yielding we show in Figure that nanocubes result. achieve If function. 3 Datavore DATA Cand UBES AllAll All 5 Device a given relation: GROUP BY, CUBE, and ROLL group by onAllattributes and Language those using count, five aswould the aggregation we al- For example, Unfortunately, we do not how this workUP. for the multiof group by’s, cubes and roll ups can be seen as relations, we can good performance for see botheasily sparse (the regime where Datavore’s data with the count aggregate function results in values thecompose relation: queries necessary in spatiotemporal setting. Just recently, naturally such operations. As we will describe nanocubes is In the relational databases community, table below isasa (where relation, its represent for every combination of present inasthe low a special value Alla dense to a valid attribute value, we Ascale GROUP BY operation isbeone in the which a relation is could derived from WeNote referthat to records in adifferent relation that contain the special value All ag-atstructures are ideal) and occupation of key space iMCountry Device Language Count Kandel et are). al. proposed Datavore, a column-oriented data a easy specialized structure to cubes of roll ups. columns arehave attributes, its lines areinrecords, and are cube values.function. tributesrecords. of the relation an aggregation record isdata added to contheaggregation re-store and query Fig.base 4. A sample relation and its associated operators. this aggregation asa relation B Figure 5. its gregation Using this notation, it is to understand the a base relation given list of attributes and anentries aggregation mens’s All Android en 2 representation [17], and Liu et al. leverage graphics hardware in iM- ventional waysFig. 5. All A sample relation and its associated aggregation operators. of describing aggregations for a given relation: group A record that contains the special value All is an aggregation record. iPhone 2 4en N ANOCUBES : A C OMPACT, S PATIOTEMPORAL R OLL -U P 3 mens, achieving extremely fast queries over large data [20]. While we by, cube, and operation ru is one in which a relation is iPhone 3 rollup. AAllgroup by 3 toDSection ATA C UBES C UBE 1 defer 7 a full discussion and direct comparison of nanocubes derived from a base relation given a list of attributes and an aggregate 3 to Datavore and iMmens, we show in Figure thatbelow nanocubes achieve its function. Dataofvisualizations inina computer In the relational databases community, the13table is a relation, Note that every different combination values present the at- are necessarily bounded by display Forfor example, group by on attributes Device and Language good performance for both sparse (the regime where Datavore’s data andrelation: so is weadded wouldtolike columns are attributes, its lines are records, and its entries are values. with tributes of theaggregate base relation an aggregation record the to re-be able to quickly collect subspaces of the count function resultssize, in the structures are ideal) and dense occupation of key space (where iMthe dataset Country Device Language Count that would end up in the same pixel on the screen. HowSouth Africa iPhone function (e.g.count, sum, max, min). example, and 6). Second, we[17], show7. how these structures enable thelinear visualizaof fourFor group by’s: aonpossible (1) no attributes; andFig. imMens until Section space (we define these inLiu Section 4.1) to essentially [31]. tion data queries and al. leverage graphics hardware in iMIndia Android en 3. cube Pseudo-code of anterms algorithm toetbuild nanocubes. tio sp is we qu (e. tur be da ea it na cie of fo qu 4. Le lab ar sio co the fac is Natural language query count of all Delta flights count of all Delta flights in the Midwest count of all flights in 2010 time-series of all United flights in 2009 heatmap of Delta flights in 2010 s R R R R D U Midwest U U tile0 c R R D R R { Delta } { Delta } { United } { Delta } t R R R D R U U 2010 2009 2010 URL /where/carrier=Delta /region/Midwest/where/carrier=Delta /field/carrier/when/2010 /tseries/when/2009/where/carrier=United /tile/tile0/when/2010/where/carrier=Delta Fig. 5. A simplified set of queries supported by nanocubes. The column s represents space; t, time; c, category. R means “rollup”, D means “drilldown”. The value next to R or D contains the subset of that dimension’s domain being selected. U represents the entire domain (“universe”). CUBE operation where the order of the input attributes is important. A ROLL UP on Device and Language (in this order) means the union of GROUP BYs on: (1) no attributes; (2) Device; and (3) Device and Language, but does not include the GROUP BY on Language only. As the results of GROUP BYs, CUBEs and ROLL UPs can be seen as relations, we can naturally compose such operators (e.g. a ROLL UP CUBE). 4 N ANOCUBE : A denoted by dim(S). The multiplicity of a schema S is the product of its chains’ number of levels: µ(S) = ∏ni=1 levels(ci ). A full assignment for a sequence of labeling functions [`1 , `2 , . . . , `k ] is a sequence of label values [v1 , v2 , . . . , vk ] where vi is a label value under `i . Any prefix of a full assignment for a sequence of labeling functions, including the empty one, is referred to as a partial assignment. Note that a full assignment is also a partial assignment since a sequence is also a prefix of itself. An address on a schema is a sequence of partial assignments for its chains, more formally, if S = [c1 , c2 , . . . , cn ] is an indexing schema, then a = [p1 , p2 , . . . , pn ] is an address of S if pi is a partial assignment for chain ci . The set of possible addresses of S is denoted by addr(S). The addresses of an object o under indexing schema S, denoted by addr(o, S) are all the addresses in addr(S) whose partial assignments are consistent with the label values associated to o and it is easy to see that the size of addr(o, S) is always µ(S). Besides a schema S, the definition of a nanocube requires a separate labeling function, `time : O → T , which we refer to as the time labeling function since we use it to encode the temporal aspect of our datasets. Thus, a nanocube for objects o1 , . . . , on is denoted by: C OMPACT, S PATIOTEMPORAL DATA C UBE Data visualizations in a computer are necessarily bounded by display size, and so we would like to be able to quickly collect subsets of the dataset that would end up in the same pixel on the screen. However, spatiotemporal navigation is inherently multiscale. The same data structure should support quick indexing for a visualization over multiple years of time series and for drilling down into one particular hour or day. Similarly, the data cube should support aggregation queries over vast spatial regions covering entire continents, as well as very narrow queries covering only a few city blocks. The database notion of ROLL UP, in a sense, aligns nicely with the notion of Level of Detail. For example, if the records of a table (relation) contain a location attribute, one can design a ROLL UP query whose resulting relation encodes the same information as the one encoded in a level of detail data structure. More concretely, suppose `1 , . . . , `k are attributes computed from the original location attribute and yield “quadtree addresses” of increasingly higher levels of detail (from 1 to k). A ROLL UP query on these (computed) attributes results in, essentially, the same information as the one contained in a quadtree (given that we are keeping the same summary in both, e.g. count). The second important notion in the design of nanocubes is the idea that we want to combine aggregations of independent dimensions at independent levels of detail. For example we might want to know for a whole country, what is the spatial distribution of tweets generated by an iPhone: coarse on the spatial dimension, but specific on the device dimension. Conversely, we might want to know the distribution of tweets (coarse on device) in a small city block (fine in space). In relational database terminology, this model has a name: it is a CUBE of ROLL UP, or a ROLL UP CUBE. With the terminology set, we can state: a nanocube is a data structure to efficiently store and query spatiotemporal ROLL UP CUBE. Besides implementation tricks, the main difference between nanocubes and previously published sparse coalesced data cubes such as Dwarf cubes [29] is in the design of aggregations across spatiotemporal dimensions (see Sections 4.3.1 and 4.3.3). Next, we present a formal description of the components that make up our nanocube index, pseudo-code for building nanocubes, an illustrated example, and how queries are made against our index. 4.1 NANO C UBE([o1 , . . . , on ], S, `time ) A key in a nanocube is any pair (a,t) where a ∈ addr(S) and corresponds to a full assignment (see definition above) and t ∈ T is a possible time label. If we remove the requirement of a being a full assignment, we say that pair (a,t) is an aggregate key. Note that every key is also an aggregate key. The set of all possible keys and the set of all possible aggregate keys of a nanocube are respectively referred to as its key space, or K ? , and its aggregate key space, or Ka? . The size of the key space, |K ? |, is referred to as its cardinality. 4.2 Building the Index To ease the remaining exposition, we assume that a nanocube maps an aggregate key to a count. However, nanocubes support any kind of summary that is an algebra with weighted sums and subtractions, and thus works for any linear statistics, including means and variances. The pseudo-code for building a nanocube is presented in Figure 3. The main idea of the algorithm is for every object oi to first find the finest address of the schema S hit by this object, update the time series associated with this address and from there on update in a deepest first fashion, all coarser addresses also hit by oi . Note that the content of the last dimension of schema S is always a time series and that is why, in line 21 of A DD, we insert the time label of the current object. The important trick used is to, when possible, allow for shared links across dimensions (dashed blue lines in Figure 2) and in the same dimension (dashed black connections). In real use cases this sharing is responsible for significant memory savings and enables exploring even larger datasets on small laptops. The functions not defined in the pseudocode should have its semantics easy to understand from its name. Definitions Let O be a set of objects. A labeling function ` : O → L associates a label value to the objects of O. We can think of ` as an attribute in a relational database. In connection with the level of detail discussion above, if `1 and `2 are two labeling functions for O, we say `1 is coarser than `2 or that `2 is finer than `1 if for any two objects o, o0 ∈ O the implication `2 (o) = `2 (o0 ) ⇒ `1 (o) = `1 (o0 ) holds. We denote this fact by `1 < `2 . A sequence of labeling functions c = [`1 , `, . . . , `k ] for objects O is a chain for O if every labeling function is coarser than the next labeling function in the sequence: `i < `i+1 . The number of levels of a chain is defined by levels(c) = |c| + 1. An indexing schema for objects O consists of a sequence of chains S = [c1 , c2 , . . . , cn ]. The dimension of an indexing schema S is the length of its sequence of chains and is 4.2.1 Nanocube Example Consider the scenario where an analyst is interested in understanding the spatiotemporal distribution of Twitter data (i.e. tweets), including which devices (e.g. iPhone, Android) people are using. Natural questions to ask include: Which device is more popular for tweeting? Is one device more popular in certain areas than in others? How has this popularity changed over time? We illustrate the construction of a nanocube built using Twitter data in Figure 2. For clarity, this example contains only five tweets o1 , . . . , o5 , all ordered in time. 4 Online Submission ID: 276 event 0 1 2 3 4 5 6 7 8 9 10 11 12 13 time query/tseries/1/3/4 start at bin 1, use buckets of 3 bins each, and collect 4 of these buckets solve using... bin: 1 bin: 3 bin: 4 bin: 6 bin: 10 accum: 2 accum: 3 accum: 4 accum: 7 accum: 9 A Summed Table Sparse Representation for Counts result 3 4 0 2 Fig. 6. An illustration of the summed-area table variant we use for our time series indexing scheme. Every node in Figure 2 stores an array of timestamped counts like the one in this figure. Fig. 7. Which device is more popular for tweeting: iPhone (blue) or Android (orange)? This choropleth map highlights areas in which devices are more popular based on a sample of 210M tweets. When we zoom in to Chicago we can observe something not seen from the overview display: south and west of the city, Android is more popular than iPhone. As shown on the top-left map of Figure 2, the first two tweets (o1 and o2 ) were sent from the east coast of the United States; the third tweet (o3 ), from South Africa; the fourth tweet (o4 ) was sent from Asia, and the fifth tweet (o5 ) from Australia. Tweets o1 and o4 were sent from an Android device while o2 , o3 , and o5 were sent from an iPhone device. The labeling functions `device , `spatial1 , and `spatial2 as well as the schema of this nanocube, S, are all defined on the left part of this figure. The labeling `device assigns a device to each tweet and `spatial1 and `spatial2 assign a spatial label to each tweet. The tweet labels given by `spatial1 and `spatial2 are essentially addresses in a quadtree partition of a square. Note that `spatial1 is coarser than `spatial2 . The right part of Figure 2 presents intermediate nanocubes generated by NANO C UBE (Figure 3) after each tweet is inserted. gregate values for multiple branches in that dimension. In a single nanocube query, each dimension is independently set to be used as the basis for either a rollup or a drilldown. In Figure 5, we provide a set of example queries and their mapping to the server query URL. It is worth noting that the order of the d dimensions does not impact the worst-case query run-time. For example, a marginal barchart of a categorical dimension (with k bars), requires O(kd) time, regardless of the category chosen or the ordering of the dimensions. 4.3.1 4.3 Querying the Cube Nanocubes support three distinct dimension types, which are always traversed in a fixed order: spatial, categorical, and finally temporal. Before describing queries for each of these specific dimension types, we first illustrate how simple queries are conducted on nanocubes using an example. Recall that the end result of the query will be to return precomputed aggregates across one or more dimensions. In Figure 2(5), assume we are interested in the count of all tweets that occurred in the northwest quadrant of the world, regardless of the device type and time. The aggregate key ka = ((p1 , p2 ),t) for this query consists of: (1) the partial assignment for the northwest quadrant in the spatial dimension: p1 = [0, 1]; (2) the empty partial path for the device dimension p2 = [ ] indicating any device; and (3) a time label t indicating any time. Finding the precomputed aggregate for a given aggregate key is called a simple query. In this example, we start at the top-most node and traverse all black parent-child links described in the partial assignment p1 : in this case only the black [0,1] link. We next cross the dimension boundary line by traversing the (blue) content link of the current node. The traversal process is repeated for the device dimension using the partial assignment p2 . In this specific case, no restrictions are made on the device, and we can jump to the next dimension by traversing the content link. At this point, we reach a leaf node containing {o1 , o2 }. Since no time constraint is imposed, the count of elements inside the leaf (2) is the answer for the query. Note that, for each dimension, a simple query only traverses a single path of its tree before jumping to the root node of a tree in the next dimension (or to a leaf node which encodes time and is treated differently). In general, higher level queries might traverse multiple paths of a single tree, and may also report single aggregates, multiple aggregates, or even combine aggregates from multiple branches. To abstract and classify how a general nanocube query processes a dimension, we use the terminology of rollups and drilldowns (the ROLL UP relational operation is related but has a different meaning than the one we intend here). The dimension that is the basis of a rollup should report a single aggregate value as a result. This aggregate might be a single existing aggregate in the nanocube or a combination of multiple aggregates from different branches of that dimension. A drilldown reports ag- Spatial Queries In our current implementation, the first dimension to be traversed in a nanocube is always the spatial dimension. It is helpful to think of this dimension as being represented by a traditional quadtree [27], where each quadtree node is enriched by an extra pointer (content pointer) that jumps to the next dimension of the nanocube. If a query matches exactly the region represented by a quadtree node, then the content pointer of that node is the gateway for all aggregates that refers precisely to that region. If the query includes categorical restrictions (or drilldowns), then these can be found by traversing down the following categorical dimensions, as described below. However, spatial regions will very rarely match exactly one node in the quadtree; therefore, we use the traditional region quadtree intersecting algorithms to compute the minimal disjoint set of quadtree nodes that exactly cover the query region [27], and sum the resulting rollups across the nodes. Arbitrarily shaped regions are not currently supported for spatial queries because of the additional complexity that is introduced, but there is no major intrinsic barrier in the nanocubes framework which prevents them from working. Instead, we support regions defined by the tiling scheme of most mapping services on the WWW. For example, the widest tile in the world in OpenStreetMap [15] has coordinates (0, 0, 0), while a tile for block-level maps of downtown Los Angeles might have coordinates (22485, 52342, 17). The first two coordinates are integer addresses, and the third coordinate corresponds to the zoom level: going down a zoom level doubles the resolution in both x and y. Our spatial drilldowns are then specified by a tile (x, y, z) address and an additional integer resolution, which denotes how many levels to break down space inside the tile. Traditionally, tiles from mapping services are squares with 256 pixels on the side, which corresponds in our case to a resolution of 8. Since our spatial drilldowns return an array of counts broken down by latitude and longitude, they are the basis for spatial density plots and choropleth maps. 4.3.2 Categorical Queries Categorical dimensions in a nanocube are represented by flat trees, which always contain a root node with potentially as many children as there are different values in that category. To restrict the domain 5 A A B 9/11 Fig. 8. A perspective on the history of American Airlines (blue) and Delta (orange). The time series show the weekly percentage of the total number of commercial flights in the United States. Note the presence of aligned spikes as well as opposing spikes. After 9/11 Delta saw a positive spike where American saw a negative one. The big bump on American Airlines after 9/11 was the merger with TWA. The heatmap shows the spatial hotspots of the two companies counting all flights after 9/11 (the time bar A can be dragged and resized to change the considered time window for the heatmap). Fig. 9. Two kinds of Customer Tickets: Type 1 (Red) and Type 2 (Blue). The heatmap on the left map corresponds to time bar A, and the one on the right to time bar B: both encode the difference between number of reports of Type 2 and Type 1 in each point of the map. Reports of Type 1 exceed reports of Type 2, but not everywhere: notice that the region of Denver is still blue. Zooming into Denver we see that the number of Type 1 reports has increased over time, but Type 2 still dominates. which makes it easy to plug in different data structures for each dimension of the nanocube. For example, for the Twitter data, we use a 2d quadtree for the spatial dimensions (latitude and longitude), and flat trees for each categorical dimension (e.g. language, device, application), and our summed-area table variant for the time dimension. The nanocube construction algorithm has not been optimized for speed (results are included in section 6) but there are several possible improvements that we could make: using multiple threads, or using memory pools to avoid the overhead of repeated memory allocations and deallocations. Due to the scale of the input data, most of our effort has been spent on optimizing memory usage, including optimized libraries for memory allocation (libtcmalloc) and tagged pointers, which allow us to use the 16 most significant bits in a 64-bit pointer to quickly identify different types of nodes in our data structure. The nanocube server exposes its API for queries via HTTP. More specifically, it provides a web service by which queries can be issued [26]. After the data cube is built, the data structures are no longer mutated, and so the server is easily parallelizable (it also means that nanocubes are add-only: they cannot be updated if a record is removed from the base relation). Our implementation uses the Mongoose library for handling multiple HTTP requests in separate threads concurrently [21]. We have built two front-end visualization clients to query the nanocube server. One client is written in C++ and uses OpenGL for efficient rendering. The other client is browser-based and is written in Javascript, HTML5, SVG, WebGL, and D3 [3]. Snapshots of the clients appear in the accompanying figures of the paper. the to a certain value of the category, the query engine simply follows the path down the child of the corresponding value. Categorical rollups are performed by simply returning the count corresponding to either the top-level node (in case of no restriction) or the child node (in case of a restriction). Categorical drilldowns are also similarly simple: they are a sparse array of all children with non-zero counts. We note that since categorical dimensions appear under spatial dimensions, answering spatial region rollups with either categorical restrictions or drilldowns requires combining the categorical rollups across all quadtree nodes that are reached by the region. An analogous phenomenon happens for nested drilldowns across multiple categories. For example, the binned scatterplot in Figure 11 can be built directly from the result of drilling down in both day of week and hour of day. The recombinable parallel set visualization of Figure 1 requires a triple breakdown of language, device and application. Single category drilldowns also trivially enable histogram plots. 4.3.3 Temporal Queries To represent the temporal dimension, we use a variant of summedarea tables [7], namely a sparse variant of the one-dimensional analogue (see Figure 6). Each time series in a node is stored as a dense, sorted array of cumulative counts, tagged by timestamp. With this data structure, we can compute a temporal rollup of event counts along any contiguous period, using only two binary searches: one to find the array element with the least upper bound of the period’s beginning, and another to find the greatest lower bound of the period’s end. The difference between these numbers is the total number of events in the period. A temporal drilldown happens similarly, and we can compute a time series with t entries by performing t + 1 binary searches. Each determines the breaking points in the cumulative array, and the final value is computed by stepwise differences. This scheme for storing time entries is attractive for several reasons. First, it ensures that we can store time series of any granularity without requiring a nested tree structure like our spatial indexing scheme. Second, the running time is essentially optimal (up to a log n factor), and the algorithm is extremely fast in practice. 5 6 E XPERIMENTS To study the behavior of our nanocube data structure, we collected six public and private datasets that ranged in size from four million records up to over one billion records. Each dataset includes geospatial, temporal, as well as data-specific categorical dimensions that range from only a few bins to as many as 30. For all but the synthetic dataset experiments, we included the geospatial time-series dimensions, and varied the other dimensions based on the datasets. In the following sections, we provide a brief overview of each of the datasets, followed by an overall summary of our experimental results in section 6.8. For each of the experiments, we paid particular attention to how much memory was required to build and store the nanocube index, as well as the overall complexity of the dataset itself, which varied greatly from one to the next. Once the nanocubes were constructed, we queried them using one or both of our front-end clients to highlight the ease with which analysts could explore the data. I MPLEMENTATION We use a client-server architecture for the current implementation of nanocubes. The server reads the multidimensional data, builds a nanocube, and then processes queries on the nanocube from client applications. The server is a C++11 template-based implementation 6 Online Submission ID: 276 Fig. 10. Highlights of a visual analysis session of the CDR dataset, with 1, 043, 884, 027 records. We noticed the different patterns in call volume by interacting with the dataset and trying different regions and category selections. Notice the patterns occur at different spatial and temporal scales. 6.2 Airline Commercial Flights History This publicly available dataset contains data for every commercial flight in the United States over a 20 year period (1987-2008) [2, 35]. For over 120 million flights, the records include the scheduled departure and arrival times, the actual departure and arrival times, the origin and destination airports, the airline, and other fields. For this experiment, we built our index using the origin airport (for latitude and longitude), scheduled departure time, the departure delay, and the airline. This allows us to answer queries related to overall departure delays for any airports, airlines, time of day, or combinations thereof. In Figure 8 we present an overview on the weekly percentages of total commercial flights in the U.S. for a 20 year period of Delta and American Airlines. The query times and bandwidth usage across all experiments are consistent, so we report them in aggregate here. The mean query time was 800µs (less than 1 millisecond) with a maximum of 12 milliseconds. The output size per query averaged 5KB, with a maximum size of 50KB (geographical tiles dominated bandwidth usage). Our server currently uses no compression, although we plan to support transparent gzip stream encoding. The mean number of queries for the C++ client was 100 requests per second. The HTML5 client is much quieter, at around 1 query per second, since linked views are only updated when a brush is released. The C++ client was designed for LANs, and its bandwidth usage is around 5Mbps, well within current capacities. 6.1 Twitter 6.3 Call Detail Records For each cellular phone call, telecommunications companies collect information about the call including time, duration, and the sequence of cell towers that carried the call. This information is organized into what are known as Call Detail Records (CDRs). A large U.S. service provider (privately) shared with us over one billion CDRs generated from a one month period in July 2010. Due to the sensitivity of CDR data, our data has been completely anonymized. No personally identifiable information was gathered or used in conducting this study. To the extent that any data was used, it was anonymous and aggregated data. The nanocube was built using the geospatial temporal data (of first cell tower), as well as the duration (transformed to a catagorical dimension) of each call (see Figure 10). Between November 2011 and June 2012, we collected about 210 million tweets that originated in the United States using Twitter’s public feed which provides a representative sampling of all tweets. The rate of tweets obtained averaged about one million per day. The data was streamed in the form of JSON objects, from which we extracted the following attributes: latitude and longitude of the device, the time the tweet occurred, the client application used, the type of device, and the language of the tweet. The categorical dimensions in our data (application, device, language) had respectively 4, 5, and 15 distinct values. With a nanocube built using this data, we could quickly explore the data to better understand the areas in which one device is more popular than another, where each of the languages is most prevalent, and how that information changes over time (see Figure 7). 7 Fig. 12. By supporting multiscale time series queries, we can explore the Brightkite checkin frequency to investigate global trends as well as short-lived events. Searching online for news about Brightkite and the event dates, we found that the iOS client for Brightkite was released exactly when the upward spike happened, and that the downward spike that appears to have ultimately caused the social network to lose its users was caused by a global outage that lasted a few days. Fig. 11. Selecting different geographical regions highlights how different populations interacted with the Brightkite social network. While in the US and UK there is no substantial difference between weekday and weekend traffic, in Japan weekday usage is markedly lower. 6.4 Location-Based Social Networks The next dataset is also publicly available, and consists of locationbased checkins in the Brightkite social network collected by Cho et al. [6]. The dataset comprises all data checkins from the (now-defunct) website between April 2008 and October 2010. In addition to latitude, longitude, and the time of each checkin, we redundantly encoded hour of day and day of week as extra categorical dimensions, since we expected there to be interesting periodic day-to-day and weekday vs. weekend patterns. In Figures 11 and 12, we show some results of visualizing this dataset with our HTML5 frontend. 6.5 growth that quickly flattens out. In the case of SPLOM 50, the index grew from 0 to 300MB with the first 200 million object insertions, but grew less than 100MB larger as a result of the next 800 million object insertions. The explanation for this behavior is that, by a characteristic of the synthetic object generator (samples from a normal distribution for each dimension) a key set of high probability was quickly generated making it harder and harder for a new object with an unseen key to be generated. Thus, later in the process, most inserted objects will not require more memory since their keys were already inserted into the nanocube. We refer to this phenomena as key saturation. In Figure 14(B), we present curves for memory usage and number of keys, both relative to the final nanocube numbers. The objects in this example are calls from the CDR dataset and the dimensions are: location of first cell tower hit by the call, call duration and initial time of the call. To test for a key saturation effect as seen in the SPLOM example, we excluded the time dimension present in the original data for this experiment. Once again, we observe an initial rapid growth on memory usage explained by the large number of combinations of cell locations and call durations not yet inserted. Once the bulk of the keys corresponding to these combinations are inserted, a relatively small but steady rate of new keys are inserted reflecting a small but steady growth in the cell tower infrastructure. Similarly defined curves for the Flights dataset are shown in Figure 14(C). We included the location of the flight’s origin airport, departure delay, and carrier, but again excluded the time dimension to test for the saturation effect. The first part of this experiment follows the same trend as before: rapid initial growth, followed by a saturation of keys and a steady but much slower growth reflecting the small rate of new airport locations and carriers. At about 80M inserted flights (circa 1995), we again observe a regime of rapid growth, which corresponds to a burst of new carriers. Customer tickets This dataset contains a record of about 8 million records of customer interactions of a large U.S. service provider over a period of 2.5 years. The dataset contains latitude, longitude, time and report type (one of eight categories). The same measures taken to anonymize CDR data in Section 6.3 were used here. In Figure 9, we highlight the use of nanocubes to detect relative changes in category in the time series plot, and how choropleth maps restricted to different time regions show the change in geographical distribution of the report types. 6.6 SPLOM This is a collection of synthetic datasets (each with five dimensions) designed by Kandel et al. [17] to exercise data cube technology (SPLOM stands for ScatterPLOt Matrix, the visual encoding used to explore the dataset in that work), also used by Liu et al. [20]. To compare resource usage to that of these other proposals, we built nanocubes using five different bin sizes per dimension, from 10 to 50. 6.7 Memory Usage To understand the memory requirements to build a nanocube, it is important to remember that objects are not inserted directly into the nanocube, but rather through their corresponding keys (see Figure 3). An object’s key identifies the most specific bin in the nanocube that contains that object. Thus, depending upon the resolutions defined for the dimensions of a nanocube, two different objects may or may not be distinguishable. For example, if the time resolution of a nanocube is one hour, two objects with timestamps at 20h10m and 20h50m will both have have keys with the same time label rounded to 20h. As a result, new occurrences of keys that were already inserted into a nanocube do not require additional storage space. Figure 14(A) shows the memory usage growth for the SPLOM dataset as we insert from zero to one billion objects into the five nanocubes of increasing bin size. In all cases there is an initial rapid 6.8 Performance Summary In Figure 13, we summarize the relevant information for building our nanocubes on the previously described datasets. The number of input objects N, the memory requirements, and the build times are reported in the first three columns, while the exact schema used for each dataset is included in the last column. Column “size” indicates the number of nodes in our data structure (in the nanocube of Figure 2(5) this number would be 41: 22 circles + 19 entries in the leaves). The “sharing” column indicates how much larger the nanocube would be without the sharing mechanism (dashed lines in Figure 2). For example, the twitter 8 Online Submission ID: 276 dataset brightkite customer tix flights twitter-small twitter splom-10 splom-50 cdrs objects (N) 4.5 M 7.8 M 121.0 M 210.0 M 210.0 M 1.0 B 1.0 B 1.0 B memory 1.6 GB 2.5 GB 2.3 GB 10.2 GB 46.4 GB 4.3 MB 166.0 MB 3.6 GB time 3.50 m 8.47 m 31.13 m 1.23 h 5.87 h 4.13 h 4.72 h 3.08 h size 149.0 M 213.0 M 274.0 M 1.2 B 5.2 B 51.2 K 8.8 M 271.0 M sharing 3.00x 2.93x 16.50x 3.72x 4.00x 5.67x 16.00x 18.60x keys (|K|) 3.5 M 7.8 M 43.3 M 116.0 M 136.0 M 7.4 K 1.9 M 96.3 M |K ? | 274 269 275 253 260 220 230 269 schema lat(25), lon(25), time(16), weekday(3), hour(5) lat(25), lon(25), time(16), type(3) lat(25), lon(25), time(16), carrier(5), delay(4) lat(17), lon(17), time(16), device(3) lat(17), lon(17), time(16), lang(5), device(3), app(2) d1(4), d2(4), d3(4), d4(4), d5(4) d1(6), d2(6), d3(6), d4(6), d5(6) lat(25), lon(25), time(16), duration(3) Fig. 13. Summary of resource usage for our reported experimental results (K=103 , M=106 , B=109 ). The numbers in parentheses on the schema column denote the number of bits necessary to refer to a value of that dimension, and their sum is the exponent of 2 in the |K ? | column. dataset nanocube would use at least 4 × 46.4GB = 185.6GB without sharing. Column “keys” is the number of distinct keys induced by the N objects (note here that the time dimension is included). Finally, column |K ? | reports the cardinality of the key space of each nanocube. All but two of the datasets fit in 4GB of RAM, and only one of them would not fit in 16GB, the amount of memory in a high-end laptop. The multiscale nature of spatiotemporal datasets make the cardinality of the key space impractically large for any dense storage scheme. 7 for which we can have a smooth interactive visualization experience. Other data cubes for visualization It is enlightening to compare nanocubes to recent data cube visualization proposals: Datavore [17] and imMens [20] (see Figure 15). For this discussion, we assume input objects inducing keys K and aggregate keys Ka . Datavore’s algorithms behave well in the sparse regime, but cannot handle very large datasets, because its querying time appears to be proportional to |K|. imMens, on the other hand, has extremely fast query times (below 1ms per query, and apparently O(1)), but is designed for the dense regime, and uses memory proportional to the cardinality of its key space, O(|K ? |). This limits the size of the key space and we observe that although imMens reports experiments varying N from 106 to 109 , the value of |K ? | in all experiments were within a factor of 5 of one another [20]. The hierarchical nature of a nanocube’s spatiotemporal index provides advantages in the fidelity of resulting visualizations for a much larger set of scales than imMens (Datavore supports exact visual encodings across any dimensions, but cannot cope with largescale datasets). This same hierarchical nature provides nanocubes with natural offscreen culling: the region visible onscreen can be interpreted as a spatiotemporal selection, reducing the total processing necessary. D ISCUSSION Sparse vs. Dense As a sparse scheme to store aggregates, nanocubes suffer from the same general drawbacks of any sparse data structure, when compared with a dense one. Namely, when the occupancy (i.e. key space covered by the inserted objects) is large, the extra logic and memory needed to handle pointers is wasteful. A dense mechanism using implicit addressing on arrays has simpler logic, faster access, and uses less memory. On the other hand, when considering multidimensional datasets, the memory requirements of a dense mechanism quickly becomes impractical (see the cardinality column of Figure 13). Except for the SPLOM experiments, every other dataset would require at least 253 memory locations to represent only the finest bin summaries (e.g. counts), without even considering the memory needed to represent aggregates. This requirement is simply overwhelming. On the other hand, a smart sparse scheme like nanocubes can handle well those datasets with present day technology. Obviously, we cannot expect nanocubes to solve the combinatorial explosion that happens when an arbitrary number of dimensions is considered, but it pushes the size limits of the multidimensional datasets 0.9 300 SPLOM 50^5 SPLOM 40^5 SPLOM 30^5 SPLOM 20^5 SPLOM 10^5 Fraction of Final Value A 0.8 0.7 B 0.6 0.5 0.3 0.2 Memory Usage Number of Keys 0.0 0M 200 250M 500M 750M 1B Number of Inserted CDRs 1.0 Memory Usage Number of Keys Fraction of Final Value 0.9 100 0.8 0.7 0.6 0.5 0.4 0.3 C 0.2 0.1 0 0.0 0m 200m 600m Number of Objects Inserted 1b 0M 25M 50M 75M 100M 125M Number of Inserted Flights Fig. 14. (A) Nanocube memory usage growth with number of elements, using the SPLOM benchmark by Kandel et al [17]. Notice the plateauing in memory usage due to key saturation. On the right, the growth of memory usage and number of keys when inserting objects into nanocubes for the Call Detail Records (B) and Flights (C) datasets. Query Time O(|K|) O(1) O(1) Constraints |K| ≤ Main Mem. |K ? | ≤ GPU Mem. | f (Ka )| ≤ Main Mem. 8 L IMITATIONS AND F UTURE W ORK Nanocubes offer efficient storage and querying of large, multidimensional, spatiotemporal datasets, but are not without limits. Nanocubes do not allow queries down to any individual record, like a traditional database. Our index was designed specifically to answer queries from interactive visualization systems that explore massive datasets by building a variety of visual encodings. Our current server API encourages much chattier communication than is necessary, peaking at hundreds of HTTP requests a second. This happens when brushes are being moved in the C++ client, but could be avoided by techniques like query queueing and prediction [5]. The current nanocube implementation allows for only one spatial dimension and one temporal dimension. It would be clearly useful to allow schemata that included multiple spatial dimensions so that one could visualize, for example, the distribution of geographical locations of flights leaving a certain different geographical area. Similarly, phone calls have two natural geographical dimensions. Nanocubes still take more memory than we would like. We picked one example to clearly demonstrate this: when indexing all six dimensions, the 210 million points from Twitter take around 45GB of memory. This is enough memory for a present-day server, but above that of typical laptops and workstations. We envision dynamic control over the cardinality of dimensions, but leave that for future work. We would also like to explore hybrid solutions that utilize both on-disk and in-memory data structures to enable more complex nanocubes. 0.4 0.1 Space O(|K| log2 |K ? |) O(|K ? |) O(| f (Ka )|) Fig. 15. Comparison of observed asymptotic resource usage of recent methods. Set K correspons to the input keys and set Ka to the aggregate keys induced by K. Key space, K ? , is a set that grows quickly with resolution and number of dimensions. Function f reflects nanocube’s sharing mechanism and has an important compression effect on the already sparse set Ka (see sharing col. in Figure 13): | f (Ka )| ≤ |Ka |. 1.0 400 Nanocube Size in MB Datavore [17] imMens [20] Nanocubes 9 R EFERENCES [1] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In Proceedings of EuroSys, to appear. ACM, 2013. [2] American Statistical Association Data Expo. Flights dataset, 2009. http://stat-computing.org/dataexpo/2009. [3] M. Bostock, V. Ogievetskey, and J. Heer. D3: Data-driven documents. IEEE Transactoins on Visualization and Computer Graphics, 17(12), 2011. [4] D. B. Carr, R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. Scatterplot matrix techniques for large n. Journal of the American Statistical Association, 82(398), 1987. [5] S.-M. Chan, L. Xiao, J. Gerth, and P. Hanrahan. Maintaining interactivity while exploring massive time series. In IEEE Symposium on Visual Analytics Science and Technology, pages 59–66. IEEE, 2008. [6] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: User movement in location-based social networks. In Proceedings of SIGKDD. ACM, 2011. [7] F. C. Crow. Summed-area tables for texture mapping. SIGGRAPH Comput. Graph., 18(3):207–212, Jan. 1984. [8] Q. Cui, M. Ward, E. Rundensteiner, and J. Yang. Measuring data abstraction quality in multiresolution visualizations. IEEE Transactions on Visualization and Computer Graphics, 12(5):709–716, Sept. 2006. [9] A. Das Sarma, H. Lee, H. Gonzalez, J. Madhavan, and A. Halevy. Efficient spatial sampling of large geographical tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pages 193–204, New York, NY, USA, 2012. ACM. [10] A. Dix and G. Ellis. by chance: enhancing interaction with large data sets through statistical sampling. In Proceedings of the Working Conference on Advanced Visual Interfaces, AVI ’02, pages 167–176, New York, NY, USA, 2002. ACM. [11] N. Elmqvist and J.-D. Fekete. Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines. IEEE Transactions on Visualization and Computer Graphics, 16(3):439–454, May 2010. [12] J.-D. Fekete and C. Plaisant. Interactive information visualization of a million items. In Proceedings of the IEEE Symposium on Information Visualization (InfoVis’02), INFOVIS ’02, pages 117–, Washington, DC, USA, 2002. IEEE Computer Society. [13] D. Fisher, I. Popov, S. Drucker, and m. schraefel. Trust me, i’m partially right: incremental visualization lets analysts explore large datasets faster. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pages 1673–1682, New York, NY, USA, 2012. ACM. [14] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29–53, 1997. [15] M. Haklay and P. Weber. Openstreetmap: User-generated street maps. Pervasive Computing, IEEE, 7(4):12–18, 2008. [16] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. SIGMOD Rec., 26(2):171–182, June 1997. [17] S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Advanced Visual Interfaces, 2012. [18] D. A. Keim. Designing pixel-oriented visualization techniques: Theory and applications. IEEE Transactions on Visualization and Computer Graphics, 6(1):59–78, Jan. 2000. [19] J. LeBlanc, M. O. Ward, and N. Wittels. Exploring n-dimensional databases. In Proceedings of the 1st conference on Visualization ’90, VIS ’90, pages 230–237, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. [20] Z. L. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. Computer Graphics Forum (Proc. EuroVis), to appear, 2013. [21] S. Lyubka. Mongoose: a small and easy to user web server. https://github.com/valenok/mongoose/. [22] J. Mackinlay. Automating the design of graphical presentations of relational information. ACM Trans. Graph., 5(2):110–141, Apr. 1986. [23] J. Mackinlay, P. Hanrahan, and C. Stolte. Show me: Automatic presentation for visual analysis. IEEE Transactions on Visualization and Computer Graphics, 13(6):1137–1144, Nov. 2007. [24] B. McDonnel and N. Elmqvist. Towards utilizing gpus in informa- [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] 10 tion visualization: A model and implementation of image-space operations. IEEE Transactions on Visualization and Computer Graphics, 15(6):1105–1112, Nov. 2009. F. Olken and D. Rotem. Simple random sampling from relational databases. In Proceedings of the 12th International Conference on Very Large Data Bases, VLDB ’86, pages 160–169, San Francisco, CA, USA, 1986. Morgan Kaufmann Publishers Inc. L. Richardson and S. Ruby. RESTful Web Services. O’Reilly Media, 2004. H. Samet. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages, VL ’96, pages 336–, Washington, DC, USA, 1996. IEEE Computer Society. Y. Sismanis, A. Deligiannakis, Y. Kotidis, and N. Roussopoulos. Hierarchical dwarfs for the rollup cube. In Proceedings of the 6th ACM International Workshop on Data Warehousing and OLAP, pages 17–24. ACM, 2003. Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf: Shrinking the petacube. In Proceedings of ACM International Conference on Management of Data (SIGMOD), pages 464–475, 2002. Y. Sismanis and N. Roussopoulos. The polynomial complexity of fully materialized coalesced cubes. In Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, VLDB ’04, pages 540–551. VLDB Endowment, 2004. I. Square. Crossfilter: Fast multidimensional filtering for coordinated views, 2013. http://github.com/square/crossfilter. C. Stolte, D. Tang, and P. Hanrahan. Query, analysis, and visualization of hierarchically structured data using polaris. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 112–122, 2002. C. Stolte, D. Tang, and P. Hanrahan. Multiscale visualization using data cubes. IEEE Transactions on Visualization and Computer Graphics, 9(2):176–187, 2003. H. Wickham. ASA 2009 data expo. Computational and Graphical Statistics, 20(2):281–283, 2011.