Presntation DataStorm Kick-off
Transcription
Presntation DataStorm Kick-off
Environmental Data Processing and Communication Infrastructure DataStorm 2013 Workshop on Large-Scale Data Management Paulo Carreira | IST & INESC-‐ID Source: GREENPLUM blog Drivers Green Economy Will require integrating faster from data faster from a wider ranges of data sources Green Produc:on Metering & Sensing Energy-‐Use Op:miza:on BPM Op:miza:on User Engagement Smart Produc :on Green Services Smart Grids Smart Transpor ta:on Green ICT Green Consump:on Smart Smart Consum Facili:es p:on Shift towards predictive operation past present historical data analytics real-time data analytics predictive operation future Enablers Internet-of-Things All objects in the world are identified with minuscule devices • We could localize and count everything Objects can communicate over an ubiquitous infrastructure • Coordinating each other to improve our well-being Internet-of-Things All objects in the world are identified with minuscule devices • We could localize and count everything All objects communicate over an ubiquitous infrastructure • Coordinating each other to improve our well-being Open Source Electronics After the devastating tsunami in 2011, DYIers in Japan built their own devices to detect radiation levels, then posted their finding on the Internet. Hackerspaces worldwide DY Geiger Counter A couple of motivating examples Smart Cities Example Query “Which groups of users had the smaller energy foot-print over the last week?” Smart Cities Example Smart Buildings Example Smart Building Smart Building Building Aggregators ... Smart Grid Building Aggregator Utility Aggregator Aggregator Smart Building ... Building Aggregator Smart Buildings Example Query “Which office rooms will be require more energy in the two hours?” Smart Buildings Example Answer based on a simulation model! Real-time Information Processing Stream Processing Systems filters sources sinks channels Applications • Finance (real-time trading decisions) • Fraud detection • Business Intelligence • Building Automation • Assisted Living • Monitoring • ... Several Challenges • Billions of events • Detecting very complex event conditions • Minimization of the development effort (to setup a system that reacts to complex situations) • ... Engineering Stream Processing Software • Currently Stream Processing software is (still) developed ad-hoc • In the 1970s Database Management Systems detached applications from data storage logic • By the 2000s Data Steam Management Systems promise to detach applications from streaming data processing logic Database Management Systems Query Data is stored; user queries Data Sinks Queries are run persistent store Data Stream Management Systems Query user queries Data Sources Data Sinks Queries are stored; persistence Data is run DSMSs Available • Research Systems ‣ TelegraphCQ, COUGAR, Borealis, STREAMON • Commercial ‣ Oracle 11g RT ‣ (IBM DB2, Microsoft SQL-Server soon to follow) • Open Source ‣ Esper, S4, ... DBMS vs DSMS • • DBMS • • • Queries are run when submitted and terminate Data is pulled Optimization occurs upfront DSMS • • • Queries are installed and run continuously Data is pushed Optimization is dynamic Streaming Queries: A short introduction Preliminaries • Model • Abstracted as an append-only relations with transient tuples • Streams have continuous and infinite nature • The events in the stream arrive online • Events arrive out-of-order Challenges • Unbounded Memory: Queries require an unbounded amount of memory to evaluate precisely. • Approximate query processing • Sliding window query processing • High data-flow rate: Data arrives at a pace (multi-GB) that floods the CPU • Sampling • Data synopsis Basics • Most Relational Algebra operators extend naturally to stream processing • Select, project, join are non-blocking • Therefore, SQL also extends naturally • Some new operators need to be introduced Real-Time Query Block select col_expr1, ..., col_exprn from stream_def where select_cond group by aggr_expr having having_cond output supression_expr order by ordering_expr Window • Batch vs Sliding • Based on length, based on time (or both) Windows Last 5 withdrawals with amount >= 200 select * from Withdrawal.win:length(5) where amount >= 200 Withdrawals over the last 5 seconds select * from Withdrawal.win:time(5 sec) Withdrawals at each 5 seconds (batch) select * from Withdrawal.win:time_batch(5 sec) Aggregation and Grouping Sums of the amounts every 5-seconds select sum(amount) from Withdrawal.win:time_batch(5 sec) Accounts where the average withdrawal amount per account for the last hour of withdrawals is greater then 1000 select account, avg(amount) from Withdrawal.win:time(1 hour) group by account having amount > 1000 Output control The ids of temperature sensor events every 10 seconds select sensorId from TemperatureSensorEvent output every 10 seconds The first of a series temperature drops below 21C a the specified output rate select * from TemperatureSensor where temp<21 output first every 60 seconds The total price for all orders in the 30- minute time window, output every minute but only after 30 minutes have passed select sum(price) from OrderEvent.win:time(30 min) output after 30 min snapshot every 1 min Ordering Batches of 5 or more stock tick events that are sorted first by price ascending and then by volume descending: select symbol from StockTickEvent.win:time(60 sec) output every 5 events order by price, volume desc Insert and remove streams What should be result of the query that alerts all accounts that have more that 2 withdrawals with the last 5 mins (of 1000 euro)? select accnt_no, count(*) as n_withdrawals from Withdrawal.win:time(5 min) where amount > 1000 group by acct_no having count(*) = 2 query 1 min <123, 1100> + 2 min <123, 2000> + <123, 2> U+ 3 min <123, 3000> + <123, 2> U- Insert and remove streams + + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - • Sources, channels, operators and sinks must handle positive (insert) and negative (remove) update streams • The semantics of all operators must be defined accordingly. Insert and remove streams Select only the events that are leaving the 30 second time window. select rstream * from TempSensor.win:time(30 sec) Research Challenges/ Opportunities Ubiquitous integration • Embedded stream processing • • Even small devices will have to integrate data from multiple source and in real time to perform optimally How can they ship parts of the processing to peers and to the cloud? Optimization and constraint solving • Answering to optimization queries • • E.g. what is the combination of actuators that maximize comfort and minimize energy consumption This requires incorporating MAX-SAT solving algorithms into the stream query processors Querying persistent and streaming data • Efficiently answering queries that join streamed with persisted (big data) • • Some very recent research (2008- ) on active data warehousing, new join and update algorithms Still unclear how this will scale to big data containers Model-Driven Engineering • Systematic approaches to developing ‘data intensive applications’ • Creating languages tailored for different users and domains • • • Applications to energy/building automation domain (e.g. energy-awareness built into the language) Simplifying (already overloaded) stream query languages Creating development environments that assist Open Data Infrastructure • Enabling open access to data storage and querying • • • Standard Common Data Models Public data infrastructures Policy making [email protected]