Volume P-242(2015) - Mathematical Journals
Transcription
Volume P-242(2015) - Mathematical Journals
GI-Edition Norbert Ritter, Andreas Henrich, Wolfgang Lehner, Andreas Thor, Steffen Friedrich, Wolfram Wingerath (Hrsg.): BTW 2015 – Workshopband Lecture Notes in Informatics 242 Norbert Ritter, Andreas Henrich, Wolfgang Lehner, Andreas Thor, Steffen Friedrich, Wolfram Wingerath (Hrsg.) Datenbanksysteme für Business, Technologie und Web (BTW 2015) – Workshopband 02. – 03. März 2015 Hamburg Proceedings Norbert Ritter, Andreas Henrich, Wolfgang Lehner, Andreas Thor, Steffen Friedrich, Wolfram Wingerath (Hrsg.) Datenbanksysteme für Business, Technologie und Web (BTW 2015) Workshopband 02. – 03.03.2015 in Hamburg, Germany Gesellschaft für Informatik e.V. (GI) Lecture Notes in Informatics (LNI) - Proceedings Series of the Gesellschaft für Informatik (GI) Volume P-242 ISBN 978-3-88579-636-7 ISSN 1617-5468 Volume Editors Norbert Ritter Universität Hamburg Fachbereich Informatik Datenbanken und Informationssysteme 22527 Hamburg, Germany E-Mail: [email protected] Andreas Henrich Otto-Friedrich-Universität Bamberg Fakultät Wirtschaftsinformatik und Angewandte Informatik Lehrstuhl für Medieninformatik 96047 Bamberg, Germany E-Mail: [email protected] Wolfgang Lehner Technische Universität Dresden Fakultät Informatik Institut für Systemarchitektur 01062 Dresden, Germany Email: [email protected] Andreas Thor Deutsche Telekom Hochschule für Telekommunikation Leipzig Gustav-Freytag-Str. 43-45 04277 Leipzig, Germany E-Mail: [email protected] Steffen Friedrich Universität Hamburg Fachbereich Informatik Datenbanken und Informationssysteme 22527 Hamburg, Germany E-Mail: [email protected] Wolfram Wingerath Universität Hamburg Fachbereich Informatik Datenbanken und Informationssysteme 22527 Hamburg, Germany E-Mail: [email protected] Series Editorial Board Heinrich C. Mayr, Alpen-Adria-Universität Klagenfurt, Austria (Chairman, [email protected]) Dieter Fellner, Technische Universität Darmstadt, Germany Ulrich Flegel, Hochschule für Technik, Stuttgart, Germany Ulrich Frank, Universität Duisburg-Essen, Germany Johann-Christoph Freytag, Humboldt-Universität zu Berlin, Germany Michael Goedicke, Universität Duisburg-Essen, Germany Ralf Hofestädt, Universität Bielefeld, Germany Michael Koch, Universität der Bundeswehr München, Germany Axel Lehmann, Universität der Bundeswehr München, Germany Peter Sanders, Karlsruher Institut für Technologie (KIT), Germany Sigrid Schubert, Universität Siegen, Germany Ingo Timm, Universität Trier, Germany Karin Vosseberg, Hochschule Bremerhaven, Germany Maria Wimmer, Universität Koblenz-Landau, Germany Dissertations Steffen Hölldobler, Technische Universität Dresden, Germany Seminars Reinhard Wilhelm, Universität des Saarlandes, Germany Thematics Andreas Oberweis, Karlsruher Institut für Technologie (KIT), Germany Gesellschaft für Informatik, Bonn 2015 printed by Köllen Druck+Verlag GmbH, Bonn Vorwort In den letzten Jahren hat es auf dem Gebiet des Datenmanagements große Veränderungen gegeben. Dabei muss sich die Datenbankforschungsgemeinschaft insbesondere den Herausforderungen von „Big Data“ stellen, welche die Analyse von riesigen Datenmengen unterschiedlicher Struktur mit kurzen Antwortzeiten im Fokus haben. Neben klassisch strukturierten Daten müssen moderne Datenbanksysteme und Anwendungen semistrukturierte, textuelle und andere multimodale Daten sowie Datenströme in völlig neuen Größenordnungen verwalten. Gleichzeitig müssen die Verarbeitungssysteme die Korrektheit und Konsistenz der Daten sicherstellen. Die jüngsten Fortschritte bei Hardware und Rechnerarchitektur ermöglichen neuartige Datenmanagementtechniken, die von neuen Index- und Anfrageverarbeitungsparadigmen (In-Memory, SIMD, Multicore) bis zu neuartigen Speichertechniken (Flash, Remote Memory) reichen. Diese Entwicklungen spiegeln sich in aktuell relevanten Themen wie Informationsextraktion, Informationsintegration, Data Analytics, Web Data Management, Service-Oriented Architectures, Cloud Computing oder Virtualisierung wider. Wie auf jeder BTW-Konferenz gruppieren sich um die Tagung eine Reihe von Workshops, die spezielle Themen in kleinen Gruppen aufgreifen und diskutieren. Im Rahmen der BTW 2015 finden folgende Workshops statt: • Databases in Biometrics, Forensics and Security Applications: DBforBFS • Data Streams and Event Processing: DSEP • Data Management for Science: DMS Dabei fasst der letztgenannte Workshop DMS als Joint Workshop die beiden Initiativen Big Data in Science (BigDS) und Data Management for Life Sciences (DMforLS) zusammen. Mit seinen Schwerpunkten reflektiert das Workshopprogramm aktuelle Forschungsgebiete von hoher praktischer Relevanz. Zusätzlich präsentieren Studenten im Rahmen des Studierendenprogramms die Ergebnisse ihrer aktuellen Abschlussarbeiten im Bereich Datenmanagement. Für jeden Geschmack sollte sich somit ein Betätigungsfeld finden lassen! Die Materialien zur BTW 2015 werden auch über die Tagung hinaus unter http://www.btw-2015.de zur Verfügung stehen. VII Die Organisation einer so großen Tagung wie der BTW mit ihren angeschlossenen Veranstaltungen ist nicht ohne zahlreiche Partner und Unterstützer möglich. Sie sind auf den folgenden Seiten aufgeführt. Ihnen gilt unser besonderer Dank ebenso wie den Sponsoren der Tagung und der GI-Geschäftsstelle. Hamburg, Bamberg, Dresden, Leipzig, im Januar 2015 Norbert Ritter, Tagungsleitung und Vorsitzender des Organisationskomitees Andreas Henrich und Wolfgang Lehner, Leitung Workshopkomitee Andreas Thor, Leitung Studierendenprogramm Wolfram Wingerath, Steffen Friedrich, Tagungsband und Organisationskomitee VIII Tagungsleitung Norbert Ritter, Universität Hamburg Organisationskomitee Felix Gessert Fabian Panse Volker Nötzold Norbert Ritter Anne Hansen-Awizen Steffen Friedrich Wolfram Wingerath Studierendenprogramm Andreas Thor, HfT Leipzig Koordination Workshops Andreas Henrich, Univ. Bamberg Wolfgang Lehner, TU Dresden Tutorienprogramm Norbert Ritter, Univ. Hamburg Thomas Seidl, RWTH Aachen Andreas Henrich, Univ. Bamberg Wolfgang Lehner, TU Dresden Second Workshop on Databases in Biometrics, Forensics and Security Applications (DBforBFS) Vorsitz: Jana Dittmann, Univ. Magdeburg; Veit Köppen, Univ. Magdeburg; Gunter Saake, Univ. Magdeburg; Claus Vielhauer, FH Brandenburg Ruediger Grimm, Univ. Koblenz Dominic Heutelbeck, FTK Stefan Katzenbeisser, TU Darmstadt Claus-Peter Klas, GESIS Günther Pernul, Univ. Regensburg Ingo Schmitt, BTU Cottbus Claus Vielhauer, FH Brandenburg Sviatoslav Voloshynovskiy, UNIGE, CH Edgar R. Weippl, SBA Research, Austria Data Streams and Event Processing (DSEP) Vorsitz: Marco Grawunder, Univ. Oldenburg, Daniela Nicklas Univ. Bamberg Andreas Behrend, Univ. Bonn Klemens Boehm, KIT Peter Fischer, Univ. Freiburg Dieter Gawlick, Oracle Boris Koldehofe, TU Darmstadt Wolfgang Lehner, TU Dresden Richard Lenz, Univ. Erlangen-Nürnberg Klaus Meyer-Wegener, Univ. ErlangenNürnberg Gero Mühl, Univ. Rostock Kai-Uwe Sattler, TU Ilmenau Thorsten Schöler, HS Augsburg IX Joint Workshop on Data Management for Science (DMS) Workshop on Big Data in Science (BigDS) Vorsitz: Birgitta König-Ries, Univ. Jena; Erhard Rahm, Univ. Leipzig; Bernhard Seeger, Univ. Marburg Jens Kattge, MPI für Biogeochemie Alfons Kemper, TU München Meike Klettke, Univ. Rostock Alex Markowetz, Univ. Bonn Thomas Nauss, Univ. Marburg Jens Nieschulze, Univ. Göttingen Kai-Uwe Sattler, TU Ilmenau Stefanie Scherzinger, OTH Regensburg Myro Spiliopoulou, Univ. Magdeburg Uta Störl, Hochschule Darmstadt Alsayed Algergawy, Univ. Jena Peter Baumann, Jacobs Univ. Matthias Bräger, CERN Thomas Brinkhoff, FH Oldenburg Michael Diepenbroeck, AWI Christoph Freytag, HU Berlin Michael Gertz, Univ. Heidelberg Frank-Oliver Glöckner, MPI-MM Anton Güntsch, BGBM Berlin-Dahlem Thomas Heinis, IC, London Thomas Hickler, Senckenberg Workshop on Data Management for Life Sciences (DMforLS) Vorsitz: Sebastian Dorok, Bayer Pharma AG; Matthias Lange, IPK Gatersleben; Gunter Saake, Univ. Magdeburg Matthias Lange, IPK Gatersleben Ulf Leser, HU Berlin Wolfgang Müller, HITS GmbH Erhard Rahm, Univ. Leipzig Gunter Saake, Univ. Magdeburg Uwe Scholz, IPK Gatersleben Can Türker, ETH Zürich Sebastian Breß, TU Dortmund Sebastian Dorok, Bayer Pharma AG Mourad Elloumi, UTM Tunisia Ralf Hofestädt, Univ. Bielefeld Andreas Keller, Saarland Univ. Jacob Köhler, DOW AgroSciences Horstfried Läpple, Bayer HealthCare X Inhaltsverzeichnis Workshopprogramm Second Workshop on Databases in Biometrics, Forensics and Security Applications (DBforBFS) Jana Dittmann, Veit Köppen, Gunter Saake, Claus Vielhauer Second Workshop on Databases in Biometrics, Forensics and Security Applications (DBforBFS)....................................................................................................19 Veit Köppen, Mario Hildebrandt, Martin Schäler On Performance Optimization Potentials Regarding Data Classification in Forensics.....21 Maik Schott, Claus Vielhauer, Christian Krätzer Using Different Encryption Schemes for Secure Deletion While Supporting Queries........37 Data Streams and Event Processing (DSEP) Marco Grawunder, Daniela Nicklas Data Streams and Event Processing (DSEP)......................................................................49 Timo Michelsen, Michael Brand, H.-Jürgen Appelrath Modulares Verteilungskonzept für Datenstrommanagementsysteme..................................51 Niko Pollner, Christian Steudtner, Klaus Meyer-Wegener Placement-Safe Operator-Graph Changes in Distributed Heterogeneous Data Stream Systems...........................................................................................................61 Michael Brand, Tobias Brandt, Carsten Cordes, Marc Wilken, Timo Michelsen Herakles: A System for Sensor-Based Live Sport Analysis using Private Peer-to-Peer Networks........................................................................................................71 Christian Kuka, Daniela Nicklas Bestimmung von Datenunsicherheit in einem probabilistischen Datenstrommanagementsystem...........................................................................................81 Cornelius A. Ludmann, Marco Grawunder, Timo Michelsen, H.-Jürgen Appelrath Kontinuierliche Evaluation von kollaborativen Recommender-Systeme in Datenstrommanagementsystemen.......................................................................................91 Sebastian Herbst, Johannes Tenschert, Klaus Meyer-Wegener Using Data-Stream and Complex-Event Processing to Identify Activities of Bats .............93 Peter M. Fischer, Io Taxidou Streaming Analysis of Information Diffusion......................................................................95 XI Henrik Surm, Daniela Nicklas Towards a Framework for Sensor-based Research and Development Platform for Critical, Socio-technical Systems........................................................................................97 Felix Beier, Kai-Uwe Sattler, Christoph Dinh, Daniel Baumgarten Dataflow Programming for Big Engineering Data...........................................................101 Joint Workshop on Data Management for Science (DMS) Sebastian Dorok, Birgitta König-Ries, Matthias Lange, Erhard Rahm, Gunter Saake, Bernhard Seeger Joint Workshop on Data Management for Science (DMS) ...............................................105 Alexandr Uciteli, Toralf Kirsten Ontology-based Retrieval of Scientific Data in LIFE .......................................................109 Christian Colmsee, Jinbo Chen, Kerstin Schneider, Uwe Scholz, Matthias Lange Improving Search Results in Life Science by Recommendations based on Semantic Information........................................................................................................115 Marc Schäfer, Johannes Schildgen, Stefan Deßloch Sampling with Incremental MapReduce ...........................................................................121 Andreas Heuer METIS in PArADISE Provenance Management bei der Auswertung von Sensordatenmengen für die Entwicklung von Assistenzsystemen .....................................131 Martin Scharm, Dagmar Waltemath Extracting reproducible simulation studies from model repositories using the CombineArchive Toolkit ...................................................................................................137 Robin Cijvat, Stefan Manegold, Martin Kersten, Gunnar W. Klau, Alexander Schönhuth, Tobias Marschall, Ying Zhang Genome sequence analysis with MonetDB: a case study on Ebola virus diversity...........143 Ahmet Bulut RightInsight: Open Source Architecture for Data Science ...............................................151 Christian Authmann, Christian Beilschmidt, Johannes Drönner, Michael Mattig, Bernhard Seeger Rethinking Spatial Processing in Data-Intensive Science ................................................161 Studierendenprogramm Marc Büngener CBIR gestütztes Gemälde-Browsing .................................................................................173 David Englmeier, Nina Hubig, Sebastian Goebl, Christian Böhm Musical Similarity Analysis based on Chroma Features and Text Retrieval Methods .....183 XII Alexander Askinadze Vergleich von Distanzen und Kernel für Klassifikatoren zur Optimierung der Annotation von Bildern .....................................................................................................193 Matthias Liebeck Aspekte einer automatischen Meinungsbildungsanalyse von Online-Diskussionen .........203 Martin Winter, Sebastian Goebl, Nina Hubig, Christopher Pleines, Christian Böhm Development and Evaluation of a Facebook-based Product Advisor for Online Dating Sites.......................................................................................................................213 Daniel Töws, Marwan Hassani, Christian Beecks, Thomas Seidl Optimizing Sequential Pattern Mining Within Multiple Streams......................................223 Marcus Pinnecke Konzept und prototypische Implementierung eines föderativen Complex Event Processing Systeme mit Operatorverteilung...........................................................233 Monika Walter, Axel Hahn Unterstützung von datengetriebenen Prozessschritten in Simulationsstudien durch Verwendung multidimensionaler Datenmodelle.....................................................243 Niklas Wilcke DduP – Towards a Deduplication Framework utilising Apache Spark............................253 Tutorienprogramm Christian Beecks, Merih Uysal, Thomas Seidl Distance-based Multimedia Indexing ...............................................................................265 Kai-Uwe Sattler, Jens Teubner, Felix Beier, Sebastian Breß Many-Core-Architekturen zur Datenbankbeschleunigung ...............................................269 Felix Gessert, Norbert Ritter Skalierbare NoSQL- und Cloud-Datenbanken in Forschung und Praxis .........................271 Jens Albrecht, Uta Störl Big-Data-Anwendungsentwicklung mit SQL und NoSQL .................................................275 XIII Workshopprogramm Second Workshop on Databases in Biometrics, Forensics and Security Applications Second Workshop on Databases in Biometrics, Forensics and Security Applications Jana Dittmann1 , [email protected] Veit Köppen1 , [email protected] Gunter Saake1 , [email protected] Claus Vielhauer2 , [email protected] 1 2 Otto-von-Guericke-University Magdeburg Brandenburg University of Applied Science The 1st Workshop on Databases in Biometrics, Forensics and Security Applications (DBforBFS) was held as satellite workshop of the BTW 2013. The workshop series is intended for disseminating knowledge in the areas of databases in the focus for biometrics, forensics, and security complementing the regular conference program by providing a place for in-depth discussions of this specialized topic. The workshop will consist of two parts: First, presentation of accepted workshop papers and second, a discussion round. In the discussion round, the participants will derive research questions and goals to address important issues in the domain databases and security. We expect the workshop to facilitate cross-fertilization of ideas among key stakeholders from academia, industry, practitioners and government agencies. Theoretical and practical coverage of the topics will be considered. We also welcome software and hardware demos. Full and short papers are solicited. Motivated by today’s challenges from both disciplines several topics include but are not limited to: • approaches increasing the search speed in databases for biometrics, forensics and security, • database validation procedures for integrity verification of digital stored content • design aspects to support multimodal biometric evidence and its combination with other forensic evidence • interoperability methodologies and exchange protocols of data of large-scale operational (multimodal) databases of identities and biometric data for forensic case assessment and interpretation, forensic intelligence and forensic ID management • database security evaluation and benchmarks for forensics and biometric applications • the role of databases in emerging applications in Biometrics and Forensics • privacy, policy, legal issues, and technologies in databases of biometric, forensic and security data. 19 1 Workshop Organizers Jana Dittmann (Otto-von-Guericke-University Magdeburg) Veit Köppen (Otto-von-Guericke-University Magdeburg) Gunter Saake (Otto von Guericke University Magdeburg) Claus Vielhauer (Brandenburg University of Applied Science) 2 Program Committee Ruediger Grimm (University of Koblenz, DE) Dominic Heutelbeck (FTK, DE) Stefan Katzenbeisser (Technical University Darmstadt, DE) Claus-Peter Klas (GESIS, DE) Günther Pernul (Universität Regensburg, DE) Ingo Schmitt (Brandenburg University of Technology, DE) Claus Vielhauer (Brandenburg University of Applied Science, DE) Sviatoslav Voloshynovskiy (unige, CH) Edgar R. Weippl (sba-research, Austria) 20 On Performance Optimization Potentials Regarding Data Classification in Forensics Veit Köppen, Mario Hildebrandt, Martin Schäler Faculty of Computer Science Otto-von-Guericke-University Magdeburg Universitätsplatz 2 39106 Magdeburg [email protected] [email protected] [email protected] Abstract: Classification of given data sets according to a training set is one of the essentials bread and butter tools in machine learning. There are several application scenarios, reaching from the detection of spam and non-spam mails to recognition of malicious behavior, or other forensic use cases. To this end, there are several approaches that can be used to train such classifiers. Often, scientists use machine learning suites, such as WEKA, ELKI, or RapidMiner in order to try different classifiers that deliver best results. The basic purpose of these suites is their easy application and extension with new approaches. This, however, results in the property that the implementation of the classifier is and cannot be optimized with respect to response time. This is due to the different focus of these suites. However, we argue that especially in basic research, systematic testing of different promising approaches is the default approach. Thus, optimization for response time should be taken into consideration as well, especially for large scale data sets as they are common for forensic use cases. To this end, we discuss in this paper, in how far well-known approaches from databases can be applied and in how far they affect the classification result of a real-world forensic use case. The results of our analyses are points and respective approaches where such performance optimizations are most promising. As a first step, we evaluate computation times and model quality in a case study on separating latent fingerprint patterns. 1 Motivation Data are drastically increased in a given time period. This is not only true for the number of data sets (comparable to new data entries), but also with respect to dimensionality. To get into control of this information overload, data mining techniques are used to identify patterns within the data. Different application domains require for similar techniques and therefore, can be improved as the general method is enhanced. In our application scenario, we are interested in the identification of patterns in data that are acquired from latent fingerprints. Within the acquired scanned data a two-class classification is of interest, to identify the fingerprint trace and the background noise. As point 21 of origin, experts classify different cases. This supervised approach is used to learn a classification and thus, to support experts in their daily work. With a small number of scanned data sets that the expert has to check and classify, a high number of further data sets can be automatically classified. Currently, the system works in a semi-automatic process and several manual steps have to be performed. Within this paper, we investigate the influence on system response and model quality, in terms of accuracy and precision, in the context of integrating the data and corresponding processes in a holistic system. Although a complete integration is feasible, different tools are currently used, which do not fully cooperate. Therefore, the efficiency or optimization regarding computation or response time are not in the focus of this work. With this paper, we step forward to create a cooperating and integrated environment that performs efficient with respect to model quality. This paper is structured as follows: In the next section, we briefly present some background regarding classification and database technologies for accessing multi-dimensional data. In Section 3, we describe the case study that is the motivation for our analysis. Within Section 4, we present our evaluation on the case study data regarding optimization due to feature and data space reduction. Finally, we conclude our work in Section 5. 2 Background In this section, we give background on classification algorithms in general. Then, we explain one of these algorithms that we apply in the remainder of this paper in more details. Finally, we introduce promising optimization approaches known from databases. We use these approaches in the remainder to discuss their optimization potential with respect to classification. 2.1 Classification Algorithms In the context of our case study in Section 3, several classification algorithms can be utilized, see, e.g., [MKH+ 13]. Each of those algorithms is used for supervised learning. Such type of learning consists of a model generation based on training data, which are labeled according to a ground-truth. The utilized classification algorithms in [MKH+ 13] partition the feature space to resemble the distribution of each instance (data point) in this space. Afterward, the quality of the model can be evaluated using an independent set of labeled test data by comparing the decision of the classifier with the assigned label. The utilized classification schemes from the WEKA data mining software [HFH+ 09] in [MKH+ 13] include support vector machines, multilayer perceptrons, rule based classifiers, decision trees, and ensemble classifiers. The latter ones combine multiple models in their decision process. 22 C4.5 decision tree In this paper, we use the classifier J48, WEKA’s [HFH+ 09] implementation of the fast C4.5 decision tree [Qui93], which is an improvement of the ID 3 algorithm [Qui86] and one of the most widely known decision tree classifiers for such problems. The advantage of decision trees is their comprehensiveness: the classifier’s decision is a leaf reached by a path of single feature thresholds. The size of the tree is reduced by a pruning algorithm which replaces subtrees. Furthermore, this particular implementation is able to deal with missing values. In order to do that, the distribution of the available values for this particular feature is taken into account. 1 build_tree ( Data (R:{r_1,..,,r_n},C) R: non-categorical_attributes r_1 to r_n, 2 3 C: categorical attribute, 4 S: training set in same schema as Data) 5 returning decision_tree; 6 7 begin 8 -- begin exceptions 9 If is_empty(S) 10 return FAILURE; If only_one_category(DATA) 11 return single_node_tree(value(C)); 12 If is_empty(R) 13 return single_node_tree(most_frequent_value(C)); 14 15 -- end excpetions 16 Attribute r_d (elem of R) := largest_Gain(R,S); 17 {d_i| i=1,2, .., m} := values_of_attribute(r_d); 18 {S_i| i=1,2, .., k} := subsets(S) where in each subset value(r_d) = d_i holds; decision_tree := tree(r_d) with nodes { d_1, d_2, .., d_m} pointing to trees 19 call ID3((R-{r_d}, C), S1), ID3((R-{r_d}), C, S2), .., ID3((R-{r_d}, C), S_k); 20 return decision_tree; 21 22 end build_tree; Figure 1: Algorithm to build a C4.5 decision tree, adapted from [Qui86] In Figure 1, we depict the general algorithm to build a C4.5 decision tree. The argument for the algorithm is a training set consisting of: (1) n non-categorical attributes R reaching from r1 to rn , (2) the categorical attribute (e.g., spam or not spam), and (3) a training set with the same schema. In Lines 8 to 15, the exception handling is depicted, for instance if there are only spam mails (Line 11). The actual algorithm tries to find the best attribute rd and distributes the remaining tuples in S according to their value in rd . For each subtree that is created in that way the algorithm is called recursively. 2.2 Approaches for Efficient Data Access Data within databases have to be organized in such a way that they are efficiently accessed. In the case of multi-dimensional data, an intuitive order does not exist. This is even more apparent for the identification of unknown patterns, where an ordering in a multi-dimensional space always dominates some dimensions. For these reasons, differ- 23 ent approaches have been proposed. They can be differentiated into storage and index structures. Typical storage improvements within the domain of Data Warehousing [KSS14a] are column-oriented storage [AMH08], Iceberg-Cube [FSGM+ 98], and Data Dwarf [SDRK02]. Whereas the Iceberg-Cube reduces computational effort, column-oriented storage improves the I/O with respect to the application scenario, where operations are performed in a column-oriented way. The Data Dwarf heavily reduces the stored data volume without loss of information. It combines computational effort and I/O cost for improving efficiency. Furthermore, there exist many different index structures for specialized purposes [GBS+ 12]. Very well-known index structures for multi-dimensional purposes are the kd-Tree [Ben75] and R-Tree [Gut84]. Both mentioned indexes are candidates, which suffer especially from the curse of dimensionality. The curse of dimensionality is a property of large and sparsely populated high-dimensional spaces, which results in the effect that for tree-based indexes often large parts have to be taken into consideration for a query (e.g., because of node overlaps). To this end, several index structures, as the Pyramid technique [BBK98] or improved sequential scans, such as the VA-File [WB97] are proposed. In the following, we briefly explain some well-known indexes that, according to prior evaluations [SGS+ 13], result in a significant performance increase. A broader overview on index structures can be found in [GG98] or [BBK01]. Challenges regarding parameterization of index structures as well as implementation issues are discussed in [AKZ08, KSS14b, SGS+ 13]. 2.2.1 Column vs. Row Stores Traditionally, database systems store their data row-wise. That means that each tuple with all its attributes is stored and then the next tuple follows. By contrast, columnar storage means that all values of a column are stored sequentially and then the next column follows. Dependent on the access pattern of a classification algorithm, the traditional row-based storage should be replaced if, for instance, one dimension (column) is analyzed to find an optimal split in this dimension. In this case, we expect a significant performance benefit. 2.2.2 Data Dwarf The basic idea of the Data Dwarf storage structure is to use prefix and suffix redundancies for multi-dimensional points to compress the data. For instance, the three dimensional points A(1, 2, 3) and B(1, 2, 4) share the same pre-fix (1, 2, ). As a result, the Dwarf has two interesting effects that are able to speed-up classifications. Firstly, due to the compression, we achieve an increased caching performance. Secondly, the access path is stable, which means that we require exactly the number of dimension look-ups to find a point (e.g., three look-ups for three dimensional points). 24 2.2.3 kd-Tree A kd-Tree index is a multi-dimensional adaption of the well-known B-Tree cycling through the available dimensions. Per tree level, this index distributes the remaining points in the current subtree into two groups. One group in the left subtree where the points have a value smaller or equal than the separator value in the current dimension, while the remaining points belong to the right sub tree. The basic goal is to achieve logarithmic effort for exact match queries. In summary, this index structure can be used to efficiently access and analyze single dimensions in order to separate two classes. 2.2.4 VA-File Tree-based index structures suffer from the curse of dimensionality. This may result in the effect that they are slower than a sequential scan. To this end, improvements of the sequential scan are proposed. The basic idea of the Vector Approximation File is to use a compressed approximation of the existing data set that fits into the main memory (or caches). On this compressed data an initial filter step is performed in order to minimize actual point look-ups. In how far this technique can be applied to speed-up classifications is currently unknown. 3 Case Study As described in [HKDV14], the classification of contact-less scans of latent fingerprints is performed using a block based approach. The following subsections summarize the application scenario, the data gathering process, and a description of the feature space. We depict this process in Fig. 2. We describe the steps in the following in more detail. Filtering CWL Sensor Contactless Scan 1st Deviation Sobel X/Y 2nd Deviation Sobel X/Y 1st Deviation Sobel X 1st Deviation Sobel Y Unsharp Masking Feature Extraction Block Segmentation - Statistical Features - Structural Features - Semantic Features - Fingerprint Ridge Feature Selection Classification Orientation Semantics 1.1 5.9 ... ... 3.3 7.5 BG ... 3.1 2.2 ... ... 9.7 1.4 FP 1.1 5.9 ... ... 3.3 7.5 BG ... 3.1 2.2 ... ... 9.7 1.4 FP Substrate Scan Data Fingerprint Figure 2: Data acquisition process, processing, and classification 25 3.1 Application scenario The application scenario for this case study is the contact-less, non-invasive acquisition of latent fingerprints. The primary challenge of this technique is the inevitable acquisition of the substrate characteristics superimposing the fingerprint pattern. Depending on the substrate, the fingerprint can be rendered invisible. In order to allow for a forensic analysis of the fingerprint, it is necessary to differentiate between areas of the surface without fingerprint residue and others covered with fingerprint residue (fingerprint segmentation). For this first evaluation, we solely rely on white furniture surface, because it provides a rather large difference between the substrate and the fingerprint. The achieved classification accuracy in a two-fold cross-validation based on 10 fingerprint samples is 93.1% for the J48 decision tree in [HKDV14]. The number of 10 fingerprints is sufficient for our evaluation, because we do not perform a biometric analysis. Due to the block-based classification, 1,003,000 feature vectors are extracted. For our extended 600 dimensional feature space (see Section 3.3), we achieve a classification accuracy of 90.008% based on 501,500 data sets for each of the two classes ”fingerprint” and ”substrate”. 3.2 Data Gathering Process The data gathering process utilizes a FRT CWL600 [Fri14] sensor mounted to a FRT MicroProf200 surface measurement device. This particular sensor exploits the effect of chromatic aberration of lenses to measure the distance and the intensity of the reflected light simultaneously. Due to this effect, the focal length of different wavelength is different. Thus, only one wavelength from the source of white light is focused at a time. This particular wavelength yields the highest intensity in the reflected light. So, it can be easily detected using a spectrometer by locating the maximum within the spectrum. The intensity value is derived from the amplitude of this peak within the value range [1; 4, 095]. The wavelength of the peak can be translated into a distance between the sensor and the measured object using a calibration table. The achieved resolution for this distance is 20 nm. The data itself are stored within a 16 bit integer array which can be afterward converted to a floating point distance value. The CWL600 is a point sensor which acquires the sample point-by-point while the sample is moved underneath. Thus, it is possible to select arbitrary lateral resolutions for the acquisition of the sample. In our case study, we use a lateral dot distance of 10 µm which results in a resolution five times as high as the commonly used resolution of 500 ppi in biometric systems. 3.3 Data Description The feature space in [HKDV14] contains statistical, structural, and fingerprint semantic features. The final feature space is extracted from the intensity and topography data (see 26 Section 3.2) and preprocessed versions of these data sets. Table 1 summarizes the 50 features which are extracted from each data set. Feature Set Statistical Features Structural Features Fingerprint Semantic Features Features Minimum value; maximum value; span; mean value; median value; variance; skewness; kurtosis; mean squared error; entropy; globally and locally normalized values of absolute min, max, median; globally and locally normalized values of relative min, max, span, median; globally normalized absolute and relative mean value of B Covariance of upper and lower half of a block B; covariance of left and right half of the block B; line variance of a block B; column variance of a block B; most significant digit frequency derived from Benford’s Law [Ben38] (9 features); Hu moments [Hu62] (7 features) Maximum standard deviation in BM after Gabor filtering; mean value of the block B for the highest Gabor response Table 1: Overview of the extracted features All features are extracted from blocks with a size of 5×5 pixels with the exception of the fingerprint semantic feature of the maximum standard deviation in BM after Gabor filtering. The fingerprint semantic features are motivated by the fingerprint enhancement, e.g. [HWJ98], which utilize Gabor filters for emphasizing the fingerprint pattern after determining the local ridge orientation and frequency. Since this filtering relies on a ridge valley pattern, it requires larger blocks. In particular, we use a block size of 1.55 by 1.55 mm (155×155 pixels) as suggested in [HKDV14]. The features are extracted from the original and pre-processed data. In particular, the intensity and topography data are pre-processed using Sobel operators in first and second order in X and Y direction combined, Sobel operators in first order in X, as well as Y direction separately, and unsharp masking (subtraction of a blurred version of the data). In result, we get a 600-dimensional feature space. However, some of the features cannot be determined, e.g., due to a division by zero in case of the relative statistical features. Thus, either the classifier must be able to deal with missing values, or those features need to be excluded. To this end, we apply the J48 classifier, because it handles missing data. 4 Evaluation In this section, we present the evaluation of the classification according to the J48 algorithm. We restrict this study to performance measurements of computation time for 27 Test outcome positive Test outcome negative Condition positive True Positive (TP) False Negative (FN) Condition negative False Positive (FP) True Negatives (TN) Table 2: Contingency table building the model and for the evaluation of the model. As influences on the performance, we identify according to Section 2.2 cardinality of the dimensions and involved features. Therefore, we investigate the model quality with respect to precision and recall. First, we present our evaluation setup. This is followed by the result presentation. Finally, we discuss our findings. 4.1 Setup Our data are preprocessed as described in Section 3. We use the implementation of C4.5 [Qui93] in WEKA, which is called J48. For the identification of relationships between included feature dimensions and feature cardinality and model build time and model evaluation time, we use different performance measurements regarding the model. We briefly describe the model performance measurements in the following. In classification, the candidates can be classified correctly or incorrectly. Compared to the test population four cases are possible, as presented in Table 2. In the following we define measures that can be derived from the contingency table. The recall (also called sensitivity or true positive rate) represents the correctly identified positive elements compared to all identified positive elements. This measure is defined as: Recall = TP TP + FN (1) Accuracy describes all correctly classified positive and negative elements compared to all elements. This measure assumes a non-skewed distribution of classes within the learning as well as training data. It is defined as: Accuracy = TP + TN TP + FN + FP + TN (2) Precision is also called positive prediction rate and measures all correctly identified positives compared to all positives in the ground truth. It is defined as: Precision = TP TP + FP (3) Specificity is also called true negative rate and is a ratio comparing the correctly classified negative elements to all negative classified elements. It is defined as: Specificity = TN FP + TN (4) 28 The measure Balanced Accuracy is applied in the case that the classes are not equally distributed. This takes non-symmetric distributions into account. The balance is achieved by computing the arithmetic mean of Recall and Specificity and it is defined as: ) . Recall + Specificity 1 TP FN Balanced Accuracy = = · + (5) 2 2 TP + FN FP + TN The F-Measure is the harmonic mean of precision and recall to deal with both interacting indicators at the same time. This results in: F-Measure = 2 · TP 2 · TP + FP + FN (6) Depending on the application scenario, a performance measure can be used for optimization. In Fig. 3, we depict all above stated performance measurements according to a filtering of our data set. 0.90 Model Performance Measures ●●● ● ● 0.88 ● ● 0.87 Quality 0.89 ● 0.86 ● ● ● ● F−Measure Accuracy balanced Accuracy Precision Recall Specificity ● ● ● 0.85 ● 300 400 500 600 Dimensions / Features Figure 3: Performance Measures for different filters of the test case In our evaluation, we investigate two different performance influences. On the one side, we are interested in filtering out correlated data columns. At the other side, we measure performance for a restricted data space domain. This is applied by a data discretization. Evaluation is based on three important aspects: • Building the model in terms of computation time, • Testing the model in terms of computation time, and • Quality of the model measured in model performance indicators. 29 From Fig. 3 it can be identified, that the computed models have a higher specificity than recall. This results also in a lower F-Measure. Furthermore, it can be seen that the training data are not imbalanced and accuracy is very close to balanced accuracy. However, all values are close and at an acceptable range. Therefore, we use for the remainder of our result presentation the F-Measure as model performance measure. For reducing the dimensional space, we secondly discretize each feature. This is computed in such a way that the variance within a feature is retained as best as possible. Currently, there are no data structures within the WEKA environment, that use restricted data spaces efficiently. Therefore, we assume that model creation and model evaluation times are not significantly influenced. However, as a database system can be used in future, the question arises, which quality influence on model performance is achieved by discretization. Therefore, we conduct an evaluation series with discretized feature dimensions, where all feature dimensions are restricted to the following cardinalities: • 256 values, • 8 values, • 512 values, • 16 values, • 1,024 values, • 32 values, • 2,048 values, and • 64 values, • full cardinality. • 128 values, 4.2 Result Presentation As a first investigation of our evaluation scenario, we present results regarding the elimination of features. For the feature elimination we decide for an statistical approach, where correlated data columns are eliminated from the data set. In Fig. 4, we present the dimensions that are included in the data set. At the x-axis, we present the correlation criteria that are used for elimination. For instance, a correlation criteria of 99% means that all data columns are eliminated from the data set that have a correlation of 0.99 to another feature within the data set. Note, we compare every feature column with every other and at an elimination decision; we left the first in the data set. Therefore, we prefer the first data columns within the data set. Furthermore, we also tested the feature reduction for discretized data sets. With a small cardinality, the feature reduction due to correlation is lower, which means that the dimensional space is higher compared to the others. We evaluate in the following the reduction of the feature space in terms of computational effort. We differentiate at this point two cases for this effort: On the one side the model building time represents computational performance for creating (learning) the model. As the amount of data tuples for learning the model we use 501,500 elements. As a second measurement, we present evaluation times where 501,500 further elements are used in a testing phase of the model. This additionally leads to the quality indicators of the model presented in Section 4.1. We present this information afterward. In Fig. 5, we present the model creation times for different data sets. With a decrease of the feature space, the computation time reduces, too. However, there are some saltus identifi- 30 600 Feature Reduction by Correlation ● 500 ● full 8 16 32 64 ● 128 256 512 1024 2048 ● 400 Dimensions ● ● ● ● ● ● ●● ● ● ● ● ● ● 300 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 0.6 0.7 0.8 0.9 1.0 Correlation Criteria Figure 4: Feature Reduction by Correlation able. These are related to the fact, that the algorithm has a dynamic model complexity. This means that the number of nodes within the model is not restricted and therefore, smaller models can have a faster generation time. Nevertheless, we do not focus on optimization for our case study, but we derive a general relationship. From our data, we can derive that a decrease is reduced for data that are not more than 85% correlated. This leads to a slower reduction in computation time. However, with this elimination of 85% correlated values, the computational effort is reduced to approximately one third. An important result from the model generation: a restrictive discretization (cases 8 and 16) does negatively influences the model building time. Note, we do not use in our evaluation an optimized data structure, which has a significant influence on the computational performance, see also Section 2.2. Although the underlying data structure is general, a restriction of the feature cardinalities improves model building times for the cases cardinality 32 and higher. For evaluation times of the model a similar behavior is identifiable. In Fig. 6, we present the evaluation times for the same data sets. Two major differences can be easily seen: On the one side, the difference between the test cases is smaller and the slopes are smoother. On the other side, a reduction of the evaluation time is optimal for cardinalities of 32 and 64. An increase of the cardinalities leads to a higher computational effort. This is respected to the fact that the sequential searches within the data are quite important for the testing phase of a model. A usage of efficient data structures should therefore be in focus of future studies. With both above presented evaluations, we only have computation time in the focus. However, we have to respect the quality of the model at the same time. Within classification applications, an increased information usage (in terms of data attributes) can increase the model quality. A reduction of the information space might lead to a lower model quality. 31 ● ● ●● 4000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 128 256 512 1024 2048 ● ●● ● ● ● ● ● full 8 16 32 64 ● 6000 ● Time in secs 8000 128 256 512 1024 2048 4000 10000 ● full 8 16 32 64 ● 6000 Time in secs Model Evaluation Time 8000 Model Build Time ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● 2000 2000 ● 0.5 0.6 0.7 0.8 0.9 1.0 0.5 Erased Correlations 0.6 0.7 0.8 0.9 1.0 Erased Correlations Figure 5: Model Build Time Figure 6: Model Evaluation Times In Fig. 7, we show the relationship between erased correlations and the F-Measure. Note, that an increase in the F-Measure is also for a reduced data space possible (e.g., in the case of full cardinality). With a reduced cardinality in the information space, a lower FMeasure is achieved. This is especially true for low cardinalities (e.g., 8 or 16). However, in the case that the cardinality is reduced from a correlation of 0.95 to 0.9 within the data set a higher decrease in the F-Measure is identifiable. A second significant reduction of the F-measure is at the 0.7 correlation elimination level. In Fig. 8, we present the relationship between model build times and the model quality. Although a negative dependency is assumed, this trend is only applicable to some parts of the evaluation space. As an optimization of model quality and computation time, the first high decrease model quality is at an elimination of 0.95 correlated values. Further eliminations do not influence the model build times in a similar decrease. Overall, we have to state that our reduction of the data space is quite high compared to the reduction of the model quality in terms of the F-measure. Note, other model performance measures are quite similar. 4.3 Discussion With our evaluation, we focus on the influences of the data space to model performance in terms of quality and computation times. Therefore, we reduce the information space in two ways. On the one hand, we restrict dimensionality by applying a feature reduction by correlation. This is also called canonical correlation analysis. It can be computed in a very efficient way and therefore, it is much faster than other feature reduction techniques, e.g., principal component analysis or linear discriminant analysis. Furthermore, we restrict the cardinality of the feature spaces, too. We discretize the feature space and are interested in 32 0.88 ● 0.90 full 8 16 32 64 ● Model Performance: F−Measure 128 256 512 1024 2048 ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● 0.88 0.90 Model Performance: F−Measure ● ● ● ● ● 0.86 0.86 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● 0.84 ● full 8 16 32 64 ● 128 256 512 1024 2048 0.82 0.82 0.84 ● ● 0.5 0.6 0.7 0.8 0.9 1.0 4000 Erased Correlations 6000 8000 10000 Model build time in secs Figure 7: Model Quality and Reduction Figure 8: Model Quality and Build Times the influence on model quality. An influence for the model build times are not assumed, due to the fact that the underlying data structures are not optimized. We focus this in future work, cf. [BDKM14]. Due to the column-wise data processing of the classifiers, we assume that a change in the underlying storage structure, e.g., column stores or Data Dwarfs, leads to a significant computational performance increase. First analyzes of the WEKA implementation reveal a high integration effort. However, the benefits are very promising. 5 Conclusion We present some ideas on improving model quality and computational performance for a classification problem. This work is a starting point to enhance the process with respect to optimize computation times in a biometric scenario. Additional use cases, e.g., indicator simulation [KL05], other data mining techniques [HK00], or operations in a privacy secure environment [DKK+ 14], can be applied to our main idea and have to be considered for filtering and reduction techniques. With our evaluation study we show that performance with respect to computation times as well as model quality can be optimized. However, a trade-off between both targets has to be achieved due to inter-dependencies. In future work, we want to improve the process by integrating and optimizing the different steps. We assume, an efficient data access structure is beneficial for model computation times and therefore, increases the application scenario. However, this computational improvement relies on the information space, especially on dimensional cardinality and number of involved dimensions. With an easy to apply algorithm, a data processing enables a fast transformation of the feature space and smooth the way for more efficient data mining for forensic scenarios. 33 Acknowledgment The work in this paper has been funded in part by the German Federal Ministry of Education and Research (BMBF) through the Research Program ”DigiDak+ SicherheitsForschungskolleg Digitale Formspuren” under Contract No. FKZ: 13N10818. References [AKZ08] Elke Achtert, Hans-Peter Kriegel, and Arthur Zimek. ELKI: A Software System for Evaluation of Subspace Clustering Algorithms. In SSDBM, LNCS (5069), pages 580– 585. Springer, 2008. [AMH08] Daniel J. Abadi, Samuel Madden, and Nabil Hachem. Column-Stores vs. Row-Stores: How different are they really? In Proceedings of the International Conference on Management of Data (SIGMOD), pages 967–980, Vancouver, BC, Kanada, 2008. [BBK98] Stefan Berchtold, Christian Böhm, and Hans-Peter Kriegel. The Pyramid-technique: Towards Breaking the Curse of Dimensionality. SIGMOD Rec., 27(2):142–153, 1998. [BBK01] Christian Böhm, Stefan Berchtold, and Daniel A. Keim. Searching in Highdimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases. ACM Comput. Surv., 33(3):322–373, 2001. [BDKM14] David Broneske, Sebastian Dorok, Veit Köppen, and Andreas Meister. Software Design Approaches for Mastering Variability in Database Systems. In GvDB, 2014. [Ben38] Frank Benford. The Law of Anomalous Numbers. Proceedings of the American Philosophical Society, 78(4):551–572, 1938. [Ben75] Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM, 18(9):509–517, 1975. [DKK+ 14] Jana Dittmann, Veit Köppen, Christian Krätzer, Martin Leuckert, Gunter Saake, and Claus Vielhauer. Performance Impacts in Database Privacy-Preserving Biometric Authentication. In Rainer Falk and Carlos Becker Westphall, editors, SECURWARE 2014: The Eighth International Conference on Emerging Security Information, Systems and Technologies, pages 111–117. IARA, 2014. [Fri14] Fries Research & Technology GmbH. Chromatic White Light Sensor CWL, 2014. http://www.frt-gmbh.com/en/chromatic-white-light-sensor-frt-cwl.aspx. [FSGM+ 98] Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing Iceberg Queries Efficiently. In Proceedings of the 24rd International Conference on Very Large Data Bases, VLDB ’98, pages 299–310, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. [GBS+ 12] Alexander Grebhahn, David Broneske, Martin Schäler, Reimar Schröter, Veit Köppen, and Gunter Saake. Challenges in finding an appropriate multi-dimensional index structure with respect to specific use cases. In Ingo Schmitt, Sascha Saretz, and Marcel Zierenberg, editors, Proceedings of the 24th GI-Workshop ”Grundlagen von Datenbanken 2012”, pages 77–82. CEUR-WS, 2012. urn:nbn:de:0074-850-4. 34 [GG98] Volker Gaede and Oliver Günther. Multidimensional Access Methods. ACM Comput. Surv., 30:170–231, 1998. [Gut84] Antonin Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Rec., 14(2):47–57, 1984. [HFH+ 09] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1):10 – 18, 2009. [HK00] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2000. [HKDV14] Mario Hildebrandt, Stefan Kiltz, Jana Dittmann, and Claus Vielhauer. An enhanced feature set for pattern recognition based contrast enhancement of contact-less captured latent fingerprints in digitized crime scene forensics. In Adnan M. Alattar, Nasir D. Memon, and Chad D. Heitzenrater, editors, SPIE Proceedings: Media Watermarking, Security, and Forensics, volume 9028, pages 08/01–08/15, 2014. [Hu62] Ming-Kuei Hu. Visual pattern recognition by moment invariants. Information Theory, IRE Transactions on, 8(2):179–187, 1962. [HWJ98] Lin Hong, Yifei Wan, and A. Jain. Fingerprint image enhancement: algorithm and performance evaluation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):777 –789, aug 1998. [KL05] Veit Köppen and Hans-J. Lenz. Simulation of Non-linear Stochastic Equation Systems. In S.M. Ermakov, V.B. Melas, and A.N. Pepelyshev, editors, Proceeding of the Fifth Workshop on Simulation, pages 373–378, St. Petersburg, Russia, July 2005. NII Chemistry Saint Petersburg University Publishers. [KSS14a] Veit Köppen, Gunter Saake, and Kai-Uwe Sattler. Data Warehouse Technologien. MITP, 2 edition, Mai 2014. [KSS14b] Veit Köppen, Martin Schäler, and Reimar Schröter. Toward Variability Management to Tailor High Dimensional Index Implementations. In RCIS, pages 452–457. IEEE, 2014. [MKH+ 13] Andrey Makrushin, Tobias Kiertscher, Mario Hildebrandt, Jana Dittmann, and Claus Vielhauer. Visibility enhancement and validation of segmented latent fingerprints in crime scene forensics. In SPIE Proceedings: Media Watermarking, Security, and Forensics, volume 8665, 2013. [Qui86] John Ross Quinlan. Induction of Decision Trees. Mach. Learn., 1(1):81–106, 1986. [Qui93] John Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. [SDRK02] Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, and Yannis Kotidis. Dwarf: Shrinking the PetaCube. In SIGMOD, pages 464–475. ACM, 2002. [SGS+ 13] Martin Schäler, Alexander Grebhahn, Reimar Schröter, Sandro Schulze, Veit Köppen, and Gunter Saake. QuEval: Beyond high-dimensional indexing à la carte. PVLDB, 6(14):1654–1665, 2013. [WB97] Roger Weber and Stephen Blott. An Approximation-Based Data Structure for Similarity Search. Technical Report ESPRIT project, no. 9141, ETH Zürich, 1997. 35 Using Different Encryption Schemes for Secure Deletion While Supporting Queries Maik Schott, Claus Vielhauer, Christian Krätzer Department Informatics and Media Brandenburg University of Applied Sciences Magdeburger Str. 50 14770 Brandenburg an der Havel, Germany [email protected] [email protected] Department of Computer Science Otto-von-Guericke-University Magdeburg Universitaetsplatz 2 39106 Magdeburg, Germany [email protected] Abstract: As more and more private and confidential data is stored in databases and in the wake of cloud computing services hosted by third parties, the privacyaware and secure handling of such sensitive data is important. The security of such data needs not only be guaranteed during the actual life, but also at the point where they should be deleted. However, current common database management systems to not provide the means for secure deletion. As a consequence, in this paper we propose several means to tackle this challenge by means of encryption and how to handle the resulting shortcomings with regards to still allowing queries on encrypted data. We discuss a general approach on how to combine homomorphic encryption, order preserving encryption and partial encryption as means of depersonalization, as well as their use on client-side or server-side as system extensions. 1 Introduction and state of the art With the increase of data in general stored in databases especially its outsourcing into cloud services, privacy-related informations are also becoming more and more prevalent. Therefore there is an increasing need of maintaining the privacy, confidentiality, and in general security of such data. Additionally privacy is required by several national laws, like the Family Educational Rights and Privacy Act and the Health Insurance Portability and Accountability Act of the United States, the Federal Data Protection Act (Bundesdatenschutzgesetz) of Germany, or the Data Protection Directive (Directive 95/46/EC) of the European Union. All these legal regulations require the timely and guaranteed – in the sense that it is impossible to reconstruct – removal of private information. Such removal is called forensic secure deletion. 37 Aside from the regular challenges of this issue, e.g. the behavior of magnetic media to partially retain the state of their previous magnetization – leaving traces of data later overwritten with other data – or the wear-levelling techniques of solid-state memory media [Gu96], RAM or swap memory copies, and remote backups, database systems have an additional complexity due to their nature to provide an efficient and fast access to data by the means of introducing several redundancies of data. A certain information is not only stored within its respective database table, but also in other locations, like indexes, logs, result caches, temporary relations or materialized views [SML07]. Deleted rows are often just flagged as deleted, without touching the actually stored data. Additionally, due to page-based storage mechanisms, changes to records on these pages, requiring a change to the layout of these pages will, not necessarily update only this very page but instead create a new copy in the unallocated parts of the file system updating with the new data, but still leaving the original page behind flagged as unallocated space. The same applies to any kind of deletion operations. Essentially, old data is being marked as deleted, however remains present and is not immediately or intentionally deleted. The later happens only occasionally, when this unallocated space is later overwritten by a new page. An extensive study of this issue was done by Stahlberg et al. [SML07] who forensically investigated five different database storage engines - IBM DB2, InnoDB, MyISAM (both MySQL), PostgreSQL, and SQLite - with regards to traces left of deleted data within the table storage, transaction log, indexes. They found that even after applying 25,000 operations and vacuuming, a large amount of deleted records could still be found. Furthermore they investigated the cost of overwriting or encrypting (albeit using highly insecure algorithms) log entries for the InnoDB engine. Grebhahn et al. [GSKS13] especially focused on index structures and what kind and amount of traces of deleted records can be reconstructed from the structure of indexes. Albeit they didn’t investigate a real database system but a mockup designed to thoroughly evaluate high-dimensional indexes, they achieved recovery rates from R trees of up to 60% in single cases. As shown, although a forensic secure deletion is required in many cases, the actual realization of removing data once it has been ingested to a database is still a difficult or even unsolved challenge. Therefore the solution must be sought at an earlier point: the time the data first enters the database. As the concept of deletion of data basically means rendering this data unreadable/illegible it shares similarities to the encryption of data without having knowledge of the proper key, as introduced by [BL96] for backup systems. Encryption can therefore be seen as a “preventive deletion” scheme in a forensic secure way. At the same time the illegibility of encrypted data also hinders the widespread use in database systems as most operations and therefore queries on them are not possible compared to their unencrypted state. Therefore, in this paper we discuss a general approach on how to use several encryption schemes to provide additional security, while still maintaining some of the advantageous properties of unencrypted data. A similar concept has been introduced in the context of CryptDB e.g. in [PRZ+11] and more recently in [GHH+14]. 38 2 Approach Encryption schemes can generally be classified as symmetric and asymmetric, depending on if the encryption and decryption processes use the same shared key or different keys (public key/ and private key). As in asymmetric schemes the sender and receiver of a message use different keys, there is less trust required between both parties w.r.t. the key handling and secure storage; thus asymmetric schemes are more secure. However, this severely affects the performance and conversely symmetric schemes are several magnitudes faster than asymmetric algorithms. In “traditional” (strong) cryptography, the goal is that the ciphertext does not reveal anything about the plaintext, e.g. any two given similar yet different plaintexts m1 and m2 the resulting ciphertexts are dissimilar and appear random. As such any operation which makes use of any property of the plaintext is not applicable to ciphertext. This is expressed by the cryptographic property is ciphertext indistinguishability introduced by [GM84] as polynomial security, i.e. given n plaintexts and their n ciphertexts, determining which ciphertexts refer to which plaintexts has the same probability as random guessing. Similar is the property non-malleability introduced by [DDN00] which states that given a ciphertext an attacker may not be able to modify this ciphertext in a way which would yield a related plaintext, i.e. a malleable scheme would fulfil: E (m⊕ x) = E(m) ⊗ x’, with E being the encryption operation, m the plaintext, x a value to change the plaintext using the operation ⊕ and x’/⊗ their counterparts in encrypted domain. However, there exist encryption schemes which intentionally give up on these security properties in exchange to provide for additional benefits, i.e. computational properties for mathematical operations in the encrypted domain. The first one of these schemes is Homomorphic Encryption (HE), with its basic concept introduced by Rivest et al. [RAD78], as an encryption scheme allowing certain binary operations on the encrypted plaintexts in the ciphertext domain by a related homomorphic operation without any knowledge of the actual plaintext, i.e.: E(m1∘ m2) = E(m1) ⊚ E(m2). However, most of the early homomorphic encryption schemes only allowed one operation (multiplication or addition) until Gentry [Ge09] introduced the first Fully Homomorphic Encryption (FHE) scheme, which allow additive as well as multiplicative arithmetic operations at the same time. As can easily be seen, homomorphic encryption does not fulfil the non-malleability criterion, since every adversary can combine two ciphertexts to create a valid new encrypted plaintext. The main drawback of this encryption scheme is that, although Gentry’s approach opened up lots of interest in the scientific community, the scheme it is very intensive with regards to computational time and space requirements, with a plaintext to ciphertext expansion factor of thousands to millions [LH14], and the computation of complex/chained operations may take several seconds. The second encryption scheme is Order-Preserving Encryption (OPE) introduced by Agrawal et al. [AKSX04] and cryptographically proofed by Boldyreva et al. [BCLO09], who also provided a cryptographic model based on hypergeometric distribution. The idea of order-preservation encryption is to provide an encryption scheme which 39 maintains the property of order of plaintext within their encrypted counterparts, i.e. m2 : E(m1) ≥ E(m2). It can be seen that this scheme does not fulfil the ciphertext indistinguishability criterion as by ordering the plaintexts and ciphertexts there is high probability of knowing which ciphertext can be mapped to which plaintext, except for equal plaintexts. ∀m1 ≥ A third encryption scheme is Partial Encryption, differing from the aforementioned as it is not a specific new approach of encrypting, but a different application of existing common encryption schemes. The basic idea, as already discussed, e.g., in [MKD+11] and [SSM+11], is that in many complex data items only some parts are confidential. However, encrypting the complete data item may hinder the use of the data as even persons or processes who only want to access these non-confidential parts would need to get the data item decrypted in some way, e.g. by providing them with a higher security clearance, even though they should not need it. Using partial encryption, only the confidential parts of the data item are encrypted and associated with metadata specifying which parts. As the data is complex, and thus HE and OPE schemes would not be usable, partial encryption would use strong cryptography. Combining these, our concept consists of two steps: The first step is to classify each data type with regards to its security level in the sense of maximum possible harm if the data gets misused or disclosed. This segmentation basically defines the amount of data to be protected as well as the type of security means needed to protect this data. Since the actual evaluation depends on the regulations of each organization, legislation, use-cases and so on, this is not the focus of this paper and thus will not be described in detail. The second step is to evaluate what kind of operations are commonly done to the data, in the sense of if the queries mostly consist of arithmetic operations, comparison operations, or other kinds of operations. If data is mainly used for arithmetic reasons, HE can be used to enable these kinds of operations on encrypted data. The same applies to data mainly used for comparison queries and OPE. If complex data types are present where only parts are sensitive and other parts contain useful information too, partial encryption can be used. It has to be stated here that this two-step concept is an extension of the basic approach discussed in [MKD+11].However, as stated before, the HE and OPE schemes are weaker than strong encryption schemes. Therefore, based on the evaluated security level of step 1, highly confidential data which satisfy the criteria for either HE or OPE should still be encrypted using strong cryptography. 3 Realization in database systems In this section we describe the implications of using our approach on the database management system by discussing if the encryption or parts thereof should be provided server-side or client-side, as well as necessary changes to queries. For two of the introduced schemes (HE and OPE) the integration into a query language is discussed only conceptually on query level, while for the third scheme (PE) a realization of the query language extension required is presented. 40 Our test MySQL database consists of actual forensic data acquired during the Digi-Dak1 project consisting of fiber scans, synthetic fingerprints and metadata in the form of statistical, spectral and gradient features for fingerprints [MHF +12] in 3.4 million tuples and diameter, length, perimeter, area, height and color as fiber features [AKV12]. 3.1 Server-side vs client-side An important notion is if the encryption E and decryption D functions should be provided client-side or server-side. On client-side, each client software would be required to implement both functions if it wants to create proper queries and interpret the results. However, this may be unfeasible for heterogeneous environments with many different types of client software, especially if future updates to the employed encryption are considered. The advantage of this approach is that the keys never leave the client. If the crypto functions are centrally server-based, every client can make use of the encrypted data with only minor changes to the queries as described in the following sections. An important issue here is the acquisition of the keys. The keys may either be provided by the client or by the server. For client-based key provision, they would be part of the query and thus transmitted over the server-client connection. In this case this connection must be secured, e.g. by using SSL/TLS. For server-based key provision they may either be part of encryption or decryption functions themselves or are provided as part of a view as described in Section 3.4. The disadvantage server-based key provision using views approach is that the password needs to be stated in the view definition, and as such this encryption approach has at most the security level of the database access controls. As such it can protect confidential data against malicious clients or clients vulnerable to SQL injections and similar attacks, but not against attackers who have lowlevel (OS) access to the database. Therefore, client-based key provision has a higher security level, but as stated in Section 2 server-based key provision may also be feasible. Approaches like [GSK13] propose an extension of the SQL syntax to permit special forensic tables that automatically handle secure deletion, and [SML07] proposes the use of an internal stream cipher to automatically encrypt every data. Both approaches need the extension/update of database systems on source code level which is in most cases not applicable to actual database systems in a productive environment. Therefore our approach consists of realizing secure deletion/encryption by making use of means provided by the database system, in our case external User-Defined Functions (UDF) – also called call specification in Oracle or CLR Function in MSSQL – from external libraries as complex cryptographic implementations are infeasible with stored procedures in SQL/PSM or related languages. 3.2 Order Preserving Encryption Like for the other encryption schemes the data needs to be encrypted before it is stored in the database. This would be done by the client who transforms an input item m to 1 http://omen.cs.uni-magdeburg.de/digi-dak/ 41 m’ = EOPE(m, keyE) which is then INSERTed. On a SELECT, the returned m’ needs to be transformed back as m = DOPE(m’, keyD). As described in the previous section, EOPE() and DOPE() may either be server- or client-side provided and the keys may either be explicitly provided or implicitly. Assuming the columns of a tuple where c2 contains confidential values mainly used for relational queries (for example with a fixed value v1) a query in the form: Query 1a: SELECT c1, c2 FROM t1 WHERE c2<v1; would be transformed into the following for OPE data: Query 1b: SELECT c1, DOPE(c2, keyD) FROM t1 WHERE c2<EOPE(v1, keyE); In case the query result does not contain OPE data, the only overhead to non-encrypted data would be the encryption of the WHERE statement. Therefore the time complexity depends on the number and complexity of OPE statements in the SELECT statement multiplied by the result count and the number and complexity of OPE statements in the WHERE statement. 3.3 Homomorphic Encryption For homomorphic encryption basically the same applies. Assuming the columns of a tuple where c1 and c2 contain confidential values mainly used for arithmetic operations queries to retrieve a value for a client or generate a value within the database: Query 2a: SELECT c1 + c2 FROM t1 WHERE …; Query 3a: INSERT INTO t2 (c3) SELECT c1 * c2 FROM t1 WHERE …; would be transformed into the following for HE data: Query 2b: SELECT DHE(ADDHE(c1, c2), keyD) FROM t1 WHERE …; Query 3b: INSERT INTO t2 (c3) SELECT MULHE(c1, c2) FROM t1 WHERE …; However depending on the homomorphic operations and if the query result is stored in the database or returned to the client the call to an additional function may not be needed. For example in the Paillier cryptosystem [Pa99] the addition of two plaintexts is expressed by a multiplication of the ciphertexts: m1 + m2 mod n = m1’ * m2’ mod n2 . 3.4 Partial Encryption In our implementation the actual library and user-defined functions were written in C#, as its runtime library provides a large amount of conveniently usable image processing and cryptography functions. As C# libraries use different export signatures the Unmanaged Exports tool (MIT license) by Robert Giesecke 2 to automatically create proper C style exports and unmarshalling. The pseudo-code of the decrypt UDF called DPE is as follows: INPUT: blob, password OUTPUT: image 2 https://sites.google.com/site/robertgiesecke/Home/uploads/unmanagedexports 42 (image, regions, cryptalg, cryptparams, blocksize) ← unpack(blob) if password = ∅ then return image endif buffer ← ARRAY BYTE[1..blocksize], coord ← ARRAY POINT[1..blocksize] b ← 0, h ← height(image), w ← width(image) for y = 1 to h do for x = 1 to w do if (x, y) in regions then buffer[b] ← image[x, y], coord[b] ← (x, y) b ← b+1 endif if b = blocksize OR (y=h AND x=w)) then buffer’ ← decrypt(buffer, cryptalg, cryptparams, password) for i = 1 to blocksize do image[coord[i]] = buffer’[i] endfor b ← 0 endif endfor endfor return image In our scenario partial encryption is used for fingerprint scans. At a crime scene there are cases where latent fingerprints – sensitive data – may be superimposed with other nonsensitive evidence like fiber traces. Depending on the investigation goal (fingerprint of fiber analysis) it may thus become necessary to make areas containing sensitive data inaccessible. Therefore in our approach, only the fingerprint parts of the scans are encrypted by using AES, packed in ZIP container along the encryption metadata (algorithm, bit size, initialization vectors, password salt) and the outline of partially encrypted regions as shown in Figure 1. Figure 1: (Synthetic) original3 and partially encrypted fingerprint 3 Example from the Public Printed Fingerprint Data Set – Chromatic White Light Sensor - Basic Set V1.0. The image was acquired using sensory from the Digi-Dak research project (http://http://omen.cs.unimagdeburg.de/digi-dak/, 2013) sponsored by the German Federal Ministry of Education and Research, see publication: Hildebrandt, M, Sturm, J., Dittmann, J., and Vielhauer, C.: Creation of a Public Corpus of Contact-Less Acquired Latent Fingerprints without Privacy Implications. Proc. CMS 2013, Springer, LNCS 8099, 2013, pp. 204–206; it uses privacy implication free fingerprint patterns generated with SFinGe, published in Cappelli, R.: Synthetic fingerprint generation, Maltoni, D., Maio, D., Jain, A.K., and Prabhakar, S. (Eds.): Handbook of Fingerprint Recognition, 2nd edn., Springer London, 2009. 43 The database itself only stores the container. Additionally for server-based key provision using views there would be two views to provide standardized means of access on the database query language level: a) an “anonymized view” for general users who do not have the proper security clearance to access unencrypted fingerprints, returning the encrypted fingerprint scan from the container (Figure 1 right side): Query 4: CREATE VIEW fingerprints_anon AS SELECT id, filename, DPE(scan) AS scan FROM fingerprints; And b) a “deanonymized view” for privileged users like forensic fingerprint experts who do have the access rights to the actual fingerprints not only unpack the image but also try to decrypt it with the provided key (Figure 1 left side): Query 5: CREATE VIEW fingerprints_deanon AS SELECT id, filename, DPE(scan, key) AS scan FROM fingerprints; Regarding the performance on our test system (Intel Core i7-4610QM @ 2.3GHz, 8 GB RAM), we observe in first experiments that querying 1000 tuples of containers takes 0.607s, unpacking the encrypted scans (Query 4) 29.110s, and returning the deanonymized scans (Query 5) 99.016s. However, it should be noted that for the Query 5 task, computation power for image parsing is included in the given figure and makes up for the main part of execution time for this query. 4 Conclusion and future work In this paper we showed a general concept on how to use different encryption schemes for the sake of secure deletion and de-personalization, also taking into account that the data still at least partially remains usable for queries. The concept includes of the classification of data by its security level and later main usage which decides the appropriate encryption scheme. We also showed how these encryption schemes can be used in common database systems without the need to directly modify the system, but using its existing capabilities of a selected database system. In future work a more thorough research on key management for server-based key provision needs to be done, as well the applicability to other database systems. Furthermore, the actual performance impact of this general concept to practical systems has to be evaluated with large-scale evaluations for relevant application scenarios, like e.g. larger forensic databases and/or biometric authentication systems where such a scheme could prevent information leakage as well as inter-system traceability. 5 Acknowledgements The work in this paper has been funded by the Digi-Dak project by the German Federal Ministry of Education and Science (BMBF) through the research program under the contract no. FKZ: 13N10816. 44 References [AKSX04] Agrawal, R.; Kiernan, J.; Srikant, R.; Xu, Y.: Order-preserving encryption for numeric data. In: SIGMOD, 2004; pages 563–574. [AKV12] Arndt, C.; Kraetzer C.; Vielhauer, C.: First approach for a computer-aided textile fiber type determination based on template matching using a 3D laser scanning microscope. In: Proc. 14th ACM Workshop on Multimedia and Security, 2012. [BCLO09] Boldyreva, A., Chenette, N., Lee, Y., O’Neill, A.: Order-preserving symmetric encryption. In: EUROCRYPT, 2009, pages 224–241. [BL96] Boneh, D.; Lipton, R. J.: A revocable backup system. In: USENIX Security Symposium, 1996; pages 91–96. [DDN00] Dolev, D.; Dwork, C.; Naor, M.: Nonmalleable Cryptography. In: SIAM Journal on Computing 30 (2), 2000; pages 391–437. [Ge09] Gentry, C: Fully homomorphic encryption using ideal lattices. In: ACM symposium on Theory of computing, STOC ’09, New York, 2009; pages 169–178. [GM84] Goldwasser, S.; Micali, S.: Probabilistic encryption. In: Journal of Computer and System Sciences. 28 (2), 1984; pages 270–299. [GSK13] Grebhahn, A.; Schäler, M.; Köppen, V.: Secure Deletion: Towards Tailor-Made Privacy in Database Systems. In: Workshop on Databases in Biometrics, Forensics and Security Applications (DBforBFS), BTW, Köllen-Verlag, 2013; pages 99–113. [GSKS13] Grebhahn, A.; Schäler, M.; Köppen, V.; Saake, G.: Privacy-Aware Multidimensional Indexing. In: BTW 2013; pages 133–147. [GHH+14] Grofig, P.; Hang, I.; Härterich, M.; Kerschbaum, F.; Kohler, M.; Schaad, A.; Schröpfer, A.; Tighzert, W.: Privacy by Encrypted Databases. Preneel, B.; Ikonomou, D. (eds.): Privacy Technologies and Policy. Springer International Publishing, Lecture Notes in Computer Science, 8450, ISBN: 978-3-319-06748-3, pp. 56-69, 2014. [Gu96] Gutmann, P.: Secure Deletion of Data from Magnetic and Solid-State Memory. In: USENIX Security Symposium, 1996. [MHF+12] Makrushin, A.; Hildebrandt, M.; Fischer, R.; Kiertscher, T.; Dittmann, J.; Vielhauer, C.: Advanced techniques for latent fingerprint detection and validation using a CWL device. In: Proc. SPIE 8436, 2012; pages 84360V. [MKD+11] Merkel, R.; Kraetzer, C.; Dittmann, J.; Vielhauer, C.: Reversible watermarking with digital signature chaining for privacy protection of optical contactless captured biometric fingerprints - a capacity study for forensic approaches. Proc. 17th International Conference on Digital Signal Processing (DSP), 2011. [LN14] Lepoint, T.; Nehring, A.: A Comparison of the Homomorphic Encryption Schemes FV and YASHE. Africacrypt 2014. [Pa99] Paillier, P.: Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. In: Eurocrypt 99, Springer Verlag, 1999; pages 223–238. [PRZ+11] Popa, R. A.; Redfield, C. M. S.; Zeldovich, N.; Balakrishnan, H.: CryptDB: Protecting Confidentiality with Encrypted Query Processing. Proc. 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011. [RAD78] Rivest, R. L.; Adleman, L.; Dertouzos, M. L.: On data banks and privacy homomorphisms. In: Foundations of Secure Computation, 1978. [SSM+11] Schäler, M.; Schulze, S.; Merkel, R.; Saake, G.; Dittmann, J.: Reliable Provenance Information for Multimedia Data Using Invertible Fragile Watermarks. In Fernandes, A. A. A.; Gray, A. J. G.; Belhajjame, K. (eds.): Advances in Databases. Springer Berlin Heidelberg, Lecture Notes in Computer Science, 7051, pp. 3-17, ISBN: 978-3642-24576-3, 2011. [SML07] Stahlberg, P.; Miklau, G.; Levine, B.N.: Threats to Privacy in the Forensic Analysis of Database Systems. In: SIGMOD, New York 2007, ACM; pages 91–102. 45 Data Streams and Event Processing Data Streams and Event Processing Marco Grawunder1 , [email protected] Daniela Nicklas2 , [email protected] 1 Universität Oldenburg Universität Bamberg 2 The processing of continuous data sources has become an important paradigm of modern data processing and management, covering many applications and domains such as monitoring and controlling networks or complex production system as well complex event processing in medicine, finance or compliance. • • • • • • • • Data streams Event processing Case Studies and Real-Life Usage Foundations – Semantics of Stream Models and Languages – Maintenance and Life Cycle – Metadata – Optimization Applications and Models – Statistical and Probabilistic Approaches – Quality of Service – Stream Mining – Provenance Platforms for event and stream processing, in particular – CEP Engines – DSMS – ”Conventional” DBMS – Main memory databases – Sensor Networks Scalability – Hardware acceleration (GPU, FPGA, ...) – Cloud Computing Standardisation In addition to regular workshop papers, we invite extended abstracts to cover hot topics, ongoing research and ideas that are ready to share and discuss, but maybe not ready to publish yet. 49 1 Workshop co-chairs Marco Grawunder (Universität Oldenburg) Daniela Nicklas (Universität Bamberg) 2 Program Committee Andreas Behrend (Universität Bonn) Klemens Boehm (Karlsruher Institut für Technologie) Peter Fischer (Universität Freiburg) Dieter Gawlick (Oracle) Boris Koldehofe (Technische Universität Darmstadt) Wolfgang Lehner (TU Dresden) Richard Lenz (Universität Erlangen-Nürnberg) Klaus Meyer-Wegener (Universität Erlangen) Gero Mühl (Universität Rostock) Kai-Uwe Sattler (Technische Universität Ilmenau) Thorsten Schöler (Hochschule Augsburg) 50 Modulares Verteilungskonzept für Datenstrommanagementsysteme Timo Michelsen, Michael Brand, H.-Jürgen Appelrath Universität Oldenburg, Department für Informatik Escherweg 2, 26129 Oldenburg {timo.michelsen, michael.brand, appelrath}@uni-oldenburg.de Abstract: Für die Verteilung kontinuierlicher Anfragen in verteilten Datenstrommanagementsytemen (DSMS) gibt es je nach Netzwerk-Architektur und Anwendungsfall unterschiedliche Strategien. Die Festlegung auf eine Strategie ist u. U. nachteilig, besonders wenn sich Netzwerk-Architektur oder Anwendungsfall ändern. In dieser Arbeit wird ein Ansatz für eine flexible und erweiterbare Anfrageverteilung in verteilten DSMSs vorgestellt. Der Ansatz umfasst drei Schritte: (1) Partitionierung, (2) Modifikation und (3) Allokation. Bei der Partitionierung wird eine kontinuierliche Anfrage in disjunkte Teilanfragen zerlegt. Die optionale Modifikation erlaubt es, Mechanismen wie Fragmentierung oder Replikation zu verwenden. Bei der Allokation werden die einzelnen Teilanfragen schließlich Knoten im Netzwerk zugewiesen, um dort ausgeführt zu werden. Für jeden der drei Schritte können unabhängige Strategien verwendet werden. Dieser modulare Aufbau ermöglicht zum Einen eine individuelle Anfrageverteilung. Zum Anderen können bereits vorhandene Strategien aus anderen Arbeiten und Systemen (z.B. eine Allokationsstrategie) integriert werden. In dieser Arbeit werden für jeden der drei Teilschritte beispielhafte Strategien vorgestellt. Außerdem zeigen zwei Anwendungsbeispiele die Vorteile des vorgestellten, modularen Ansatzes gegenüber einer festen Verteilungsstrategie. 1 Einleitung In verarbeitenden Systemen ist es häufig notwendig, den persistenten Teil des Systems mehrfach und verteilt vorzuhalten, um Ausfälle kompensieren zu können. In verteilten Datenbankmanagementsystemen (DBMS) werden persistente Daten disjunkt (Fragmentierung) oder redundant (Replikation) auf verschiedene Knoten des Netzwerks verteilt. Anfragen werden einmalig gestellt und greifen ausschließlich auf die Knoten zu, die die betreffenden Daten besitzen. Betrachtet man allerdings verteilte Datenstrommanagementsysteme (DSMS), so ist eine Verteilung der Datenstromelemente allein nicht zielführend, da diese flüchtig sind. Hier sind die Anfragen persistent, da sie theoretisch kontinuierlich ausgeführt werden. Eine solche, auf mehrere Knoten verteilte, kontinuierliche Anfrage wird verteilte, kontinuierliche Anfrage genannt. Eine verteilte, kontinuierliche Anfrage befindet sich auf mehreren Knoten, indem die einzelnen Operationen den Knoten zugeordnet und dort ausgeführt 51 werden (in sogenannten Teilanfragen). Zwischenergebnisse der Teilanfragen werden von Knoten zu Knoten gesendet, bis schließlich die Ergebnisse an den Nutzer gesendet werden. Für eine Zerlegung einer kontinuierlichen Anfrage in Teilanfragen gibt es allerdings viele Möglichkeiten (z.B. eine Teilanfrage für die gesamte kontinuierliche Anfrage oder je eine Teilanfrage pro Operation). Die Zerlegung in Teilanfragen ist nur ein Aspekt, der bereits aufzeigt, dass unterschiedliche Strategien für eine Anfrageverteilung in verteilten DSMSs existieren. Ein weiterer Aspekt ist die Netzwerk-Architektur (bspw. homogene Cluster, Client/Server-Architekturen und Peer-To-Peer (P2P)-Netzwerke). In einem homogenen Cluster verfügen alle Knoten i.d.R. über die gleichen Mengen an Systemressourcen. Daher muss bei einer Zuweisung von (Teil-) Anfragen an einen Knoten in homogenen Clustern keinerlei Heterogenität berücksichtigt werden. Kommt hingegen ein heterogenes P2P-Netzwerk zum Einsatz, kann es vorkommen, dass nicht jeder Knoten jede (Teil-) Anfrage ausführen kann (bspw. aufgrund komplexer Operatoren). In vielen verteilten DSMSs ist die Anfrageverteilung system-intern auf eine bestimmte Netzwerk-Architektur ausgelegt (z.B. Client/Server). Somit wird es aufwendig, die zugrunde liegende Netzwerk-Architektur nachträglich zu wechseln. Ein Grund für eine solche Einschränkung ist die Tatsache, dass keine optimale Verteilungsstrategie existiert, die für alle Netzwerk-Architekturen und Knoten eingesetzt werden kann. Neben unterschiedlichen Netzwerk-Architekturen können ebenso anwendungsspezifische Kenntnisse des Nutzers die Anfrageverteilung optimieren (z.B. die Identifikation von Operatoren, die eine hohe Systemlast erzeugen). Eine solche manuelle Optimierung ist allerdings nur möglich, wenn der Nutzer die Anfrageverteilung für eine konkrete kontinuierliche Anfrage konfigurieren kann. Diese Arbeit stellt ein Konzept für eine flexible und erweiterbare Anfrageverteilung in einem verteilten DSMS vor. Dazu wird die Verteilung in drei Phasen unterteilt. Jede Phase bietet eine eindeutige Schnittstelle mit Ein- und Ausgaben, die die Umsetzung mehrerer Strategien ermöglicht. Die Anfrageverteilung kann individuell für die zugrunde liegende Netzwerk-Architektur und den konkreten Anwendungsfall konfiguriert und ggfs. optimiert werden. Außerdem ist es möglich, bereits existierende Strategien aus anderen Quellen zu übernehmen. Es entsteht eine Sammlung an Strategien, die je nach Anwendungsfall individuell kombiniert werden können. Der Rest der Arbeit ist wie folgt gegliedert: Abschnitt 2 gibt einen Überblick über andere Systeme, die ebenfalls kontinuierliche Anfragen verteilen. Das modulare Verteilungskonzept wird in Abschnitt 3 vorgestellt. Dabei liegt der Fokus auf der Erläuterung des Konzeptes. Auf eine vollständige Übersicht aller zum jetzigen Zeitpunkt verfügbarer Strategien wird in dieser Arbeit verzichtet. Abschnitt 4 beinhaltet den aktuellen Stand der Implementierung und beispielhafte Anwendungsszenarien. In Abschnitt 5 wird die Arbeit abschließend zusammengefasst. 52 2 Verwandte Arbeiten In den DSMSs Borealis [CBB+ 03] und StreamGlobe [KSKR05] werden neue kontinuierliche Anfragen zunächst grob zerlegt und im Netzwerk verteilt. Häufig werden einzelne Operatoren Knoten zugeordnet. Nach der anfänglichen Verteilung werden Teile der kontinuierlichen Anfrage während der Verarbeitung verschoben, d.h., sie werden von Knoten zu Knoten übertragen. Dadurch können Kommunikationskosten gespart werden, indem bspw. Teilanfragen verschiedener Knoten zusammengefasst werden. Das DSMS Stormy [LHKK12] zerlegt keine kontinuierlichen Anfragen, sondern führt sie stets vollständig auf einem Knoten aus. Die Auswahl des Knotens geschieht mittels eines Hashwertes, welcher aus der kontinuierlichen Anfrage gebildet wird: Jeder Knoten übernimmt einen Teil des Wertebereichs, sodass kontinuierliche Anfragen mit bestimmten Hashwerten bestimmten Knoten zugeordnet werden. StreamCloud [GJPPM+ 12], Storm [TTS+ 14] und Stratosphere [WK09] nutzen CloudInfrastrukturen, um bei Bedarf verarbeitende, virtuelle Knoten in der Cloud zu erzeugen und Teilanfragen zuzuordnen. In Storm kann der Nutzer zusätzlich angeben, wie viele Instanzen für eine Operation erstellt werden sollen (bspw. für Fragmentierung oder Replikation). StreamCloud bietet unterschiedliche Zerlegungsstrategien, die später in dieser Arbeit aufgegriffen werden. In SPADE, einer Anfragesprache von System S [GAW09], werden kontinuierliche Anfragen nach einem Greedy-Algorithmus zerlegt. Die Zuweisung der so entstandenen Teilanfragen wird mittels eines Clustering-Ansatzes durchgeführt. Dadurch sollen die Teilanfragen möglichst wenigen Knoten zugeordnet werden. Daum et al. [DLBMW11] haben ein Verfahren für eine automatisierte, kostenoptimale Anfrageverteilung vorgestellt. Sie verfolgen damit andere Ziele als das in dieser Arbeit vorgestellte Konzept, bei dem es viel mehr um Flexibilität und Modularität als um Automatisierung geht. Jedes hier vorgestellte DSMS besitzt seine eigene Vorgehensweise (mit eigenen Vor- und Nachteilen), kontinuierliche Anfragen im Netzwerk zu verteilen. Jedoch bieten sie – im Gegensatz zu dem hier vorgestellten Konzept – nicht die Flexibilität und Modularität, um Verteilungsstrategien bei Bedarf zu wechseln. Sie erlauben häufig nur mit großem Aufwand neue, evtl. domänenspezifische Strategien zu implementieren und einzusetzen. 3 Konzept Häufig werden kontinuierliche Anfragen vom Nutzer deklarativ gestellt. Anschließend wird der Anfragetext i.d.R. in eine sprachenunabhängige Struktur übersetzt, die sich an Anfragepläne von DBMSs orientiert: die sogenannten logischen Operatorgraphen. Sie können als gerichtete, azyklische Graphen interpretiert werden, wobei die Knoten die Operatoren repräsentieren. Die Kanten stellen die Datenströme zwischen den Operatoren dar. Ein logischer Operator beschreibt, welche Operation auf einen Datenstrom ausgeführt 53 werden soll (bspw. Selektion, Projektion, Join). Sie beinhaltet jedoch nicht die konkrete Implementierung. Diese wird erst bei der tatsächlichen Ausführung der (Teil-) Anfragen auf einem Knoten eingesetzt. Aufgrund der Unabhängigkeit von der Sprache und von der Implementierung basiert das hier vorgestellte Konzept auf logischen Operatorgraphen. Es beschreibt also, wie die Verteilung eines logischen Operatorgraphen flexibel und erweiterbar durchgeführt werden kann. Allokation Modifikation Partitionierung Wie bereits erwähnt, ist eine einzige, fest programmierte Vorgehensweise zur Anfrageverteilung häufig nicht praktikabel. Dementsprechend wird in dieser Arbeit eine mehrschrittige Vorgehensweise verfolgt: (1) Partitionierung, (2) Modifikation und (3) Allokation. Abbildung 1 soll den Zusammenhang der Phasen zueinander verdeutlichen. Für jede Phase werden mehrere Strategien zur Verfügung gestellt, aus denen der Nutzer für jede kontinuierliche Anfrage wählen kann. Als Alternative ist eine automatisierte Auswahl vorstellbar, diese wird jedoch in dieser Arbeit nicht weiter verfolgt. Knoten A Knoten B Knoten C Knoten D Logischer Operatorgraph Verteilte Teilgraphen Abbildung 1: Phasen der Verteilung kontinuierlicher Anfragen. Ausgehend von einem logischen Operatorgraphen wird in der ersten Phase, der Partitionierung, der Graph in Teilgraphen zerlegt, wobei keinerlei Änderungen am Graphen vorgenommen werden (z.B. neue Operatoren). Diese Teilgraphen werden dann modifiziert, um weitere Eigenschaften im Graphen sicherzustellen. Dabei können die Teilgraphen ergänzt, verändert oder auch entfernt werden. Beispielsweise wurde in der Abbildung der mittlere Teilgraph repliziert. Anschließend wird in der Allokation eine Zuordnung zwischen Teilgraph und ausführenden Knoten hergestellt und der Teilgraph schließlich übermittelt. In dieser Phase werden die Teilgraphen nicht mehr verändert. 1 2 3 #NODE_PARTITION < P a r t i t i o n i e r u n g s s t r a t e g i e > <P a r a m e t e r > #NODE_MODIFICATION <M o d i f i k a t i o n s s t r a t e g i e > <P a r a m e t e r > #NODE_ALLOCATE < A l l o k a t i o n s s t r a t e g i e > <P a r a m e t e r > 4 5 kontinuierliche Anfrage Listing 1: Selektion der Strategien zur Verteilung einer kontinuierlichen Anfrage. Die Auswahl der Strategien wird hier dem Nutzer überlassen. Ein Beispiel, wie der Nutzer eine kontinuierliche Anfrage im DSMS Odysseus [AGG+ 12] verteilen und entsprechende 54 Strategien auswählen kann, ist in Listing 1 zu sehen. Zu Beginn gibt der Nutzer die Strategie für jede Phase an (Zeilen 1 bis 3). Mittels Parameter können die Strategien weiter verfeinert werden (bspw. Replikationsgrad). Anschließend wird die eigentliche Anfrage formuliert, welche anhand der gewählten Kombination der Strategien verteilt wird. Der Nutzer muss den Anfragetext nicht speziell anpassen, um diese verteilen zu können. Es müssen lediglich zuvor die eingesetzten Strategien spezifiziert werden. Für eine andere Anfrage können wiederum andere Strategien gewählt werden. In den folgenden Abschnitten werden die Phasen zur Verteilung genauer beschrieben und beispielhafte Strategien kurz erläutert. 3.1 Partitionierung Aufgrund der Tatsache, dass eine gegebene kontinuierliche Anfrage in einem Netzwerk verteilt werden soll, werden einzelne Knoten häufig Teile der Anfrage erhalten. Dementsprechend muss zuvor entschieden werden, wie die Anfrage zerlegt werden soll. Das bedeutet in diesem Fall, dass der logische Operatorgraph in mehrere Teilgraphen zerlegt werden muss. Teilgraphen beschreiben somit, welche logischen Operatoren zusammen auf einem Knoten ausgeführt werden sollen. Es ist dabei wichtig, dass die Teilgraphen zueinander disjunkt sind, d.h., jeder logischer Operator der kontinuierlichen Anfrage befindet sich in genau einem Teilgraphen. Die logischen Operatoren müssen innerhalb eines Teilgraphen jedoch nicht zusammenhängend sein. Das Finden einer geeigneten Zerlegung eines Graphen ist NP-hart [BMS+ 13]. Dementsprechend wird vorgeschlagen, mehrere Strategien zur Partitionierung anzubieten. Eine Partitionierungsstrategie erhält einen logischen Operatorgraphen und liefert eine Menge an disjunkten Teilgraphen. Einige Partitionierungsstrategien sind die folgenden: QueryCloud Der logische Operatorgraph wird als ein Teilgraph behandelt. Das bedeutet, dass keine Zerlegung durchgeführt wird. Dies ist nützlich, wenn es sich um eine Anfrage mit einer geringen Zahl an logischen Operatoren handelt und eine Zerlegung nicht praktikabel erscheint. OperatorCloud Jeder logischer Operator ist ein Teilgraph. Dies repräsentiert die maximale Zerlegung des Operatorgraphen. Diese Strategie ist für kontinuierliche Anfragen interessant, die wenige, jedoch sehr komplexe logische Operatoren beinhalten. OperatorSetCloud In vielen kontinuierlichen Anfragen verursachen die zustandsbehafteten Operatoren die meiste Systemlast, wie bspw. Aggregationen [GJPPM+ 12]. Die Partitionierungsstrategie OperatorSetCloud zerlegt den logischen Operatorgraphen, sodass jeder Teilgraph maximal einen solchen Operator enthält. Dadurch ist es möglich, die zustandsbehafteten Operatoren verschiedenen Knoten zuzuteilen, sodass die Systemlast besser im Netzwerk verteilt werden kann. Nutzerbasiert Falls es die Anfragesprache erlaubt, kann der Nutzer direkt angeben, welche Operatoren zusammen auf einem Knoten ausgeführt werden sollen. Diese Strategie ist besonders für Evaluationen praktisch, da damit bestimmte Szenarien der Verteilung nachgestellt und reproduziert werden können. 55 Auf Details wird im Rahmen dieser Arbeit verzichtet. Die ersten drei Strategien wurden dem Vorbild des DSMS StreamCloud nachempfunden und werden in [GJPPM+ 12] genauer erläutert. Es ist möglich, dass Entwickler weitere Strategien konzipieren und einsetzen. Dadurch können in konkreten Anwendungsszenarien bspw. spezielle Eigenschaften der Knoten und des Netzwerks ausgenutzt werden. 3.2 Modifikation In der zweiten Phase wird der logische Operatorgraph modifiziert, um weitere Eigenschaften in der kontinuierlichen Anfrage sicherzustellen. Dieser Schritt ist optional und kann übersprungen werden, wenn keine Änderungen am logischen Operatorgraphen notwendig sind. Sind jedoch mehrere Modifikationen notwendig, kann diese Phase mehrfach durchgeführt werden. An dieser Stelle sind ebenfalls unterschiedliche Möglichkeiten vorstellbar. Jede Modifikationsstrategien erhält als Eingabe die Menge an Teilgraphen aus der ersten Phase. Die Ausgabe beinhaltet eine modifizierte Menge an Teilgraphen. In dieser Phase ist ebenfalls vorgesehen, dass Entwickler weitere Strategien konzipieren und einsetzen. Jedoch wurden in der vorliegenden Arbeit folgende Modifikationsstrategien betrachtet: Replikation Jeder Teilgraph wird (u. U. mehrfach) repliziert. Dadurch kann jeder Teilgraph auf mehreren Knoten ausgeführt werden. Solange mindestens ein Knoten den Teilgraphen ausführt, können (Zwischen-) Ergebnisse berechnet und gesendet werden. Horizontale Fragmentierung Ähnlich zur Replikation wird jeder Teilgraph repliziert, jedoch empfängt jede Kopie nur einen Teil des Datenstroms. Die Ergebnisse werden am Ende wieder zusammengefasst. Damit kann die Verarbeitung parallelisiert werden, was bei besonders hohen Datenraten oder komplexen Berechnungen sinnvoll ist. Eine Illustration beider Strategien ist in Abbildung 2 zu sehen. Links ist der Einsatz der Operator Operator Operator Operator Union Merge Operator Operator Operator Operator Operator Operator Operator Operator Operator Operator Operator Operator Operator Operator Fragment Operator Operator Abbildung 2: Verwendung der Replikations- (links) und der horizontalen Fragmentierungsstrategie (rechts). 56 Replikation als Modifikationsstrategie zu sehen: der Teilgraph wird kopiert und die replizierten (Teil-) Ergebnisse werden mittels eines speziellen Merge-Operator vereinigt. Der Merge-Operator erkennt und entfernt Duplikate in den Datenströmen, sodass die Replikation die Verarbeitungsergebnisse nicht unnötig vervielfacht. Im Rahmen der horizontalen Fragmentierung werden Teilgraphen ebenfalls kopiert (in der Abbildung rechts). Der vorgelagerte Fragment-Operator zerlegt den eintreffenden Datenstrom in disjunkte Fragmente, die parallel verarbeitet werden (bspw. mittels Hashwerten der Datenstromelemente). Der Union-Operator vereinigt die Teilmengen schließlich zu einem Datenstrom. In der Modifikationsphase ist ebenfalls vorgesehen, dass Entwickler eigene Strategien entwickeln und einsetzen (z.B. vertikale Fragmentierung). 3.3 Allokation Die dritte Phase – die Allokation – beinhaltet die Aufgabe, Teilanfragen den Knoten im Netzwerk zur Ausführung zuzuordnen. Auch hier sind in Abhängigkeit zum vorliegenden Netzwerk verschiedene Vorgehensweisen vorstellbar, sodass im Rahmen dieser Arbeit mehrere Strategien betrachtet werden. Jede Allokationsstrategie erhält als Eingabe die Menge an (ggfs. replizierten und/oder fragmentierten) Teilgraphen. Die Ausgabe umfasst eine 1:n-Zuordnung zwischen ausführenden Knoten und Teilanfragen. Das bedeutet, dass ein Knoten mehrere Teilgraphen erhalten kann, jedoch wird jeder Teilgraph genau einem Knoten zugeordnet. Folgende Allokationsstrategien wurden bisher verfolgt: Nutzerbasiert Der Nutzer gibt die Zuordnung vor (bspw. über eine grafische Oberfläche oder durch Annotationen im Anfragetext). Der Nutzer kann über Spezialwissen verfügen, die es ermöglichen, eine optimale Zuordnung anzugeben. Round-Robin Die Teilgraphen werden der Reihe nach an die Knoten verteilt. Lastorientiert Die Teilgraphen werden an die Knoten verteilt, welche aktuell die geringste Auslastung aufweisen. Durch diese Vorgehensweise kann die Systemlast im Netzwerk verteilt werden. Contract-Net Für jeden Teilgraphen wird eine Auktion ausgeschrieben, und jeder Knoten kann bei Interesse ein Gebot abgeben. Ein Gebot beschreibt die Bereitschaft, den Teilgraphen zu übernehmen. Dabei kann ein Gebot aus verschiedenen Faktoren zusammengesetzt werden. Beispielsweise spielt die Menge an verfügbaren Systemressourcen eine Rolle: Je mehr Ressourcen frei sind, desto besser kann der Teilgraph ausgeführt werden. Der Knoten mit dem höchsten Gebot erhält schlussendlich den Teilgraphen. Ist das Netzwerk bekannt und sind alle Knoten homogen, kann Round-Robin für eine schnelle und einfache Verteilung der Anfrage genutzt werden. Sollen Auslastungen der Knoten berücksichtigt werden, ist die lastorientierte Strategie vorzuziehen. Sie kann auch eingesetzt werden, wenn die Knoten über unterschiedliche Leistungskapazitäten verfügen. Contract-Net sollte benutzt werden, wenn die Autonomie der Knoten zu berücksichtigen ist (d.h., die Knoten entscheiden selbstständig, welche Teilgraphen sie ausführen wollen). Das in dieser Arbeit vorgestellte Konzept sieht vor, dass Entwickler eigene Allokationss- 57 trategien implementieren können, um bspw. domänenspezifisches Wissen einzusetzen oder spezielle Netzwerkstrukturen zu berücksichtigen. 4 Aktueller Stand Das oben beschriebene Konzept wurde in Odysseus [AGG+ 12] als zusätzliche Komponente implementiert. Jede oben genannte Strategie ist verfügbar und kann in kontinuierlichen Anfragen eingesetzt werden. Dadurch kann Odysseus in unterschiedlichen NetzwerkArchitekturen eingesetzt werden, ohne dass umfangreiche Änderungen an der Verteilung durchgeführt werden müssen (es muss lediglich die Strategieauswahl angepasst werden). Im Folgenden werden zwei Anwendungsbeispiele von Odysseus vorgestellt, die aufzeigen sollen, wie das oben genannte Konzept flexibel eingesetzt werden kann. Anwendungsfall 1: In einem Anwendungsfall wird Odysseus auf mehrere Knoten in einem heterogenen und autonomen P2P-Netzwerk eingesetzt, um Sportereignisse in Echtzeit auszuwerten. Die Daten werden mit Hilfe von aktiven Sensoren aufgenommen und an das Netzwerk gesendet. Die Sensoren sind dabei an spielrelevanten Entitäten wie den Spielern und dem Ball angebracht. Die Analyse geschieht mit zuvor verteilten kontinuierlichen Anfragen. Da zum einen ein solches Sensornetzwerk mehrere tausend Datenstromelemente pro Sekunde erzeugen kann und zum anderen die Analyse dieser Daten teuer ist, bietet sich eine Anfrageverteilung an, die in Listing 2 dargestellt ist. Konkret sollen 1 2 3 #NODE_PARTITION o p e r a t o r s e t c l o u d #NODE_MODIFICATION f r a g m e n t a t i o n h o r i z o n t a l h a s h n #NODE_ALLOCATE c o n t r a c t n e t 4 5 originale Analyse-Anfrage Listing 2: Beispielhafte Verwendung der Anfrageverteilung für eine Sportanalyse. Operatoren, die viel Last erzeugen, auf unterschiedlichen Knoten ausgeführt werden (OperatorSetCloud-Partitionierungsstrategie). Dadurch werden unterschiedliche Sportanalysen von verschiedenen Knoten des Netzwerks übernommen. Zusätzlich soll der Datenstrom fragmentiert werden, um die Systemlast für einzelne Knoten zu verringern (Modifikationsstrategie hash-basierte horizontale Fragmentierung und n Fragmenten). Da die Knoten heterogen und autonom sind, wird in diesem Fall die Contract-Net-Allokationsstrategie eingesetzt. Anwendungsfall 2: In einem Windpark liefert jedes Windrad kontinuierlich Statusinformationen, die als Datenstrom interpretiert werden (z.B. die aktuell erzeugte Energie sowie Windrichtung und -geschwindigkeit). Diese Datenströme werden zur Überwachung und Kontrolle der Windräder benötigt und an ein homogenen Cluster aus Odysseus-Instanzen gesendet. Es können spezielle Datenstromelemente versendet werden, die Alarmmeldungen oder Störungen signalisieren. Aus diesem Grund ist es wichtig, dass jedes Datenstromelement (jede Alarmmeldung oder Störung) verarbeitet wird, auch wenn ein Knoten in dem verarbeitendem Cluster ausfällt. Listing 3 zeigt eine mögliche Anfrageverteilung für dieses 58 Szenario unter der Verwendung eines homogenen Clusters. Konkret sollen alle Operatoren 1 2 3 #NODE_PARTITION q u e r y c l o u d #NODE_MODIFICATION r e p l i c a t i o n n #NODE_ALLOCATE r o u n d r o b i n 4 5 originale Überwachungs-Anfrage Listing 3: Beispielhafte Verwendung der Anfrageverteilung für eine Windpark-Überwachung. der Anfrage auf einen Knoten ausgeführt werden, da die Cluster-Knoten mit ausreichend Ressourcen ausgestattet sind (QueryCloud-Partitionierungsstrategie). Zusätzlich wird der Datenstrom repliziert, um die Ausfallsicherheit zu erhöhen und um Alarmmeldungen nicht zu verlieren (Modifikationsstrategie Replikation mit n Replikaten). Als Allokator kommt die Round-Robin-Strategie zum Einsatz, da es sich um ein homogenes Cluster mit identischen Knoten handelt. 5 Zusammenfassung Für eine Anfrageverteilung in verteilten DSMSs gibt es je nach Netzwerk-Architektur und Anwendungsfall unterschiedliche Strategien. Viele Systeme haben sich auf eine Verteilungsstrategie spezialisiert, wodurch sie nur mit großem Aufwand an eine Änderung der Netzwerk-Architektur angepasst werden können. Außerdem ist die Anfrageverteilung in vielen Systemen nicht durch den Nutzer konfigurierbar, was es unmöglich macht anwendungsspezifische Kenntnisse einzubringen. Aus diesem Grund wurde in dieser Arbeit ein modularer Konzept für eine flexible und erweiterbare Anfrageverteilung in verteilten DSMSs vorgestellt. Das Konzept sieht dabei einen logischen Operatorgraphen als Eingabe vor und liefert verteilte Teilgraphen als Ausgabe. Strukturell umfasst er drei Schritte: (1) Partitionierung, (2) Modifikation und (3) Allokation. Bei der Partitionierung wird der logische Operatorgraph in disjunkte Teilgraphen zerlegt, um die Operatoren zu identifizieren, die gemeinsam auf einem Knoten im Netzwerk ausgeführt werden sollen. Die optionale Modifikation erlaubt es Mechanismen wie Fragmentierung oder Replikation zu verwenden, indem die Teilgraphen modifiziert werden. In der Allokationsphase werden die einzelnen (modifizierten) Teilgraphen Knoten im Netzwerk zugewiesen. Für jeden der drei Schritte gibt es Schnittstellen, wodurch unabhängige Strategien miteinander kombiniert werden können. Dieser modulare Aufbau ermöglicht zum einen eine individuelle Anfrageverteilung. Zum anderen können bereits vorhandene Strategien aus anderen Arbeiten und Systemen (z.B. eine Allokationsstrategie) integriert werden. In dieser Arbeit wurden für jeden der drei Teilschritte beispielhafte Strategien vorgestellt. Das Konzept wurde im DSMS Odysseus als zusätzliche Komponente implementiert und erfolgreich in verschiedenen Anwendungsszenarien eingesetzt. Zwei Anwendungsszena- 59 rien wurden in dieser Arbeit kurz vorgestellt: (1) Die Sportanalyse in Echtzeit mittels einem P2P-Netzwerk aus heterogenen Knoten und (2) die Überwachung eines Windparks. Odysseus musste in beiden Anwendungsfällen lediglich bei der Strategieauswahl angepasst werden. Dies zeigt, dass das oben beschriebene Konzept zur Verteilung kontinuierlicher Anfragen flexibel und erweiterbar ist. Literatur [AGG+ 12] H.-Jürgen Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen und Daniela Nicklas. Odysseus: a highly customizable framework for creating efficient event stream management systems. DEBS ’12, Seiten 367–368. ACM, 2012. [BMS+ 13] Aydin Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders und Christian Schulz. Recent Advances in Graph Partitioning. CoRR, abs/1311.3144, 2013. [CBB+ 03] Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Cetintemel, Ying Xing und Stan Zdonik. Scalable Distributed Stream Processing. In CIDR 2003 - First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, January 2003. [DLBMW11] Michael Daum, Frank Lauterwald, Philipp Baumgärtel und Klaus Meyer-Wegener. Kalibrierung von Kostenmodellen für föderierte DSMS. In BTW Workshops, Seiten 13–22, 2011. [GAW09] Buğra Gedik, Henrique Andrade und Kun-Lung Wu. A Code Generation Approach to Optimizing High-performance Distributed Data Stream Processing. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, Seiten 847–856, New York, NY, USA, 2009. ACM. [GJPPM+ 12] Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patino-Martinez, Claudio Soriente und Patrick Valduriez. StreamCloud: An Elastic and Scalable Data Streaming System. IEEE Transactions on Parallel and Distributed Systems, 23(12):2351–2365, 2012. [KSKR05] Richard Kuntschke, Bernhard Stegmaier, Alfons Kemper und Angelika Reiser. Streamglobe: Processing and sharing data streams in grid-based p2p infrastructures. In Proceedings of the 31st international conference on Very large data bases, Seiten 1259–1262. VLDB Endowment, 2005. [LHKK12] Simon Loesing, Martin Hentschel, Tim Kraska und Donald Kossmann. Stormy: an elastic and highly available streaming service in the cloud. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, EDBT-ICDT ’12, Seiten 55–60, New York, NY, USA, 2012. ACM. [TTS+ 14] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal und Dmitriy V. Ryaboy. Storm@twitter. In SIGMOD Conference, Seiten 147–156, 2014. [WK09] Daniel Warneke und Odej Kao. Nephele: Efficient Parallel Data Processing in the Cloud. In Proceedings of the 2Nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS ’09, Seiten 8:1–8:10, New York, NY, USA, 2009. ACM. 60 Placement-Safe Operator-Graph Changes in Distributed Heterogeneous Data Stream Systems Niko Pollner, Christian Steudtner, Klaus Meyer-Wegener Computer Science 6 (Data Management) Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) [email protected], [email protected], [email protected] Abstract: Data stream processing systems enable querying continuous data without first storing it. Data stream queries may combine data from distributed data sources like different sensors in an environmental sensing application. This suggests distributed query processing. Thus the amount of transferred data can be reduced and more processing resources are available. However, distributed query processing on probably heterogeneous platforms complicates query optimization. This article investigates query optimization through operator graph changes and its interaction with operator placement on heterogeneous distributed systems. Pre-distribution operator graph changes may prevent certain operator placements. Thereby the resource consumption of the query execution may unexpectedly increase. Based on the operator placement problem modeled as a task assignment problem (TAP), we prove that it is NP-hard to decide in general whether an arbitrary operator graph change may negatively influence the best possible TAP solution. We present conditions for several specific operator graph changes that guarantee to preserve the best possible TAP solution. 1 Introduction Data stream processing is a well suited technique for efficient analysis of streaming data. Possible application scenarios include queries on business data, the (pre-)processing of measurements gathered by environmental sensors or by logging computer network usage or online-services usage. In such scenarios data often originate from distributed sources. Systems based on different software and hardware platforms acquire the data. With distributed data acquisition, it is feasible to distribute query processing as well, instead of first sending all data to a central place. Some query operators can be placed directly on or near the data acquisition systems. This omits unnecessary transfer of data that are not needed to answer the queries, and partitions the processing effort. Thus querying high frequency or high volume data becomes possible that would otherwise require expensive hardware or could not be processed at all. Also data acquisition devices like wireless sensor nodes profit from early operator execution. They can save energy if data is filtered directly at the source. 61 Problem Statement Optimization of data stream queries for a distributed heterogeneous execution environment poses several challenges: The optimizer must decide for each operator on which processor it should be placed. Here and in the remainder of this article, the term processor stands for a system that is capable to execute operators on a data stream. A cost model is a generic base for the operator placement decision. It can be adapted to represent the requirements of specific application scenarios, so that minimal cost represents the best possible operator distribution. Resource restrictions on the processors and network links between them must also be considered. In a heterogeneous environment costs and capacities will vary among the available processors. The optimizer can optimize the query graph before and after the placement decision. Pre-placement changes of the query graph may however foil certain placements and a specific placement limits the possible post-placement algebraic optimization. For example, a change of the order of two operators can increase costs if the first operator of the original query was available directly on the data source and the now first operator in the changed query is not. This must be considered when using common rules and heuristics for query graph optimization. Contribution We investigate the influence of common algebraic optimization techniques onto a following operator placement that is modeled as a task assignment problem (TAP). We prove that the general decision whether a certain change of the query graph worsens the best possible TAP solution is NP-hard. We then present analysis of different common operator graph changes and state the conditions under which they guarantee not to harm the best possible placement. We do not study any special operator placement algorithm, but focus on preconditions for graph changes. Article Organization The following section gives an overview on related work from both the fields of classical database query optimization and data stream query optimization. Sect. 3 introduces the TAP model for the operator placement. It is the basis for the following sections. We prove the NP-hardness of the query-graph-change influence in Sect. 4 and present the preconditions for special graph changes in Sect. 5. The next section shows how to use the preconditions with an exemplary cost model for a realistic query. In the last section we conclude and present some ideas for further research. 2 Related Work This section presents related work on operator graph optimization from the domains of data base systems (DBS) and data stream systems (DSS). Due to space limitations, we are unfortunately only able to give a very rough overview. Query optimization in central [JK84] as well as in distributed [Kos00] DBS is a well studied field. Basic ideas like operator reordering are also applicable to DSS. Some operators, especially blocking operators, however have different semantics. Other techniques like optimization of data access have no direct match in DSS. Strict resource restrictions are 62 also rarely considered with distributed DBS because they are not thought to run on highly restricted systems. The authors of [HSS+ 14] present a catalog of data stream query optimizations. For each optimization, realistic examples, preconditions, its profitability, and dynamic variants are listed. Among operator graph changes the article also presents other optimizations, like load-shedding, state sharing, operator placement and more. They impose the question for future research, in which order different optimizations should be performed. In the paper at hand, we go a first step into this direction by studying the influence of operator graph changes to following placement decisions. We detail the impact of all the five operator graph changes from [HSS+ 14]. We think that these changes cover the common query graph optimizations. The articles [TD03] and [NWL+ 13] present different approaches to dynamic query optimization. The basic idea is that the order in which tuples visit operators is dynamically changed at runtime. The concept of distributed Eddies from [TD03] decides this on a per tuple basis. It does not take the placement of operators into account. Query Mesh [NWL+ 13] precreates different routing plans and decides at runtime which plan to use for a set of tuples. It does not consider distributed query processing. 3 Operator Placement as Task Assignment Problem The operator placement can be modeled as a TAP. Operators are represented as individual tasks. We use the following TAP definition, based on the definition in [DLB+ 11]. P is the set of all query processors. L is the set of all communication channels. A single communication channel l ∈ L is defined as l ⊆ {P × P }. A communication channel subsumes the communication between processors that share a common medium. T is the set of all operators. The data rate (in Byte) between two operators is given by rt1 t2 , t1 , t2 ∈ T . The operators and rates represent the query graph. ctp are the processing costs of operator t on processor p. kp1 p2 gives the cost of sending one Byte of data between processor p1 and processor p2 . The costs are based on some cost model according to the optimization goal. Since cost models are highly system and application specific, we do not assume a specific cost model for our research of querygraph-change effects. Sect. 6 shows how to apply our findings to an exemplary cost model. [Dau11, 98–121] presents methods for the estimation of operator costs and data rates. The distribution algorithm tries to minimize the overall cost. It does this by minimizing term (1), considering the constraints (2) - (5). The sought variables are xtp . xtp = 1 means that task t is executed on processor p. The first sum in equation (1) are the overall processing costs. The second sum are all communication costs. min ++ t∈T p∈P ctp xtp + + + + + t1 ∈T p1 ∈P t2 ∈T p2 ∈P 63 k p 1 p 2 rt 1 t 2 x t 1 p 1 x t 2 p 2 (1) + + + + (p1 ,p2 )∈l t1 ∈T t2 ∈T ctp xtp ≤ b(p), ∀p ∈ P (2) rt 1 t 2 x t 1 p 1 x t 2 p 2 ≤ d(l), ∀l ∈ L (3) xtp = 1, ∀t ∈ T (4) xtp ∈ {0, 1} , ∀p ∈ P, ∀t ∈ T (5) t∈T + p∈P Constraint (2) limits the tuple processing cost of the operators on one processor to its capacity b(p). Constraint (3) limits the communication rate on one communication channel to its capacity d(l). Constraints (4) and (5) make sure that each task is distributed to exactly one single processor. Our findings are solely based on the objective function together with the constraints. We do not assume any knowledge about the actual distribution algorithm. There exist different heuristics for solving a TAP. See e.g. [DLB+ 11] and [Lo88]. 4 Generic Operator-Graph-Change Influence Decision A single algebraic query transformation changes the TAP in numerous ways. For example an operator reordering changes multiple data rates, which are part of multiple equations inside the TAP. When some of those factors increase, it is hard to tell how it affects a following operator placement. The transformed query might even become impossible to execute. One way to determine the usefulness of a given transformation is to compare the minimum costs of both the original and the transformed query graph. If the transformed query graph has lower or equal cost for the optimal operator placement, i.e. lower or equal minimum cost, than the original query, the transformation has a non-negative effect. U (Q) denotes the query graph that results from applying a change U to the original query Q. Since the operator placement needs to solve a TAP, an NP-complete problem, it is not efficient to compute the placement for each possible transformation. A function CompareQuery(Q, U (Q)), that compares two queries and returns true iff U (Q) has smaller or equal minimal costs than Q would solve the problem. Sentence. CompareQuery(Q, U (Q)) is NP-hard. Definition. Utp is a transformation that allows task t only to be performed by processor p. All other aspects of Utp (Q) are identical to Q. Both Q and Utp (Q) have equal costs when operators are placed in the same way, i.e. as long as t is placed on p. Proof. Given CompareQuery(Q, U (Q)) and transformations Utp it is possible to compute the optimal distribution. For each task t it is possible to compare Q and Utp (Q) for 64 each processor p. If CompareQuery(Q, Utp (Q)) returns true Utp (Q) has the same minimum cost as Q. Thus the optimal placement of t is p. The algorithm in pseudo code: ComputeDistribution(Q) { foreach (t in Tasks) { foreach (p in Processors) { if (CompareQuery(Q, U_tp(Q)) == true) { DistributeTaskProcessor(t, p); // makes sure t will be distributed to p break; // needed if multiple distributions exist } } } } ComputeDistribution(Q) calls CompareQuery(Q, U (Q)) at most |T | · |P | times. This is a polynomial time reduction of ComputeDistribution(Q). To compute the optimal distribution it is necessary to solve the TAP, an NP-complete problem. This proves that CompareQuery(Q, U (Q)) is NP-hard. 5 Specific Query Graph Changes While the general determination of a query graph change’s impact is NP-hard, it can easily be determined for specific cases. If a transformation neither increases variables used for the TAP nor adds new variables to the TAP, it is trivial to see that all valid operator placement schemes are still valid after the transformation. For the transformed query exist operator placements with lower or equal costs than the original query’s costs: the original optimal placement is still valid and has lesser or equal costs. We establish preconditions for all the five operator graph changes from [HSS+ 14]. If the preconditions are met the transformation is safe. That means for each valid operator placement scheme of the original query exists a valid scheme for the transformed query with equal costs. So the preconditions especially guarantee that the minimum costs do not rise. However, if heuristic algorithms are used for solving the TAP, they may fail to find an equally good solution for the transformed query as they did for the original query and vice versa, because local minima may change. Table 1 shows all preconditions at a glance. We now justify why these preconditions hold. Notation Most of the used notation directly follows from the TAP, especially ctp and rt1 t2 . The cost cAp of an operator A on the processor p depends on the input stream of A and thus on the overall query executed before that operator. Query graph changes affect the input streams of operators an thus also change the costs needed to execute those operators. In order to distinguish between the original and the changed query we use U (A) to indicate the operator A with the applied query change. U (A) and A behave in the same way, but may have different cost, since they work on different input streams. The costs cU (A)p are needed to execute U (A) on p and the following operator t receives an input stream with 65 the data rate rU (A)t . In addition rI denotes the input data stream and rO denotes the output data stream. Operator Reordering Operator reordering switches the order of two consecutive operators. The operator sequence A → B is transformed to U (B) → U (A). In the original query, operator A is placed on processor pA and operator B on pB . It is possible that pA is the same processor as pB , but it is not known whether both operators are on the same processor, so this cannot be assumed. pA = pB would result in a set of preconditions that are easier to fulfill than the preconditions we present. The transformed query can place the operators U (B) and U (A) on any of the processors pA and pB . Case 1: U (B) is placed on pA and U (A) is placed on pB . To insure the validity of all distributions, the transformed operators’ cost must not exceed the cost of the other original operator, which results in equations (6) and (7). Since the reordering affects the data rates between operators, precondition (8) must hold. Case 2: Both operators are placed on pA . This adds an internal communication inside pA to the operator graph. Equation (9) ensures that internal communication is not factored into the TAP constraints and cost function. The sum of the cost for both transformed operators must be smaller or equal than the cost of A, which is described by equation (10). The changed data rates are reflected in equation (11). Case 3: Both operators are placed on pB . This case is similar to case 2 and can be fulfilled with the preconditions given by equations (9), (12) and (13). Case 4: The remaining option, U (B) is placed on pB and U (A) is placed on pA , can be viewed as changed routing. Since the remaining distribution of the query is unknown, the changed routing can be problematic and this option is inherently not safe. It is possible that pA processes the operator that sends the input to A and that pB has an operator that processes the output stream of B. In this situation the changed routing causes increased communication cost, since the tuples must be send from pA to pB (applying B) to pA (applying A) to pB instead of only sending them once from pA to pB . If one of the operators has more than one input stream not all cases can be used. Even if the stream does not need to be duplicated, if A has additional input streams only case 2 is valid. The other cases are not safe anymore, because the transformation changes the routing of the second stream from destination pA to destination pB . Similar, if B has additional input streams only case 3 is safe. Redundancy Elimination This query change eliminates a redundant operator: the query graph has an operator A on two different positions processing the same input stream, duplicated by another operator. This change works by removing one of the instances of A and duplicating its output. The original query consist of three operators. Operator D (Dup Split in [HSS+ 14]) is placed on pD , while an instance of A is placed both on p1 and p2 . The transformed query 66 Transformation Precondition (∀p ∈ P ) Case Case 1: U (B) on pA U (A) on pB Operator reordering Case 2: U (B) on pA U (A) on pA – Operator separation – (6) cBp ≥ cU (A)p (7) rAB ≥ rU (B)U (A) (8) kpp = 0 ∧ ∀l ∈ L : (p, p) ∈ /l (9) cAp ≥ cU (B)p + cU (A)p (10) rAB ≥ rO (11) kpp = 0 ∧ ∀l ∈ L : (p, p) ∈ /l Case 3: U (B) on pB U (A) on pB Redundancy elimination cAp ≥ cU (B)p cBp ≥ cU (B)p + cU (A)p (12) rAB ≥ rI (13) kpp = 0 ∧ ∀l ∈ L : (p, p) ∈ /l cDp ≥ cU (A)p + cU (D)p kpp = 0 ∧ ∀l ∈ L : (p, p) ∈ /l Case 1: All on pA (9) (9) (14) (9) cAp ≥ cA1 p + cA2 p (15) cAp ≥ cCp (16) rAB ≥ rO (17) cBp ≥ cCp (18) rAB ≥ rI (19) Fusion Case 2: All on pB Fission kpp = 0 ∧ ∀l ∈ L : (p, p) ∈ /l + cAp ≥ cSp + cM p + cU (A)p – (9) (20) U (A) Table 1: Preconditions for safe query graph changes that must be fulfilled for all processors. If an operator is not available on some processors, the preconditions can be assumed fulfilled for these processors. It is sufficient that the preconditions of one case are fulfilled. 67 consists of the operators U (A) and U (D), with U (D) duplicating the output instead of the input. The only possibility to place the transformed query without changing routing is to place both U (A) and U (D) on pD . The additional internal communication, due to the additional operator on pD , again forces equation (9). To ensure that any processor can perform the transformed operators, equation (14) is necessary. In some situations (when pD is the same processor as p1 or p2 ) the change is safe as long as A does not increase the data rate. But since it is unknown how the operators will be placed, this requirement is not sufficient. Operator Separation The operator separation splits an operator A into the two operators A1 → A2 . Additional internal communication results in precondition (9). Equation (15) ensures that the separated operators’ costs are together less than or equal to A’s cost. Fusion Fusion is the opposite transformation to operator separation. The two operators A → B are combined to the single operator C (a superbox in [HSS+ 14]). For the original query A is placed on pA and B on pB . The combined operator can be placed on either pA or pB . The cost for C must not exceed the cost of pA or pB respectively. In addition, the data rates are affected and thus also add preconditions. So either the fulfillment of equations (16) and (17) (if C is placed on pA ) or (18) and (19) (if C is placed on pB ) guarantee the safety of this change. A special case of the fusion is the elimination of an unneeded operator, i.e. removing the operator does not change the query result. Since the redundant operator can change the data rate of a stream (e.g. a filter applied before a more restrictive filter) it still needs to fulfill the preconditions to be safe. Fission The original query is only the single operator A. Fission replaces A by a partitioned version of it, by applying a split operator S, multiple versions of U (A), which can potentially be distributed across different processors, and finally a merge operator M to unify the streams again. Since it is unknown whether other processors exist that can share the workload profitably, the transformed operators must be placed on the processor that executed the original A. This is safe when preconditions (9) and (20) hold. These equations demand that the combined costs of the split, merge and all parallel versions of U (A) can be executed by all processors with smaller or equal cost than the original A. 6 Application Given a query and a DSS it is now possible to test whether a specific change is safe. Using an exemplary cost model we examine a simple example query. Cost Model [Dau11, 91–98] presents a cost model that will be used for the following example. We use a filter and a map operator, which have the following costs: 68 CF ilter = λi CF il + λo CAppendOut CM ap = λi Cproj + λo CAppendOut (21) (22) CF il and Cproj are the costs associated with filtering respectively projecting an input tuple arriving at the operator. CAppendOut represents the costs of appending one tuple to the output stream. λi is the input stream tuple rate, while λo is the output stream tuple rate. For those operators λo is proportional to λi and the equations (21) and (22) can be simplified to λi fOp , where fOp is the cost factor of operator O on processor p for one tuple. Using these simplified equations, the assumption that the tuple rate is proportional to the data rate and costs and selectivities are non-zero, equations (6) to (8) can be rewritten as: λI fAp ≥ λI fBp ⇔ σA λI fBp ≥ σU (B) λI fAp ⇔ σA λI ≥ σU (B) λI ⇔ fAp fBp ≥1 fAp σA σU (B) ≥ fBp σA σU (B) ≥ 1 (23) (24) (25) The equations for the other two cases shown in table 1 can be similarly rewritten. Equations (23) to (25) show that there are relatively few values to compare: We need the ratio of the operator selectivity and for each processor the ratio of operator costs. Example We examine the simple query of a map operator M followed by a filter F applied on a stream containing image data monitoring conveyor belts transporting freshly produced items. The query supports judging the quality of the current production run. Operator M classifies each tuple (and thus each observed produced item) into one of several quality classes and is rather expensive. F filters the stream for one conveyor belt, because different conveyor belts transport different items and are observed by different queries. M does not change the data rate of the stream. It simply replaces the value unclassified already stored inside the input stream for each tuple with the correct classification and thus has a selectivity of 1. There are multiple types of processors available inside the production f p hall. Depending on the processor type the ratio fM differs quite a bit, but overall M is Fp more expensive: this ratio fluctuates between 2 and 10. Equations (23) to (25) show that the selection push down is always safe if σU (F ) is smaller or equal than 0.1: In this case it is always possible that the two operators switch their places without violating additional constraints of the TAP. If σU (F ) is greater than 0.1 this change is not necessarily safe. It is possible that the preconditions of one of the other two cases (both operators on the same processor) are fulfilled or another good distribution is possible, but the latter cannot be tested in a reasonable time as we discussed in Sect. 4. 69 7 Conclusion We presented our findings on the interaction between optimization through query graph changes and the placement of operators on different heterogeneous processing systems. We first motivated our research and defined the problem. Existing work on query optimization through operator graph changes in the context of DMS and DSS was presented, none of which studied the interaction with operator placement. The next section presented the TAP model of the distribution problem. We showed that it is NP-hard to decide in general if an arbitrary query graph change can negatively influence the best possible operator placement scheme. Based on a selection of common query graph changes from the literature, we deduced preconditions under which operator placement does not mind the changes. The last section showed the application of our findings with an exemplary cost model for a realistic query. The preconditions for safe operator graph changes are quite restrictive. They severely limit the possible changes if followed strictly. As with general query optimization, development of heuristics to loosen certain preconditions seems promising. The preconditions presented in this article are the basis for such future work. Another interesting field is the direct integration of query graph optimization in the usually heuristic distribution algorithms. Distribution algorithms could be extended to consider query graph changes in addition to the operator placement. We plan to investigate these ideas in our future research. References [Dau11] M. Daum. Verteilung globaler Anfragen auf heterogene Stromverarbeitungssysteme. PhD thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), 2011. [DLB+ 11] M. Daum, F. Lauterwald, P. Baumgärtel, N. Pollner, and K. Meyer-Wegener. Efficient and Cost-aware Operator Placement in Heterogeneous Stream-Processing Environments. In Proceedings of the 5th ACM International Conference on Distributed Event-Based Systems (DEBS), pages 393–394, New York, NY, USA, 2011. ACM. [HSS+ 14] M. Hirzel, R. Soulé, S. Schneider, B. Gedik, and R. Grimm. A Catalog of Stream Processing Optimizations. ACM Comput. Surv., 46(4):1–34, 2014. [JK84] M. Jarke and J. Koch. Query Optimization in Database Systems. ACM Comput. Surv., 16(2):111–152, 1984. [Kos00] D. Kossmann. The State of the Art in Distributed Query Processing. ACM Comput. Surv., 32(4):422–469, 2000. [Lo88] V. M. Lo. Heuristic Algorithms for Task Assignment in Distributed Systems. IEEE Transactions on Computers, 37(11):1384–1397, 1988. [NWL+ 13] R. V. Nehme, K. Works, C. Lei, E. A. Rundensteiner, and E. Bertino. Multi-route Query Processing and Optimization. J. Comput. System Sci., 79(3):312–329, 2013. [TD03] F. Tian and D. J. DeWitt. Tuple Routing Strategies for Distributed Eddies. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pages 333–344. VLDB Endowment, 2003. 70 Herakles: A System for Sensor-Based Live Sport Analysis using Private Peer-to-Peer Networks Michael Brand, Tobias Brandt, Carsten Cordes, Marc Wilken, Timo Michelsen University of Oldenburg, Dept. of Computer Science Escherweg 2, 26129 Oldenburg, Germany {michael.brand,tobias.brandt,carsten.cordes,marc.wilken,timo.michelsen}@uni-ol.de Abstract: Tactical decisions characterize team sports like soccer or basketball profoundly. Analyses of training sessions and matches (e.g., mileage or pass completion rate of a player) form more and more a crucial base for those tactical decisions. Most of the analyses are video-based, resulting in high operating expenses. Additionally, a highly specialized system with a huge amount of system resources like processors and memory is needed. Typically, analysts present the results of the video data in time-outs (e.g., in the half-time break of a soccer match). Therefore, coaches are not able to view statistics during the match. In this paper we propose the concepts and current state of Herakles, a system for live sport analysis which uses streaming sensor data and a Peer-to-Peer network of conventional and low-cost private machines. Since sensor data is typically of high volume and velocity, we use a distributed data stream management system (DSMS). The results of the data stream processing are intended for coaches. Therefore, the front-end of Herakles is an application for mobile devices (like smartphones or tablets). Each device is connected with the distributed DSMS, retrieves updates of the results and presents them in real-time. Therefore, Herakles enables the coach to analyze his team during the match and to react immediately with tactical decisions. 1 Introduction Sport is a highly discussed topic around the world. Fans, the media, teams and athletes discuss about performance, tactical decisions and mistakes. Statistics about sports are an important part of these discussions. Some statistics are calculated manually by people outside the playing field (e.g., counting ball contacts). Other statistics are retrieved automatically by computer-based analysis systems. However, such systems require expensive computers to calculate the statistics due to the huge amount of data to process. Not every team is able to buy and maintain them. Additionally, many systems need too much time to calculate the statistics, making it impossible to deliver them during the game. Many already existing sport analysis systems are using several cameras placed around the playing field. This increases the costs for buying and maintaining the system even further. In this paper, we propose Herakles, a live sport analysis system based solely on conventional low-cost hardware and software we are currently working on. Instead of using costly cameras, Herakles uses a network of sensors attached to all game-relevant objects, e.g., the players and the ball. Since sensor systems are usually not allowed during games and 71 opponents won’t wear the sensors anyway, Herakles focuses on the use for tryouts. The active sensors regularly send their actual position, acceleration and other values as data streams. Therefore, Herakles uses a distributed data stream management system (DSMS). With such a DSMS, it is possible to compute relevant sport statistics in real-time. In our work, real-time means ”near to the actual event”(rather than in timeouts or after a match). But using a single machine with a DSMS installed invokes the possibility of single-pointof-failures: (1) a single DSMS instance would run the risk of overload and (2) the live analysis stops completely if the single DSMS instance fails. Therefore, Herakles uses multiple machines to share the processing load and to increase the reliability of the entire system. Currently, many already existing distributed DSMSs are using computer clusters, grids, etc. (e.g., [vdGFW+ 11]). Since Herakles focuses on using a collection of smaller low-cost private machines like notebooks, a private Peer-to-Peer (P2P) network is more feasible. In our work, each peer is considered to be heterogeneous and autonomous. This makes it possible to use these types of private machines for distributed data stream processing of sensor data for live sport analysis. Although the peers are low-cost private machines, the system costs depend on the used sensor network. To show the calculated sport statistics in a convenient way for different users, Herakles focuses on mobile devices for the presentation (e.g., smartphones or tablets). With this, the statistics can be directly shown in real-time. Additionally, Herakles abstracts from specific sensors as well as from a specific sport. Statistics for different sports can be calculated using different sensors but based on the same system. Subsequently, these statistics have to be presented in a way that the technological background is encapsulated from the user, e.g., the user does not need to configure the P2P network. The remainder of this paper is structured as follows: Section 2 gives an overview of related sport analysis systems. The concept of Herakles is explained in Section 3. In Section 4 we give an overview on the current state of our implementation. Section 5 gives a prospect of future work and Section 6 concludes this paper. 2 Related Work There are many existing sport analysis systems developed and used. However, the majority of them is built on cameras. Many of them are commercial products like Synergy Sports Technology1 or Keemotion2 . Other projects are scientific. For instance, Baum [Bau00] provides a video-based system for performance analysis of players and objects. It uses high-speed video cameras and was tested in the context of baseball. As mentioned earlier, there are many problems related to video-based systems, especially the high acquisition and maintenance costs. Leo et al. [LMS+ 08] refer to a video-based soccer analysis of players and objects. In their approach, the ball and players are tracked by six video cameras. Then, the video data is 1 http://corp.synergysportstech.com/ 2 http://www.keemotion.com/ 72 received by six machines and processed by a central supervisor. The arising problems of this approach are the same as for Baum [Bau00]. Smeaton et al. [SDK+ 08] propose an analysis of the overall health of the sportsperson in the context of football games. Their sensor-based approach includes body sensors in combination with video recordings. The location is tracked by GPS, but the data can only be reviewed after the game and their approach currently handles only one person at a time. Von der Grün [vdGFW+ 11] proposes a sensor-based approach called RedFIR. It uses expensive hardware for the real-time analysis (e.g., SAP HANA for a german soccer team3 ), which leads to high acquisition and maintenance costs like in [Bau00]. Additionally, their data is processed in an in-memory database instead of a DSMS. To our best knowledge, there is no sport analysis system that uses a network of conventional, low-cost machines for processing streaming sensor data and that shows the results on a mobile device in real-time. 3 Concept In this section, we give an overview of Herakles. Its architecture is shown in figure 1 and is divided into three components: (1) sensor network, (2) P2P network of DSMSs and (3) mobile device. P2P network of DSMSs Sensor network Sensor data stream Position sensor Position receiver Sport statistics Queries Mobile device Figure 1: Architecture of Herakles with sensors, a P2P network and a mobile device. We equip important game entities with sensors to form the sensor network. Each of these sensors sends the position and other information like speed and acceleration of the respective entity. This information is captured by receivers, which are placed around the playing field. These receivers are connected to the next component, the P2P network of conventional (private) machines with DSMSs installed. The DSMSs receive the sensor data stream. To be independent from a specific sport and sensor technology, Herakles separates the incoming raw sensor data into multiple intermediate schemata at first (data abstraction). The DSMSs use continuous queries built on top of these intermediate schemata to continuously analyze the data and to calculate the sport statistics. With this, we can replace the sensors, but we do not need to change the continuous queries. 3 http://tinyurl.com/o82mub5 73 The query results are streamed to the mobile device, the last component in the architecture of Herakles. The mobile device visualizes the statistics and the user can decide at any time, which statistics should be shown at the moment. Inside Herakles, each statistic is mapped to a continuous query, which is distributed, installed and executed in the P2P network. The distribution is necessary to make sure that not a single peer processes a complete query (avoiding overloads). In the following sections, we give a more detailed view about the components of Herakles. In section 3.1, we describe our data abstraction. The used P2P network and the distributed DSMS with it’s features is explained in section 3.2. Finally, the presentation on mobile devices is explained in section 3.3. 3.1 Data Abstraction Distributed Data Container To improve reusability and flexibility, Herakles separates the sensor data into different layers. In doing so, we can avoid direct dependencies between calculated statistics and a sensor’s technology and data format. An overview of Herakles’ data abstraction is shown in figure 2. Events Basketball-specific Events Soccer-specific Events ... Generic Movement Data Intermediate Schema Raw Sensor Data Figure 2: Data abstraction layers ranging from sensor schema up to sport specific schemata. The sensor network sends its data as raw sensor data. Inside the P2P network, this data is converted into a common intermediate schema for standardization. This includes for example a unit conversion. The intermediate schema improves the interchangeability in Herakles: While the sensor’s data format can change, the intermediate schema stays unchanged. When different sensors are used, only the adapter between raw sensor data and the intermediate schema has to be rewritten. Other parts of Herakles including the continuous queries for the sport statistics stay unchanged. In our opinion, generic movement information is interesting for more than one sport and should be reused. Therefore, on top of the intermediate schema, we differentiate between generic movement data (e.g., the player’s position and running speed) and sport-specific events (e.g., shots on goal for soccer). Sport-specific events need to be created for each sport individually (e.g., basketball and soccer). By separating generic movement data from sport-specific events, we can reuse parts of continuous queries in different sports. In ge- 74 neral, Herakles uses continuous queries built on each other to transform the data: one continuous query receives the raw sensor data, transforms it into the intermediate schema and provides the results as an artificial data source. This source can be reused from other continuous queries to determine the movement data, identifying sport-specific events etc. Another aspect of the data abstraction is the storage and the access to static values. For instance, the positions of the soccer-goals are needed in many statistics (e.g., for counting shots on goal or identifying goals itself). Another example is the identification, which sensor is attached to which player or ball. Since Herakles uses a decentralized and dynamic P2P network, a central server is not applicable. Additionally, it is not feasible to define the static values on each peer individually. Therefore, Herakles needs a decentralized way to share these static information across the network. The data abstraction of Herakles uses a so-called distributed data container (DDC). It reads information (e.g., from a file) and distributes them automatically to other peers in the network without further configuration. Then, each peer can use these values inside its own DSMS. Changes at one peer in the network are propagated to the other peers automatically. With this, the user has to define the static values only at one peer. 3.2 Data Stream Management System Herakles uses a distributed DSMS in a P2P network for processing the sensor data. Since the peers are private machines (e.g., notebooks or mobile devices), the peers are considered to be autonomous and heterogeneous [VLO09]. To use a distributed DSMS in such a P2P network with enough performance, reliability, availability and scalability for live sports analysis, multiple mechanisms must be in place. Query distribution makes it possible to use more than one peer for query processing (to share system loads). With fragmentation, the data streams can be split and processed in parallel on multiple peers. Replication increases the reliability of the system by executing a query on multiple peers at the same time. Therefore, results are available, even if a peer failure occurs [Mic14]. Recovery also reacts to peer failures in order to restore the previous state of the distributed system at run-time. Dynamic load balancing monitors the resource usage of the peers in the network at run-time and shifts continuous queries from one peer to another. In our opinion, a P2P network of DSMS instances with all features mentioned above can handle the challenges of processing and analyzing data from active sensors in the context of live sport analysis. 3.3 Presentation The coach wants an agile and lightweight device to access the data. Therefore, we chose mobile devices like smartphones or tables as front-end. These devices are connected to 75 the P2P network (e.g., via Wi-Fi) and receive the result streams of the continuous queries. The presentation of the calculated sport statistics should be fast and easy to support the coaches’ decisions during the game. Therefore, the results have to be aggregated and semantically linked to provide a simple and understandable access to information. We interviewed different coaches to identify the needed information. We classified those information in (1) player statistics, (2) team statistics and (3) global statistics. Player statistics provide information of a specific player (e.g., ball contacts or shots on goal). This view can be used to track the behavior of a single player in significant situations. Team statistics provide information about an entire team (e.g., ball possession or pass completion rate). An overview of the current game is provided by the global statistics view. All of these views are updated in real-time since the mobile device continuously receives the statistic values. 4 Implementation status Currently, Herakles is under development and many functions and mechanics are already in place. Therefore, there are no evaluation results available, yet. However, we tested Herakles with the analysis of a self-organized basketball game. In this section, we give a brief overview of the current state of our implementation. At first, we give a description of our sensor network and data abstraction (section 4.1). The used distributed DSMS and its relevant features are explained in section 4.2. Section 4.3 contains a description about the used sport statistics. Section 4.4 describes the presentation application, which is used for the coach to see the statistics in real-time. 4.1 Sensor networks and data abstraction We already tested a GPS sensor network for outdoor and a Wi-Fi sensor network for indoor sports. In both sensor networks, position data is provided by mobile applications using sensors of the mobile device. That makes both sensor networks low priced. To test the approach, we equipped 12 players and the ball with a device running the sensor application and let them compete in a short basketball game while recording the resulting data. The DDC and the different layers of the data abstraction component are implemented. The DDC can be filled in two ways: (1) by reading a file or (2) by messages from other peers. The DDC generates messages to be sent to other DDC instances on other peers resulting in a consistent state through the P2P network. The intermediate schema for the sensor data is shown in listing 1. x, y and z have to be sent by the sensors (but can be converted if they are in a different unit, e.g., in centimeters), whereas ts can be calculated by the DSMS as well as v and a (using two subsequent elements sent by the same sensor). In our opinion, measuring the positions in millimeters and the time in microseconds are sufficient for most sport statistics. Currently, we have implemented adapters to wrap the data streams from the testing source networks mentioned above into our intermediate schema. 76 Listing 1: Intermediate schema. 1 2 3 4 5 6 7 sid ts x y z v a - unique ID timestamp [microseconds] x-position [mm] y-position [mm] z-position [mm] current absolute velocity [ mm s ] current absolute acceleration [ mm s2 ] 4.2 Distributed Data Stream Management System For our distributed DSMS, we use Odysseus, a highly customizable framework for creating DSMSs. Its architecture consists of easily extensible bundles, each of them encapsulating specific functions [AGG+ 12]. Odysseus provides all basic functions for data stream processing and is already extended for the distributed execution in a P2P network of heterogeneous and autonomous peers. Furthermore, mechanisms for query distribution, fragmentation and replication are already available [Mic14]. Because of its current functionality in addition with the extensibility, we decided to use Odysseus for Herakles. However, Odysseus had not all features we need to implement a reliable real-time sport analytic system. It did not support dynamic load balancing (for shifting continuous queries at run-time) and recovery. Therefore, we extended Odysseus with these mechanisms. For the dynamic load balancing, we implemented a communication protocol and a simple load balancing strategy. For recovery, we implemented a combination of active standby and upstream-backup. Active standby means to have a continuous query executed multiple times on different peers (similar to replication). However, only the streaming results of one peer are used. But, if that peer fails (or leaves the network on purpose), another peer with a copy of the query replaces it. With upstream-backup, a peer saves the stream elements which had been sent to its subsequent peer, until that peer indicates that it has processed said stream elements. If the subsequent peer fails, another peer can install the lost continuous query again and the processing can be redone with the previously saved stream elements. 4.3 Live Statistics Analyzing the data and calculating useful statistics in terms of continuous queries is an essential part of Herakles. Typically, queries are described by a declarative query language. To avoid complex query declarations on the mobile devices, we implemented an own query language called SportsQL. It is a compact language especially designed for sport statistic queries in Herakles. If the user selects a statistic to show, the mobile application of Herakles generates a corresponding SportsQL query, which is sent to an Odysseus- 77 instance. The transition of the SportsQL query to an executable continuous query is done inside of Odysseus. With this decision, the continuous queries and the mobile application are disconnected from each other: we can create and improve the complex continuous queries in Odysseus without changing the mobile application. Furthermore, any other device can also use SportsQL without changing Odysseus. An example of SportsQL is shown in listing 2. Listing 2: Example of SportsQL for the shots on goal of a specific player with the id 8. 1 { 2 3 4 5 6 } "statisticType": "player", "gameType": "soccer", "entityId": 8, "name": "shotsongoal" In this query, a mobile device wants a statistic about the current amount of shots on goal for a specific player. The attribute name identifies the statistic to generate and the attribute gameType identifies the analyzed game (in this case soccer). StatisticType differentiates between team statistic, player statistic and global statistic mentioned in section 3.1. In this example, a player statistic is specified. Therefore, an entityId refers to a player for which this statistic should be created. It is possible to send further parameters within a SportsQL query such as time and space parameters. This can be used to limit the query results, e.g., for a specific range of time or for a specific part of the game field. 4.4 Presentation We implemented an extensible Android application intended for tablets and smartphones. Android has been chosen because it is common and allows the application to be run on different kinds of devices. Figure 3 shows a screenshot of our current application. 5 Future Work In our current implementation of Herakles, we focus on soccer-specific statistics to show the proof of concept. But there are a lot of possibilities for more advanced statistics and views. Depending on the sport, we want to support additional statistics. Consequently, we plan to enhance our mobile application, the SportsQL query language and the corresponding continuous queries in the P2P network. Herakles works with radio-based sensors reducing the costs compared to video-based systems. Therefore, it focuses on training sessions or friendly games. Nevertheless, the probability to get a license to use such a system within an official, professional match is low 78 Figure 3: Mobile application for the coach with statistics and a game topview. whereas video-based systems are more accepted. An extension could be to use a videobased system to get the player positions. Currently, we do not consider inaccuracies in sensor data streams. For example, some sensors send inaccurate data or no data at all for a certain time. Therefore, there can be anomalies in sensor data [ACFM14] and sport statistics, which should be considered in Herakles in the future. Additionally, we do not face the problems of security, which rise with the use of private P2P networks. Currently, we expect that each peer is cooperative and does not want to damage the system on purpose. But this assumption must be weakened in the future: sensor data streams have to be encrypted, peers need to be checked (e.g., web of trust [GS00]) and a distributed user management should be in place. An open issue is the lack of evaluation. Currently, we are implementing the last steps. We made a few tests with a GPS sensor network and are about to begin with the evaluation. We plan to measure how much data Herakles can process in real-time, how fast it can react to sport-events (like interruptions), how accurate the sport statistics are and how much reliability and availability the P2P network really provides. 6 Summary In professional sports, complex and expensive computer-based analysis systems are used to collect data, to setup statistics, and to compare players. Most of them are video-based and not every team has the opportunity to buy and maintain them. With Herakles, we proposed an alternative sport analysis system, which uses a decentralized and dynamic P2P network of conventional private computers. A sensor network placed on the playing field (e.g., sensors attached to the players and balls) is continuously sending position data to this P2P network. On each peer, a DSMS is installed. This collection of DSMSs is used to cooperatively process the streaming sensor data in real-time. To be independent 79 from specific sensors, we designed a data abstraction layer, which separates the sensors from our continuous queries generating the sport statistics. Herakles presents the statistics on a mobile device, where the user immediately sees those statistics during the game. Users can choose other statistics at any time and the P2P network adapts to these changes automatically. In the P2P network, we use Odysseus, a component-based framework for developing DSMSs. Despite the data stream processing, it supports continuous query distribution, replication and fragmentation. We added dynamic load balancing and recovery mechanisms to fulfill our requirements. Finally, Herakles uses Android-based mobile devices to show the calculated statistics in real-time. There are still many tasks to do: primarily, extensions to other sports and statistics. We are about to begin with the evaluation to measure performance of efficiency of Herakles. But we are confident: With Herakles, we show that it is possible to analyze sport events in real-time with commodity hardware like notebooks. References [ACFM14] Annalisa Appice, Anna Ciampi, Fabio Fumarola und Donato Malerba. Data Mining Techniques in Sensor Networks - Summarization, Interpolation and Surveillance. Springer Briefs in Computer Science. Springer, 2014. [AGG+ 12] H.-Jürgen Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen und Daniela Nicklas. Odysseus: a highly customizable framework for creating efficient event stream management systems. DEBS ’12, Seiten 367–368. ACM, 2012. [Bau00] C.S. Baum. Sports analysis and testing system, 2000. US Patent 6,042,492. [GS00] T. Grandison und M. Sloman. A survey of trust in internet applications. Communications Surveys Tutorials, IEEE, 3(4):2–16, Fourth 2000. [LMS+ 08] Marco Leo, Nicola Mosca, Paolo Spagnolo, Pier Luigi Mazzeo, Tiziana D’Orazio und Arcangelo Distante. Real-time multiview analysis of soccer matches for understanding interactions between ball and players. In Proceedings of the 2008 international conference on Content-based image and video retrieval, Seiten 525–534. ACM, 2008. [Mic14] Timo Michelsen. Data stream processing in dynamic and decentralized peer-to-peer networks. In Proceedings of the 2014 SIGMOD PhD symposium, Seiten 1–5. ACM, 2014. [SDK+ 08] Alan F. Smeaton, Dermot Diamond, Philip Kelly, Kieran Moran, King-Tong Lau, Deirdre Morris, Niall Moyna, Noel E O’Connor und Ke Zhang. Aggregating multiple body sensors for analysis in sports. 2008. [vdGFW+ 11] Thomas von der Grün, Norbert Franke, Daniel Wolf, Nicolas Witt und Andreas Eidloth. A real-time tracking system for football match and training analysis. In Microelectronic Systems, Seiten 199–212. Springer, 2011. [VLO09] Quang Hieu Vu, Mihai Lupu und Beng Chin Ooi. Peer-to-Peer Computing: Principles and Applications. Springer, 2009. 80 Bestimmung von Datenunsicherheit in einem probabilistischen Datenstrommanagementsystem Christian Kuka SCARE-Graduiertenkolleg Universität Oldenburg D-26129 Oldenburg [email protected] Daniela Nicklas Universität Bamberg D-96047 Bamberg [email protected] Abstract: Für die kontinuierliche Verarbeitung von unsicherheitsbehafteten Daten in einem Datenstrommanagementsystem ist es notwendig das zugrunde liegende stochastische Modell der Daten zu kennen. Zu diesem Zweck existieren mehrere Ansätze, wie etwas das Erwartungswertmaximierungsverfahren oder die Kerndichteschätzung. In dieser Arbeit wird aufgezeigt, wie die genannten Verfahren in ein Datenstrommanagementsystem verwendet werden können, umso eine probabilistische Datenstromverarbeitung zu ermöglichen und wie sich die Bestimmung des stochastischen Modells auf die Latenz der Verarbeitung auswirkt. Zudem wird die Qualität der ermittelten stochastischen Modelle verglichen und aufgezeigt, welches Verfahren unter welchen Bedienungen bei der kontinuierlichen Verarbeitung von unsicherheitsbehafteten Daten am effektivsten ist. 1 Einführung Für die qualitätssensitive Verarbeitung von Sensordaten ist es notwendig die aktuelle Qualität der Daten zu kennen. Eine der hierbei häufig verwendeten Qualitätsdimensionen ist der statistische Fehler von Sensormessungen. In vielen Fällen wird hierbei die aus dem Datenblatt stammende Kennzahl für die Standardabweichung herangezogen um das stochastische Modell im Sinne einer Normalverteilung zu verwenden. Jedoch kann das Rauschen eines Sensors von vielen Kriterien abhängen und sich vor allem auch dynamisch zur Laufzeit ändern. Eine Form der Qualitätsbestimmung besteht darin, direkt das zugrunde liegende stochastische Modell der Sensormessungen kontinuierlich neu zu ermitteln. Vor allem im Bereich der kontinuierlichen Verarbeitung von hochfrequenten Sensordaten ist es hierbei notwendig die Speicherkapazitäten des Systems zu beachten und die Daten so schnell wie möglich zu verarbeiten. Für diese Form der Verarbeitung existiert mittlerweile eine Vielzahl von Systemen, welche unter dem Begriff Datenstrommanagementsystem zusammengefasst werden können. Im Rahmen von Datenstrommanagementsystemen hat sich für die Verarbeitung von Unsicherheiten der Begriff der probabilistischen Datenstromverarbeitung [TPD+ 12, JM07, KD09] etabliert. Ziel der Verarbeitung ist es nicht nur den reinen Messwert, sondern die zugrunde liegende Unsicherheit innerhalb der Verarbeitung in einem Datenstrommanagementsystem zu repräsentieren und zu verarbeiten, so dass der entstehende kontinuierliche 81 Ausgabestrom einer Anfrage auch immer die aktuelle Ergebnisunsicherheit enthält. Bei der Verarbeitung von Unsicherheiten kann dabei zwischen zwei Klassen unterschieden werden, der Verarbeitung von diskreten Wahrscheinlichkeitsverteilungen und der Verarbeitung von kontinuierlichen Wahrscheinlichkeitsverteilungen. Diskrete Verteilungen werden häufig dazu genutzt die Existenzunsicherheit von möglichen Welten darzustellen. Kontinuierliche Wahrscheinlichkeitsverteilungen dienen dagegen dazu, Unsicherheiten in der Sensorwahrnehmung, welche etwa durch das Messverfahren an sich oder Umwelteinflüsse induziert werden, zu beschreiben. Im Folgenden liegt der Fokus daher auf der Bestimmung von kontinuierlichen stochastischen Modellen. Die Bestimmung von stochastischen Modellen auf Basis von Datenströmen bei Filteroperationen wurde unter anderem in [ZCWQ03] behandelt. Hierbei war allerdings das Ziel, das stochastische Modell zu verwenden, um das Rauschen um einen Selektionsbereich innerhalb der Verarbeitung zu bestimmen. Ziel dieser Arbeit ist es aber das mehrdimensionale stochastische Modell der Daten selbst zu bestimmen, um eine probabilistische Verarbeitung der Daten, wie sie in [TPD+ 12] mit dem Mischtyp-Modell eingeführt wurde, zu ermöglichen. Das Modell hat den Vorteil, dass es sowohl die Unsicherheit über die Existenz einzelner Attribute, sowie auch die Unsicherheit über die Existenz ganzer Tupel repräsentieren kann. Zur Evaluation von verschiedenen Verfahren zur Bestimmung und Verarbeitung der mehrdimensionalen stochastischen Modelle wurde diese probabilistische Verarbeitung mit den Konzepten der deterministischen Verarbeitung mit Zeitintervallen aus [Krä07] kombiniert und in dem Datenstrommanagementsystem Odysseus [AGG+ 12] implementiert. 2 Verfahren zur Bestimmung von stochastischen Modellen Für die Bestimmung von mehrdimensionalen stochastischen Modellen, wie sie bei der probabilitischen Datenstromverarbeitung verwendet werden, existieren prinzipiell mehrere Möglichkeiten. Zu diesen Verfahren zählen etwa das Erwartungswertmaximierungsverfahren und die Kerndichteschätzung, welche im Folgenden näher erläutert werden. 2.1 Erwartungsmaximierungsverfahren Das Erwartungswertmaximierungsverfahren [DLR77] dient dazu die Parameter eines stochastischen Modells durch mehrere Iterationen an die Verteilung von Daten anzunähern. Hierzu wird versucht die Log-Likelihood L zwischen den zu bestimmenden Parametern und den zur Verfügung stehenden Daten in jeder Iteration t des Algorithmus zu maximieren. Als Parameter bieten sich hierfür die Parameter einer multivariaten Mischverteilung aus Gauß-Verteilungen mit Parameter θ = {wi , µi , Σi }m i=1 an. Eine multivariate Mischverteilung aus Gauß-Verteilungen über eine kontinuierliche Zufallsvariable X ist eine Menge 82 von m gewichteten Gauß-Verteilungen X1 , X2 , . . . , Xm , wobei Xi die Wahrscheinlichkeitsdichtefunktion fX (x) = m + wi fXi (x) mit fXi (x) = i=1 T −1 1 1 e− 2 (x−µi ) Σi (x−µi ) 1/2 | Σi | (2π)k/2 /m ist. Dabei gilt, dass 0 ≤ wi ≤ 1, i=1 wi = 1, k die Größe des Zufallsvektors ist und jede Mischverteilungskomponente Xi eine k-variate Gauß-Verteilung mit Erwartungswert µi und Kovarianz-Matrix Σi ist. Zur Annäherung einer Gauß-Mischverteilung wird zunächst ein initiales stochastisches Modell mit m Gauß-Verteilungen bestimmt. Auf Basis des aktuellen Modells werden nun im E-Schritt die Erwartungswerte bestimmt, also die Wahrscheinlichkeiten, dass die aktuellen Werte aus dem aktuellen stochastischen Modell generiert wurden. (t) τij (t) (t) wj FXj (xi ; θj ) = /m l=1 (t) (t) wl FXj (xi ; θl ) (t) γj = n + i=1 , i = 1, . . . , n, j = 1, . . . , m (t) τij , j = 1, . . . , m Während des M-Schrittes werden die neuen Parameter für θ anhand der Ergebnisse aus dem E-Schritt bestimmt. (t+1) wj = (t+1) Σj = (t) n γj 1 + (t) (t+1) , j = 1, . . . , m, µj = (t) τij , j = 1, . . . , m n γj i=1 n 1 + (t) γj i=1 (t) (t+1) τij (xi − µj (t+1) T )(xi − µj ) , j = 1, . . . , m Nach jedem EM-Schritt wird die Log-Likelihood berechnet und mit einem gegebenen Schwellwert verglichen. Ist die Differenz kleiner als der gegebene Schwellwert oder überschreitet die Anzahl der Iterationen die maximale Anzahl, werden die bestimmten Parameter für die Gewichte (w), den Erwartungswert (µ), sowie die Kovarianz-Matrix (Σ) der Mischverteilung zurückgeliefert. 2.2 Kerndichteschätzung Im Gegensatz zum EM-Verfahren wird bei der Kerndichteschätzung (KDE) für jeden Messwert eine Komponente in einer Mischverteilung erstellt und eine Bandbreite bestimmt. Die Bandbreite dient dazu eine Varianz/Kovarianz Matrix für alle Komponenten der Mischverteilung zu bilden und so das eigentliche zugrunde liegende Modell möglichst gut wieder zu geben. Zur Bestimmung der Bandbreite B haben sich mehrere Verfahren 83 etabliert, wie etwa die Scott-Regel [Sco92]. Die Parameter der Komponenten der Mischverteilung lassen sich somit wie folgt berechnen: 1 , µj = xj , Σj = Σ(x) ∗ B n Wobei Σ(x) die Varianz/Kovarianz der zugrunde liegenden Daten repräsentiert. Man sieht bereits, dass das KDE ohne mehrmalige Iterationen über die zugrunde liegenden Daten auskommt, da sowohl der Erwartungswert wie auch die Varianz/Kovarianz inkrementell bestimmt werden können. wj = Da bei der Kerndichteschätzung die Anzahl an Komponenten der Mischverteilung linear zu der Zahl der Messwerte steigt und somit das Ergebnis generell ungeeignet ist für eine Verarbeitung in einem Datenstrommanagementsystem, benötigt es ein Verfahren zur Reduktion der Komponenten. In [ZCWQ03] stellen die Autoren ein Verfahren vor, welches das KDE-Verfahren auf einen eindimensionalen Strom anwendet und die dabei resultierende Mischverteilung durch ein Kompressionsverfahren auf eine geringere Anzahl von Verteilungen reduziert. Dieses Verfahren ist allerdings nicht für multivariate Verteilungen anwendbar. In [CHM12] wurde eine Selbstorganisierende Merkmalskarten (SOM) verwendet um Cluster zu bilden und diese Cluster durch eine Verteilung darzustellen. SOMs haben allerdings allgemein den Nachteil, dass die Gefahr einer Überanpassung der Gewichtsvektoren besteht. Eine weitere Möglichkeit zur Reduktion der Komponenten besteht in dem Bregman Hard Clustering Verfahren [BMDG05]. Bei dem Bregman Hard Clustering Verfahren wird versucht, ähnliche Verteilungen innerhalb einer Mischverteilung durch die Bildung von Clustern zu vereinfachen. Hierbei werden zunächst Cluster mit je einem Repräsentanten gebildet und anschließend für jedes Cluster eine Minimierung ausgeführt mit dem Ziel den Informationsverlust zwischen den Clusterzentren und den Komponenten zu minimieren. Das Verfahren kann als eine Generalisierung des Euklidischen k-Means Verfahrens angesehen werden, wobei die Kullback-Leibler Divergenz als Minimierungsziel verwendet wird. Um allerdings die Bestimmung des Integrals innerhalb der Kullback-Leibler Divergenz zu umgehen, wird die Kullback-Leibler Divergenz in eine Bregman Divergenz umgewandelt. Die Bregman Divergenz ist dabei definiert als: DF (θj ||θi ) = F (θj ) − F (θi )− < θj − θi , ∇F (θi ) > (1) Hierbei wird die Dichtefunktion einer Normalverteilung in die kanonische Dekomposition der jeweiligen Exponentialfamilie wie folgt umgeschrieben: N (x; µ, σ 2 ) = exp{< θ, t(x) > −F (θ) + C(x)} Wobei θ = (θ1 = µ σ 2 , θ2 (2) = − 2σ1 2 ) die natürlichen Parameter, t(x) = (x, x2 ) die notwenθ2 dige Statistik und F (θ) = − 4θ12 + teilung darstellen. 1 2 log −π θ2 die Log-Normalisierung für eine Normalver- Unter der Bedingung, dass beide Verteilungen von der gleichen Exponentialfamilie stammen, lässt sich die Kullback-Leibler Divergenz in die Bregman Divergenz umformen KL(N (x; µi , σi2 )||N (x; µj , σj2 )) = DF (θj ||θi ) (3) , so dass nun direkt die Bregman Divergenz als Distanz innerhalb des k-Means Verfahrens zur Clusterbildung angewendet werden kann. 84 3 Evaluation der Verfahren Im Folgenden werden die Verfahren zur Bestimmung des stochastischen Modells der Daten eines Datenstrom hinsichtlich ihrer Latenz, aber auch hinsichtlich der Güte des stochastischen Modells evaluiert. Zu diesem Zweck wurden die Verfahren als Verarbeitungsoperatoren innerhalb des Odysseus DSMS realisiert. Die Evaluation wurde dabei sowohl auf synthetischen Daten, wie auch auf Daten aus einem Ultrabreitband-Positionierungssystem [WJKvC12] durchgeführt. Hierzu wurden 10.000 Messwerte aus einer Normalverteilung, sowie aus einer logarithmischen Normalverteilung generiert um einen Datenstrom aus Messwerten zu simulieren. Die Evaluation der Latenz und der Güte des Modells betrachtet dabei drei Szenarien mit Datenfenstern der Größe 10, 100 und 1000. Das Datenfenster definiert dabei die Anzahl an Messwerten auf denen die Operatoren das stochastische Modell bestimmen sollen. Die Güte des Modells betrachtet das aktuell bestimmte stochastische Modell im Hinblick auf alle 10.000 Datensätze. Als Qualitätskriterium wird hierzu das Akaike-Informationskriterium verwendet. Das AIC ist ein Maß für die relative Qualität eines stochastischen Modells für eine gegebene Datenmenge und ist definiert als: AIC = 2k − 2 ln(L) (4) Der Parameter k repräsentiert hierbei die Anzahl der freien Parameter in dem stochastischen Modell und der Parameter L gibt die Log-Likelihood zwischen dem stochastischen Modell und der gegebenen Datenmenge wieder. Dieses Informationskriterium ist für die Evaluation deshalb gut geeignet, da es sowohl die Nähe der generierten Mischverteilung aus den drei Verfahren zu den tatsächlichen Daten bewertet und zudem die Anzahl der Komponenten innerhalb der Mischverteilungen in die Bewertung mit einfließen lässt. Die Nähe zu den tatsächlichen Daten ist wichtig für die Qualität der Verarbeitungsergebnisse und die Komponentenanzahl der Mischverteilung hat eine Auswirkung auf die Latenz der Verarbeitung, da jede Komponente innerhalb einer Mischverteilung bei Operationen wie der Selektion oder dem Verbund mit einem Selektionskriterium bei einer probabilistischen Verarbeitung integriert werden muss. Um mögliche Ausreißer zu minimieren wurde jede Evaluation wurde dabei 10-mal wiederholt. Als Testsystem diente ein Lenovo Thinkpad X240 mit Intel Core i7 und 8GB RAM. Die verwendete Java Laufzeitumgebung war ein OpenJDK Runtime Environment (IcedTea 2.5.2) (7u65-2.5.2-2) mit einer OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode). Bei dem Betriebssystem handelte es sich um ein Debian GNU/Linux mit einem 3.14 Kernel. Das EM-Verfahren versucht ein stochastisches Modell an die eingehenden Daten anzupassen. Dabei spielen neben der Datenfenstergröße die Anzahl der Iterationen, der Konvergenzschwellwert für die Veränderung der Log-Likelihood in jeder Iteration, sowie die Komponentenanzahl der Mischverteilungen eine Rolle für die Latenz dieses Operators. Für die Evaluation wurde der Konvergenzschwellwert auf 0.001 gesetzt, die Anzahl an Iterationen auf 30 und die Zahl der Komponenten auf 2. Die gleiche Anzahl an Iterationen wird ebenfalls in der von V. Garcia bereitgestellten Java Bibliothek jMEF1 verwendet. 1 http://vincentfpgarcia.github.io/jMEF/ 85 800 700 600 Zeit (ms) 500 EM 400 Bregman 300 KDE 200 100 0 0 1000 2000 3000 4000 5000 6000 Elemente 7000 8000 9000 10000 Abbildung 1: Latenz der Operatoren bei einem Datenfenster der Größe 100 für Daten aus einer logarithmischen Normalverteilung Das KDE-Verfahren bestimmt für jeden Datenwert eine eigene Komponente in der resultierenden Mischverteilung. Der entwickelte Operator verwendet hierzu die Scott-Regel zur Bestimmung der Bandbreite der Kovarianzmatrix der Komponenten. Das Bregman Hard Clustering, welches in einem weiteren Schritt verwendet wird um die Anzahl an Komponenten auf die gewünschte Zahl zu minimieren wurde mit einer maximalen Anzahl von 30 Iterationen konfiguriert. Um die Resultate vergleichbar zu halten wurde der Operator so konfiguriert, dass er ebenfalls eine 2-komponentige Mischverteilung ermittelt, also zwei Cluster bildet. Der hier verwendete Konvergenzschwellwert für das Erwartungswertmaximierungsverfahren liegt oberhalb des, in der verwendeten Apache Commons Math3 Bibliothek2 als Standardwert festgelegten, Wertes von 0.00001, da sich in den Versuchen zeigte, dass bereits ein höherer Konvergenzschwellwert ausreichte um die Verfahren hinsichtlich der Güte des stochastischen Modells und der gemessenen Latenz miteinander zu vergleichen. 3.1 Synthetische Sensordaten Das Latenzverhalten der einzelnen Verfahren ist in Abb. 1 für Daten aus einer log. Normalverteilung für ein Datenfenster der Größe 100 dargestellt. Das EM-Verfahren weist hierbei eine ähnliche und stabile Latenz von durchschnittlich ca. 200 Millisekunden auf. Dies ist durch die mehrmalige Iteration über die aktuell gültigen Daten zur Bestimmung der 2 http://commons.apache.org/proper/commons-math 86 Log-Normalverteilung 1000 Log-Normalverteilung 100 Log-Normalverteilung 10 Normalverteilung 1000 Normalverteilung 100 Normalverteilung 10 0 5 10 15 KDE + Bregman-Hard Clustering 20 25 30 35 EM Abbildung 2: Vergleich des AIC zwischen EM-Verfahren und KDE mit Bregman Hard Clustering bei unterschiedlichen Datensatzfenstergrößen für Werte aus einer Normalverteilung und einer logarithmischen Normalverteilung Log-Likelihood zwischen dem jeweils temporären stochastischen Modell und den Daten geschuldet. Im Gegensatz zum EM-Verfahren kann das Band bei der Kerndichteschätzung kontinuierlich bestimmt werden. Allerdings fällt auf, dass trotz mehrmaliger Wiederholung der Messung das Verfahren zum Bregman Hard Clustering eine deutlich höhere Latenz aufweist. Dieses Verhalten ist dabei unabhängig von der Art der Verteilung. Dies ist vor allem auf die Tatsache zurück zu führen, dass das Bregman Hard Clustering Verfahren in jeder Iteration die Bregman Divergenz zwischen den Clusterzentren und den einzelnen Komponenten bestimmen muss und zusätzlich noch den Zentroiden aus jedem Cluster in jeder Iteration neu ermitteln muss. Beim Vergleich der durchschnittlichen Latenz bei unterschiedlichen Größen von Datenfenstern zeigt sich, dass die Latenz des EM-Verfahrens konstant bleibt, während die Latenz des Bregman Hard Clusterings stark ansteigt. Bei der Qualitätsbetrachtung des ermittelten stochastischen Modells fällt auf, dass die Qualität des EM-Verfahrens im Sinne des AIC bei Werten aus einer logarithmischen Normalverteilung deutlich besser abschneidet als das KDE-Verfahren in Kombination mit dem Bregman Hard Clustering. Bei Werten aus einer Normalverteilung dagegen unterscheidet sich der AIC-Wert nur geringfügig bei den beiden Verfahren. Ein gleiches Verhalten lässt sich auch bei Datensatzfenstern der Größe 1.000 beobachten. Ist allerdings die Anzahl an Datensätzen gering, ändert sich dieses Verhalten. Bei einem Datensatzfenster der Größe 10 zeigt sich unabhängig von dem zugrunde liegenden stochastischen Modell der Daten, dass die Kombination aus KDE und Bregman Hard Clustering das bessere Modell liefert. Zudem unterscheiden sich die Latenzen bei dieser Datenmenge zwischen den beiden Verfahren nur gering. 87 400 Y-Position (mm) 200 -500 -400 -300 -200 -100 0 0 100 Position 3 Position 7 Position 4 Position 8 200 300 -200 -400 -600 -800 X-Position (mm) Position 1 Position 5 Position 2 Position 6 Abbildung 3: Messwerte der Positionsbestimmung für die Positionen 1–8 3.2 Reale Sensordaten Um zu zeigen, dass die Verfahren auch stochastische Modelle von echten Sensordaten erstellen können, wurden die Operatoren auf Sensordatenaufzeichnungen eines Ultrabreitband-Positionierungssystem [WJKvC12] angewendet. Insgesamt wurden 8 Positionen (vgl. Abbildung 3) bestimmt, von denen im Folgenden die Positionen 6 und 7 als repräsentative Positionen näher betrachtet werden. Hierbei wurde das stochastische Modell jeder Position mit dem EM-Verfahren und der Kombination aus KDE und Bregman Hard Clustering auf einem Datensatzfenster der Größe 10 und einem Datensatzfenster der Größe 100 bestimmt. Bei der Betrachtung der zeitlichen Bestimmung des stochastischen Modells in Abb. 4 fallen zunächst für die Position 6 anfängliche Ausreißer bei der Nähe zum Modell auf. Dies deutet auf eine anfängliche Anpassung der Positionierungsknoten der Anwendung hin. In den darauf folgenden Messungen bleiben sowohl die Modellqualität des EM-Verfahrens, wie auch das resultierende Modell des Bregman Hard Clustering stabil. Wie bereits bei den synthetischen Daten ist auch bei realen Sensordaten das Phänomen erkennbar, dass die Kombination aus KDE mit Bregman Hard Clustering bei kleinen Datensatzfenstern im Vergleich zum EM-Verfahren bessere stochastische Modelle ermittelt. Dagegen ist bei größeren Datensatzfenstern das EM-Verfahren besser geeignet um gute stochastische Modell im Sinne des AIC zu bestimmen. 88 25 25 20 20 AIC 15 AIC 15 EM 10 Bregman 5 0 EM 10 Bregman 5 0 10 20 30 40 50 60 Messungen 70 80 90 0 100 (a) Position 6 0 10 20 30 40 50 60 Messungen 70 80 90 100 (b) Position 7 Abbildung 4: Qualität des stochastischen Modells über die Zeit bei einem Datensatzfenster der Größe 100 von Position 6 und 7 4 Zusammenfassung und Ausblick In dieser Arbeit wurden Verfahren zur kontinuierlichen Bestimmung des zugrunde liegenden mehrdimensionalen stochastischen Modells von Messwerten aus aktiven Datenquellen vorgestellt. Ziel ist es, diese mehrdimensionalen stochastischen Modelle in einem probabilistischen Datenstrommanagementsystem zu verarbeiten. Bei den Verfahren handelt es sich um das Erwartungsmaximierungsverfahren und die Kerndichteschätzung in Kombinationen mit dem Bregman Hard Clustering Ansatz. Zunächst wurden die Grundlagen der jeweiligen Verfahren aufgezeigt. Zur Repräsentation der Unsicherheiten wurde das in [Krä07] entwickelte Modell durch das Mischtyp Modell [TPD+ 12] erweitert und in dem Odysseus DSMS realisiert. Bei der Evaluation der Verfahren wurde zunächst auf Basis von synthetischen Daten die Latenz der einzelnen Verfahren ermittelt. Hierbei zeigte sich, dass die Kombination aus Kerndichteschätzung und Bregman Hard Clustering aufgrund der mehrmaligen Iterationen über die Komponenten einer Mischverteilung eine wesentlich höhere Latenz als das Erwartungsmaximierungsverfahren aufweist. Zudem sind die resultierenden stochastischen Modelle im Sinne des Akaike Informationskriterium in den meisten Fällen schlechter als die angenäherten Modelle des Erwartungsmaximierungsverfahrens. Aus Sicht der Latenzoptimierung und angesichts der Qualität der bestimmten Modelle sollte daher das Erwartungsmaximierungsverfahren bei der Datenstromverarbeitung bevorzugen werden. Einzige Ausnahme sind Anwendungen in denen nur geringe Mengen an Daten zur Verfügung stehen. Hier konnte die Kombination aus Kerndichteschätzung und Bregman Hard Clustering die besseren stochastischen Modelle bestimmen. Eine Evaluation auf Basis von Sensoraufzeichnungen von Ultrabreitband-Lokalisierungssensoren bestätigten die Resultate aus der Evaluation mit synthetischen Daten. 89 Danksagung Die Autoren möchten Herrn Prof. Huibiao Zhu von der East China Normal University für seine Unterstützung danken. Diese Arbeit wurde durch die Deutsche Forschungsgesellschaft im Rahmen des Graduiertenkollegs (DFG GRK 1765) SCARE (www.scare.unioldenburg.de) gefördert. Literatur [AGG+ 12] H.-J. Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen und Daniela Nicklas. Odysseus: a highly customizable framework for creating efficient event stream management systems. In Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems, DEBS ’12, Seiten 367–368, New York, NY, USA, 2012. ACM Press. [BMDG05] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon und Joydeep Ghosh. Clustering with Bregman Divergences. Journal of Machine Learning Research, 6:1705–1749, 2005. [CHM12] Yuan Cao, Haibo He und Hong Man. SOMKE: Kernel density estimation over data streams by sequences of self-organizing maps. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1254–1268, 2012. [DLR77] Arthur P. Dempster, Nan M. Laird und Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal statistical Society, 39(1):1– 38, 1977. [JM07] T. S. Jayram und S. Muthukrishan. Estimating statistical aggregates on probabilistic data streams. In In ACM Symposium on Principles of Database Systems, Seiten 243– 252, New York, NY, USA, 2007. ACM Press. [KD09] Bhargav Kanagal und Amol Deshpande. Efficient query evaluation over temporally correlated probabilistic streams. In International Conference on Data Engineering, 2009. [Krä07] Jürgen Krämer. Continuous Queries over Data Streams-Semantics and Implementation. Dissertation, Philipps-Universität Marburg, 2007. [Sco92] D.W Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. 1992. + [TPD 12] Thanh T. L. Tran, Liping Peng, Yanlei Diao, Andrew McGregor und Anna Liu. CLARO: modeling and processing uncertain data streams. The VLDB Journal, 21(5):651– 676, Oktober 2012. [WJKvC12] Thorsten Wehs, Manuel Janssen, Carsten Koch und Gerd von Cölln. System architecture for data communication and localization under harsh environmental conditions in maritime automation. In Proceedings of the 10th IEEE International Conference on Industrial Informatics (INDIN), Seiten 1252–1257, Los Alamitos, CA, USA, 2012. IEEE Computer Society. [ZCWQ03] Aoying Zhou, Zhiyuan Cai, Li Wei und Weining Qian. M-kernel merging: Towards density estimation over data streams. In 8th International Conference on Database Systems for Advanced Applications, Seiten 285–292. IEEE, 2003. 90 Kontinuierliche Evaluation von kollaborativen Recommender-Systeme in Datenstrommanagementsystemen – Extended Abstract – Cornelius A. Ludmann, Marco Grawunder, Timo Michelsen, H.-Jürgen Appelrath University of Oldenburg, Department of Computer Science Escherweg 2, 26121 Oldenburg, Germany {cornelius.ludmann, marco.grawunder, timo.michelsen, appelrath}@uni-oldenburg.de Recommender-Systeme (RecSys) findet man in vielen Informationssystemen. Das Ziel eines RecSys ist das Interesse eines Benutzers an bestimmten Objekten (engl. item) vorherzusagen, um aus einer großen Menge an Objekten diejenigen dem Benutzer zu empfehlen, für die das vorhergesagte Interesse des Benutzers am größten ist. Die zu empfehlenden Objekte können zum Beispiel Produkte, Filme/Videos, Musikstücke, Dokumente, Point of Interests etc. sein. Das Interesse eines Benutzers an einem Objekt wird durch eine Bewertung (engl. rating) quantifiziert. Die Bewertung kann explizit durch den Benutzer angegeben (der Benutzer wird dazu aufgefordert, ein bestimmtes Objekt zu bewerten) oder implizit vom Verhalten des Benutzers abgeleitet werden (im einfachsten Fall durch eine binäre Bewertung: Objekt genutzt vs. nicht genutzt). Mit Methoden des maschinellen Lernens wird aus bekannten Bewertungen ein Modell trainiert, dass unbekannte Bewertungen vorhersagen kann. Für die Bestimmung der Empfehlungsmenge werden die Bewertungen aller unbewerteten Objekte für einen Benutzer vorhergesagt und die best-bewerteten Objekte empfohlen. Im realen Einsatz eines RecSys entstehen kontinuierlich neue Bewertungen, die bei der Integration in das Modell die Vorhersagen für alle Benutzer verbessern können. Betrachtet man die Bewertungs- und Kontextdaten nicht als statische Lerndaten, sondern als kontinuierlich auftretende, zeitannotierte Events eines potenziell unendlichen Datenstroms, entspricht das eher der Situation, wie ein RecSys produktiv eingesetzt wird. Zur Umsetzung eines RecSys schlagen wir die Erweiterung eines Datenstrommanagementsystems (DSMS) vor, welches mit Hilfe von Datenstrom-Operatoren und kontinuierlichen Abfrageplänen Datenströme mit Lerndaten zum kontinuierlichen Lernen eines Modells nutzt. Die Anwendung des Modells zur Bestimmung der Empfehlungsmenge wird ebenso durch Events eines Datenstroms ausgelöst (zum Beispiel durch die Anzeige in einer Benutzerapplikation). Integriert man ein RecSys in ein DSMS, so stellt sich die Frage, wie verschiedene Ansätze bzw. Implementierungen evaluiert und verglichen werden können. Bei der Evaluation werden die Daten i. d. R. in Lern- und Testdaten aufgeteilt. Die Evaluation von RecSys, bei denen der zeitliche Zusammenhang der Daten eine Rolle spielt, hat besondere Anfor- 91 derungen: Die zeitliche Reihenfolge der Lerndaten muss erhalten bleiben und es dürfen keine Testdaten zur Evaluation genutzt werden, die zeitlich vor den genutzten Lerndaten liegen. Um den zeitlichen Verlauf zu berücksichtigen, wird ein Datenstrommodel genutzt, welches die Nutzdaten mit Zeitstempeln annotiert. Zur Evaluierung eines DSMS-basierten RecSys schlagen wir den in Abbildung 1 dargestellten Aufbau vor. Als Eingabe erhält das DSMS als Bewertungsdaten die Tupel (u, i, r)t mit der Bewertung r des Benutzers u für das Objekt i zum Zeitpunkt t sowie Anfragen für Empfehlungen für den Benutzer u zum Zeitpunkt t. Der Evaluationsaufbau gliedert sich grob in drei Teile: Continuous Learning nutzt die Lerndaten zum maschinellen Lernen des RecSys-Modells. Continuous Recommending wendet zu jeder Empfehlungsanfrage das Modell an, um für den entsprechenden Benutzer eine Empfehlungsmenge auszugeben. Continuous Evaluation teilt die Bewertungsdaten in Lern- und Testdaten auf und nutzt das gelernte Modell zur Evaluation. Dazu wird die Bewertung für ein Testtupel vorhergesagt und die Vorhersage mit der wahren Bewertung verglichen. route predict_rating test_prediction Continuous Evaluating window train_recsys_model Continuous Learning get_unrated_items predict_rating recommend Continuous Recommending Feedback Abbildung 1: Aufbau eines RecSys mit kontinuierlicher Evaluation Dieser Aufbau wurde mit dem DSMS Odysseus [AGG+ 12] prototypisch umgesetzt und die Evaluation mit dem MovieLens-Datensatz1 durchgeführt. Als nächste Schritte sollen weitere Evaluationsmethoden mit diesem Aufbau umgesetzt, sowie Algorithmen zum maschinellen Lernen für RecSys für den Einsatz in einem DSMS optimiert werden. Literatur [AGG+ 12] H.-Jürgen Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen und Daniela Nicklas. Odysseus: A Highly Customizable Framework for Creating Efficient Event Stream Management Systems. In DEBS’12, Seiten 367–368. ACM, 2012. 1 http://grouplens.org/datasets/movielens/ 92 Using Data-Stream and Complex-Event Processing to Identify Activities of Bats Extended Abstract Sebastian Herbst, Johannes Tenschert, Klaus Meyer-Wegener Data Management Group, FAU Erlangen-Nürnberg, Erlangen, Germany E-Mail <firstname>.<lastname>@fau.de 1 Background and Motivation Traditional tracking of bats uses telemetry [ADMW09]. This is very laborious for the biologists: At least two of them must run through the forest to get a good triangulation. And, only one bat at a time can be tracked with this method. The Collaborative Research Center 1508 of the German Science Foundation (DFG) has been established to develop more sophisticated sensor nodes for the bats to carry1 . These mobile nodes must not be heavier than the telemetry senders used so far, but they offer much more processing capacity in addition to sending a beacon with an ID. Ground nodes receive the signals transmitted by the mobile nodes. They are also sensor nodes that run on batteries. Currently, all their detections are forwarded to a central base station, where a localization method integrates them into a position estimation for each bat [NKD+ 15]. The base station is a standard computer with sufficient power and energy. In the future, some parts of the localization may already be done on the ground nodes to reduce the data transmission and thus save some energy. Output of the localization method is a position stream. Each element of this stream contains a timestamp, a bat ID, and x and y coordinates. While the biologists would like to have a z coordinate as well, it cannot be provided at the moment because of technological restrictions. Plans are to include it in the future. While filtering and smoothing have already been done, the precision of the coordinate values depends on the localization method used. It may only be in the order of meters to tens of meters, or (with much more effort) in the range of decimeters. Furthermore, some positions may be missing in the stream. A bat may be temporarily out of the range of the ground nodes, or its mobile node may be switched off to save energy. Hence, the subsequent processing must be robust with respect to imprecise position values. 1 The sensor nodes are glued at the neck of the bats. After two to four weeks, they fall off. 93 2 Goals and Challenges The goal of this work is to investigate the use of of data-stream processing (DSP) and complex-event processing (CEP) to extract information meaningful for biologists from the position stream described. Biologists are interested in patterns of bat behavior, which are known to some extent, but have not been observed over longer periods of time, and have not been correlated in time for a group of bats. The elementary parts of the patterns can be expressed in terms of flight trajectories, which are a special case of semantic trajectories [PSR+ 13]. Biologists however are more interested in bat activities indicated by sequences of trajectories. Our idea is to identify these activities with the near-real-time processing that a combination of DSP and CEP can provide. This gives information on current activities earlier to the biologists, so they could go out and check by themselves what is happening. Also, they can experiment with the bats by providing extra food or emitting sounds. Furthermore, the mobile nodes on the bats can be configured to some extent, as can be the ground nodes. So when a particular behavior is reported to the biologists, they can adjust the localization method, e. g. switch to higher precision, even if that costs more energy. In order to reach that goal, a first set of DSP queries and CEP rules has been defined. Activities are managed as objects for each bat. A change of activity is triggered by events and is recorded as an update of this object. Current activities as well as the previous activities of each bat are displayed for the biologists. Ongoing work evaluates the activity detection by comparing the output with the known activities of bats in a simulation tool. It has already helped to adjust the substantial number of parameter values used in the DSP queries and CEP rules. Acknowledgments. This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) under the grant of FOR 1508 for sub-project no. 3. References [ADMW09] Sybill K. Amelon, David C. Dalton, Joshua J. Millspaugh, and Sandy A. Wolf. Radiotelemetry; techniques and analysis. In Thomas H. Kunz and Stuart Parsons, editors, Ecological and behavioral methods for the study of bats, pages 57–77. Johns Hopkins University Press, Baltimore, 2009. [NKD+ 15] Thorsten Nowak, Alexander Koelpin, Falko Dressler, Markus Hartmann, Lucila Patino, and Joern Thielecke. Combined Localization and Data Transmission in Energy-Constrained Wireless Sensor Networks. In Wireless Sensors and Sensor Networks (WiSNet), 2015 IEEE Topical Conference on, Jan 2015. [PSR+ 13] Christine Parent, Stefano Spaccapietra, Chiara Renso, Gennady Andrienko, Natalia Andrienko, Vania Bogorny, Maria Luisa Damiani, Aris Gkoulalas-Divanis, Jose Macedo, Nikos Pelekis, Yannis Theodoridis, and Zhixian Yan. Semantic Trajectories Modeling and Analysis. ACM Computing Surveys, 45(4), August 2013. Article 42. 94 Streaming Analysis of Information Diffusion Extended Abstract Peter M. Fischer, Io Taxidou Univ. of Freiburg, CS Department, 79110 Freiburg, Germany {peter.fischer,taxidou}@informatik.uni-freiburg.de 1 Background and Motivation Modern social media like Twitter or Facebook encompass a significant and growing share of the population, which is actively using it to create, share and exchange messages. This has a particularly profound effect on the way how news and events are spreading.Given the relevance of social media both as sensor of the real world (e.g., news detection) and its impact on the real world (e.g., shitstorms), there has been a significant work on fast, scalable and thorough analyses, with a special emphasis on trend detection, event detection and sentiment analysis. To understand the relevance and trustworthiness of social media messages deeper insights into Information Diffusion are needed: where and by whom a particular piece of information has been created, how it has been propagated and whom it may have influenced. Information diffusion has been a very active field of research, as recently described in a SIGMOD Record survey [GHFZ13]. The focus has been on developing models of information diffusion and targeted empirical studies. Given the complexity of most of these models, nearly all of the investigations have been performed on relatively small data sets, offline and in ad-hoc setups. Despite all this work, there is little effort to tackle the problem of real-time evaluation of information diffusion, which is needed to assess the relevance. These analyses need to deal with Volume and Velocity on both on messages and social graphs. The combination of message streams and social graphs is scarely investigated, while incomplete data and complex models make reliable results hard to achieve. Existing systems do not handle the challenges: Graph computation systems neither address the fast change rates nor the realtime interaction between the graph and other processing, while data streams systems fall short on the combination of streams and complex graphs. 2 Goals and Challenges The goal of our research is to develop algorithms and systems to trace the spreading of information in social media that produce large scale, rapid data. We identified three crucial building blocks for such a real-time tracing system: 95 1) Algorithms and Systems to perform the tracing and influence assignment, in order to deliver paths that information most likely propagated 2) Classification of user roles, as to provide support for assessing their impact on information diffusion process 3) Predictions on the spreading rate in order to allow estimations of information diffusion lifetime and how representative evaluations on the current state will be. Our first task is to design, implement and evaluate algorithms and systems that can trace information spreading and assign influence at global scale while producing the results in real-time, matching the volumes and rates of social media. This requires a correlation between the message stream and the social graph. While we already showed that real-time reconstruction of retweets is feasible when the social graph fragment is locally accessible [TF14], real-life social graphs contain hundreds of millions of users, which requires distributed storage and operation. Our approach keeps track of the (past) interactions and drives the partitioning on the communities that exist in this interaction graph. Additionally, since the information available for reconstruction is incomplete, either from lack of social graph information or from API limitations, we aim to develop and evaluate methods that infer missing path information in a low-overhead manner. In contrast to existing, model-based approaches, we rely on a lightweight, neighborhood-based approach. Access to diffusion paths enables a broad range of analyses of the information cascades. Given this broad range, we are specifically focusing on features that provide the baselines for supporting relevance and trustworthiness, namely the identification of prominent user roles such as opinion leaders or bridges. An important aspect includes the interactions and connections among users, that lead towards identifying prominent user roles. Our approach will rely on stream-aware clustering instead of fixed roles over limited data. The process of information spreading varies significantly in speed and duration: most cascades end after a short period, others are quickly spreading for a short time, while yet other group see multiple peaks of activity or stay active for longer periods of time. Understanding how long such a diffusion continues provides important insights on how relevant a piece of information is and how complete its observation is. Virality predictions from the start of a cascade are hard to achieve, while incremental, lightweight forecasts are more feasible. New observations can then be used to update and extend this forecast, incorporating temporal as well as strucutural features. References [GHFZ13] Adrien Guille, Hakim Hacid, Cécile Favre, and Djamel A. Zighed. Information diffusion in online social networks: a survey. SIGMOD Record, 42(2):17–28, 2013. [TF14] Io Taxidou and Peter M. Fischer. Online Analysis of Information Diffusion in Twitter. In Proceedings of the 21st International Conference Companion on World Wide Web, WWW ’14 Companion, 2014. 96 Towards a Framework for Sensor-based Research and Development Platform for Critical, Socio-technical Systems Henrik Surm1, Daniela Nicklas2 1 2 Faculty of Information Systems and Applied Computer Sciences University of Bamberg, Bamberg [email protected] Department for Computing Science, University Oldenburg, Oldenburg [email protected] 1 Motivation The complexity of critical systems, in which failures can either endanger human life or cause drastic economic losses, has dramatically increased over the last decades. More and more critical systems envolve into so-called “socio-technical systems”: humans are integrated by providing and assessing information and by making decisions in otherwise semi-autonomous systems. Such systems depend heavily on situational awareness, which is obtained by processing data from multiple, heterogenous sensors: raw sensor data is cleansed and filtered to obtain features or events, which are combined, enriched and interpreted to reach higher semantical levels, relevant for system decisions. Existing systems which use sensor data fusion often require an a-priori configuration which can not be changed while the system is running, or require a human intervention to adapt to changed sensor sources, and provide no run-time-extensibility for new sensor types [HT08]. In addition, analysis of data quality or query plan reliability is often not possible and management of recorded data frequently is done by hand. Our goal is to support the research, development, evaluation, and demonstration of such systems. Thus, in the proposed talk we analyze requirements and challenges for the data management of sensor based research environments, and we propose a data stream based architecture which fulfills these requirements. 2 Challenges and Requirements To support the research, development, test, and demonstration of sensor-based, critical applications, we plan to address the following challenges and requirements: Changing sensor configurations: during research and development, the sensor configuration might change often. In addition, when sensor data is delivered from moving objects, the available sensor sources may change during runtime. Information quality: since sensors do not deliver exact data, quality aspects needs to be considered at all levels of processing. 97 Validation of data management: since the sensor data processing is a vital part of the system, it should be validatable and deterministic. Reproducability of experiments and tests, intelligent archiving: in the process of research, development and test, application sensor-data and additional data for ground-truth (e.g., video-streams) needs to be archived and replayable. Integration with simulation environments: before such systems are deployed in the real, they need to be modeled and analyzed in various simulation tools. The transition from pure simulation to pure real-world execution should be easy. In the project, we plan to address these challenges by a modular, extensible and comprehensive framework. While these challenges are similar to other data analysis and integration scenarios, they have to be addressed under the specific constraints of limited maritime communication channels and data standards. The main contribution will be a combined solution that adresses these challenges in a unified framework. This is why we will base our work on the Odysseus [AGGM12] framework: it has a clear formal foundation based on [K08] and is designed for extensibility, using the OSGI service platform. Odysseus offers bundles for data integration, mining, storage and data quality representation, and allows installation and updates of bundles without restarting the system, which leads to a high flexibility and run-time adaptability. The framework will offer a unified sensor access and data fusion approach which allows a flexible use of the platform for the transition from simulation to real-world applications, reducing the expenditure of time for research and development. 3 Outlook The framework will be used in two scenarios, which is a research port for analysis of sensor-based support for port navigation, and driver state observations for cooperative, semi-autonomous driving applications. Further, we plan to explore that approach in other projects, covering cooperative e-navigation and smart city applications. References [AGGM12] APPELRATH, H.-JÜRGEN ; GEESEN, DENNIS ; GRAWUNDER, MARCO ; MICHELSEN, TIMO ; NICKLAS, DANIELA: Odysseus: a highly customizable framework for creating efficient event stream management systems. In: DEBS ’12 Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems, DEBS ’12. New York, NY, USA : ACM, 2012 — ISBN 978-1-4503-1315-5, S. 367–368 [HT08] HE, YINGJIE ; TULLY, ALAN: Query processing for mobile wireless sensor networks: State-of-the-art and research challenges. In: Third International Symposium on Wireless Pervasive Computing (ISWPC '08), S. 518 - 523. IEEE Computer Society 2008 — ISBN 978-1-4244-1652-3 [K07] KRÄMER, JÜRGEN: Continuous Queries over Data Streams - Semantics and Implementation. Dissertation, Universität Marburg, 2007 98 Dataflow Programming for Big Engineering Data – extended abstract – Felix Beier, Kai-Uwe Sattler, Christoph Dinh, Daniel Baumgarten Technische Universität Ilmenau, Germany {first.last}@tu-ilmenau.de Nowadays, advanced sensing technologies are used in many scientific and engineering disciplines, e. g., in medical or industrial applications, enabling the usage of data-driven techniques to derive models. Measures are collected, filtered, aggregated, and processed in a complex analytic pipeline, joining them with static models to perform high-level tasks like machine learning. Final results are usually visualized for gaining insights directly from the data which in turn can be used to adapt the processes and their analyses iteratively to refine knowledge further. This task is supported by tools like R or MATLAB, allowing to quickly develop analytic pipelines. However, they offer limited capabilities of processing very large data sets that require data management and processing in distributed environments – tasks that have vastly been analyzed in the context of database and data stream management systems. Albeit the latter provide very good abstraction layers for data storage, processing, and underlying hardware, they require a complex setup, provide only limited extensibility, and hence are hardly used in scientific or engineering applications [ABB+ 12]. As consequence, many tools are developed, comprising optimized algorithms for specialized tasks, but burdening developers with the implementation of low-level data management code, usually in a language that is not common in their community. In this context, we analyzed the source localization problem for EEG/MEG signals (which can be used, e. g., to develop therapies for stroke patients) in order to develop an approach for bridging this gap between engineering applications and large-scale data management systems. The source localization problem is challenging, since the problem is ill-posed and signal-to-noise ratio (SNR) is very low. Another challenging problem is the computational complexity of inverse algorithms. While large data volumes (brain models and high sampling rates) need to be processed, low latencies constraints must be kept because interactions with the probands are necessary. The analytic processing chain is illustrated in Fig. 1. The Recursively Applied and Projected Multiple Signal Classification (RAPMUSIC) algorithm is used for locating neural sources, i. e., activity inside a brain corresponding to a specific input. Therefore, 366 MEG/EEG sensors are placed above the head which are continuously sampled at rates of 600 – 1250 Hz. The forward solution of the boundary element model (BEM) of the brain at uniformly distributed locations on the white matter surface is passed as second input. It is constructed once from a magnetic resonance imaging (MRI) scan and, depending on the requested accuracy, comprises 10s of thousands of vertices representing different locations on the surface. RAP-MUSIC recursively identifies active neural regions with a complex pipeline for preprocessing signal measures and correlating them with the BEM. To meet the latency constraints, the RAP- 101 Figure 1: Overview Source Localization Processing Chain MUSIC algorithm has been parallelized for GPUs and a C++ library has been created in an analysis tool called MNE-CPP [DLS+ 13], including parsers for data formats used by vendors of medical sensing-equipment, signal filter operators, transformation routines, etc. Although this library can be used to create larger analyses pipelines, implementing and evaluating new algorithms still requires a lot of low-level boilerplate code to be written, leading to significant development overheads. Latter can be avoided with the usage of domain-specific languages which are specialized for signal processing, natively working on vectors or matrices as first-class data types like MATLAB. But they offer less control over memory management and parallelization for custom algorithms which are crucial for meeting the latency constraints. When large-scale data sets shall be processed, even a specialized tool quickly runs into performance problems as distributed processing is mostly not supported because of large development overheads for cluster-scale algorithms. To handle these problems, we propose to apply dataflow programming here. We implemented a multi-layered framework which allows to define analytic programs by an abstract flow of data, independent from its actual execution. This enables quick prototyping while letting the framework handle data management and parallelization issues. Similar to Pig for batch-oriented MapReduce jobs, a scripting language for stream-oriented processing, called PipeFlow, is provided as front-end. In the current version, PipeFlow allows to inject primitives for partitioning dataflows, executing sub-flows in parallel on cluster nodes leveraging multi-core CPUs, and merging partial results. We plan to automatically parallelize flows in the future using static code analysis and a rule-based framework for exploiting domain-specific knowledge about data processing operators. The dataflow programs are optimized by applying graph rewriting rules and code for an underlying execution backend is generated. Therefore, the framework provides an engine called PipeFabric which offers a large C++ library of operator implementations with focus on low-latency processing. One key aspect of PipeFabric is its extensibility for complex user-defined types and operations. Simple wrappers are sufficient to embed already existing domain-specific libraries. For our use case, processing of large matrices is required. Therefore, we used the Eigen library and are currently porting functions from MNE-CPP. We are also working on code generators for other back-ends like Spark which will be useful for comparing capabilities of different frameworks for common analytic workloads – which, to the best of our knowledge, has not been done yet. References [ABB+ 12] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: efficient query execution on raw data files. In ACM SIGMOD, 2012. + [DLS 13] C. Dinh, M. Luessi, L. Sun, J. Haueisen, and M. S Hamalainen. Mne-X: MEG/EEG Real-Time Acquisition, Real-Time Processing, and Real-Time Source Localization Framework. Biomedical Engineering/Biomedizinische Technik, 2013. 102 Joint Workshop on Data Management for Science Joint Workshop on Data Management for Science Sebastian Dorok1,5 , Birgitta König-Ries2 , Matthias Lange3 , Erhard Rahm4 , Gunter Saake5 , Bernhard Seeger6 1 Bayer Pharma AG Friedrich Schiller University Jena Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben 4 University of Leipzig 5 Otto von Guericke University Magdeburg 6 Philipps University Marburg 2 3 Message from the chairs The Workshop on Data Management for Science (DMS) is a joint workshop consisting of the two workshops Data Management for Life Sciences (DMforLS) and Big Data in Science (BigDS). BigDS focuses on addressing big data challenges in various scientific disciplines. In this context, DMforLS focuses especially on life sciences. In the following, we give short excerpts of the call for papers of both workshops: Data Management for Life Sciences In life sciences, scientists collect an increasing amount of data that must be stored, integrated, processed, and analyzed efficiently to make effective use of them. Thereby, not only the huge volume of available data raises challenges regarding storage space and analysis throughput, but also data quality issues, incomplete semantic annotation, long term preservation, data access, and compliance issues, such as data provenance, make it hard to handle life science data. To address these challenges, advanced data management techniques and standards are required. Otherwise, the use of life science data will be limited. Thereby, one question is whether general purpose techniques and methods for data management are suitable for life science use cases or specialized solutions tailored to life science applications must be developed. Big Data in Science The volume and diversity of available data has dramatically increased in almost all scientific disciplines over the last decade, e.g. in meteorology, genomics, complex physics simulations and biological and environmental research. This development is due to great advances in data acquisition (e.g. improvements in remote sensing) and data accessibility. On the one hand, the availability of such data masses leads to a rethinking in scientific disciplines on how to extract useful information and on how to foster research. On the other hand, researchers feel lost in the data masses because appropriate data management tools have been not available so far. However, this is starting to change with the recent development of big data technologies that seem to be not only useful in business, but also offer great opportunities in science. 105 The joint workshop DMS brings together database researchers with scientists from various disciplines especially life sciences to discuss current findings, challenges, and opportunities of applying data management techniques and methods in data-intensive sciences. The joint workshop is held for the first time in conjunction with the 16th Conference on Database Systems, Technology, and Web (BTW 2015) at the University of Hamburg on March 03, 2015. The contributions were reviewed by three to four members of the respective program committee. Based on the reviews, we selected eight contributions for presentation at the joint workshop. We assigned each contribution to one of three different sessions covering different main topics. The first session comprises contributions related to information retrieval. The contribution Ontology-based retrieval of scientific data in LIFE by Uciteli and Kirsten presents an approach that utilizes ontologies to facilitate query formulation. Colmsee et al. make also use of ontologies, but use them for improving search results. In Improving search results in life science by recommendations based on semantic information, they describe and evaluate their approach that uses document similarities based on semantic information. To improve performance of sampling analyses using MapReduce, Schäfer et al. present an incremental approach. In Sampling with incremental MapReduce, the authors describe a way to limit data processing to updated data. In the next session, we consolidate contributions dealing with data provenance. In his position paper METIS in PArADISE, Heuer examines the importance of data provenance in the evaluation of sensor data, especially in assistance systems. In their contribution Extracting reproducible simulation studies from model repositories using the COMBINE archive toolkit, Scharm and Waltemath deal with reproducible simulation studies. The last session covers the topic data analysis. In Genome sequence analysis with MonetDB: a case study on Ebola virus diversity, Cijvat et al. present a case study on genome analysis using a relational main-memory database system as platform. In RightInsight: Open source architecture for data science, Bulut presents an approach based on Apache Spark to conduct general data analyses. In contrast, Authmann et al. focus on challenges in spatial applications and suggest an architecture to address them in their paper Rethinking spatial processing in data-intensive science. We are deeply grateful to everyone who made this workshop possible – the authors, the reviewers, the BTW team, and all participants. Program chairs Data Management for Life Sciences Gunter Saake (Otto von Guericke University Magdeburg) Uwe Scholz (IPK Gatersleben) Big Data in Science Birgitta König-Ries (Friedrich Schiller University Jena) Erhard Rahm (University of Leipzig) Bernhard Seeger (Philipps University Marburg) 106 Program committee Data Management for Life Sciences Sebastian Breß (TU Dortmund) Sebastian Dorok (Otto von Guericke University Magdeburg) Mourad Elloumi (University of Tunis El Manar, Tunisia) Ralf Hofestädt (Bielefeld University) Andreas Keller (Saarland University, University Hospital) Jacob Köhler (DOW AgroSciences, USA) Matthias Lange (IPK Gatersleben) Horstfried Läpple (Bayer HealthCare AG) Ulf Leser (Humboldt-Universität zu Berlin) Wolfgang Müller (HITS GmbH) Erhard Rahm (University of Leipzig) Can Türker (ETH Zürich, Switzerland) Big Data in Science Alsayed Algergawy (Friedrich Schiller University Jena) Peter Baumann (Jacobs Universität) Matthias Bräger (CERN) Thomas Brinkhoff (FH Oldenburg) Michael Diepenbroeck (Alfred-Wegner-Institut) Christoph Freytag (Humboldt Universität) Michael Gertz (Uni Heidelberg) Frank-Oliver Glöckner (MPI für Marine Mikrobiologie) Anton Güntsch (Botanischer Garten und Botanisches Museum, Berlin-Dahlem) Thomas Heinis (Imperial College, London) Thomas Hickler (Senckenberg) Jens Kattge (MPI für Biogeochemie) Alfons Kemper (TU München) Meike Klettke (Uni Rostock) Alex Markowetz (Uni Bonn) Thomas Nauss (Uni Marburg) Jens Nieschulze (Forschungsreferat für Datenmanagement der Uni Göttingen) Kai-Uwe Sattler (TU Ilmenau) Stefanie Scherzinger (OTH Regensburg) Myra Spiliopoulou (Uni Magdeburg) Uta Störl (HS Darmstadt) 107 Ontology-based Retrieval of Scientific Data in LIFE Alexandr Uciteli1,2, Toralf Kirsten2,3 1 Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig 2 LIFE Research Centre for Civilization Diseases, University of Leipzig 3 Interdisciplinary Centre for Bioinformatics, University of Leipzig Abstract: LIFE is an epidemiological study determining thousands of Leipzig inhabitants with a wide spectrum of interviews, questionnaires, and medical investigations. The heterogeneous data are centrally integrated into a research database and are analyzed by specific analysis projects. To semantically describe the large set of data, we have developed an ontological framework. Applicants of analysis projects and other interested people can use the LIFE Investigation Ontology (LIO) as central part of the framework to get insights, which kind of data is collected in LIFE. Moreover, we use the framework to generate queries over the collected scientific data in order to retrieve data as requested by each analysis project. A query generator transforms the ontological specifications using LIO to database queries which are implemented as project-specific database views. Since the requested data is typically complex, a manual query specification would be very timeconsuming, error-prone, and is, therefore, unsuitable in this large project. We present the approach, overview LIO and show query formulation and transformation. Our approach runs in production mode for two years in LIFE. 1 Introduction Epidemiological projects study the distribution, the causes and the consequences of health-related states and events in defined populations. The goal of such projects is to identify risk factors of (selected) diseases in order to establish and to optimize a preventive healthcare. LIFE is an epidemiological and multi-cohort study in the described context at the Leipzig Research Centre for Civilization Diseases (Univ. of Leipzig). The goal of LIFE is to determine the prevalence and causes of common civilization diseases including adiposity, depression, and dementia by examining thousands of Leipzig (Germany) inhabitants of different ages. Participants include pregnant women and children from 0-18 as well as adults in separate cohorts. All participants are determined in a possible multi-day program with a selection out of currently more than 700 assessments. To these assessments belong interviews and selfcompleted questionnaires to physical examinations, such as for anthropometry, EKG, MRT, and laboratory analyses of taken specimen. Data is acquired for each assessment depending on the participant’s investigation program using specific input systems and prepared input forms. All collected data is integrated and, thus, harmonized in a central research database. This database consists of data tables referring to assessments (i.e., investigations). Their complexity ranges from tables with a small number of columns to tables with a very high column number. For example, the table referring to the Structured Clinical Interview (SCID) consists of more than 900 columns, i.e., questions and sub-questions of the interview input form. The collected data are analyzed in an increasing number of analysis projects; currently, there are more than 170 projects active. Each project is initially specified by a proposal 109 documenting the analysis goal, plan and the required data. However, there are two key aspects that are challenging. Firstly, the applicant needs to find assessments (research database tables) of interest to specify the requested data in the project proposal. This process can be very difficult and time consuming, in particular, when the scientist is looking for specific data items (columns), such as weight and height, without knowing the corresponding assessment. Secondly, current project proposals typically request data from up to 50 assessments which are then organized in project-specific views (according to the data requests). These views can be very complex. Usually, they combine data from multiple research database tables, several selection expressions and a multitude on projected columns out of the data tables. A manual specification of database queries to create such views for each analysis project would be a very error-prone and timeconsuming process and is, therefore, nearly impossible. Hence, we make the following contributions. - We developed an ontological framework. The framework utilizes the LIFE Investigation Ontology (LIO) which classifies and describes assessments, relations between them, and their items. - We implemented ontology-based tools using LIO to generate database queries which are stored as project-specific analysis views within the central research database. The views allow scientists and us to easily access and to export the requested data of an analysis project. Both, LIO and ontology-based tools are running in production mode for two years. The rest of the paper is organized as follows. The Section 2 describes the ontological framework and especially LIO. The Section 3 deals with ontology-based query formulation and transformation, while Section 4 describes some implementation aspects. Section 5 concludes the paper. 2 Framework The goal of the ontological framework is to semantically describe all integrated data of biomedical investigations in LIFE using an ontology. The ontology is utilized on the one hand by scientists to search for data items or complete investigations of interest or simply to browse the ontology to get information about the captured data of the investigations. On the other hand, the ontology helps to query and retrieve data of the research database by formulating queries on a much higher level than SQL. The ontological framework consists of three interrelated layers (Fig. 1). The integrated data layer comprises all data elements (instance data) of the central research database providing data of several source systems in an integrated, preprocessed and cleaned fashion. The metadata layer describes all instance data of the research database on a very technical level. To these metadata belongs the used table and column names, corresponding data types but also the original question or measurement text and the code list when a predefined answer set has been originally associated to the data item. This metadata is stored in a dedicated metadata repository (MDR) and is inherently interrelated with Figure 1: Framework overview 110 Figure 2: Selection of the LIFE Investigation Ontology the instance data. Finally, the ontology layer is represented by the developed LIFE Investigation Ontology (LIO) [KK10] and its mapping to the collected metadata in the MDR. LIO utilizes the General Formal Ontology (GFO) [He10] as a top-level ontology and, thus, reuses defined fundamental categories of GFO, such as Category, Presential and Process. Fig. 2a gives a high-level overview over LIO. Subcategories of GFO:Presential refer to collected scientific data, participants and specimen. Scientific data is structured on a technically level by categories within the sub-tree of LIO:Data. Instances of these categories are concrete data files and database tables, e.g., of the research database. They are used to locate instance data for later querying. Subcategories of GFO:Process refer to processes of two different types, data acquisition and material analysis processes. Instances of these categories are documented, e.g., by specific states and conditions of the examination, the examiner conducting the investigation etc. This process documentation is additionally specified to the scientific data that they possibly generate. The documentation can be used for downstream analysis of the scientific data, to evaluate the process quality and for an impact analysis on the measurement process. Finally, subcategories of GFO:Category are utilized to semantically classify biomedical investigations in LIFE. Fig. 2b shows an overview of main categories. Fundamentally, we differentiate between items (e.g., questions of a questionnaire) and item sets. This separation allows us to ontologically distinguish between data tables (item set) of the research database and its columns (item). Moreover, we are able to classify both, investigation forms as predefined and rather static item sets and specific project-specific analysis views which potentially include items from multiple investigation forms. The latter can be dynamically defined and provided by a user group. All biomedical investigations together with their containing items (i.e., questions and measurements) are associated to LIO categories using the instance_of relationship type. Fig. 3 shows an example; the interview socio-demography consists of several questions including for country of birth, material status and graduation. Both, the interview and its items, are associated to LIO:Interview and LIO:Item, respectively. Special relationships with type has_item represent the internal structure of the interview. The semantic classification, i.e., the association to LIO subclasses of LIO:Metadata, is manually specified by the investigator in an operational software application from which it is imported into 111 Figure 3: Utilization of item set and item categories to describe investigations the overall metadata repository (MDR). Moreover, the MDR captures and manages the structure of each investigation form, and, thus, its items, and their representation in the central research database. By reusing both kinds of specifications in LIO, the mappings between collected metadata in the MDR and LIO categories are inherently generated. This makes it easy to describe and classify new investigations (assessments) in LIO; it necessitates only an initial manual semantic classification and import of corresponding metadata into the MDR. 3 Ontology-based Query Formulation and Transformation Scientific data of the central research database are analyzed in specific analysis projects. Each of them is specified by a project proposal, i.e., the applicant describe the analysis goal, the analysis plan and the data she request for. In the simplest case, the data request consists in a list of assessments. This is extended, in some cases, by defined inclusion or exclusion criteria. In complex scenarios, the applicant is interested in specific items instead of all items of an assessment. In all scenarios, data can be queried per single assessment. However, it is common to request data in a joined fashion, especially, when the applicant focuses on specific items from multiple assessments. Currently, each data request is satisfied by specific analysis views which are implemented as database views within the relational research database. We use LIO to formulate queries over the scientific data which are finally transformed and stored as project-specific analysis views in the research database. Fig. 4 sketches the query formulation and transformation process. Firstly, LIO is used to formulate queries for each analysis project. The applicant can search for assessments of interest browsing along LIO’s structure and the associated instances, i.e., concrete assessments or items. She can select complete assessments as predefined item sets and specific items of an assessment. These selections are used by the query generator to create the query projection, i.e., the items for that data should be retrieved. These items are firstly sorted by the selected assessment in alphabetic order and, secondly, by their rank within each assessment, i.e., with respect to their position on the corresponding input form. Inclusion and exclusion criteria can be specified on item level. The query generator interrelates single conditions by the logical operator AND and creates the selection expression of the resulting query for each assessment. Per default, the query generator produces one query for each selected assessment or item set of an assessment. Moreover, experienced users can create new item sets containing items from multiple assessments. These item sets result in join queries using patient identifiers and examination time points (due to recurrent visits) as join criteria. To find out which item (column) contains patient identifiers and time points, the items are specifically labelled when the assessment definitions (source schema) are imported into the MDR. Some sources allow an automatic labelling (using source-specific rules), 112 whereas other sources need a manual intervention to fully describe their schemas and the resulting items of LIO. The query generator takes the ontologybased specifications (selections and conditions) using LIO as input and firstly creates SQL-like queries. These queries have intermediate character and utilize keywords SELECT, FROM and WHERE with the same meaning as in SQL. In contrast to SQL, they contain ontology categories and associated instances as placeholder which are then finally replaced by conrete table and column names of the Figure 4: Query fromulation and data access research database when SQL queries are generated. To resolve table and column names the query generator utilizes mappings between LIO and the MDR. 4 Implementation LIO currently consists of 33 categories, more than 700 assessments and ca. 120 analysis results (latter two are instances in LIO) together with more than 39,000 items in total. The large and increasing number of assessments, their containing items and their correspondences (mappings) to database table and column metadata are stored in the MDR which is implemented in a relational database system. Assessments and items are loaded on demand from the MDR and are associated to LIO categories as instances. Therefore, new assessments can be easily added to LIO and without modifying the LIO’s core structure or changing ontology files. We implemented a Protégé [NFM00, Sc11] plug-in loading LIO and corresponding instances from the MDR to support an applicant when she specifies the required data for a project proposal. She can navigate along LIO’s structure and, hence, is able to find and pick the items of interest for her proposal. On the other hand, the plug-in allows us to formulate and to transfer ontology-based queries into SQL-queries which are then stored as project-specific analysis views over the scientific data of the research database. These views can be access in two different ways. Firstly, the views can be used for further database-internal data processing using the database API and SQL. This is the most preferred way for persons with database skills. Secondly, the plug-in includes options to propagate views to a web-based reporting software which wraps the database views in tabular reports. These reports can be executed by an applicant. The retrieved data are then available for download to continue data processing with special analysis tools, such as SPSS, R etc. There are other approaches which are highly related to our ontology-based framework. i2b2 [Mu10] is a framework for analyzing data in clinical context. In contrast to our approach, it utilizes a separate data management and, thus, necessitates additional data load and transformation processes. Moreover, the goal of i2b2 is primarily to find relevant patients and not to retrieve scientific data. Like LIO, the Search Ontology [Uc14] is used to formulate queries over data. Its focus is on queries for search engines, while our 113 approach focuses on structured data in a relational database. Similar to LIO, the Ontology of Biomedical Investigations (OBI) [Br10] classifies and describes biomedical investigations. In contrast to OBI, LIO utilizes a core structure which is dynamically extended by assessments fully described in a dedicated metadata repository. Hence, our framework is able to generate queries over data of the research database and prevents from describing each investigation in detail by using OBI. 5 Conclusion We introduced an ontology-based framework to query large and heterogeneous sets of scientific data. The framework consists of the developed LIFE Investigation Ontology (LIO) on the top level which semantically describes scientific data of the central research database (base level). Both levels are interrelated by (technical) metadata which are managed in a metadata repository. LIO gets insight which data are available within the research database, on the one hand, and is used, on the other hand, to formulate queries over the collected scientific data. The ontology-based queries are transformed into database queries which are stored as analysis-specific database views. The queries include per default items of a single assessment. Moreover, join queries merging items from multiple assessments are also supported. Together, ontology-based querying simplifies the data querying for end users and frees IT-people from implementing rather complex SQL queries. In future, we will extend LIO and the query generator to overcome current limitations, e.g., according to the specification and transformation of query conditions. Acknowledgment: This work was supported by the LIFE project. The research project is funded by financial means of the European Union and of the Free State of Saxony. LIFE is the largest scientific project of the Saxon excellence initiative. References [Br10] Brinkman, R. R. et al.: Modeling biomedical experimental processes with OBI. In Journal of biomedical semantics, 2010, 1 Suppl 1; pp. S7. [He10] Herre, H.: General Formal Ontology (GFO): A Foundational Ontology for Conceptual Modelling. In (Poli, R.; Healy, M.; Kameas, A. Eds.): Theory and Applications of Ontology: Computer Applications. Springer Netherlands, Dordrecht, 2010; pp. 297–345. [KK10] Kirsten, T.; Kiel, A.: Ontology-based Registration of Entities for Data Integration in Large Biomedical Research Projects. In (Fähnrich, K.-P.; Franczyk, B. Eds.): Proceedings of the annual meeting of the GI. Köllen Druck+Verlag GmbH, Bonn, 2010; pp. 711–720. [Mu10] Murphy, S. N. et al.: Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). In Journal of the American Medical Informatics Association JAMIA, 2010, 17; pp. 124–130. [NFM00]Noy, N. F.; Fergerson, R. W.; Musen, M. A.: The Knowledge Model of Protégé-2000: Combining Interoperability and Flexibility. In (Goos, G. et al. Eds.): Knowledge Engineering and Knowledge Management Methods, Models, and Tools. Springer Berlin Heidelberg, Berlin, Heidelberg, 2000; pp. 17–32. [Sc11] Schalkoff, R. J.: Protégé, OO-Based Ontologies, CLIPS, and COOL: Intelligent systems: Principles, paradigms, and pragmatics. Jones and Bartlett Publishers, Sudbury, Mass., 2011; pp. 266–272. [Uc14] Uciteli, A. et al.: Search Ontology, a new approach towards Semantic Search. In (Plödereder, E. et al. Eds.): FoRESEE: Future Search Engines 2014 - 44. annual meeting of the GI, Stuttgart - GI Edition Proceedings P-232. Köllen, Bonn, 2014; pp. 667–672. 114 Improving Search Results in Life Science by Recommendations based on Semantic Information Christian Colmsee1, Jinbo Chen1, Kerstin Schneider2, Uwe Scholz1, Matthias Lange1 1 Department of Cytogenetics and Genome Analysis Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben Corrensstr. 3 06466 Stadt Seeland, Germany {colmsee,chenj,scholz,lange}@ipk-gatersleben.de 2 Department Automation / Computer Science Harz University of Applied Sciences Friedrichstr. 57-59 38855 Wernigerode, Germany {kschneider}@hs-harz.de Abstract: The management and handling of big data is a major challenge in the area of life science. Beside the data storage, information retrieval methods have to be adapted to huge data amounts as well. Therefore we present an approach to improve search results in life science by recommendations based on semantic information. In detail we determine relationships between documents by searching for shared database IDs as well as ontology identifiers. We have established a pipeline based on Hadoop allowing a distributed computation of large amounts of textual data. A comparison with the widely used cosine similarity has been performed. Its results are presented in this work as well. 1 Introduction Nowadays the management and handling of big data is a major challenge in the field of informatics. Larger datasets are produced in less time. In particular, this aspect is intensively discussed in life science. At a technology level, new concepts and algorithms have to be developed to enable a seamless processing of huge data amounts. In the area of data storage new database concepts are implemented, such as column based storage or in memory databases. In respect to data processing, distributed data storage and computation have been made available by new frameworks such as the Hadoop framework (http://hadoop.apache.org). Hadoop is using the MapReduce approach [DG08] allowing the distribution of tasks in the map phase over different clusters and to reduce the amount of data in the reduce phase. Furthermore Hadoop is able to integrate extensions such as the column oriented database HBase. So the framework combines the distributed computation architecture with the advantages of a NoSQL database system. Hadoop has already been used in life science applications such as Hadoop-BAM [NKS+12] and Crossbow [LSL+09]. 115 Beside these technological aspects, information retrieval (IR) plays an important role as well. In this context search engines play a pivotal role for an integrative IR over widely spread and heterogeneous biological data. Search engines are complex software systems and have to fulfil various qualitative requirements to get accepted by the scientific community. Its major components are discussed in [LHM+14]: Linguistic (text and data decomposition, e.g. tokenization; language processing, e.g. stop words and synonyms) Indexing (efficient search, e.g. inverse text index) Query processing (fuzzy matching and query expansion, e.g. phonetic search, query suggestion, spelling correction) Presentation (intuitive user interface, e.g. faceted search) Relevance estimation (feature extraction and ranking, e.g. text statistics, text feature scoring and user pertinence) Recommender systems (semantic links between related documents, e.g. “page like this” and “did you mean” The implementation of those components is part of the research project LAILAPS [ECC+14]. LAILAPS is an information retrieval system to link plant genomic data in the context of phenotypic attributes for a detailed forward genetic research. The underlying search engine allows fuzzy querying for candidate genes linked to specific traits over a loosely integrated system of indexed and interlinked genome databases. Query assistance and an evidence based annotation system enable a time efficient and comprehensive information retrieval. The results are sorted by relevance using an artificial neural network incorporating user feedback and behaviour tracking. While the ranking algorithm of LAILAPS provides user specific results, the user might be interested in links to other relevant database entries to an entry of his interest. Such a recommender system is still a missing LAILAPS feature but would have enormous impact for the quality of search results. A scientist may search for a specific gene to retrieve all relevant information to this gene without a dedicated search in different databases. To realise such a goal, recommendation systems are a widely used method in information retrieval. This concept is already used in several life science applications. For example, EB-eye as IR system for all databases that are hosted at the European Bioinformatics Institute (EBI) provide suggestions to alternative database records [VSG+10]. Another example are PubMed based IR systems for searching in biomedical abstracts [Lu11]. In this work we will describe a concept of providing recommendations in LAILAPS based on semantic information. 2 Results When users are searching in LAILAPS for specific terms, the result is a list of relevant database entries. Beside the particular search result, the user would benefit from a list of related database entries that are potentially of interest to him. To implement this feature it is necessary to measure the similarity between database entries. A widely used concept is the expression of a document as a vector of words (tokens) and the computation of its 116 distance by cosine similarity. Within this approach the tokens of each document will be compared, meaning documents using similar words have a higher similarity to each other. But to get a more useful result especially in the context of life science it would be necessary to integrate semantic information in the comparison of documents. Here we present a method allowing the estimation of semantic relationships of documents. 2.1 Get semantics with database identifiers A widely used concept to provide semantic annotation in life science databases are ontologies such as the Gene Ontology (GO) [HCI+04]. Each GO term has a specific ID allowing the exact identification of a term. Beside the use of ontologies, annotation targets are repositories of gene functions. This wide range of databases such as Uniprot [BAW+05], have in common that they can be referenced by a unique identifier for each database entry. With the help of such unique identifiers for ontologies and database entries the documents in LAILAPS can be compared on a semantic level. If for example two documents share a GO identifier, this could be interpreted as a semantic connection between these documents. The final goal therefore would be to design a recommendation system, which is determining these information and to recommend the end user database entries based on these unique identifier. For the extraction of above mentioned identifiers, different methods could be applied. One method is the usage of regular expressions, where specific patterns are used, such as a token beginning with the letters GO, is likely a GO term [BSL+11]. Another method is described in Mehlhorn et al. [MLSS12], where predictions are made with the support of a neural network. Feature extractions were focused on positions, symbols as well as word statistics to predict a database entry identifier. To include a very high number of database identifiers, we decided to use the neural network based approach, allowing the identification of IDs based on known ID patterns. 2.2 Determine document relations We applied Hadoop to identify IDs in a high throughput manner. The Hadoop pipeline has two MapReduce components (see Figure 1). The first MapReduce job has a database as an input file. Each database entry consists of a unique document ID as well as document content. The mapper will then analyse each document and detect tokens that might be an ID with the IDPredictor tool from Mehlhorn et al. [MLSS12]. The reducer will then generate a list of pairs with a token and documents including this token. The second MapReduce job will then determine the document relations. Here the mapper will build pairs of documents having an ID in common. The reducer will finally count the number of shared IDs for each document pair. A high count of shared IDs means a high similarity between two documents. The source code of the pipeline is available at: http://dx.doi.org/10.5447/IPK/2014/18. 117 Figure 1: Hadoop pipeline including two MapReduce components as well as the ID prediction component 2.3 Cosine similarity versus ID prediction As a benchmark we computed documents from the Swissprot database and compared the ranking results with the cosine similarity mentioned in section 2.1. While the cosine similarity score between two documents is built upon word frequencies and results in a value between zero and one, the ID prediction score is an integer value based on shared IDs. To make both values comparable, we calculated z-scores for both ranking scores. To detect deviations in the ranking, the results were plotted on a scatterplot (see Figure 2). The plot illustrates, that in most cases there are only small differences in the ranking. But there are some cases of large differences in the relative ranking, indicating, that for specific document relations the semantic component leads to a completely different ranking in contrast to the simple approach of comparing words. Figure 2: Scatterplot illustrating the different ranking results between cosine similarity and ID prediction score 118 When looking into specific results with strongly different rankings, semantically similarities could be detected. Picking up one example from Figure 2 (marked with a red circle) a document pair was ranked at place 1 in IDPrediction and at place 148 in cosine similarity. When looking into these documents we could determine that they are sharing a lot of IDs like EC (Enzyme Commission) numbers and GO terms. Both documents are dealing with fatty acid synthase in fungal species. A protein BLAST against Swissprot of the protein sequence of document A listed document B in the fourth position with a score of 1801 and an identity of 44%. 3 Discussion and Conclusion In this work we developed a system providing recommendations based on semantic information. By the support of a neural network, IDs were predicted. With this information, documents can be compared on a semantic level. To support big data in life science we implemented the documents distance computation as a Hadoop pipeline. The results of our approach have shown differences to cosine similarity in case of rankings. The ID prediction based approach is able to detect semantic similarities between documents and recommend this information to the users. However to get a precise idea about the quality improvement, the new method should be applied to the LAILAPS frontend system to determine if the users are more interested in this new information. To implement the presented pipeline into LAILAPS powerful systems such as ORACLE Big Data [Dj13] could be a solution. It allows supporting multiple data source including Hadoop, NoSQL as well as the ORACLE database itself. Although Hadoop is a powerful system, LAILAPS would also benefit from a more integrative approach like using in memory technology. Users who would like to install their own LAILAPS instance might be not able to set up their own Hadoop cluster. In memory systems might be able to allow as the just-in-time computation of the available data as well. LAILAPS would benefit from further investigations in this field. Acknowledgements The expertise and support of Steffen Flemming regarding the establishment and maintenance of the Hadoop cluster is gratefully acknowledged. References [BAW+05] Bairoch, A.; Apweiler, R.; Wu, C.H.; Barker, W.C.;Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M. et al. The universal protein resource (Uniprot). Nucleic acids research, 33 (suppl 1):D154-D159, 2005. [BSL+11] Bachmann, A.; Schult, R.; Lange, M.; Spiliopoulou, M. Extracting Cross References from Life Science Databases for Search Result 119 Ranking. In Proceedings of the 20th ACM Conference on Information and Knowledge Management, 2011. [DG08] Dean, J.; Ghemawat, S.: Mapreduce: simpified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008. [Dj13] Djicks, J.: Oracle: Big Data for the Enterprise. Oracle White Paper, 2013. [ECC+14] Esch, M.; Chen, J.; Colmsee, C.; Klapperstück, M.; Grafahrend-Belau, E.; Scholz, U.; Lange, M.: LAILAPS – The Plant Science Search Engine. Plant and Cell Physiology, Epub ahead of print, 2014. [HCI+04] Harris, M.A.; Clark, J.; Ireland, A.; Lomax, J.; Ashburner, M.; Foulger, R.; Eilbeck, K.; Lewis, S.; Marshall, B.; Mungall, C. et al.: The gene ontology (GO) database and informatics resource. Nucleic acids research, 32 (Database issue):D258, 2004. [LHM+14] Lange, M; Henkel, R; Müller, W; Waltemath, D; Weise, S: Information Retrieval in Life Sciences: A Programmatic Survey. In M. Chen, R. Hofestädt (editors) Approaches in Integrative Bioinformatics. Springer, 2014, pp 73-109. [LSL+09] Langmead, B.; Schatz, M.C.; Lin, J.; Pop, M.; Salzberg, S.L.: Searching for SNPs with cloud computing. Genome Biology, 10(11):R134, 2009. [Lu11] Lu, Z.: PubMed and beyond: a survey of web tools for searching biomedical literature. Database. Oxford University Press, 2011 [MLSS12] Mehlhorn, H.; Lange, M.; Scholz, U.; Schreiber, F.: IDPredictor: predict database links in biomedical database. J. Integrative Bioinformatics, 9, 2012. [NKS+12] Niemenmaa, M.; Kallio, A.; Schumacher, A.; Klemelä, P.; Korpelainen, E.; Heljanko, K.: Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics, 28(6):876-877, 2012. [VSG+10] Valentin, F.; Squizzato, S.; Goujon, M.; McWilliam, H.; Paern, J.; Lopez, R.: Fast and efficient searching of biological data resources using EB-eye. Briefings in bioinformatics, 11(4):375-384, 2010. 120 Sampling with Incremental MapReduce Marc Schäfer, Johannes Schildgen, Stefan Deßloch Heterogeneous Information Systems Group Department of Computer Science University of Kaiserslautern D-67653 Kaiserslautern, Germany {m schaef,schildgen,dessloch}@cs.uni-kl.de Abstract: The goal of this paper is to increase the computation speed of MapReduce jobs by reducing the accuracy of the result. Often, the timely processing is more important than the precision of the result. Hadoop has no built-in functionality for such an approximation technique, so the user has to implement sampling techniques manually. We introduce an automatic system for computing arithmetic approximations. The sampling is based on techniques from statistics and the extrapolation is done generically. This system is also extended by an incremental component which enables the reuse of already computed results to enlarge the sampling size. This can be used iteratively to further increase the sampling size and also the precision of the approximation. We present a transparent incremental sampling approach, so the developed components can be integrated in the Hadoop framework in a non-invasive manner. 1 Introduction Over the last ten years, MapReduce [DG08] has become an often-used programming model for analyzing Big Data. Hadoop1 is an open-source implementation of the MapReduce framework and supports executing jobs on large clusters. Different from traditional relational database systems, MapReduce focusses on the three characteristics (”The 3 Vs”) of Big Data, namely volume, velocity and variety [BL12]. Thus, efficient computations on very large, fast changing and heterogeneous data are an important goal. One benefit of MapReduce is that it scales. So, it is well-suited for using the KIWI approach (”Kill It With Iron”): If a computation is too slow, one can simply upgrade to better hardware (”Scale Up”) or add more machines to a cluster (”Scale Out”). In this paper, we focus on a third dimension additional to resources and time, namely computation accuracy. The dependencies of the dimensions can be depicted in a timeresources-accuracy triangle. It says that one cannot make Big-Data analyses in short time with few resources and perfect accuracy. The area of the triangle is constant. Thus, if one wants to be accurate and fast, more resources are needed (KIWI approach). If a hundredpercent accuracy is not mandatory, a job can run fast and without upgrading the hardware. On the one hand, most work regarding Big-Data analysis, i.e. frameworks and algorithms are 100% precise. On the other hand, these approaches often give up the ACID properties 1 http://hadoop.apache.org 121 and claim: Eventual consistency (”BASE”) is enough. So, let us add this: For many computations, a ninety-percent accuracy is enough. One example: Who cares if the number of your friends’ friends’ friends in a social network is displayed as 1,000,000 instead of 1,100,000? Some people extend the definition of Big Data by a fourth ”V”: veracity [Nor13]. This means, the data sources differ in their quality. Data may be inaccurate, outdated or just wrong. So, in many cases, Big-Data analyses are already inaccurate. When using sampling, the accuracy of the result decreases again, but the computation time improves. Sampling means, only a part of the data is analyzed and the results are extrapolated in the end. Within this work, we extended the Marimba framework (see section 4.1) by a sampling component to execute existing Hadoop jo